[05:23] <jam> morning all
[07:35] <jam> rvba: morning
[07:35] <rvba> jam: Hi
[07:35] <jam> I don't know if you've seen this, but it appears that maas-region-controller gets *horribly* broken if you have your IP address changed.
[07:36] <jam> I'm trying to setup a 'large' region on EC2, and I shut it down while I went away, so when I started it again, it came back up with a different IP.
[07:36] <jam> lots of bits were broken, and so far, no luck actually getting it running.
[07:36] <jam> So I'm trying to just purge everything and reinstall.
[07:36] <rvba> jam: by horribly broken you mean that nothing tells you that the connection is broken?
[07:36] <jam> dpkg-reconfigure maas-region-controller  doesn't actually seem to reset things to get them to the right IP addresses and URLs.
[07:37] <jam> rvba: well, #1 I filed a bug on, which is that the cluster spins forever trying to connect to the region
[07:37] <jam> but fails
[07:37] <jam> and fails before it starts celery
[07:37] <rvba> Right.
[07:37] <jam> so it doesn't even have task_logger hooked up.
[07:37] <jam> but the maas-region-controller itself sits spinning.
[07:37] <jam> because we now set BROKER_URL
[07:37] <jam> and that is no longer valid after the IP address changes.
[07:37] <jam> also, rabbitmq-server is... unhappy at best.
[07:38] <rvba> I see.
[07:38] <rvba> Well, not really rabbitmq
[07:38] <jam> like 'service rabbitmq-server stop' claims that it is stopping the broker
[07:38] <jam> but 'beam.smp' is still running.
[07:38] <rvba> Celery will be unhappy.
[07:38] <jam> rvba: celery was at least logging that there was a problem.
[07:38] <rvba> That looks like a problem in rabbitmq itself.
[07:39] <jam> rvba: sure
[07:39] <jam> I'm not saying it is Maas's fault that rabbit is broken, but it was a lot of stuff that was just 'not working', and I couldn't really figure out how to fix it.
[07:39] <jam> note that squid was also ended up pointing to the wrong IP address, etc.
[07:40] <rvba> Interesting.
[07:40] <rvba> And scary :)
[07:40] <jam> rvba: also, there are some bits where it wasn't clear whether the dpkg-reconfigure wanted the URL or the IP Address.
[07:43] <rvba> jam: I'm surprised that restarting rabbitmq does not fix the problem (the problem with rabbitmq of course).
[07:44] <jam> rvba: well 'service rabbitmq-server stop' doesn't actually stop it, so the restart doesn't really work.
[07:46] <jam> rvba: the cluster controller also calls itself name='master', is there supposed to be a way to configure that?
[07:47] <rvba> jam: when the cluster controller connects to the region, it changed the name of the nodegroup using its UUID.
[07:47] <rvba> changes*
[07:47] <jam> rvba: http://ec2-23-23-14-48.compute-1.amazonaws.com/MAAS/api/1.0/nodegroups/?op=list
[07:47] <jam> not that I see
[07:48] <jam> rvba: though... it is still failing to startup, so there may be something else going on.
[07:49] <rvba> I can't get the page to load.  But I suppose you are still seing the cluster named 'master' ... this simply means that the cluster controller was unable to connect to the region.
[07:49] <jam> on the plus side, EC2 c1.medium are a lot faster than my VM, and I can do 10,000 nodes in 3s, vs the 12s I was seeing locally.
[07:49] <jam> rvba: ah, you aren't in the security group, I imagine.
[07:50] <jam> rvba: so I see 2 nodegroups listed
[07:50] <jam> so it is somehow getting a message through, but not enough to change its name.
[07:51] <jam> rvba: http://paste.ubuntu.com/1280693/
[07:51] <jam> (if you want to help, you can give me your ip address and I can add you to the security group)
[07:53] <rvba> jam: I have access now.  ta
[07:57] <rvba> jam: having 2 clusters named 'master' is really a weird situation.  Not really scary because the 'master' nodegroup is identified by picking the *first* nodegroup but still, the situation is really bizarre.
[10:30] <jtv> mgz: my diff has updated.  Thanks for reviewing it.  :)
[10:33] <mgz> jtv: I will now review for real :)
[10:33] <jtv> Thanks!
[10:34] <jtv> I'll have to go catch a train in a minute.
[10:36] <jam> well, I got up to 10 cluster controllers all talking to 1 master (so 11 total) and 4k nodes on each, for 44,000 nodes total. Rebuilds tags in about 15s, but the 2-cpu region controller is definitely the bottleneck.
[10:38] <jtv> jam: great to hear that you got that working... would be very interesting to know where the bottlenecks at finer granularities will be.
[10:38] <rvba> jam: that's very good news.
[10:39] <jam> jtv: well, in the immediate term I'm going to restart the central machine with c1.xlarge instead of a c1.medium, and see what I can get it to do.
[10:39] <jtv> Heh -- you did 44K nodes without even scaling up the central server?  That is happy news indeed.  :)
[10:39] <jam> jtv: well, there are certainly aspects of the system that aren't scaling well.
[10:39] <jam> You can't really load "http://.../MAAS" or "http://.../MAAS/nodes/"
[10:39] <rvba> The whole page is definitely one of them :)
[10:40] <jtv> The main page in particular might be nice to have working......
[10:40] <rvba> jtv: this is not completely trivial, txlongpoll is the component that should be improved/replaced.
[10:40] <jtv> mgz: Raphael got in first
[10:40] <jam> rvba: the big issue is that whatever lag is introduced into the system makes it *very* hard to play with it. 10s lag spike without interaction means I miss the whole time the rebuild is happening.
[10:41] <mgz> jtv: I also posted, with a note you may want to look at
[10:41] <jtv> Willdo
[10:41] <mgz> well, the last sentence, the rest is just me thinking
[10:41] <mgz> basically, expression should use // not / I think
[10:42] <jam> on the plus side, I discovered that you can tell Amazon that everything in a given *security group* should be able to talk to eachother on a given set of ports. So when you add new nodes, they automatically can talk to eachother.
[10:42] <jtv> mgz: I think you're right, but also I think it's a separate issue.  I haven't tried it out but I suspect the test picks numbers that happen to come out to an even amount in megs.
[10:42] <mgz> jam: yup.
[10:42] <jtv> FWIW I reviewed that branch and said please to document these little things...
[10:42]  * jtv -> train
[10:43] <jtv> See you tomorrow, folks!
[10:43] <mgz> >>> n>>20<<20 == n
[10:43] <mgz> False
[10:43] <mgz> so, nope...
[10:44] <mgz> I didn't understand the security group that permits itself idiom for ages, but it's actually quite handy.
[10:45] <jam> mgz: the downside is that I don't have a way to 'ssh' to all the machines and poke at them... which is what Juju would be nice to have.
[10:46] <mgz> hm, you do, but probably not in a handy way without writing a wrapper script
[10:46] <mgz> you can always permit 22 in the same group, or add default and that group, and the api will give you the hostnames/ips
[10:48] <jam> mgz: sure, but managing 10+ machines manually is certainly at the point of a bit of pain.
[10:49] <jam> add to it the multi-second latency spikes
[10:52] <mgz> hm, is trunk make broken currently?
[10:53] <mgz> I probably just need to clean first I guess...
[10:54] <mgz> clean not removing the stuff in bin/ and distclean removing a local egg cache is annoying
[10:55] <mgz> maybe I need to add a global egg cache to my setup script, but it's really not needed apart from if you blow away the working tree files
[11:08] <mgz> okay, this is fun
[11:10] <mgz> today's machine can't run tests, because it can't start the db, because pg_ctl is not installed... because the packaging thinks that conflicts with postgres itself?!

[11:12] <mgz> probably the same as usual need to update my package lists locally...
[11:31] <mgz> okay, this latest bit is derived from there being no python-selenium package
[11:33] <jam> everyone say hello to our newest blue squad member, dimitern
[11:34] <mgz> need to add multiverse... why did this not break earlier I wonder
[11:34] <mgz> her dimitern
[11:34] <mgz> *hey
[11:34] <mgz> dammit, can't even greet without tyop
[11:35] <dimitern> mgz: hey :)
[11:36] <dimitern> glad to meet your guys!
[11:38] <mgz> gah, what is borked with postgres...
[11:39] <mgz> okay, rm -rf db && make sampledata worked
[11:39] <mgz> the error reporting for this stuff is painful, CalledProcessError is not very helpful
[11:58] <rvba> Hi dimitern, welcome aboard.
[12:00] <dimitern> rvba: hey :) 10x
[12:45] <jam> mgz, dimitern: so I'm currently load testing MAAS with 1 Region Controller, and 10 Cluster Controllers, each with 4000 node records (in EC2)
[12:46] <jam> I made the Region controller a C1.xlarge, so it has 8 cores, and the individual nodes are c1.mediums with 2 cores.
[12:46] <jam> If I just issue 1 tag to be rebuilt, it finishes in 9s.
[12:46] <jam> (as measured by the time for all requests to complete, the first one completes in a lot less time)
[12:46] <jam> If I rebuild 2 concurrently, it goes up to 12s.
[12:47] <jam> Anything higher than that, and it just serializes (presumably because the cluster nodes only have parallel = 2)
[12:47] <jam> I also think we are roughly at capacity for 1 MAAS controller, because it is going from 9s to 12s when running simultaneous requests.
[12:48] <jam> Certainly all 8 Apache WSGI processes are 'active' at >25% CPU during this time
[12:48] <jam> (there should be on the order of 20 concurrent requests.)
[12:48] <jam> I wonder about over-committing for Apache, though.
[12:48] <mgz> that sounds as expected
[12:51] <jam> mgz: well setting wsgi parallel = 20, and I still only ever see ~8 apache processes consuming memory.
[12:51] <jam> sorry, consuming cpu
[12:53] <jam> mgz: I also only see postgres consume at most 4 CPUs and normally <3. Is there a good way to tell if postgres is bottlenecking, or do I need to move that onto another machine.
[12:54] <jam> rvba: what is a reasonable way from python/api to decommision 4k nodes?
[12:54] <jam> (I want to get rid of all nodes in the 'master' group)
[12:55] <rvba> jam: I guess you can simply delete() them.  From python directly I mean.
[12:55] <jam> rvba: Node.delete() ? Or NodeGroup.node_set.delete() ?
[12:55] <jam> or is the latter only deleting the linkage?
[12:55] <rvba> jam: the latter is fine.
[12:56] <rvba> It will delete the nodes.
[12:56] <jam> rvba: 'RelatedManager has no attribute delete'
[12:56] <rvba> jam: nodegroup.node_set.all().delete()
[12:57] <jam> rvba: right, vs nodegroup.node_set.clear() which would just remove the linkage.
[12:57] <rvba> Yep.
[12:57] <jam> rvba: interestingly, the bottleneck when deleting 4k nodes is actually' django-admin'.
[12:58] <jam> Presumably lots of python hooks on the delete method?
[12:58] <jam> (It has to delete mac addresses for each node it deletes.)
[12:58] <jam> maybe we should have had a pre-join there :)
[12:58] <rvba> Yeah, very probably.
[13:07] <rvba> jam: deleting mac addresses should be handled at the db level (delete cascade).
[13:08] <jam> mgz: definitely seems to be a 'thundering herd' issue here. Many of  the processes finish significantly before the rest. It appears that they all get their first request started on time, but they tend to get starved out for that first 'give me all the nodes to work on' request.
[13:08] <jam> rvba: well it should, but we have custom code in Node.delete
[13:08] <rvba> Yeah, to update the DNS config.
[13:08] <rvba> I mean we have signals hooked up to node.delete() to update the DNS config.
[13:08] <jam> ah
[13:08] <dimitern> jam: how many VMs I need to setup to run maas locally? is 1 enough or I'll need more?
[13:08] <rvba> That might be heavy.
[13:09] <jam> dimitern: I use just 1
[13:09] <dimitern> jam: 10x
[13:09] <jam> if you actually want to test the PXE booting, etc. You'll want more VMs for that.
[13:09] <jam> But just for running a server, and poking at the metadata, 1 is enough.
[13:09] <dimitern> yeah, that's what i figured
[13:09] <dimitern> i'll setup pxe later, once the webapp is running
[13:11] <allenap> rvba: FORCE_SCRIPT_NAME and STATIC_URL are driving me up the wall. Do you have time this afternoon to talk about it?
[13:11] <jam> rvba: so I might have been at fault for the first IP change mess-up. I did it again this time (rebooting to a larger Instance), and all I needed was to do the dpkg-reconfigure on all the machines.
[13:11] <rvba> allenap: sure
[13:12] <rvba> jam: all right.  Having to do that is normal given the state we are in.
[13:13] <rvba> allenap: I can talk now if you're free.
[13:15] <allenap> rvba: Yeah
[13:43] <allenap> rvba: Fwiw, this is my solution: http://paste.ubuntu.com/1281102/
[13:45] <rvba> allenap: looks good to me.  It even fixes using settings.STATIC_URL in src/maasserver/middleware.py which is wrong.  Never actually exercised but still wrong.
[13:46] <rvba> allenap: even in debug mode on a prod instance, apache serves the static files btw, so this is actually never used in production.
[14:58] <flacoste> roaksoax, rvba, allenap: anyone looked at bug 1066421 yet?
[14:58] <roaksoax> flacoste: yeah, it is duplicate from bug 1065763
[15:00] <roaksoax> rvba: so, is there any way we can use the same queue from txlongpoll?
[15:00] <roaksoax> rvba: using say... a different vhost?
[15:01] <rvba> roaksoax: well, a queue belongs to a vhost.
[15:01] <rvba> roaksoax: can you remind me what you're trying to do again?
[15:02] <roaksoax> rvba: so during the installtion from cd/preseed, we need to start rabbitmq server (in the chroot) to be able to create the queues (running a daemon in the installer is against policy).
[15:02] <roaksoax> rvba: so the queue for txlongpoll is being created correctly, but the one for the celery wprkers is not
[15:02] <roaksoax> so we need to find a workaround for that
[15:02] <roaksoax> either by using same suser/password and different vhost
[15:02] <roaksoax> or osmething
[15:05] <rvba> roaksoax: AFAIK the queues are created dynamically.  We are only creating users and vhosts in the postinst scripts.
[15:05] <rvba> Am I missing something?
[15:06] <roaksoax> rvba: right, so that's what I meant
[15:06] <roaksoax> rvba: the users/vhosts are created for txlongpoll
[15:06] <roaksoax> rvba: but not for the workers
[15:06] <roaksoax> rvba: and that's during the installer only
[15:08] <rvba> roaksoax: how are the workers' user/vhost different from those of txlongpoll?
[15:09] <roaksoax> rvba: it is not a problem of the postinst
[15:10] <roaksoax> rvba: it is a problem from running rabbitmq daemon in the installer (chroot), which goes against policy
[15:10] <roaksoax> but for some reason it does work for the first, (txloinpoll user/vhost create) but not for the second one (celery user worker/vhost creattion)
[15:11] <roaksoax> rvba: so that's why I was wondering if we would be able to use the same username/password created for txlongpoll, with a different vhost
[15:11] <rvba> roaksoax: It would be nice to understand why it works for the first and not for the second.
[15:11] <rvba> roaksoax: there is no real reason why not.  But it's a bit ugly to use the same credentials for two completely unrelated stuff.
[15:12] <roaksoax> rvba: that's what I'm trying to figure out, but it seems that the first part gets created because rabbitmq is running correctly, but then gets killed and the rest of the stuff doesn't get done
[15:12] <roaksoax> i've tried to work around it without success
[15:13] <roaksoax> rvba: i'll further investigate now as I have a couple of other ideas on how to fix it
[15:13] <roaksoax> rvba: but the issue is basically rabbitmq not running during the installer
[15:14] <allenap> roaksoax, rvba: Another (better? harder?) solution might be to get the region to configure all this stuff at runtime instead of install-time.
[15:16] <rvba> allenap: it would be one step towards having less stuff in the packaging.  But obviously we don't really have the "framework" in place to do that properly.
[15:16] <roaksoax> indeed
[15:16] <roaksoax> i think ti would be to risky to do it at this point
[15:16] <roaksoax> being so close to release
[15:16] <allenap> Fair point.
[15:22] <dimitern> guys, can you help with errors like this when running buildout on maas src root: Error: There is a version conflict.
[15:22] <dimitern> We already have: distribute 0.6.28dev-r0
[15:24] <allenap> dimitern: Are you on Quantal or Precise?
[15:24] <dimitern> quantal
[15:25] <dimitern> i did a vanilla install of 12.04 server, updated/upgraded, did release upgrade to 12.10 and called make install-dependencies, then i got this
[15:28] <dimitern> I read some forums about this error, which recommended to list the package = version in versions.cfg explicitly
[15:28] <roaksoax> rvba: http://pastebin.ubuntu.com/1281273/
[15:29] <dimitern> I did that, but then it complains about other packages further on (like ampqlib)
[15:30] <roaksoax> rvba: i found out what's the issue... i'm such a dumbass
[15:30] <allenap> dimitern: Here, I have python-setuptools 0.6.28-1ubuntu2 installed, but virtualenv provides 0.6.24 when I create a new env. Can you try:
[15:31] <rvba> roaksoax: what's the issue?
[15:31] <allenap> dimitern: virtualenv foo && foo/bin/python -c 'import setuptools; print setuptools.__file__'
[15:32] <roaksoax> rvba: i just remmebered that we work around it creating the rabbitmq stuff on the upstart job to address this issue
[15:33] <dimitern> allenap: so I need to run buildout in a virtualenv?
[15:33] <roaksoax> rvba: btw... the worker stuff is for maas-region-celery.upstart
[15:33] <allenap> dimitern: Ah. Just type `make` and it'll sort it out for you.
[15:35] <allenap> dimitern: The virtualenv was a hack to make buildout behave a little better. I suspect we'll actually get rid of buildout eventually, or scale back its role.
[15:35] <dimitern> allenap: ok, now vbox issues... I'll dig into it some more (shared folders/read only filesystem error)
[15:36] <dimitern> allenap: it seems to have worked after chown -R :) now it's installing, 10x
[15:36] <allenap> Cool.
[15:46] <roaksoax> rvba: the worker stuff is used when running maas-region-celery right?
[15:47] <rvba> roaksoax: right.
[16:01] <dimitern> any help on how to resolve this?: maas@maas1:~/work/maas$ make syncdb
[16:01] <dimitern> bin/database --preserve run -- bin/maas syncdb --noinput
[16:01] <dimitern> initdb: Postgres-XC node name is mandatory
[16:40] <roaksoax_> rvba: ok so we are gonna need a wrapper to start maas-region-celery
[16:40] <roaksoax_> rvba: http://paste.ubuntu.com/1281418/
[16:54] <roaksoax> '
[17:09] <roaksoax> rvba: still around?
[17:09] <rvba> roaksoax: yeah, but I'll need to bugger off real soon.
[17:12] <roaksoax> rvba: ciukd you please help me test this? http://paste.ubuntu.com/1281498/
[17:12] <roaksoax> rvba: i'm at ODS and network is crap
[17:12] <roaksoax> and can't really test it
[17:12] <roaksoax> rvba: start-region-celery -> create one on /usr/sbin
[17:13] <roaksoax> rvba: and just modify the upstart job accordingly
[17:14] <rvba> roaksoax: why a custom script instead of the exec /usr/bin/celeryd ?
[17:15] <roaksoax> rvba: we need to create rabbitmq credentials on upstart job if tey had not being created on the installer
[17:15] <roaksoax> rvba: which is the actual workaround we made for txlongpoll
[17:15] <rvba> roaksoax: yes, I understand that part, but why do you need another startup script?
[17:15] <roaksoax> rvba: but the upstart job cannot do so because we set the uid,gid to maas
[17:15] <roaksoax> rvba: cannot run rabbitmq
[17:16] <roaksoax> rvba: so the best approach is tyo simply make a "wrapper" that runs celery d
[17:16] <rvba> Ah ok.  I take it it's not possible to setuid maas/setgid maas just before executing exec /usr/bin/celeryd…
[17:16] <roaksoax> rvba: nope unfortunately :(
[17:16] <roaksoax> rvba: it applies to everything
[17:18] <rvba> roaksoax: ok, I'll test that for you, but then I'll need to run.
[17:18] <roaksoax> rvba: awesome thank you
[17:18] <roaksoax> rvba: i'm tethering from my phone cause internet sucks here
[17:26] <rvba> roaksoax: looks like it works ok, the celeryd daemon gets started all right.
[17:26] <rvba> roaksoax: I've manually deleted the user maas_workers.
[17:26] <rvba> In that case it fails because the pre-start script tried to create the vhost which was already there.
[17:27] <rvba> roaksoax: if I also delete the vhost it is fine.
[17:28] <rvba> roaksoax: note that this /usr/sbin/start-region-celery utility is very similar to a helper function we already have so it would be better to share the code.