/srv/irclogs.ubuntu.com/2014/09/03/#maas.txt

=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
jtv	rvba: is the user-data at the end of installation only for the fast-path installer then?	05:44
Valduare	hows it going guys	05:49
jtv	Hi	05:52
Valduare	any news on maas with arm devices	05:54
=== liam_ is now known as Guest81233
=== jfarschman is now known as MilesDenver
rvba	jtv: the user-data is requested before the f-p installation happens. Now I'm not sure it happens in the case of d-i.	06:35
=== CyberJacob\|Away is now known as CyberJacob
jtv	rvba: I'm trying it out...	06:56
=== jfarschman is now known as MilesDenver
jtv	dimitern: hi there — when could you talk about networking?	07:35
dimitern	jtv, hey, how about tomorrow or on friday?	07:36
dimitern	jtv, what's a good time for you?	07:36
jtv	Anytime before 11:00 UTC. Except our standup is at 08:30 UTC.	07:37
dimitern	jtv, so how about tomorrow @ 10 UTC ?	07:38
jtv	Yes, great.	07:38
dimitern	jtv, i'll send an invite, cheers	07:38
dimitern	i'll invite roaksoax and jam if they want to join	07:39
jtv	Sure.	07:39
dimitern	jtv, actually, do you mind if we move it 30m earlier - 9:30 UTC, as it will overlap with our standup :)	07:40
jtv	Better for me actually!	07:40
dimitern	great! invite sent	07:43
ramonskie	i have upgraded to trusty on my cluster controller and also to maas 1.5.2 but now i stumbled on to this bug https://bugs.launchpad.net/maas/+bug/1307779	07:44
ubot5	Ubuntu bug 1307779 in MAAS "fallback from specific to generic subarch broken" [Critical,Fix released]	07:44
ramonskie	it seems to be fixed in 1.6 but i can't find that package. any idea's?	07:45
bigjools	ramonskie: ppa:maas-maintainers/stable	07:48
ramonskie	thanks	07:49
=== jfarschman is now known as MilesDenver
ramonskie	okay so i upgraded and it finds a image now. only its now stuck on the screen where i see route-info	08:47
jtv	rvba: hey, how about I remove that restriction where you need at least 16 bits of netmask on a managed network? That was for the old generated zone files.	09:21
rvba	jtv: yep, we don't need that restriction anymore.	09:22
jtv	\o/	09:22
jtv	Easy karma.	09:22
ramonskie	when enlisting nodes it hangs on the following screen https://dl.dropboxusercontent.com/u/50671970/enlist-hang.jpg	09:33
jtv	ramonskie: that part looks OK to me in itself... how long did you watch them hang?	09:34
jtv	Because if it failed there, I'd probably expect some error output.	09:34
ramonskie	for about 10 minutes now	09:35
jtv	If this is all it shows on the console, I'd give it a bit longer.	09:36
=== jfarschman is now known as MilesDenver
ramonskie	jtv: thanks i will wait	09:37
=== CyberJacob is now known as CyberJacob\|Away
ramonskie	jtv: how long should i wait....	09:52
jtv	ramonskie: any change on the screen?	09:52
ramonskie	nope nothing	09:52
jtv	Then I guess it's time to go trawling through the logs.	09:53
ramonskie	which log files do i need to check..	09:53
jtv	I'm looking along. It's on the maas server, in /var/log/maas.	09:53
jtv	To be honest, since we're not seeing any error message, I don't know what to look for in this case.	09:53
jtv	Oh, one thing that might also help: shift-PageUp on the node's console might show a bit more history.	09:54
jtv	(Might as well keep the node doing whatever it's doing for now, in case it changes its mind)	09:54
jtv	First thing to do is a quick scan for obvious errors:	09:55
jtv	/var/log/maas/apache2/error.log, /var/log/maas/celery.log, /var/log/maas/maas.log	09:55
jtv	If there's an error in there, chances are it'll jump right out at you.	09:56
ramonskie	maas.log is empty nothing	09:56
jtv	That is really odd.	09:56
jtv	Even if there are no requests, it's supposed to have periodic jobs in there. Unless... which version is this?	09:57
jtv	(Roughly — e.g. "the one that came with 14.04")	09:57
ramonskie	first i wasn on ubuntu 12.04 with maas 1.4 then upgraded to ubuntu 14.04 and maas 1.5 and now upgraded to 1.6	09:57
jtv	OK, that's pretty recent. Good.	09:58
jtv	I'm not sure if you'll have /var/log/maas/apache2 then; but it's just a symlink to /var/log/apache2.	09:58
ramonskie	in selery is see some dhcp lease errors	09:59
jtv	Oh?	09:59
jtv	Can you paste one?	09:59
ramonskie	ERROR/MainProcess] Task provisioningserver.tasks.upload_dhcp_leases[e19c4353-92a2-499e-8f95-154b3b017950] raised unexpected: IOError()	09:59
ramonskie	also need the trace?	09:59
jtv	That'd be nice, thanks. Maybe use paste.ubuntu.com.	10:00
ramonskie	wait let me clean all the logs and restart the controller server and start all over	10:01
ramonskie	in case its non related stuff	10:02
jtv	OK	10:02
jtv	Thanks for the review rvba.	10:03
rvba	jtv: np. I just put up for review https://code.launchpad.net/~rvb/maas/revert-2872/+merge/233191. A run in the CI shows that it fixes the problem introduced recently.	10:05
* jtv looks		10:05
ramonskie	jtv: these are the errors after a restart in celery.log http://pastebin.com/ayRxD5v0	10:06
jtv	Thanks.	10:06
jtv	rvba: done. :)	10:06
rvba	Ta	10:06
jtv	ramonskie: scant consolation but this is code that's already removed from the dev version. :)	10:08
jtv	Now, what seems to be going wrong is that the cluster controller is having trouble talking to the region-controller API.	10:08
ramonskie	its on the same server	10:09
ramonskie	i only have one	10:09
jtv	Yeah, it should be easy, shouldn't it?	10:09
ramonskie	:P	10:09
jtv	You may want to check that your DEFAULT_MAAS_URL is configured sensibly.	10:09
jtv	That's in several places. Just grep /etc/maas for it — but as root, or you won't be able to read some of those files.	10:10
jtv	(There's credentials in some of them.)	10:10
ramonskie	./maas_local_settings.py:DEFAULT_MAAS_URL = "http://172.21.42.1/MAAS" ./maas_local_settings.py.dpkg-dist:DEFAULT_MAAS_URL = "http://maas.internal.example.com/"	10:10
jtv	That looks sensible — assuming 172.21.42.1 is indeed your server's IP address, and the nodes will be able to reach it.	10:11
jtv	You could have a look to see if that same request shows up in the Apache error log, in case it did get through to the server.	10:12
jtv	(Or maybe even the Apache access log — but I doubt that)	10:12
jtv	Oh! Just in case, you may want to search the Apache access log for "/MAAS/MAAS"	10:13
ramonskie	nope nothing found	10:14
ramonskie	only thing i see in the apache error log is this : No such file or directory: mod_wsgi (pid=2708): Unable to change working directory to '/home/maas'	10:15
ramonskie	but should not be the problem	10:15
jtv	No, shouldn't be.	10:15
jtv	It's dumb, but maybe you could just try making a wget request to http://172.21.42.1/MAAS from the server itself, just to make sure that gets through?	10:16
ramonskie	strange thing is i see now that i have 2 cluster controllers in clusters	10:16
jtv	Two clusters? That's interesting.	10:16
jtv	When it wakes up, the cluster registers itself with the region controller, and then just keeps polling for the region controller to say "sure, yeah, you're accepted."	10:17
ramonskie	yeah i think upgrade	10:17
jtv	The region controller should identify them by UUID.	10:17
ramonskie	i think its a upgrade quirk that happend in 1.5	10:18
ramonskie	but i gave that another set op ips and dns zone	10:18
jtv	New to me... I guess they have different UUIDs? I guess one is "master"?	10:18
ramonskie	yes one is master and one is called maas	10:18
ramonskie	in dns zone	10:18
jtv	If they're both running on the same server, that spells trouble.	10:18
jtv	Because they control a DHCP server, a DNS server, iSCSI, a TFTP server, and so on.	10:19
ramonskie	ahh well that explains a lot	10:19
jtv	I'm still not sure _how_ it would cause the failure in the log, but it's definitely closer to the source.	10:20
ramonskie	the only problem is if i delete the newly created cluster it pops backup in pending state	10:20
jtv	Yeah.	10:20
ramonskie	and the other cluster have still a set of working nodes in them	10:20
jtv	You could try stopping the cluster controller, deleting the new cluster, and then in the UI updating the old one to look like the new one.	10:21
jtv	Mind you, they'll still have different UUIDs... so that may not be good enough.	10:21
jtv	I think this'll require some database surgery.	10:21
ramonskie	they have different uuids	10:21
jtv	Yeah.	10:21
jtv	So the upgrade generated a new one instead of reusing the old one.	10:22
ramonskie	also the old cluster has only 6 boot-images and the new one 126	10:22
jtv	Yeah, a lot has changed there.	10:22
ramonskie	so can i move the nodes from the old cluster to the new one	10:22
jtv	Let's start by getting a good view of the situation... if you grep /etc/maas for "UUID", do you get consistent UUIDs from the various config files?	10:23
ramonskie	and then delete the old cluster	10:23
jtv	The only way to move nodes is to delete them from one cluster and re-enlist them into the other.	10:23
jtv	If that is not a problem, then I think it's the easiest way out.	10:23
jtv	But it means that anything you've got running on those nodes is lost.	10:23
ramonskie	maas_local_celeryconfig_cluster.py:CLUSTER_UUID = '3d245a63-2b23-42be-8977-f36cb2218b9e'	10:24
ramonskie	and thats the new one	10:24
ramonskie	yeah i can't delete thos nodes openstack is running on it with alot of vms :(	10:24
jtv	Blast.	10:24
ramonskie	otherwise i would already have done a clean install :)	10:25
jtv	Well, it's going to get tricky at any rate. First let me have a look for known bugs.	10:25
ramonskie	okay	10:26
jtv	Meanwhile, could you check that the cluster UUID in /etc/maas/maas_cluster.conf is consistent with the one you found in maas_local_celeryconfig_cluster.py?	10:26
ramonskie	yes they are the same	10:27
jtv	OK	10:27
ramonskie	both the new accepted cluster	10:27
ramonskie	can i disabele the other cluster but still let dns work	10:28
jtv	ramonskie: safest thing to try I guess would be to set them to the old cluster. But... first a look for known bugs.	10:28
ramonskie	that should solve it	10:28
jtv	Well, plus a restart. :) And then you'd have to delete the new one.	10:28
ramonskie	okay but if i set it to the old cluster will also the new boot images be added?	10:29
jtv	Should be, yes. Because AFAICT the two actually share everything except a process.	10:31
jtv	It's the same files on disc, etc. It may take a few minutes for the remaining cluster controller to inform the region controller of what it has.	10:33
ramonskie	you already checked known bugs?	10:33
ramonskie	so should i try this?	10:33
jtv	I checked known unfixed bugs. Let me make one more round for ones that may have been fixed later.	10:34
=== jfarschman is now known as MilesDenver
jtv	ramonskie: I guess it's not bug 1344089, and that's the best candidate I found.	10:38
ubot5	bug 1344089 in MAAS 1.6 "IntegrityError after upgrading to 1.6beta5" [Critical,Fix released] https://launchpad.net/bugs/1344089	10:38
jtv	(I realise you hit your problem with an earlier version)	10:38
ramonskie	i'am on 1.6 now	10:39
jtv	Anyway, assuming that's not it, we'll have to make the change. I'd stop the cluster controllers first.	10:39
jtv	Then update the UUID entries in the config, to use your original cluster's UUID.	10:40
jtv	I'd also set the cluster interfaces to Unmanaged, just so you can re-enable the right one later.	10:41
ramonskie	whats the best and savest way to stop the cluster controller	10:42
jtv	Then restart, accept the right cluster controller if needed (it may be automatic), enable the right cluster interface, and see if that fixes things.	10:42
jtv	sudo service maas-cluster-celery stop	10:42
jtv	sudo service maas-pserv stop	10:42
jtv	Then I'd run a “ps -ef \| grep maas” to check for lingering processes.	10:42
ramonskie	wow there is still a lot running	10:44
jtv	Yeah it's not a small thing.	10:44
ramonskie	several of these: /usr/bin/python /usr/bin/celeryd --logfile=/var/log/maas/celery-region.log --schedule=/var/lib/maas/celerybeat-region-schedule --loglevel=INFO --beat --queues=celery,master	10:45
ramonskie	should i kill them?	10:45
jtv	No, those are the region controller's celery.	10:45
ramonskie	and there should be 10 of them?	10:45
jtv	Probably not.	10:46
ramonskie	lol	10:46
jtv	But I don't know what might cause there to be more... I do hope you don't have two region controllers as well!	10:46
ramonskie	i cerently hope not	10:48
ramonskie	the only thing i did was what i thought a simple upgrade	10:48
jtv	Yeah. This clearly shouldn't have happened.	10:49
ramonskie	okay edited both files maas_local_celeryconfig_cluster.py and maas_cluster.conf	10:49
ramonskie	with the old uuid	10:50
jtv	OK.	10:50
jtv	And you've set the cluster interfaces to Unmanaged in the UI?	10:50
ramonskie	the new cluster?	10:50
ramonskie	done for the new cluster	10:51
jtv	OK. I'd do the old one as well.	10:51
jtv	(The only drawback is your DHCP server will be down briefly — let's keep it short)	10:52
ramonskie	no dhcp entries will be deleted?	10:52
jtv	Not as such, though there may be more confusion that will only become clear later.	10:53
ramonskie	backed it up just in case	10:53
jtv	Good.	10:54
jtv	And then we get to restart. A reboot would be the most comprehensive.	10:55
ramonskie	reboot it is	10:55
ramonskie	okay rebooted	10:57
ramonskie	i removed the new cluster now	10:57
ramonskie	do i need to set managed dhcp on again?	10:59
ramonskie	what are the best next steps to take?	11:00
jtv	First: is the old cluster now Accepted?	11:00
jtv	If it is, then yes, re-enable DHCP management (and DNS management I guess — you mentioned using that)	11:01
ramonskie	yes the old cluster is accepted and the boot-images are now also 126 instead of 6	11:01
jtv	Excellent!	11:02
jtv	Want to try that node again?	11:02
ramonskie	yup	11:02
ramonskie	let me first check if everything is okay	11:03
ramonskie	and that not the ipaddress have changed :P	11:03
jtv	Yeah. Anything you can check is a plus at this point. :)	11:03
jtv	If you feel up to it, maybe a fresh look at those logs in /var/log/maas.	11:03
ramonskie	okay check seems okay no error for now	11:05
ramonskie	will try a node now	11:05
* jtv bates breath		11:06
ramonskie	whooopppdidoooh	11:08
ramonskie	it works	11:08
ramonskie	muchos kudos to you!!!	11:08
jtv	Phew.	11:09
ramonskie	thanks for helping mate	11:09
jtv	Glad I could help — and glad it didn't come crashing down on us. :-)	11:09
ramonskie	yes i'm realy glad i don't need to start over. this saved so much work	11:10
ramonskie	and finaly the auto discover of ipmi is working :D	11:10
jtv	I'll have to go now, but I would really appreciate if you could file a bug about this — especially the part where you upgraded and got two cluster controllers. That might still be in the packaging somewhere.	11:10
ramonskie	okay will do thanks for all the help	11:12
ramonskie	where do you want me to fill in the bug?	11:12
jtv	https://bugs.launchpad.net/maas	11:13
jtv	(You have a Launchpad account ,right?)	11:13
ramonskie	yup	11:14
ramonskie	okay creating one now	11:14
jtv	Thanks. If we can prevent this from happening to someone else, that's wonderful.	11:14
* jtv runs now		11:14
jtv	Good night!	11:14
ramonskie	i'm out to bye	11:22
ramonskie	bug created https://bugs.launchpad.net/maas/+bug/1364903	11:22
ubot5	Ubuntu bug 1364903 in MAAS "2 cluster controllers after upgrade from 1.4 > 1.5" [Undecided,New]	11:22
rvba	blake_r: Hi Blake. I had to revert 2872. See https://code.launchpad.net/~rvb/maas/revert-2872/+merge/233191 for details.	11:31
=== jfarschman is now known as MilesDenver
rvba	blake_r: Now I'm thinking that revision 2871 also introduced a problem (but a different one): one CI run failed with the nodes failing to get the images they need to boot. This looks like the problem you diagnosed yesterday and said you were working on.	12:10
rvba	blake_r: I'd say this is a race condition as the CI test passed a couple of times.	12:11
rvba	blake_r: since you'll be up in less than an hour I'll refrain from reverting this one again. Let's talk when you come online.	12:12
=== jfarschman is now known as MilesDenver
blake_r	rvba: yes it is possible it will pass	13:05
blake_r	rvba: the issue is that RPC is used for the API call but not for the image selection when a node is booting	13:05
blake_r	rvba: so pxeconfig will fail	13:05
blake_r	rvba: I have a branch that fixes pxeconfig	13:06
rvba	blake_r: 2871 causes the images not to be present from time to time. 2872 (which I reverted) was causing the node to fail to enlist (see my paste on the revertion MP).	13:07
rvba	noeds*	13:08
rvba	nodes*	13:08
rvba	arg	13:08
blake_r	rvba: its not that the images are not present, its that the images are not present in the BootImage model, which is going away	13:15
rvba	blake_r: right, what I meant that, as far as the node can see (and this involves the BootImage model), the images are not there.	13:16
blake_r	rvba: yes correct, I was going to get that branch ready and land today, looks like I will have to do all of the again, :(	13:17
rvba	blake_r: I just reverted 2872 (which was causing failures all the time), not 2871.	13:17
blake_r	rvba: okay	13:18
blake_r	rvba: will look at the mp in a moment, getting through email this morning	13:18
rvba	blake_r: I'm sorry but I was in the middle of a QA and having trunk broken like that means a lot of time wasted for me.	13:18
blake_r	rvba: oh I see the reason	13:21
blake_r	rvba: yeah its reporting the avaliable architectures wrong, I will work on a fix	13:21
rvba	blake_r: cool, ta.	13:21
rvba	blake_r: if you can fix the breakage introduced by 2871 first that would be great. Because 2871 is still checked in.	13:24
blake_r	rvba: okay	13:24
rvba	Thanks.	13:24
=== jfarschman is now known as MilesDenver
=== jfarschman is now known as MilesDenver
newell	I am getting a 500 error when I go to aquire a commissioned node with latest trunk. Here is the stacktrace: http://paste.ubuntu.com/8223903/	14:25
newell	Anyone seen this before?	14:26
rvba	newell: let me have a look at this stacktrace…	14:27
rvba	newell: looks like a bug in gen_dynamic_ip_addresses_with_host_maps: it should skip the ngi with no static_ip_range_low/hig	14:30
rvba	newell: can you file a critical bug about this?	14:30
newell	yeah	14:30
newell	https://bugs.launchpad.net/maas/+bug/1364993	14:35
ubot5	Ubuntu bug 1364993 in MAAS "gen_dynamic_ip_addresses_with_host_maps: it should skip the ngi with no static_ip_range_low/hig" [Critical,New]	14:35
rvba	newell: I changed the title for this bug. We try to explain what the problem is in the title/descriptions. Suggestions on the possible cause or ideas on how to fix it should be put in the comments.	14:37
rvba	newell: this helps triaging and lets people come up with alternative solutions.	14:38
newell	ha I just changed the title as well before I just read this	14:39
newell	wonder where it stands now	14:39
newell	https://bugs.launchpad.net/maas/+bug/1364993	14:39
ubot5	Ubuntu bug 1364993 in MAAS "500 error when trying to acquire a commissioned node" [Critical,New]	14:39
newell	Is that better?	14:39
rvba	Yep, it describes the problem.	14:40
newell	k, we were both in the middle of changing it and my page wasn't refreshed, that is why I didn't see that you had modified it	14:40
rvba	I figured :)	14:41
=== roadmr is now known as roadmr_afk
=== roadmr_afk is now known as roadmr
=== magicrob1tmonkey is now known as magicrobotmonkey
=== ming is now known as Guest50512
newell	rvba, still around?	16:24
rvba	newell: yep	16:26
newell	rvba, we need a test for that bug or is it trivial enough to just push the change you mentioned?	16:26
rvba	newell: as with any non-trivial change, it's worth a test	16:28
rvba	newell: now I'm not so sure my solution is the right one as test__treats_undefined_static_range_as_zero_size_network seems to test the case where ngi has not static range.	16:30
newell	yeah I was looking at that too	16:31
newell	your change doesn't break any tests though	16:31
rvba	newell: there is a bug in the test :)	16:36
newell	ha	16:37
rvba	newell: can you spot it?	16:37
newell	let me take a look	16:37
rvba	newell: My fix is up for review. https://code.launchpad.net/~rvb/maas/bug-1364993/+merge/233244. And I need to step out now. ttyl.	16:43
newell	sounds good I will review it, sorry wife was asking me questions and got pulled away	16:43
newell	ha, just needed to save it	16:44
=== roadmr is now known as roadmr_afk
dpb2	Hi all -- I'm starting a server, but I don't see any log message for the power on attempt in celery log (just periodic dhcp refreshes, etc). What is up?	17:09
dpb2	(this install had been working fine)	17:09
dpb2	roaksoax: ^ any ideas?	17:11
roaksoax	dpb2: check whether maas-pserv is running	17:15
roaksoax	dpb2: are you importing images?	17:15
dpb2	roaksoax: all maas services are reported as running	17:16
dpb2	roaksoax: let me check on the images	17:16
dpb2	roaksoax: actually, I'm not sure how to check that. :)	17:16
roaksoax	dpb2: what MAAS version are you using?	17:17
dpb2	1.6.1+bzr2550-0ubuntu1~ppa2	17:17
roaksoax	dpb2: uhmmm	17:18
roaksoax	blake_r: ^^ any thoughts?	17:18
dpb2	roaksoax: I could restart all the services, but I didn't want to mask an issue	17:18
roaksoax	dpb2: do please restart the issue. 1.7 will completely change in that area	17:19
roaksoax	dpb2: because of celery being silly and causing issues like this	17:19
roaksoax	dpb2: (we are getting rid of celery)	17:19
roaksoax	dpb2: can you see logs?	17:20
roaksoax	dpb2: maas.log celery.log	17:20
dpb2	hm	17:20
dpb2	roaksoax: I'm seeing the logs now	17:20
dpb2	(just now)	17:20
dpb2	roaksoax: so...	17:20
dpb2	roaksoax: if boot images are importing, does that block power up attempts	17:20
dpb2	?	17:20
roaksoax	dpb2: yes it can... celery blocks any other jobs if a bigger job is in progress	17:21
dpb2	yikes	17:21
dpb2	ok	17:21
dpb2	well, I think it's working now. the old "try it again" fixed it	17:22
dpb2	thanks	17:22
roaksoax	dpb2: np!	17:22
Valduare	hows it going guys	17:39
Valduare	any news on maas with arm devices?	17:39
newell	Valduare, there is currently some arm support (i.e. arm64/armhf etc.)	17:50
Valduare	I have a few mk808 devices here that would be fun to be able to spin them up etc	17:51
=== roadmr_afk is now known as roadmr
=== CyberJacob\|Away is now known as CyberJacob

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!