=== jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver [05:44] rvba: is the user-data at the end of installation only for the fast-path installer then? [05:49] hows it going guys [05:52] Hi [05:54] any news on maas with arm devices === liam_ is now known as Guest81233 === jfarschman is now known as MilesDenver [06:35] jtv: the user-data is requested before the f-p installation happens. Now I'm not sure it happens in the case of d-i. === CyberJacob|Away is now known as CyberJacob [06:56] rvba: I'm trying it out... === jfarschman is now known as MilesDenver [07:35] dimitern: hi there — when could you talk about networking? [07:36] jtv, hey, how about tomorrow or on friday? [07:36] jtv, what's a good time for you? [07:37] Anytime before 11:00 UTC. Except our standup is at 08:30 UTC. [07:38] jtv, so how about tomorrow @ 10 UTC ? [07:38] Yes, great. [07:38] jtv, i'll send an invite, cheers [07:39] i'll invite roaksoax and jam if they want to join [07:39] Sure. [07:40] jtv, actually, do you mind if we move it 30m earlier - 9:30 UTC, as it will overlap with our standup :) [07:40] Better for me actually! [07:43] great! invite sent [07:44] i have upgraded to trusty on my cluster controller and also to maas 1.5.2 but now i stumbled on to this bug https://bugs.launchpad.net/maas/+bug/1307779 [07:44] Ubuntu bug 1307779 in MAAS "fallback from specific to generic subarch broken" [Critical,Fix released] [07:45] it seems to be fixed in 1.6 but i can't find that package. any idea's? [07:48] ramonskie: ppa:maas-maintainers/stable [07:49] thanks === jfarschman is now known as MilesDenver [08:47] okay so i upgraded and it finds a image now. only its now stuck on the screen where i see route-info [09:21] rvba: hey, how about I remove that restriction where you need at least 16 bits of netmask on a managed network? That was for the old generated zone files. [09:22] jtv: yep, we don't need that restriction anymore. [09:22] \o/ [09:22] Easy karma. [09:33] when enlisting nodes it hangs on the following screen https://dl.dropboxusercontent.com/u/50671970/enlist-hang.jpg [09:34] ramonskie: that part looks OK to me in itself... how long did you watch them hang? [09:34] Because if it failed there, I'd probably expect some error output. [09:35] for about 10 minutes now [09:36] If this is all it shows on the console, I'd give it a bit longer. === jfarschman is now known as MilesDenver [09:37] jtv: thanks i will wait === CyberJacob is now known as CyberJacob|Away [09:52] jtv: how long should i wait.... [09:52] ramonskie: any change on the screen? [09:52] nope nothing [09:53] Then I guess it's time to go trawling through the logs. [09:53] which log files do i need to check.. [09:53] I'm looking along. It's on the maas server, in /var/log/maas. [09:53] To be honest, since we're not seeing any error message, I don't know what to look for in this case. [09:54] Oh, one thing that might also help: shift-PageUp on the node's console might show a bit more history. [09:54] (Might as well keep the node doing whatever it's doing for now, in case it changes its mind) [09:55] First thing to do is a quick scan for obvious errors: [09:55] /var/log/maas/apache2/error.log, /var/log/maas/celery.log, /var/log/maas/maas.log [09:56] If there's an error in there, chances are it'll jump right out at you. [09:56] maas.log is empty nothing [09:56] That is really odd. [09:57] Even if there are no requests, it's supposed to have periodic jobs in there. Unless... which version is this? [09:57] (Roughly — e.g. "the one that came with 14.04") [09:57] first i wasn on ubuntu 12.04 with maas 1.4 then upgraded to ubuntu 14.04 and maas 1.5 and now upgraded to 1.6 [09:58] OK, that's pretty recent. Good. [09:58] I'm not sure if you'll have /var/log/maas/apache2 then; but it's just a symlink to /var/log/apache2. [09:59] in selery is see some dhcp lease errors [09:59] Oh? [09:59] Can you paste one? [09:59] ERROR/MainProcess] Task provisioningserver.tasks.upload_dhcp_leases[e19c4353-92a2-499e-8f95-154b3b017950] raised unexpected: IOError() [09:59] also need the trace? [10:00] That'd be nice, thanks. Maybe use paste.ubuntu.com. [10:01] wait let me clean all the logs and restart the controller server and start all over [10:02] in case its non related stuff [10:02] OK [10:03] Thanks for the review rvba. [10:05] jtv: np. I just put up for review https://code.launchpad.net/~rvb/maas/revert-2872/+merge/233191. A run in the CI shows that it fixes the problem introduced recently. [10:05] * jtv looks [10:06] jtv: these are the errors after a restart in celery.log http://pastebin.com/ayRxD5v0 [10:06] Thanks. [10:06] rvba: done. :) [10:06] Ta [10:08] ramonskie: scant consolation but this is code that's already removed from the dev version. :) [10:08] Now, what seems to be going wrong is that the cluster controller is having trouble talking to the region-controller API. [10:09] its on the same server [10:09] i only have one [10:09] Yeah, it should be easy, shouldn't it? [10:09] :P [10:09] You may want to check that your DEFAULT_MAAS_URL is configured sensibly. [10:10] That's in several places. Just grep /etc/maas for it — but as root, or you won't be able to read some of those files. [10:10] (There's credentials in some of them.) [10:10] ./maas_local_settings.py:DEFAULT_MAAS_URL = "http://172.21.42.1/MAAS" ./maas_local_settings.py.dpkg-dist:DEFAULT_MAAS_URL = "http://maas.internal.example.com/" [10:11] That looks sensible — assuming 172.21.42.1 is indeed your server's IP address, and the nodes will be able to reach it. [10:12] You could have a look to see if that same request shows up in the Apache error log, in case it did get through to the server. [10:12] (Or maybe even the Apache access log — but I doubt that) [10:13] Oh! Just in case, you may want to search the Apache access log for "/MAAS/MAAS" [10:14] nope nothing found [10:15] only thing i see in the apache error log is this : No such file or directory: mod_wsgi (pid=2708): Unable to change working directory to '/home/maas' [10:15] but should not be the problem [10:15] No, shouldn't be. [10:16] It's dumb, but maybe you could just try making a wget request to http://172.21.42.1/MAAS from the server itself, just to make sure that gets through? [10:16] strange thing is i see now that i have 2 cluster controllers in clusters [10:16] Two clusters? That's interesting. [10:17] When it wakes up, the cluster registers itself with the region controller, and then just keeps polling for the region controller to say "sure, yeah, you're accepted." [10:17] yeah i think upgrade [10:17] The region controller should identify them by UUID. [10:18] i think its a upgrade quirk that happend in 1.5 [10:18] but i gave that another set op ips and dns zone [10:18] New to me... I guess they have different UUIDs? I guess one is "master"? [10:18] yes one is master and one is called maas [10:18] in dns zone [10:18] If they're both running on the same server, that spells trouble. [10:19] Because they control a DHCP server, a DNS server, iSCSI, a TFTP server, and so on. [10:19] ahh well that explains a lot [10:20] I'm still not sure _how_ it would cause the failure in the log, but it's definitely closer to the source. [10:20] the only problem is if i delete the newly created cluster it pops backup in pending state [10:20] Yeah. [10:20] and the other cluster have still a set of working nodes in them [10:21] You could try stopping the cluster controller, deleting the new cluster, and then in the UI updating the old one to look like the new one. [10:21] Mind you, they'll still have different UUIDs... so that may not be good enough. [10:21] I think this'll require some database surgery. [10:21] they have different uuids [10:21] Yeah. [10:22] So the upgrade generated a new one instead of reusing the old one. [10:22] also the old cluster has only 6 boot-images and the new one 126 [10:22] Yeah, a lot has changed there. [10:22] so can i move the nodes from the old cluster to the new one [10:23] Let's start by getting a good view of the situation... if you grep /etc/maas for "UUID", do you get consistent UUIDs from the various config files? [10:23] and then delete the old cluster [10:23] The only way to move nodes is to delete them from one cluster and re-enlist them into the other. [10:23] If that is not a problem, then I think it's the easiest way out. [10:23] But it means that anything you've got running on those nodes is lost. [10:24] maas_local_celeryconfig_cluster.py:CLUSTER_UUID = '3d245a63-2b23-42be-8977-f36cb2218b9e' [10:24] and thats the new one [10:24] yeah i can't delete thos nodes openstack is running on it with alot of vms :( [10:24] Blast. [10:25] otherwise i would already have done a clean install :) [10:25] Well, it's going to get tricky at any rate. First let me have a look for known bugs. [10:26] okay [10:26] Meanwhile, could you check that the cluster UUID in /etc/maas/maas_cluster.conf is consistent with the one you found in maas_local_celeryconfig_cluster.py? [10:27] yes they are the same [10:27] OK [10:27] both the new accepted cluster [10:28] can i disabele the other cluster but still let dns work [10:28] ramonskie: safest thing to try I guess would be to set them to the old cluster. But... first a look for known bugs. [10:28] that should solve it [10:28] Well, plus a restart. :) And then you'd have to delete the new one. [10:29] okay but if i set it to the old cluster will also the new boot images be added? [10:31] Should be, yes. Because AFAICT the two actually share everything except a process. [10:33] It's the same files on disc, etc. It may take a few minutes for the remaining cluster controller to inform the region controller of what it has. [10:33] you already checked known bugs? [10:33] so should i try this? [10:34] I checked known unfixed bugs. Let me make one more round for ones that may have been fixed later. === jfarschman is now known as MilesDenver [10:38] ramonskie: I guess it's not bug 1344089, and that's the best candidate I found. [10:38] bug 1344089 in MAAS 1.6 "IntegrityError after upgrading to 1.6beta5" [Critical,Fix released] https://launchpad.net/bugs/1344089 [10:38] (I realise you hit your problem with an earlier version) [10:39] i'am on 1.6 now [10:39] Anyway, assuming that's not it, we'll have to make the change. I'd stop the cluster controllers first. [10:40] Then update the UUID entries in the config, to use your original cluster's UUID. [10:41] I'd also set the cluster interfaces to Unmanaged, just so you can re-enable the right one later. [10:42] whats the best and savest way to stop the cluster controller [10:42] Then restart, accept the right cluster controller if needed (it may be automatic), enable the right cluster interface, and see if that fixes things. [10:42] sudo service maas-cluster-celery stop [10:42] sudo service maas-pserv stop [10:42] Then I'd run a “ps -ef | grep maas” to check for lingering processes. [10:44] wow there is still a lot running [10:44] Yeah it's not a small thing. [10:45] several of these: /usr/bin/python /usr/bin/celeryd --logfile=/var/log/maas/celery-region.log --schedule=/var/lib/maas/celerybeat-region-schedule --loglevel=INFO --beat --queues=celery,master [10:45] should i kill them? [10:45] No, those are the region controller's celery. [10:45] and there should be 10 of them? [10:46] Probably not. [10:46] lol [10:46] But I don't know what might cause there to be more... I do hope you don't have two region controllers as well! [10:48] i cerently hope not [10:48] the only thing i did was what i thought a simple upgrade [10:49] Yeah. This clearly shouldn't have happened. [10:49] okay edited both files maas_local_celeryconfig_cluster.py and maas_cluster.conf [10:50] with the old uuid [10:50] OK. [10:50] And you've set the cluster interfaces to Unmanaged in the UI? [10:50] the new cluster? [10:51] done for the new cluster [10:51] OK. I'd do the old one as well. [10:52] (The only drawback is your DHCP server will be down briefly — let's keep it short) [10:52] no dhcp entries will be deleted? [10:53] Not as such, though there may be more confusion that will only become clear later. [10:53] backed it up just in case [10:54] Good. [10:55] And then we get to restart. A reboot would be the most comprehensive. [10:55] reboot it is [10:57] okay rebooted [10:57] i removed the new cluster now [10:59] do i need to set managed dhcp on again? [11:00] what are the best next steps to take? [11:00] First: is the old cluster now Accepted? [11:01] If it is, then yes, re-enable DHCP management (and DNS management I guess — you mentioned using that) [11:01] yes the old cluster is accepted and the boot-images are now also 126 instead of 6 [11:02] Excellent! [11:02] Want to try that node again? [11:02] yup [11:03] let me first check if everything is okay [11:03] and that not the ipaddress have changed :P [11:03] Yeah. Anything you can check is a plus at this point. :) [11:03] If you feel up to it, maybe a fresh look at those logs in /var/log/maas. [11:05] okay check seems okay no error for now [11:05] will try a node now [11:06] * jtv bates breath [11:08] whooopppdidoooh [11:08] it works [11:08] muchos kudos to you!!! [11:09] Phew. [11:09] thanks for helping mate [11:09] Glad I could help — and glad it didn't come crashing down on us. :-) [11:10] yes i'm realy glad i don't need to start over. this saved so much work [11:10] and finaly the auto discover of ipmi is working :D [11:10] I'll have to go now, but I would really appreciate if you could file a bug about this — especially the part where you upgraded and got two cluster controllers. That might still be in the packaging somewhere. [11:12] okay will do thanks for all the help [11:12] where do you want me to fill in the bug? [11:13] https://bugs.launchpad.net/maas [11:13] (You have a Launchpad account ,right?) [11:14] yup [11:14] okay creating one now [11:14] Thanks. If we can prevent this from happening to someone else, that's wonderful. [11:14] * jtv runs now [11:14] Good night! [11:22] i'm out to bye [11:22] bug created https://bugs.launchpad.net/maas/+bug/1364903 [11:22] Ubuntu bug 1364903 in MAAS "2 cluster controllers after upgrade from 1.4 > 1.5" [Undecided,New] [11:31] blake_r: Hi Blake. I had to revert 2872. See https://code.launchpad.net/~rvb/maas/revert-2872/+merge/233191 for details. === jfarschman is now known as MilesDenver [12:10] blake_r: Now I'm thinking that revision 2871 also introduced a problem (but a different one): one CI run failed with the nodes failing to get the images they need to boot. This looks like the problem you diagnosed yesterday and said you were working on. [12:11] blake_r: I'd say this is a race condition as the CI test passed a couple of times. [12:12] blake_r: since you'll be up in less than an hour I'll refrain from reverting this one again. Let's talk when you come online. === jfarschman is now known as MilesDenver [13:05] rvba: yes it is possible it will pass [13:05] rvba: the issue is that RPC is used for the API call but not for the image selection when a node is booting [13:05] rvba: so pxeconfig will fail [13:06] rvba: I have a branch that fixes pxeconfig [13:07] blake_r: 2871 causes the images not to be present from time to time. 2872 (which I reverted) was causing the node to fail to enlist (see my paste on the revertion MP). [13:08] noeds* [13:08] nodes* [13:08] arg [13:15] rvba: its not that the images are not present, its that the images are not present in the BootImage model, which is going away [13:16] blake_r: right, what I meant that, as far as the node can see (and this involves the BootImage model), the images are not there. [13:17] rvba: yes correct, I was going to get that branch ready and land today, looks like I will have to do all of the again, :( [13:17] blake_r: I just reverted 2872 (which was causing failures all the time), not 2871. [13:18] rvba: okay [13:18] rvba: will look at the mp in a moment, getting through email this morning [13:18] blake_r: I'm sorry but I was in the middle of a QA and having trunk broken like that means a lot of time wasted for me. [13:21] rvba: oh I see the reason [13:21] rvba: yeah its reporting the avaliable architectures wrong, I will work on a fix [13:21] blake_r: cool, ta. [13:24] blake_r: if you can fix the breakage introduced by 2871 first that would be great. Because 2871 is still checked in. [13:24] rvba: okay [13:24] Thanks. === jfarschman is now known as MilesDenver === jfarschman is now known as MilesDenver [14:25] I am getting a 500 error when I go to aquire a commissioned node with latest trunk. Here is the stacktrace: http://paste.ubuntu.com/8223903/ [14:26] Anyone seen this before? [14:27] newell: let me have a look at this stacktrace… [14:30] newell: looks like a bug in gen_dynamic_ip_addresses_with_host_maps: it should skip the ngi with no static_ip_range_low/hig [14:30] newell: can you file a critical bug about this? [14:30] yeah [14:35] https://bugs.launchpad.net/maas/+bug/1364993 [14:35] Ubuntu bug 1364993 in MAAS "gen_dynamic_ip_addresses_with_host_maps: it should skip the ngi with no static_ip_range_low/hig" [Critical,New] [14:37] newell: I changed the title for this bug. We try to explain what the problem is in the title/descriptions. Suggestions on the possible cause or ideas on how to fix it should be put in the comments. [14:38] newell: this helps triaging and lets people come up with alternative solutions. [14:39] ha I just changed the title as well before I just read this [14:39] wonder where it stands now [14:39] https://bugs.launchpad.net/maas/+bug/1364993 [14:39] Ubuntu bug 1364993 in MAAS "500 error when trying to acquire a commissioned node" [Critical,New] [14:39] Is that better? [14:40] Yep, it describes the problem. [14:40] k, we were both in the middle of changing it and my page wasn't refreshed, that is why I didn't see that you had modified it [14:41] I figured :) === roadmr is now known as roadmr_afk === roadmr_afk is now known as roadmr === magicrob1tmonkey is now known as magicrobotmonkey === ming is now known as Guest50512 [16:24] rvba, still around? [16:26] newell: yep [16:26] rvba, we need a test for that bug or is it trivial enough to just push the change you mentioned? [16:28] newell: as with any non-trivial change, it's worth a test [16:30] newell: now I'm not so sure my solution is the right one as test__treats_undefined_static_range_as_zero_size_network seems to test the case where ngi has not static range. [16:31] yeah I was looking at that too [16:31] your change doesn't break any tests though [16:36] newell: there is a bug in the test :) [16:37] ha [16:37] newell: can you spot it? [16:37] let me take a look [16:43] newell: My fix is up for review. https://code.launchpad.net/~rvb/maas/bug-1364993/+merge/233244. And I need to step out now. ttyl. [16:43] sounds good I will review it, sorry wife was asking me questions and got pulled away [16:44] ha, just needed to save it === roadmr is now known as roadmr_afk [17:09] Hi all -- I'm starting a server, but I don't see any log message for the power on attempt in celery log (just periodic dhcp refreshes, etc). What is up? [17:09] (this install had been working fine) [17:11] roaksoax: ^ any ideas? [17:15] dpb2: check whether maas-pserv is running [17:15] dpb2: are you importing images? [17:16] roaksoax: all maas services are reported as running [17:16] roaksoax: let me check on the images [17:16] roaksoax: actually, I'm not sure how to check that. :) [17:17] dpb2: what MAAS version are you using? [17:17] 1.6.1+bzr2550-0ubuntu1~ppa2 [17:18] dpb2: uhmmm [17:18] blake_r: ^^ any thoughts? [17:18] roaksoax: I *could* restart all the services, but I didn't want to mask an issue [17:19] dpb2: do please restart the issue. 1.7 will completely change in that area [17:19] dpb2: because of celery being silly and causing issues like this [17:19] dpb2: (we are getting rid of celery) [17:20] dpb2: can you see logs? [17:20] dpb2: maas.log celery.log [17:20] hm [17:20] roaksoax: I'm seeing the logs now [17:20] (just now) [17:20] roaksoax: so... [17:20] roaksoax: if boot images are importing, does that block power up attempts [17:20] ? [17:21] dpb2: yes it can... celery blocks any other jobs if a bigger job is in progress [17:21] yikes [17:21] ok [17:22] well, I think it's working now. the old "try it again" fixed it [17:22] thanks [17:22] dpb2: np! [17:39] hows it going guys [17:39] any news on maas with arm devices? [17:50] Valduare, there is currently some arm support (i.e. arm64/armhf etc.) [17:51] I have a few mk808 devices here that would be fun to be able to spin them up etc === roadmr_afk is now known as roadmr === CyberJacob|Away is now known as CyberJacob