[00:01] roaksoax: we are going to keep working from the hotel now [00:01] catch you later [00:02] alright! === matsubara-afk is now known as matsubara [12:53] hi =) [12:54] how do I provision a new node for maas? can I create VM with libvirt and "enlist" it as a node? === kentb-out is now known as kentb [15:47] hallo, it is expected that today's build show 'It Works' message instead maas interface ? thanks [15:51] rvba: howdy!! [15:51] njin: maybe you are missing the /MAAS ? [15:51] roaksoax: Hi [15:51] rvba: so I tested the django thing... seems to be working just fine [15:51] rvba: have you been able to test it too? [15:52] roaksoax: yeah, all fine. Tested on precise and quantal. [15:56] roaksoax, with /MAAS it return error 200 [16:00] rvba: cool then i'll move that to /stable [16:00] njin: check your apache2 logs? and maas logs? === andreas__ is now known as ahasenack [16:19] roaksoax, thanks apache2 log is full of errors /usr/share/maas/start_up...Lock timeout, missed mod_wgsi and others === matsubara is now known as matsubara-lunch [16:49] rvba: thanks for looking at lp 1131418 [16:49] Launchpad bug 1131418 in MAAS "Nodes don't go to ready, after commissioning they get a 500 error when reporting back to maas" [Critical,Triaged] https://launchpad.net/bugs/1131418 [16:50] we reinstalled everything last night as we were onsite and had to finish a delivery [16:51] racedo: np. As I said on the bug, I suspect the tag definition is buggy. [16:51] we don't seem to be hitting this after reproducing exactly the same environment two times [16:51] That's very weird. [16:52] yes, it all started when using --constraint maas-name=name-of-the-host wasn't recognising the host and the juju log in zookeeper said no such name or no matching or something similar [16:52] with a name-of-the-host that existed [16:52] if we see anything today we'll update the lp [16:53] Ok, cool. [17:04] roaksoax: melmoth: have you seen in your engagements maas nodes rebooting after being deployed with a service or even after being commissioned and go to grub rescue prompt? [17:06] racedo, not really, but i never understood the powermanagement thingy woith maas [17:06] i guess when you deploy a service, it pick a ready node, and boots ip automatically [17:06] i never experienced it, but i guess this is what should happen, right ? [17:07] racedo: never [17:07] roaksoax: melmoth: it's exactly what Destreyf explains here at 19:20: http://irclogs.ubuntu.com/2012/05/15/%23juju.txt [17:08] i wont have time to read today. [17:09] we reported the pattern in lp 1131737 [17:09] Launchpad bug 1131737 in MAAS "Nodes stay at grub rescue prompt after being redeployed with juju" [Undecided,New] https://launchpad.net/bugs/1131737 [17:11] racedo: can you pastebin the celery logs? [17:11] var/log/maas/celery.log region-celery.log [17:12] racedo: what it looks to me is that when you tell it to deploy again it is not actually telling it to PXE boot [17:12] roaksoax: it picks a pxe image and the cfg says to boot from the disk apparently [17:12] thats what we understand from the boot screen [17:13] the funny thing is that it doesn't happen always [17:13] * racedo working on getting the logs [17:13] racedo: might it be that when you juju destroy and then juju deploy to that machine, it is not being set to use the PXE image to deploy? [17:14] roaksoax: https://pastebin.canonical.com/85444/ celety.log [17:15] rvba: ^^ [17:17] roaksoax: well, but the last pattern was: delete zookeeper from maas, reboot, enlist, accept and commission, node reboots fine, juju bootstrap to node, bootstraps fine, reboots after it finishes, pxe boots, gets pxe config that tells to boot from disk, goes to grub rescue [17:17] racedo: ahh I see [17:18] racedo: so that's a grub issue [17:18] it seems [17:18] roaksoax: it could well be [17:18] racedo: what's the disk it should be booting from? [17:18] the funny thing is that it's kind of random [17:19] racedo: do you happen to know what it is been told by PXE when it tells it to PXE boot? [17:19] say LOCALBOOT 0 [17:19] or KERNEL chain.c32 [17:20] racedo: ok you are gonna have to do something to test [17:20] roaksoax: we have 4 disks, 3 in raid 5 and one as spare disk, hw raid [17:20] racedo: right, so maybe that's the problem [17:20] but since this is HW raid [17:20] it should [17:20] not affect [17:20] they are presented as 1 disk to the os [17:21] ok [17:21] racedo: ok so go to /usr/share/pyshared/provisioningserver/pxe [17:21] roaksoax: ok, we will test that if we hit it this time (rebuilding) [17:21] racedo: ok, but go there [17:21] yep === matsubara-lunch is now known as matsubara [17:21] racedo: you will find 2 files of interest [17:22] config.local.template [17:22] config.local.x86.template [17:22] you need to figure out which one it is using [17:22] ok [17:22] oh i see [17:22] i'm guessing it is using config.local.x86.template since that should be used for most of the hw [17:23] ok, once i figure which one what do i do [17:23] racedo: something like this: http://paste.ubuntu.com/5555710/ [17:23] racedo: so LOCALBOOT -1 if it is the one been used [17:23] racedo: or APPEND hd0 or APPEND hd0,1 [17:23] perfect [17:24] Im documenting this now [17:24] thanks roaksoax [17:24] racedo: the latter APPEND hd0 is basically telling it boot for hd0 [17:24] or boot from hd0,1 [17:24] that *might* be the issue [17:24] got you [17:24] racedo: if this doesn'y solve it, maybe grub messed things up [17:24] racedo: but someone who might help in grub related stuff if cjwatson [17:26] rvba: this is a weird thing: https://pastebin.canonical.com/85444/ (omshell issues) [17:27] i've seen it over and over ^^ [18:02] racedo: did it work? the boot thing? [19:06] roaksoax: we haven't hit it again, it's 10 nodes so far and nothing [19:07] roaksoax: it's very random... but i fully agree with you it looks like a grub issue with their hw raid setup [19:36] roaksoax: we are hitting the issue right now [19:36] we are at the grub rescue prompt [19:38] we don't really now how to deal with the grub rescue prompt but negronjl is changing the what you suggested in maas: http://paste.ubuntu.com/5555710/ [19:39] roaksoax, where ( in the maas server ) are the files that need modifying ... I want to try changing that [19:41] roaksoax: we got it /usr/share/pyshared/provisioningserver/pxe [19:43] yeah [19:43] sorry [19:43] racedo: so do that andjust reboot the machine manually and tell it to PXE boot [19:44] roaksoax: ok, we are on it now, just modified the grub templates and we are going to pxe boot a node that fails [19:45] cool [19:45] racedo: try to see what the ouput of the PXE boot says [19:45] we will see it [19:47] roaksoax: when you say to tell it to pxe boot [19:47] racedo: ipmi [19:47] do you mean by ipmi pxe boot it, not reenlist it and start again? will pxeboot apply the new grub commands just by rebooting? [19:48] racedo: yes [19:48] racedo: so you need to do what I told you yesterday [19:48] yes [19:48] we got that scripted :) [19:48] cool [19:51] roaksoax: does this mean that every time a node pxe boots grub get installed through these templates then? [19:52] racedo: nope, this means that when the node PXE boot, it is telling the node "boot from your localdisk, rather than tftp" [19:53] racedo: so chainc.32 should have automatically determined where to pxe boot from, but since it didn't, we are telling it to specifically look for hd0 to pxe from [19:54] oh, yes but that's happening i think, and then it will go to grub rescue when trying to boot from the disk [19:54] we are trying to grab a screenshot [19:57] racedo: ok cool [19:57] racedo: or you can do a video [19:58] racedo: with recordmydesktop === matsubara is now known as matsubara-afk [19:58] only our customer has the access from a windows desktop :( [19:59] racedo: boomer [19:59] yep [19:59] yeah just try to get screenshots [20:00] racedo: did you check with cjwatson? [20:01] not yet, so we have a plan b if this fails, we have like 2 hours to get this working, the plan is to work with the customer remotely on this if needed [20:01] but today we have a plan b which is to just not use these boxes [20:02] ack [20:07] roaksoax: we couldn't catch the screen but i saw: SAY Booting under MAAS direction.. which seems to be on config.install.template [20:08] but as we deleted it and started again it might be part of this randomness... [20:11] racedo: yeah that's installation [20:11] or ocmmissioning [21:26] racedo: how are things going? [21:55] roaksoax: we deployed everything with what we had up and running and we might try to debug with the customer and support the grub issue next week if the customer has the time [21:56] roaksoax: thanks again man, it's been an interesting week ;) [22:18] roaksoax: btw if you are interested i captured the tcpdump of yesterday's internal server error and today extracted the xml file that the node posts to maas with wireshark: https://launchpadlibrarian.net/132080481/maas_xml_post [22:19] this is lp 1131418 [22:19] Launchpad bug 1131418 in MAAS "Nodes don't go to ready, after commissioning they get a 500 error when reporting back to maas" [Critical,Triaged] https://launchpad.net/bugs/1131418 === kentb is now known as kentb-afk [23:06] I have found a bug on the MAAS site. Should I report it on the Ubuntu Issue tracker? I ask because it is actually not a bug with a Ubuntu distro. [23:07] Basically, the search feature for the online documentation is broken. It simply hangs when you do a search. [23:07] Search works on the rest of the MAAS site. [23:08] But searches on other parts of the MAAS site don't seem to report matches in the MAAS docs.