=== frankban|afk is now known as frankban [09:21] /me waves hello [09:31] * michael____ waves hello [09:36] hi everybody! [09:36] how to check the command sent from MAAS tp PXE boot certian machine ? [09:37] is it logged anywhere ? [14:35] Bug #1732920 opened: Editing node interfaces require multiple attempts [14:42] roaksoax: Looking for a quick clarification. We are seeing hosts hang right after initrd and kernel are downloaded via tftp. Once the download completes is there a last communication to/from the client to the rack or region controller? === jac_ is now known as jac_cplane [14:46] vogelc_: the machine PXE boots, downloads initrd/kernel, loads into the ephemeral environment and does things [14:46] if the IP given as a kernel parameter is unreacheable, the machine will never communicate with the MAAS region [14:46] vogelc_: that said, check your kernel params and verify that the IP being passed is the correct one [14:47] ips are correct. I am wondering if bootp traffic is getting blocked somewhare [14:47] where [14:48] vogelc_: could be, if it is hanging while loading kernel/initrd [14:48] vogelc_: are you connected via the machine's VNC ? [14:49] I have an OOB console to the machine [14:50] vogelc_: if you are using serial console, try enabling serial for the macvhine via a kernel param (e.g. console=ttyS0,8600n8 or something of that sort [14:50] vogelc_: also, regiond.log will tell you if the machine contacted the region [14:51] vogelc_: 2017-11-13 18:21:38 regiond: [info] 10.90.90.192 POST /MAAS/metadata/status/yfhwqe HTTP/1.1 --> 204 NO_CONTENT (referrer: -; agent: python-requests/2.9.1) [14:51] it should show something like that [15:05] Bug #1732927 opened: Erroneous IP Already in use when setting interface to unconfigured [16:14] Bug #1732942 opened: [2.3, UI] Unable to set interface back to 'Disconnected' === frankban is now known as frankban|afk [17:09] Bug #1732948 opened: [2.3, HWTv2] Machine was marked Failed commissioning due to cloud-init failures, but tests weren't aborted and shown as 'pending [17:48] roadsoax: https://bugs.launchpad.net/maas/+bug/1717031. Do you want the curtin from UEFI or BIOS with my custom partition so it is successful? [17:49] roaksoax: ^^ [18:07] jamesbenson: i think what we need is: [18:07] 1. set UEFI in the BIOS [18:07] 2. commission the machine [18:08] 3. What MAAS automatically did for commissioning [18:08] 4. your cusotm partitioning [18:08] so we would need output for both 3 and 4 [18:09] jamesbenson: if you can confirm whther maas deploys after maas commissioning the machine, that'd be good too [18:38] roaksoax: 3? what details are you looking for explicitly? Not sure what you mean by automatically. I just set a machine to UEFI and commissioning now. [18:40] jamesbenson: ok, so maas will set a defualt storage default [18:40] deploy that machine with athta storage layout [18:40] and gather the curtin config [18:42] with my storage layout? [18:44] jamesbenson: without [18:45] jamesbenson: sorry, maybe i'm not being clear. Commissiong the mcahine and deploy it with the storage layout MAAS automatically selected. (gather the curtin config) [18:45] jamesbenson: and then, apply your custom layout, and gather your curtin config [18:46] it would be good to get the installation logs for each case as well [18:48] no just: "with athta storage" was a bit confusing... [18:50] jamesbenson: sorry :) friday afternoon, no lunch and ready to call it a week [18:51] so my plan is to commission, don't change anything deploy and get the curtin, (assuming it fails) release, deploy with my storage options, gather the curtin. [18:54] ++ [18:54] the curtin is only available after the deployment is done? [18:54] cool [18:54] deploying now [18:55] jamesbenson: while deploying [18:56] when should I issue that command then? [19:02] jamesbenson: when the machine is 'deploying' you can issue the command [19:07] hmm... the command is: maas administrator machines get-curtin-config ksran6 [19:07] log into maas and issue that command? [19:07] got me an error [19:11] jamesbenson: maas 'user' machine get-curtin-config [19:11] s/machine/machines/ [19:11] try that ? [19:12] yeah that did it [19:34] roaksoax: still having my problem. Cannot get any additional data from the console on this hang. [19:35] roaksoax: I can't tell if it's waitign on something local, on something network, or whats going on. sugestions on how to further drill down to what component after initrd is responsible for the hang? [20:06] Bug #1732980 opened: MAAS incorrectly PXE boot UEFI/legacy boot [20:06] Bug #1732983 opened: [2.3] Strange behavior when removing a secondary controller [20:07] roaksoax: what's the partitioning mount points for uefi? [20:07] '/boot/efi'? [20:11] found it: https://help.ubuntu.com/community/UEFI [20:11] https://help.ubuntu.com/community/UEFI#Creating_an_EFI_System_Partition [20:15] Bug #1732983 changed: [2.3] Strange behavior when removing a secondary controller [20:21] Bug #1732983 opened: [2.3] Strange behavior when removing a secondary controller [20:24] roaksoax: also, is there any way to get MAAS to use http instead of TFTP for transfering files? we are going across multiple hops and through virtual interfaces on virtual machines. i think its causing the choking. [20:30] Bug #1732983 changed: [2.3] Strange behavior when removing a secondary controller [20:33] roaksoax: can you give me parted & lsblk info for uefi partitioning? [20:34] looking for mountpoints, flags, etc. [20:42] ^^ anyone? [20:56] jamesbenson: haven't tried to use UEFI yet :( [21:06] thanks xygnal... [21:06] uefi doesn't work .... [21:12] according to roaksoax it does. i see commits in code from earlier versions showing that it does. [21:13] but there may be specific details, possibly with hardware model/bios settings, that need to be setup to make that work with maas [21:21] xygnal, yes, we are trying to debug that. [21:23] jamesbenson: look forward to your success :) [21:23] lol, me too! [21:23] wish that was my problem. harder to solve one here. pxe boots are hanging after initrd, and cannot find WHY. [21:23] not every time. intermittently. [21:23] as if they are stalling [21:24] the big issue for us when we started was networking stuff... ours used to stall too, mostly due to how our interfaces were set up. eth0 pxe, eth1 public... make sure IP's were assigned in both. [21:24] pxe had/has no outside access only eth1 [21:25] we have a work around for our storage issue, but more manual than needed..... [21:25] s/needed/'should be'/ [21:25] ah we share pxe and prod right now on single interface [21:26] oh... we can do that, but never had luck.. try seperating. [21:26] explain? [21:26] what problems? [21:26] we set up a dedicated pxe/internal subnet and a dedicated public nic. [21:27] 2 switches... [21:27] 2 nics [21:27] so pxe goes to only internal and 1 switch; internet nic goes to different nic/different switch [21:27] dedicated traffic [21:31] too costly for how big we build [21:31] roger. [21:32] how big is your rig? [21:32] we buy refurb from servermonkey.com [21:36] Dell's. We are a big shop. [21:36] though we've built our deployment sytem around being able to use whatever hardware we want [21:37] xygnal: does the problem host have network console access, or alternatively, configure the kernel's netconsole so you can monitor/interact [21:38] we're watching the KVM console on the DRAC. We're not missing anything. [21:39] tried CONSOLE=tty0 with verose, debug, debug=vc, --verbose options.. no difference [21:39] no data [21:39] loads kernel [21:39] loads initrd [21:39] both "...ok" [21:39] then stalled [21:39] dead in the water [21:39] xygnal: you don't see any kernel messages at all? [21:39] none. zero messages. [21:39] yeah, that's what we do... our maas server is a VM on our management servers that tie into all of our racks on the internal switch. [21:39] with with those global kernel options showing up in my KVM console window. [21:40] know they are being applied [21:40] which dells? r610/r710? [21:40] 6/7/830s [21:40] okay, we've got the ones I mentioned... [21:41] r410,r910's too [21:41] yeah we have a good mix [21:41] xygnal: this is a BIOS boot? [21:41] yes, not using UEFI at this time. [21:41] and at the time whe nthis happens all we see is DHCP and TFTP traffic. i dont think it gets to iscsi yet? [21:42] xygnal: silly question but... have you tried dropping the initrd and only booting the kernel? [21:42] i dont know what is next in the order after initrd loads [21:42] no I have not. [21:42] I will mention that if we repeatedly power cycle the box, it sometimes gets past this hang [21:42] and this happens across pretty much everything, intermittently [21:43] xygnal: in case there's an issue with the load of the initrd... the boot-loader normally puts in memory immediately after the kernel image, then hands over to the kernel's entry point [21:43] that would be a real pain to troubleshoot with how hard to is to single this issue out. [21:43] can take quite a few tries [21:43] xygnal: if you don't get the kernel to even start up it's a pretty good assumption the network is losing packets and the transfer isn't completing, or is being corrupted [21:44] TJ-: suspecting it but cannot prove it yet, a bit hard to prove it with UDP in the first place. Also seen a lot of bug reports about syslinux/pxelinux versions handling packet loss differentely. [21:44] xygnal: I'd mirror the port the host is attached to on my switch, capture the TFTP stream, reconstrct the file and check it's hash to ensure what you *think* is being sent to the host, actually arrives there :) [21:45] we've had network engineers go over the equipmnt and insist no problems [21:45] no errors [21:45] i've been asking them for that :) [21:45] its new infrastructure so they are not yet organized enough to find it and setup my span [21:45] how do you suggest the compare? [21:50] xygnal: if you see nothing from the kernel, make the working assumption it isn't arriving in memory correctly. Therefore, try reducing the size of the boot image - I'd switch to loading something like GRUB or an ipxe image. something as small as possible to prove that *something* can boot [21:51] that is sound advice [21:51] xygnal: if you can get ipxe to start, you could use it to chainload the linux kernel - and get some logging out of ipxe about the transfer. you enver know, there might be a subtle bug in the PXE on the hardware [21:51] give it something less likely to be given a chance to be corrupted [21:51] yes, exactly [21:52] even if it's a basic "hello world" static ELF binary you compile yourtself! [21:52] after all, the linux kernel is just an ELF binary executable [21:55] how do i access the TFTP files used for these commissions and deploys, manually, if i wnat to test out TFTP client downloads [21:55] to match checkcims that way [21:55] checksums [21:55] it looked like twisted3 is hosting it directly [21:56] fastest test i could peform is to boot a rescue cd, perform a 50 TFTP downloads of the same file, checksum each one [21:56] if any fail, despite that [21:56] dispute* ;) [22:00] you mean boot the rescue CD on the bare hardware then let it test TFTP ? [22:00] I realize multi-hop traversal of TFTP is bad in the first place, but we are too big to do a rack controller in every single local subnet. [22:00] yes [22:00] I do [22:00] because although that'll test with the kernel, it *won't* test any bad behaviour by the system's PXE BIOS services [22:01] correct that would just eliminate of we are getting bad checksums on our TFTP transfer [22:01] if [22:01] well no, because corruption, if it occurs, could be due to the PXE BIOS itself, not the network. stupid things like latency can sometimes induce weird side-effects [22:01] if it's being corrupted by the systems PXE BIOS services, we have a MUCH more annoying problem [22:02] I dos but that would allow me to release the network infrastructure from blame [22:02] good checksum end to end? well if its bad, its not the network. [22:02] if you're able to do a bare-metal rescue disc test, couldn't you also through FOG or similar onto a laptop to make it a PXE server, plug a cable from the host to the laptop, and do a direct PXE boot test :) [22:03] yes i could but remember - this doesnt happen every single time [22:03] s/through/throw/ [22:03] its an intermittent problem. often enough to really piss off customers, not often enough to catch every single try. [22:03] right... but you could do an additional 50 tests of host<>laptop PXe boot on top of your 50 TFTP file transfers. [22:04] it sounds like packet-drop. does the network have monitoring in place for that? [22:05] yes but they claim no problems. [22:05] thats is why i am trying to get the network performance team to give me a span off the nearest switch port [22:05] :) well there's also the iPXE bootable ISO/DVD you could use to test the PXE side of course [22:06] the real concern for me is [22:06] if it IS corruption on the network level [22:06] how are we doing to solve THAT? this is far too fragile of a transfer to lose packet.s [22:08] xygnal, shielded cabled? :-p [22:08] if it's TFTP related, I'd first check if the TFTP server and client are negiotating to transfer more than 512-byte sized blocks ... if the packet size increases too much there could be issues with MTU. [22:09] I can snoop that out from the rack controller i bet with a tcpdump [22:09] I'd lock the TFTP server to 512-byte blocks if that were suspected, for testing. [22:09] how do i lock it? this is not local in.tftpd this is twisted3 process [22:10] From the TFTP protocol: [22:10] If the defined blocksize produces an IP packet size that exceeds the minimum MTU at any point of the network path, IP fragmentation and reassembly will occur not only adding more overhead[8] but also leading to total transfer failure when the minimalist IP stack implementation in a host's BOOTP or PXE ROM does not (or fails to properly) implement IP fragmentation and reassembly [22:12] good catch [22:12] I've dealt with it before; many many years ago! [22:12] these days it's relatively rare [22:13] soon as the team exposed that they plan to run a rack controller *per datacenter*, meaning a whole lot of forwarding from other subnets, I knew we might run into this stuff [22:15] thanks much [23:04] roaksoax: any plans to move to iPXE so we use http instead of TFTP? [23:07] xygnal: not in the shorterm, although we hvae a PR adding support for iPXE. I dont know what's the extend of it or whether it is used for everything [23:08] https://code.launchpad.net/~wpk/maas/+git/maas/+merge/332552 [23:16] ty sir [23:17] seems it only adds it for kvm now, but seems trivail to use it across the board for x86 systems [23:19] passign initrd over HTTP sounds a lot more reliable than using TFTP across multipe hops [23:30] Bug #1733015 opened: [2.3, HWTv2] Runtime resets back to 0 after 24 hours [23:30] may be, although we don't typically we recommend putting 1 region per DC, instead of splitting racks :) [23:32] we are 1 region, with 1 rack (pair) per datacenter [23:34] though our datacenters are all located in the same US State [23:35] minimal latency [23:38] xygnal: so region/rack are routed ? [23:38] ah I remember [23:38] maybe due to dhcp relaying ? [23:38] duh [23:38] yeah i'm sure our config is floating back there somewhere ;)