/srv/irclogs.ubuntu.com/2017/11/17/#maas.txt

=== frankban|afk is now known as frankban
michael____ /me waves hello09:21
* michael____ waves hello09:31
michael____hi everybody!09:36
michael____how to check the command sent from MAAS tp PXE boot certian machine ?09:36
michael____is it logged anywhere ?09:37
mupBug #1732920 opened: Editing node interfaces require multiple attempts <MAAS:Incomplete> <https://launchpad.net/bugs/1732920>14:35
vogelc_roaksoax: Looking for a quick clarification. We are seeing hosts hang right after initrd and kernel are downloaded via tftp.  Once the download completes is there a last communication to/from the client to the rack or region controller?14:42
=== jac_ is now known as jac_cplane
roaksoaxvogelc_: the machine PXE boots, downloads initrd/kernel, loads into the ephemeral environment and does things14:46
roaksoaxif the  IP given as a kernel parameter is unreacheable, the machine will never communicate with the MAAS region14:46
roaksoaxvogelc_: that said, check your kernel params and verify that the IP being passed is the correct one14:46
vogelc_ips are correct.  I am wondering if bootp traffic is getting blocked somewhare14:47
vogelc_where14:47
roaksoaxvogelc_: could be, if it is hanging while loading kernel/initrd14:48
roaksoaxvogelc_: are you connected via the machine's VNC ?14:48
vogelc_I have an OOB console to the machine14:49
roaksoaxvogelc_: if you are using serial console, try enabling serial for the macvhine via a kernel param (e.g. console=ttyS0,8600n8 or something of that sort14:50
roaksoaxvogelc_: also, regiond.log will tell you if the machine contacted the region14:50
roaksoaxvogelc_: 2017-11-13 18:21:38 regiond: [info] 10.90.90.192 POST /MAAS/metadata/status/yfhwqe HTTP/1.1 --> 204 NO_CONTENT (referrer: -; agent: python-requests/2.9.1)14:51
roaksoaxit should show something like that14:51
mupBug #1732927 opened: Erroneous IP Already in use when setting interface to unconfigured <MAAS:New> <https://launchpad.net/bugs/1732927>15:05
mupBug #1732942 opened: [2.3, UI] Unable to set interface back to 'Disconnected' <MAAS:Triaged> <https://launchpad.net/bugs/1732942>16:14
=== frankban is now known as frankban|afk
mupBug #1732948 opened: [2.3, HWTv2] Machine was marked Failed commissioning due to cloud-init failures, but tests weren't aborted and shown as 'pending <MAAS:Triaged by ltrager> <https://launchpad.net/bugs/1732948>17:09
jamesbensonroadsoax: https://bugs.launchpad.net/maas/+bug/1717031. Do you want the curtin from UEFI or BIOS with my custom partition so it is successful?17:48
jamesbensonroaksoax:  ^^17:49
roaksoaxjamesbenson: i think what we need is:18:07
roaksoax1. set UEFI in the BIOS18:07
roaksoax2. commission the machine18:07
roaksoax3. What MAAS automatically did for commissioning18:08
roaksoax4. your cusotm partitioning18:08
roaksoaxso we would need output for both 3 and 418:08
roaksoaxjamesbenson: if you can confirm whther maas deploys after maas commissioning the machine, that'd be good too18:09
jamesbensonroaksoax: 3?  what details are you looking for explicitly?   Not sure what you mean by automatically.  I just set a machine to UEFI and commissioning now.18:38
roaksoaxjamesbenson: ok, so maas will set a defualt storage default18:40
roaksoaxdeploy that machine with athta storage layout18:40
roaksoaxand gather the curtin config18:40
jamesbensonwith my storage layout?18:42
roaksoaxjamesbenson: without18:44
roaksoaxjamesbenson: sorry, maybe i'm not being clear. Commissiong the mcahine and deploy it with the storage layout MAAS automatically selected. (gather the curtin config)18:45
roaksoaxjamesbenson: and then, apply your custom layout, and gather your curtin config18:45
roaksoaxit would be good to get the installation logs for each case as well18:46
jamesbensonno just:  "with athta storage" was a bit confusing...18:48
roaksoaxjamesbenson: sorry :) friday afternoon, no lunch and ready to call it a week18:50
jamesbensonso my plan is to commission, don't change anything deploy and get the curtin, (assuming it fails) release, deploy with my storage options, gather the curtin.18:51
roaksoax++18:54
jamesbensonthe curtin is only available after the deployment is done?18:54
jamesbensoncool18:54
jamesbensondeploying now18:54
roaksoaxjamesbenson: while deploying18:55
jamesbensonwhen should I issue that command then?18:56
roaksoaxjamesbenson: when the machine is 'deploying' you can issue the command19:02
jamesbensonhmm... the command is: maas administrator machines get-curtin-config ksran619:07
jamesbensonlog into maas and issue that command?19:07
jamesbensongot me an error19:07
roaksoaxjamesbenson: maas 'user' machine get-curtin-config19:11
roaksoaxs/machine/machines/19:11
roaksoaxtry that ?19:11
jamesbensonyeah that did it19:12
xygnalroaksoax: still having my problem. Cannot get any additional data from the console on this hang.19:34
xygnalroaksoax: I can't tell if it's waitign on something local, on something network, or whats going on.  sugestions on how to further drill down to what component after initrd is responsible for the hang?19:35
mupBug #1732980 opened: MAAS incorrectly PXE boot UEFI/legacy boot <hp-proliant-dl380-g9> <maas> <pxe-boot> <uefi> <MAAS:New> <https://launchpad.net/bugs/1732980>20:06
mupBug #1732983 opened: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>20:06
jamesbensonroaksoax: what's the partitioning mount points for uefi?20:07
jamesbenson'/boot/efi'?20:07
jamesbensonfound it:  https://help.ubuntu.com/community/UEFI20:11
jamesbensonhttps://help.ubuntu.com/community/UEFI#Creating_an_EFI_System_Partition20:11
mupBug #1732983 changed: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>20:15
mupBug #1732983 opened: [2.3] Strange behavior when removing a secondary controller <MAAS:Won't Fix> <https://launchpad.net/bugs/1732983>20:21
xygnalroaksoax: also, is there any way to get MAAS to use http instead of TFTP for transfering files? we are going across multiple hops and through virtual interfaces on virtual machines. i think its causing the choking.20:24
mupBug #1732983 changed: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>20:30
jamesbensonroaksoax: can you give me parted & lsblk info for uefi partitioning?20:33
jamesbensonlooking for mountpoints, flags, etc.20:34
jamesbenson^^ anyone?20:42
xygnaljamesbenson: haven't tried to use UEFI yet :(20:56
jamesbensonthanks xygnal...21:06
jamesbensonuefi doesn't work ....21:06
xygnalaccording to roaksoax it does. i see commits in code from earlier versions showing that it does.21:12
xygnalbut there may be specific details, possibly with hardware model/bios settings, that need to be setup to make that work with maas21:13
jamesbensonxygnal, yes, we are trying to debug that.21:21
xygnaljamesbenson: look forward to your success :)21:23
jamesbensonlol, me too!21:23
xygnalwish that was my problem. harder to solve one here.  pxe boots are hanging after initrd, and cannot find WHY.21:23
xygnalnot every time. intermittently.21:23
xygnalas if they are stalling21:23
jamesbensonthe big issue for us when we started was networking stuff... ours used to stall too, mostly due to how our interfaces were set up.  eth0 pxe, eth1 public... make sure IP's were assigned in both.21:24
jamesbensonpxe had/has no outside access only eth121:24
jamesbensonwe have a work around for our storage issue, but more manual than needed.....21:25
jamesbensons/needed/'should be'/21:25
xygnalah we share pxe and prod right now on single interface21:25
jamesbensonoh... we can do that, but never had luck.. try seperating.21:26
xygnalexplain?21:26
xygnalwhat problems?21:26
jamesbensonwe set up a dedicated pxe/internal subnet and a dedicated public nic.21:26
jamesbenson2 switches...21:27
jamesbenson2 nics21:27
jamesbensonso pxe goes to only internal and 1 switch; internet nic goes to different nic/different switch21:27
jamesbensondedicated traffic21:27
xygnaltoo costly for how big we build21:31
jamesbensonroger.21:31
jamesbensonhow big is your rig?21:32
jamesbensonwe buy refurb from servermonkey.com21:32
xygnalDell's.  We are a big shop.21:36
xygnalthough we've built our deployment sytem around being able to use whatever hardware we want21:36
TJ-xygnal: does the problem host have network console access, or alternatively, configure the kernel's netconsole so you can monitor/interact21:37
xygnalwe're watching the KVM console on the DRAC. We're not missing anything.21:38
xygnaltried CONSOLE=tty0 with verose, debug, debug=vc, --verbose options.. no difference21:39
xygnalno data21:39
xygnalloads kernel21:39
xygnalloads initrd21:39
xygnalboth "...ok"21:39
xygnalthen stalled21:39
xygnaldead in the water21:39
TJ-xygnal: you don't see any kernel messages at all?21:39
xygnalnone. zero messages.21:39
jamesbensonyeah, that's what we do... our maas server is a VM on our management servers that tie into all of our racks on the internal switch.21:39
xygnalwith with those global kernel options showing up in my KVM console window.21:39
xygnal know they are being applied21:40
jamesbensonwhich dells?  r610/r710?21:40
xygnal6/7/830s21:40
jamesbensonokay, we've got the ones I mentioned...21:40
jamesbensonr410,r910's too21:41
xygnalyeah we have a good mix21:41
TJ-xygnal: this is a BIOS boot?21:41
xygnalyes, not using UEFI at this time.21:41
xygnaland at the time whe nthis happens all we see is DHCP and TFTP traffic. i dont think it gets to iscsi yet?21:41
TJ-xygnal: silly question but... have you tried dropping the initrd and only booting the kernel?21:42
xygnali dont know what is next in the order after initrd loads21:42
xygnalno I have not.21:42
xygnalI will mention that if we repeatedly power cycle the box, it sometimes gets past this hang21:42
xygnaland this happens across pretty much everything, intermittently21:42
TJ-xygnal: in case there's an issue with the load of the initrd... the boot-loader normally puts in memory immediately after the kernel image, then hands over to the kernel's entry point21:43
xygnalthat would be a real pain to troubleshoot with how hard to is to single this issue out.21:43
xygnalcan take quite a few tries21:43
TJ-xygnal: if you don't get the kernel to even start up it's a pretty good assumption the network is losing packets and the transfer isn't completing, or is being corrupted21:43
xygnalTJ-: suspecting it but cannot prove it yet, a bit hard to prove it with UDP in the first place. Also seen a lot of bug reports about syslinux/pxelinux versions handling packet loss differentely.21:44
TJ-xygnal: I'd mirror the port the host is attached to on my switch, capture the TFTP stream, reconstrct the file and check it's hash to ensure what you *think* is being sent to the host, actually arrives there :)21:44
xygnalwe've had network engineers go over the equipmnt and insist no problems21:45
xygnalno errors21:45
xygnali've been asking them for that :)21:45
xygnalits new infrastructure so they are not yet organized enough to find it and setup my span21:45
xygnalhow do you suggest the compare?21:45
TJ-xygnal: if you see nothing from the kernel, make the working assumption it isn't arriving in memory correctly. Therefore, try reducing the size of the boot image - I'd switch to loading something like GRUB or an ipxe image. something as small as possible to prove that *something* can boot21:50
xygnalthat is sound advice21:51
TJ-xygnal: if you can get ipxe to start, you could use it to chainload the linux kernel - and get some logging out of ipxe about the transfer. you enver know, there might be a subtle bug in the PXE on the hardware21:51
xygnalgive it something less likely to be given a chance to be corrupted21:51
TJ-yes, exactly21:51
TJ-even if it's a basic "hello world" static ELF binary you compile yourtself!21:52
TJ-after all, the linux kernel is just an ELF binary executable21:52
xygnalhow do i access the TFTP files used for these commissions and deploys, manually, if i wnat to test out TFTP client downloads21:55
xygnalto match checkcims that way21:55
xygnalchecksums21:55
xygnalit looked like twisted3 is hosting it directly21:55
xygnalfastest test i could peform is to boot a rescue cd, perform a 50 TFTP downloads of the same file, checksum each one21:56
xygnalif any fail, despite that21:56
xygnaldispute* ;)21:56
TJ-you mean boot the rescue CD on the bare hardware then let it test TFTP ?22:00
xygnalI realize multi-hop traversal of TFTP is bad in the first place, but we are too big to do a rack controller in every single local subnet.22:00
xygnalyes22:00
xygnalI do22:00
TJ-because although that'll test with the kernel, it *won't* test any bad behaviour by the system's PXE BIOS services22:00
xygnalcorrect that would just eliminate of we are getting bad checksums on our TFTP transfer22:01
xygnalif22:01
TJ-well no, because corruption, if it occurs, could be due to the PXE BIOS itself, not the network. stupid things like latency can sometimes induce weird side-effects22:01
xygnalif it's being corrupted by the systems PXE BIOS services, we have a MUCH more annoying problem22:01
xygnalI dos but that would allow me to release the network infrastructure from blame22:02
xygnalgood checksum end to end? well if its bad, its not the network.22:02
TJ-if you're able to do a bare-metal rescue disc test, couldn't you also through FOG or similar onto a laptop to make it a PXE server, plug a cable from the host to the laptop, and do a direct PXE boot test :)22:02
xygnalyes i could but remember - this doesnt happen every single time22:03
TJ-s/through/throw/22:03
xygnalits an intermittent problem. often enough to really piss off customers, not often enough to catch every single try.22:03
TJ-right... but you could do an additional 50 tests of host<>laptop PXe boot on top of your 50 TFTP file transfers.22:03
TJ-it sounds like packet-drop. does the network have monitoring in place for that?22:04
xygnalyes but they claim no problems.22:05
xygnalthats is why i am trying to get the network performance team to give me a span off the nearest switch port22:05
TJ-:) well there's also the iPXE bootable ISO/DVD you could use to test the PXE side of course22:05
xygnalthe real concern for me is22:06
xygnalif it IS corruption on the network level22:06
xygnalhow are we doing to solve THAT? this is far too fragile of a transfer to lose packet.s22:06
jamesbensonxygnal, shielded cabled? :-p22:08
TJ-if it's TFTP related, I'd first check if the TFTP server and client are negiotating to transfer more than 512-byte sized blocks ... if the packet size increases too much there could be issues with MTU.22:08
xygnalI can snoop that out from the rack controller i bet with a tcpdump22:09
TJ-I'd lock the TFTP server to 512-byte blocks if that were suspected, for testing.22:09
xygnalhow do i lock it? this is not local in.tftpd this is twisted3 process22:09
TJ-From the TFTP protocol:22:10
TJ-If the defined blocksize produces an IP packet size that exceeds the minimum MTU at any point of the network path, IP fragmentation and reassembly will occur not only adding more overhead[8] but also leading to total transfer failure when the minimalist IP stack implementation in a host's BOOTP or PXE ROM does not (or fails to properly) implement IP fragmentation and reassembly22:10
xygnalgood catch22:12
TJ-I've dealt with it before; many many years ago!22:12
TJ-these days it's relatively rare22:12
xygnalsoon as the team exposed that they plan to run a rack controller *per datacenter*, meaning a whole lot of forwarding from other subnets, I knew we might run into this stuff22:13
xygnalthanks much22:15
xygnalroaksoax: any plans to move to iPXE so we use http instead of TFTP?23:04
roaksoaxxygnal: not in the shorterm, although we hvae a PR adding support for iPXE. I dont know what's the extend of it or whether it is used for everything23:07
roaksoaxhttps://code.launchpad.net/~wpk/maas/+git/maas/+merge/33255223:08
xygnalty sir23:16
roaksoaxseems it only adds it for kvm now, but seems trivail to use it across the board for x86 systems23:17
xygnalpassign initrd over HTTP sounds a lot more reliable than using TFTP across multipe hops23:19
mupBug #1733015 opened: [2.3, HWTv2] Runtime resets back to 0 after 24 hours <MAAS:Triaged> <https://launchpad.net/bugs/1733015>23:30
roaksoaxmay be, although we don't typically we recommend putting 1 region per DC, instead of splitting racks :)23:30
xygnalwe are 1 region, with 1 rack (pair) per datacenter23:32
xygnalthough our datacenters are all located in the same US State23:34
xygnalminimal latency23:35
roaksoaxxygnal: so region/rack are routed ?23:38
roaksoaxah I remember23:38
roaksoaxmaybe due to dhcp relaying ?23:38
roaksoaxduh23:38
xygnalyeah i'm sure our config is floating back there somewhere ;)23:38

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!