/srv/irclogs.ubuntu.com/2017/11/17/#maas.txt

=== frankban\|afk is now known as frankban
michael____	/me waves hello	09:21
* michael____ waves hello		09:31
michael____	hi everybody!	09:36
michael____	how to check the command sent from MAAS tp PXE boot certian machine ?	09:36
michael____	is it logged anywhere ?	09:37
mup	Bug #1732920 opened: Editing node interfaces require multiple attempts <MAAS:Incomplete> <https://launchpad.net/bugs/1732920>	14:35
vogelc_	roaksoax: Looking for a quick clarification. We are seeing hosts hang right after initrd and kernel are downloaded via tftp. Once the download completes is there a last communication to/from the client to the rack or region controller?	14:42
=== jac_ is now known as jac_cplane
roaksoax	vogelc_: the machine PXE boots, downloads initrd/kernel, loads into the ephemeral environment and does things	14:46
roaksoax	if the IP given as a kernel parameter is unreacheable, the machine will never communicate with the MAAS region	14:46
roaksoax	vogelc_: that said, check your kernel params and verify that the IP being passed is the correct one	14:46
vogelc_	ips are correct. I am wondering if bootp traffic is getting blocked somewhare	14:47
vogelc_	where	14:47
roaksoax	vogelc_: could be, if it is hanging while loading kernel/initrd	14:48
roaksoax	vogelc_: are you connected via the machine's VNC ?	14:48
vogelc_	I have an OOB console to the machine	14:49
roaksoax	vogelc_: if you are using serial console, try enabling serial for the macvhine via a kernel param (e.g. console=ttyS0,8600n8 or something of that sort	14:50
roaksoax	vogelc_: also, regiond.log will tell you if the machine contacted the region	14:50
roaksoax	vogelc_: 2017-11-13 18:21:38 regiond: [info] 10.90.90.192 POST /MAAS/metadata/status/yfhwqe HTTP/1.1 --> 204 NO_CONTENT (referrer: -; agent: python-requests/2.9.1)	14:51
roaksoax	it should show something like that	14:51
mup	Bug #1732927 opened: Erroneous IP Already in use when setting interface to unconfigured <MAAS:New> <https://launchpad.net/bugs/1732927>	15:05
mup	Bug #1732942 opened: [2.3, UI] Unable to set interface back to 'Disconnected' <MAAS:Triaged> <https://launchpad.net/bugs/1732942>	16:14
=== frankban is now known as frankban\|afk
mup	Bug #1732948 opened: [2.3, HWTv2] Machine was marked Failed commissioning due to cloud-init failures, but tests weren't aborted and shown as 'pending <MAAS:Triaged by ltrager> <https://launchpad.net/bugs/1732948>	17:09
jamesbenson	roadsoax: https://bugs.launchpad.net/maas/+bug/1717031. Do you want the curtin from UEFI or BIOS with my custom partition so it is successful?	17:48
jamesbenson	roaksoax: ^^	17:49
roaksoax	jamesbenson: i think what we need is:	18:07
roaksoax	1. set UEFI in the BIOS	18:07
roaksoax	2. commission the machine	18:07
roaksoax	3. What MAAS automatically did for commissioning	18:08
roaksoax	4. your cusotm partitioning	18:08
roaksoax	so we would need output for both 3 and 4	18:08
roaksoax	jamesbenson: if you can confirm whther maas deploys after maas commissioning the machine, that'd be good too	18:09
jamesbenson	roaksoax: 3? what details are you looking for explicitly? Not sure what you mean by automatically. I just set a machine to UEFI and commissioning now.	18:38
roaksoax	jamesbenson: ok, so maas will set a defualt storage default	18:40
roaksoax	deploy that machine with athta storage layout	18:40
roaksoax	and gather the curtin config	18:40
jamesbenson	with my storage layout?	18:42
roaksoax	jamesbenson: without	18:44
roaksoax	jamesbenson: sorry, maybe i'm not being clear. Commissiong the mcahine and deploy it with the storage layout MAAS automatically selected. (gather the curtin config)	18:45
roaksoax	jamesbenson: and then, apply your custom layout, and gather your curtin config	18:45
roaksoax	it would be good to get the installation logs for each case as well	18:46
jamesbenson	no just: "with athta storage" was a bit confusing...	18:48
roaksoax	jamesbenson: sorry :) friday afternoon, no lunch and ready to call it a week	18:50
jamesbenson	so my plan is to commission, don't change anything deploy and get the curtin, (assuming it fails) release, deploy with my storage options, gather the curtin.	18:51
roaksoax	++	18:54
jamesbenson	the curtin is only available after the deployment is done?	18:54
jamesbenson	cool	18:54
jamesbenson	deploying now	18:54
roaksoax	jamesbenson: while deploying	18:55
jamesbenson	when should I issue that command then?	18:56
roaksoax	jamesbenson: when the machine is 'deploying' you can issue the command	19:02
jamesbenson	hmm... the command is: maas administrator machines get-curtin-config ksran6	19:07
jamesbenson	log into maas and issue that command?	19:07
jamesbenson	got me an error	19:07
roaksoax	jamesbenson: maas 'user' machine get-curtin-config	19:11
roaksoax	s/machine/machines/	19:11
roaksoax	try that ?	19:11
jamesbenson	yeah that did it	19:12
xygnal	roaksoax: still having my problem. Cannot get any additional data from the console on this hang.	19:34
xygnal	roaksoax: I can't tell if it's waitign on something local, on something network, or whats going on. sugestions on how to further drill down to what component after initrd is responsible for the hang?	19:35
mup	Bug #1732980 opened: MAAS incorrectly PXE boot UEFI/legacy boot <hp-proliant-dl380-g9> <maas> <pxe-boot> <uefi> <MAAS:New> <https://launchpad.net/bugs/1732980>	20:06
mup	Bug #1732983 opened: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>	20:06
jamesbenson	roaksoax: what's the partitioning mount points for uefi?	20:07
jamesbenson	'/boot/efi'?	20:07
jamesbenson	found it: https://help.ubuntu.com/community/UEFI	20:11
jamesbenson	https://help.ubuntu.com/community/UEFI#Creating_an_EFI_System_Partition	20:11
mup	Bug #1732983 changed: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>	20:15
mup	Bug #1732983 opened: [2.3] Strange behavior when removing a secondary controller <MAAS:Won't Fix> <https://launchpad.net/bugs/1732983>	20:21
xygnal	roaksoax: also, is there any way to get MAAS to use http instead of TFTP for transfering files? we are going across multiple hops and through virtual interfaces on virtual machines. i think its causing the choking.	20:24
mup	Bug #1732983 changed: [2.3] Strange behavior when removing a secondary controller <MAAS:Triaged> <https://launchpad.net/bugs/1732983>	20:30
jamesbenson	roaksoax: can you give me parted & lsblk info for uefi partitioning?	20:33
jamesbenson	looking for mountpoints, flags, etc.	20:34
jamesbenson	^^ anyone?	20:42
xygnal	jamesbenson: haven't tried to use UEFI yet :(	20:56
jamesbenson	thanks xygnal...	21:06
jamesbenson	uefi doesn't work ....	21:06
xygnal	according to roaksoax it does. i see commits in code from earlier versions showing that it does.	21:12
xygnal	but there may be specific details, possibly with hardware model/bios settings, that need to be setup to make that work with maas	21:13
jamesbenson	xygnal, yes, we are trying to debug that.	21:21
xygnal	jamesbenson: look forward to your success :)	21:23
jamesbenson	lol, me too!	21:23
xygnal	wish that was my problem. harder to solve one here. pxe boots are hanging after initrd, and cannot find WHY.	21:23
xygnal	not every time. intermittently.	21:23
xygnal	as if they are stalling	21:23
jamesbenson	the big issue for us when we started was networking stuff... ours used to stall too, mostly due to how our interfaces were set up. eth0 pxe, eth1 public... make sure IP's were assigned in both.	21:24
jamesbenson	pxe had/has no outside access only eth1	21:24
jamesbenson	we have a work around for our storage issue, but more manual than needed.....	21:25
jamesbenson	s/needed/'should be'/	21:25
xygnal	ah we share pxe and prod right now on single interface	21:25
jamesbenson	oh... we can do that, but never had luck.. try seperating.	21:26
xygnal	explain?	21:26
xygnal	what problems?	21:26
jamesbenson	we set up a dedicated pxe/internal subnet and a dedicated public nic.	21:26
jamesbenson	2 switches...	21:27
jamesbenson	2 nics	21:27
jamesbenson	so pxe goes to only internal and 1 switch; internet nic goes to different nic/different switch	21:27
jamesbenson	dedicated traffic	21:27
xygnal	too costly for how big we build	21:31
jamesbenson	roger.	21:31
jamesbenson	how big is your rig?	21:32
jamesbenson	we buy refurb from servermonkey.com	21:32
xygnal	Dell's. We are a big shop.	21:36
xygnal	though we've built our deployment sytem around being able to use whatever hardware we want	21:36
TJ-	xygnal: does the problem host have network console access, or alternatively, configure the kernel's netconsole so you can monitor/interact	21:37
xygnal	we're watching the KVM console on the DRAC. We're not missing anything.	21:38
xygnal	tried CONSOLE=tty0 with verose, debug, debug=vc, --verbose options.. no difference	21:39
xygnal	no data	21:39
xygnal	loads kernel	21:39
xygnal	loads initrd	21:39
xygnal	both "...ok"	21:39
xygnal	then stalled	21:39
xygnal	dead in the water	21:39
TJ-	xygnal: you don't see any kernel messages at all?	21:39
xygnal	none. zero messages.	21:39
jamesbenson	yeah, that's what we do... our maas server is a VM on our management servers that tie into all of our racks on the internal switch.	21:39
xygnal	with with those global kernel options showing up in my KVM console window.	21:39
xygnal	know they are being applied	21:40
jamesbenson	which dells? r610/r710?	21:40
xygnal	6/7/830s	21:40
jamesbenson	okay, we've got the ones I mentioned...	21:40
jamesbenson	r410,r910's too	21:41
xygnal	yeah we have a good mix	21:41
TJ-	xygnal: this is a BIOS boot?	21:41
xygnal	yes, not using UEFI at this time.	21:41
xygnal	and at the time whe nthis happens all we see is DHCP and TFTP traffic. i dont think it gets to iscsi yet?	21:41
TJ-	xygnal: silly question but... have you tried dropping the initrd and only booting the kernel?	21:42
xygnal	i dont know what is next in the order after initrd loads	21:42
xygnal	no I have not.	21:42
xygnal	I will mention that if we repeatedly power cycle the box, it sometimes gets past this hang	21:42
xygnal	and this happens across pretty much everything, intermittently	21:42
TJ-	xygnal: in case there's an issue with the load of the initrd... the boot-loader normally puts in memory immediately after the kernel image, then hands over to the kernel's entry point	21:43
xygnal	that would be a real pain to troubleshoot with how hard to is to single this issue out.	21:43
xygnal	can take quite a few tries	21:43
TJ-	xygnal: if you don't get the kernel to even start up it's a pretty good assumption the network is losing packets and the transfer isn't completing, or is being corrupted	21:43
xygnal	TJ-: suspecting it but cannot prove it yet, a bit hard to prove it with UDP in the first place. Also seen a lot of bug reports about syslinux/pxelinux versions handling packet loss differentely.	21:44
TJ-	xygnal: I'd mirror the port the host is attached to on my switch, capture the TFTP stream, reconstrct the file and check it's hash to ensure what you think is being sent to the host, actually arrives there :)	21:44
xygnal	we've had network engineers go over the equipmnt and insist no problems	21:45
xygnal	no errors	21:45
xygnal	i've been asking them for that :)	21:45
xygnal	its new infrastructure so they are not yet organized enough to find it and setup my span	21:45
xygnal	how do you suggest the compare?	21:45
TJ-	xygnal: if you see nothing from the kernel, make the working assumption it isn't arriving in memory correctly. Therefore, try reducing the size of the boot image - I'd switch to loading something like GRUB or an ipxe image. something as small as possible to prove that something can boot	21:50
xygnal	that is sound advice	21:51
TJ-	xygnal: if you can get ipxe to start, you could use it to chainload the linux kernel - and get some logging out of ipxe about the transfer. you enver know, there might be a subtle bug in the PXE on the hardware	21:51
xygnal	give it something less likely to be given a chance to be corrupted	21:51
TJ-	yes, exactly	21:51
TJ-	even if it's a basic "hello world" static ELF binary you compile yourtself!	21:52
TJ-	after all, the linux kernel is just an ELF binary executable	21:52
xygnal	how do i access the TFTP files used for these commissions and deploys, manually, if i wnat to test out TFTP client downloads	21:55
xygnal	to match checkcims that way	21:55
xygnal	checksums	21:55
xygnal	it looked like twisted3 is hosting it directly	21:55
xygnal	fastest test i could peform is to boot a rescue cd, perform a 50 TFTP downloads of the same file, checksum each one	21:56
xygnal	if any fail, despite that	21:56
xygnal	dispute* ;)	21:56
TJ-	you mean boot the rescue CD on the bare hardware then let it test TFTP ?	22:00
xygnal	I realize multi-hop traversal of TFTP is bad in the first place, but we are too big to do a rack controller in every single local subnet.	22:00
xygnal	yes	22:00
xygnal	I do	22:00
TJ-	because although that'll test with the kernel, it won't test any bad behaviour by the system's PXE BIOS services	22:00
xygnal	correct that would just eliminate of we are getting bad checksums on our TFTP transfer	22:01
xygnal	if	22:01
TJ-	well no, because corruption, if it occurs, could be due to the PXE BIOS itself, not the network. stupid things like latency can sometimes induce weird side-effects	22:01
xygnal	if it's being corrupted by the systems PXE BIOS services, we have a MUCH more annoying problem	22:01
xygnal	I dos but that would allow me to release the network infrastructure from blame	22:02
xygnal	good checksum end to end? well if its bad, its not the network.	22:02
TJ-	if you're able to do a bare-metal rescue disc test, couldn't you also through FOG or similar onto a laptop to make it a PXE server, plug a cable from the host to the laptop, and do a direct PXE boot test :)	22:02
xygnal	yes i could but remember - this doesnt happen every single time	22:03
TJ-	s/through/throw/	22:03
xygnal	its an intermittent problem. often enough to really piss off customers, not often enough to catch every single try.	22:03
TJ-	right... but you could do an additional 50 tests of host<>laptop PXe boot on top of your 50 TFTP file transfers.	22:03
TJ-	it sounds like packet-drop. does the network have monitoring in place for that?	22:04
xygnal	yes but they claim no problems.	22:05
xygnal	thats is why i am trying to get the network performance team to give me a span off the nearest switch port	22:05
TJ-	:) well there's also the iPXE bootable ISO/DVD you could use to test the PXE side of course	22:05
xygnal	the real concern for me is	22:06
xygnal	if it IS corruption on the network level	22:06
xygnal	how are we doing to solve THAT? this is far too fragile of a transfer to lose packet.s	22:06
jamesbenson	xygnal, shielded cabled? :-p	22:08
TJ-	if it's TFTP related, I'd first check if the TFTP server and client are negiotating to transfer more than 512-byte sized blocks ... if the packet size increases too much there could be issues with MTU.	22:08
xygnal	I can snoop that out from the rack controller i bet with a tcpdump	22:09
TJ-	I'd lock the TFTP server to 512-byte blocks if that were suspected, for testing.	22:09
xygnal	how do i lock it? this is not local in.tftpd this is twisted3 process	22:09
TJ-	From the TFTP protocol:	22:10
TJ-	If the defined blocksize produces an IP packet size that exceeds the minimum MTU at any point of the network path, IP fragmentation and reassembly will occur not only adding more overhead[8] but also leading to total transfer failure when the minimalist IP stack implementation in a host's BOOTP or PXE ROM does not (or fails to properly) implement IP fragmentation and reassembly	22:10
xygnal	good catch	22:12
TJ-	I've dealt with it before; many many years ago!	22:12
TJ-	these days it's relatively rare	22:12
xygnal	soon as the team exposed that they plan to run a rack controller per datacenter, meaning a whole lot of forwarding from other subnets, I knew we might run into this stuff	22:13
xygnal	thanks much	22:15
xygnal	roaksoax: any plans to move to iPXE so we use http instead of TFTP?	23:04
roaksoax	xygnal: not in the shorterm, although we hvae a PR adding support for iPXE. I dont know what's the extend of it or whether it is used for everything	23:07
roaksoax	https://code.launchpad.net/~wpk/maas/+git/maas/+merge/332552	23:08
xygnal	ty sir	23:16
roaksoax	seems it only adds it for kvm now, but seems trivail to use it across the board for x86 systems	23:17
xygnal	passign initrd over HTTP sounds a lot more reliable than using TFTP across multipe hops	23:19
mup	Bug #1733015 opened: [2.3, HWTv2] Runtime resets back to 0 after 24 hours <MAAS:Triaged> <https://launchpad.net/bugs/1733015>	23:30
roaksoax	may be, although we don't typically we recommend putting 1 region per DC, instead of splitting racks :)	23:30
xygnal	we are 1 region, with 1 rack (pair) per datacenter	23:32
xygnal	though our datacenters are all located in the same US State	23:34
xygnal	minimal latency	23:35
roaksoax	xygnal: so region/rack are routed ?	23:38
roaksoax	ah I remember	23:38
roaksoax	maybe due to dhcp relaying ?	23:38
roaksoax	duh	23:38
xygnal	yeah i'm sure our config is floating back there somewhere ;)	23:38

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!