[00:35] back [00:40] blackboxsw: ok; I've got 55 consecutive reboots with no issues [00:41] ok rharper I misread your comment, I thought you did see the hang in europe [00:41] oh, I do [00:41] this is with my fix applied [00:41] I wanted to make sure it wasn't a fluke; [00:41] ahh good deal. so we may not need to write network names [00:42] * blackboxsw is trying again is us-central1 [00:42] smoser and I were chatting, and that is still proobably the right thing to do anyhow [00:42] but we can decide when to land that; [00:42] this also needs maas qa before we push it too [00:43] I'm worried about all of the other places were we don;'t see this now; and we're not using the systemd-udev-settle.service; it's just not running anywhere except if you've got zfs installed or lvm enabled [00:43] * rharper probes some vmtest runs on bionic with zfs and lvm to see what their journal says [00:45] again rharper you saw this just w/ cloud-init clean --logs and reboots right? [00:45] oh yeah [00:45] man, us-central1 still not reproducing it for me [00:45] first boot in europe-west1 with the 420 bionic image works fine [00:45] then I rebooted [00:46] saw it [00:46] rebooted, it came up [00:46] will try switching to europewest again [00:46] so not always, since it's racy [00:46] s/again/for once/ [00:47] of course this could be as well that my recent attempts had debug message checking and printing name_assign_type [00:52] heh, didn't realize our account instance view was shared [00:52] I see rharper-b1 now [00:53] sure enough first boot in europe-west1-b bricked [00:53] geez man region-related for sure [00:54] yeah [00:55] blackboxsw: that's super interesting [00:55] w.r.t the region, I suspect it's load [00:56] could be. trying to see if I can a success boot so I can add my debug deb on followup reboots [00:57] ah, yeah, you have to get one good boot to set the root password [00:58] yeah or boot from my previous image snapshot. I'll try that [00:59] so .. we can reproduce fairly easily ? [01:00] looks like europe-west1-b or europe-west1-d regions [01:00] smoser: yeah [01:02] rharper: if nothing was running trigger.... [01:02] er.. if nothing was runnign the trigger service , then what was doing the cold plug? [01:02] not the trigger service [01:02] that;s always running [01:02] *something* has to be doing it. or we wouldn't have .link files respected at all [01:02] it's the settle service [01:03] so any possible 'udevadm settle' from anywhere would make it work [01:03] yeah [01:03] in xenial, the networking.service unit runs a pre command with udevadm settle in it [01:04] artful could show it, if things were fast enough; and even in bionic, it has to be in this one region where things run slightly odd [01:05] ok, I need to step a way [01:08] blackboxsw: logs of launch-softlayer needs combining with launch-ec2 too. [01:15] blackboxsw: this got missed. [01:15] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/342334 [01:15] not terribly important [01:16] and then this one needs landing too [01:16] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344189 [01:32] smoser: landed https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344189 [01:41] blackboxsw: thanks === mgerdts_ is now known as mgerdts [14:02] blackboxsw: smoser: if we wanted to be more targetted with the settle, we could for example, trigger it within cloud-init-local if we detech non-renamed interfaces with knames (and ifnames=0 no in cmdline); that would then only impact systems which happen to have that early race between cloud-init-local and udev-trigger [14:06] right. and that would be in some ways safer [14:06] from the perspective of not changing boot [14:06] i'd like to have slangasek or xnox thoughts [14:06] yes [14:06] as i am apt to agree with you, that not having the settle service active in boot is ... well just wrong. [14:07] there is a swarm of "why is my boot slow/ systemd-analyze blame shows udev-settle.service" [14:07] oh? [14:07] yes [14:07] so we did it as an optimization :) [14:07] but it's because they have things like usb nics or other storage devices that take *time* to come up [14:07] no [14:07] I don't think so [14:07] (it was a joke) [14:07] it's not clear to me why it's not enabled by default [14:07] yet [14:07] i can make a system boot REALLY REALLY FAST [14:07] and sometimes even do what you want! [14:07] but, lvm2 has a generator which forces it on, if lvm2 is needed in some sitations [14:07] and zfs of course, Requires it [14:08] since they need all of their devices up before they can mount or build a raid, etc [14:08] so, it *really* seems like it should just always be on [14:08] one ends up "Waiting" for rootfs anyhow [14:08] i agree. we should request slangasek and xnox review of your MP ? [14:08] we've seen those "waiting for device ... foo to appear" [14:08] smoser: or possible add a systemd task and ask in the GCE bug [14:08] rharper: well, in my fast boot, sometimes / isnt' there, so but it boots really fast. [14:08] but I would like foundation review [14:09] smoser: lol! [14:09] I get (initramfs) prompt *so* fast those times [14:09] exactly. [14:09] and systemd-analyze does not blame udev! [14:09] I usually take the extra savings and then compile my own kernel, kexec into it to find my root [14:11] blackboxsw: interesting observation w.r.t zone and image; [14:11] I wonder if we can further disect what's special about the 420 image in europe-west1 vs. current stuff [14:12] none-the-less; it does make sense to do something to detect if we've raced and try to fix that in the case we do [14:12] I'm going to see if I can target the settle within cloud-init-local on the reproducer [14:12] i compile my kernels with -O4 and funroll-loops . its the best. [14:13] "it does make sense to do something to detect" [14:13] maybe [14:14] it only makes so much sense to determine when a system is broken... why didn't we just fix the system ? [14:14] that's fair; for now I'm mostly intereted in if we can detect it; [14:15] whether we target a more narrow fix so as to not "udevadm settle" the world ; aka smoser's favorite alias to 'sleep 1' [14:15] needs more discussion [14:19] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344189 [14:19] bah. bad link [14:20] https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344255 [14:20] that one. [14:20] that is softlayer doc improvement. [14:20] rharper, blackboxsw, dpb1 ^ [14:20] yeah, saw that [14:20] 6 ways [14:21] from sunday [14:23] hehe [15:45] 2018-04-25 15:44:49,632 - __init__.py[DEBUG]: WARK: found unstable device names: ['eth0']; calling udevadm settle [15:45] 2018-04-25 15:44:49,968 - util.py[DEBUG]: WARK: Waiting for udev events to settle took 0.336 seconds [15:46] smoser: we can detect, and "resolve" it more narrowly [15:46] if we want [15:46] I'll put up an alternative patch with this change [15:50] link ? [15:50] philroche: https://launchpad.net/~smoser/+archive/ubuntu/ibmcloud-test [15:50] should be populated shortly with a test. [16:25] smoser: reviewed https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344255 === r-daneel_ is now known as r-daneel [17:16] smoser: blackboxsw: this is an alternative, more targetted settle, https://code.launchpad.net/~raharper/cloud-init/+git/cloud-init/+merge/344339 [18:19] blackboxsw: I think I pinged you about my updated merge request, but I don't think I received a response, I might've restarted the chat, anyways here it is again : https://code.launchpad.net/~jocha/cloud-init/+git/cloud-init/+merge/344192 :) [18:19] ahh thanks jocha I'll give it a looksie today [18:22] awesome thanks! [19:55] blackboxsw: i responded to your https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/344255 . [19:56] really just wanting to know if you think i cleared things up [19:57] smoser: yes cleared. land at will, or I can [19:57] ok. ill land [19:57] I'm camping in cloud-init hangout trying to get my IBMcloud setup up [19:57] now that I'm approved [19:57] but can't seem the find/create my API creds [22:30] got it. and updating launch-softlayer script