[01:34] blackboxsw: https://jenkins.ubuntu.com/server/view/cloud-init,%20curtin,%20streams/job/cloud-init-integration-ec2-a/25/console [01:34] thoughts ? [01:36] powersj: https://jenkins.ubuntu.com/server/view/cloud-init,%20curtin,%20streams/job/cloud-init-integration-ec2-x/26/console ? you have any idea on htat one ? [01:37] Waiter InstanceRunning failed: Waiter encountered a terminal failure state [01:38] 17 seconds after it launched the instance it encoutered a terminal failure state. [01:39] so it was doing install deb encountered error [01:39] self.instance.wait_until_running() [01:39] well, probably not [01:40] not 17 seconds after launch of the instance [01:40] so the instance was booting and we were waiting [01:40] yeah [02:36] ok. playing a bit more, http://paste.ubuntu.com/p/sFJbTP5Brn/ [02:36] that shows the "gain" of easily decorating the class. [02:37] and explains skip_by_date decorator some. [02:37] hope to have a mp for that tmororw. [02:37] nice, didn't realize you were still working. [02:37] that was just still in my head [02:37] integration test failures suck [02:38] with that... i do need to go afk. [02:38] have a nice night all. [02:39] I've started adding descriptions to run/failed jenkins jobs that have hit the keyserver errors. [02:39] hm.. the name wont be right... [02:39] :-( [02:39] ok. later. [02:39] later. I'm kicking another integration run and will see what gives on that ec2 wait traceback [02:41] powersj: could that terminal run state be a result of maybe kicking off two ec2 intergration tests simultaneously? like a cleanup job on one wiped all live instances etc? [02:41] part of the teardown based on keys or something [02:43] thats what i thought [02:43] i didnt see that instance listed though [02:44] is it still occuring? [04:04] all green now powersj [04:04] just finished. some intermittent error. [04:04] smoser: for tomorrow, looks like we hit all green on existing integrationpatchsets. [04:04] I'm outta here === otubo1 is now known as otubo [16:13] rharper: on this azure instance with the renamed cirename0, it looks like netplan wasn't waiting on cirename0 per journal [16:13] emd-networkd-wait-online[743]: ignoring: cirename0 [16:20] hrm [16:20] emd-networkd[722]: eth0: Interface name change detected, eth0 has been renamed to rename3. [16:20] emd-networkd[722]: rename3: Interface name change detected, rename3 has been renamed to eth0. [16:20] seems like a bit of thrashing in kernel renames of eth9 [16:20] seems like a bit of thrashing in kernel renames of eth0 [16:35] the rename3 are udev [16:35] persistent_rules i think [17:02] blackboxsw: /etc/cloud.cfg.d [17:03] or whatever the name of the directory is -- if you deployed the system with maas, cloud-init renames things at boot too. [17:04] cyphermox: right, there was a race w/ cloud-init trying a rename on this one azure instance but failing because a 2nd nic came up as eth0 in the meantime. I was trying to peek at the other players renaming interfaces at the same time. I just turned on systemd-networkd debug to check out what happens on next boot [17:05] end result was that the azure instance was left with a nic named 'cirename0' (which was cloud-init's doing) [17:06] yeah even after reboot, systemd-networkd-wait-online.service is camping out for 2 minutes [17:06] checking the debug journal now [17:07] just looking through this now https://github.com/systemd/systemd/issues/7143 [17:07] I mean this http://paste.ubuntu.com/p/W4CT8g4Tyq/ [17:09] and looking specifically at wait-online-service https://github.com/systemd/systemd/issues/7143 [17:09] and looking specifically at wait-online-service http://paste.ubuntu.com/p/RMcBf6YYjN/ [17:09] copy buffer fail [17:13] so it certainly isn't waiting on cirename0, just on eth0, which seems to be out to lunch for some reason. I'll poke to see if I can determine what networkd thinks eth0 actually is (like mac address etc). [17:13] cat /run/systemd/network/10-netplan-eth0.* [17:13] [Match] [17:13] MACAddress=00:0d:3a:91:bc:49 [17:13] [Link] [17:13] Name=eth0 [17:13] WakeOnLan=off [17:13] [Match] [17:13] MACAddress=00:0d:3a:91:bc:49 [17:13] Name=eth0 [17:13] [Network] [17:13] DHCP=ipv4 [17:13] [DHCP] [17:13] UseMTU=true [17:16] blackboxsw: sorry, I can't really keep track here; but two things jumped out in the netplan yaml you pointed out earlier [17:17] I guess just one thing [17:17] https://launchpad.net/ubuntu/+source/netplan.io/0.38 [17:17] no worries, I'm spamming the channel anyway (not really blocked yet, but curious why networkd-online would actually be blocking for so long in this situation as cirename0 matches our optional: True netplan yaml case [17:17] ^ this is a definite SRU candidate, but it's currently blocked in cosmic due to haskell. [17:17] checking now [17:18] it should fix renames in general [17:18] or at least greatly improve the behavior. [17:19] good to know about general netplan ip leases... though it looks like it currently Tracebacks on all interfaces (even the ones that should be managed) [17:20] cyphermox: but maybe I'm getting that "netplan ip leases" traceback as none of the wait-online-service timed out in general and didn't persist the information for the manage interface eth0 [17:21] I'll watch that 0.38 release eagerly thanks [17:21] might try it out on my broken bionic instance now to see what gives [17:22] cyphermox: is ppa:cyphermox/netplan.io a good ppa to try 0.38? [17:24] no, it's not up to date [17:33] * blackboxsw can just play w/ cosmic to see behavior differences there [17:46] hah think I got it rharper [17:46] it's only this corner case on azure: [17:47] mac1 of original nic1 is rendered to /etc/netplan/50-cloud-init.yaml by us. [17:47] nic1 detached from instance and nic2 (new-mac) attached to instance [17:48] 90-azure-hotplug.yaml matches with our hotpluggedeth0 rule per https://paste.ubuntu.com/p/gZBWH5GKmg/ [17:49] no problem on that boot either, but when you re-attached orig nic1 (it is labeled by the kernel as eth1 now on this system because nic2 is now eth0). [17:57] but that nic1 mac1 matches the original 50-cloud-init.yaml hotplug which also performs a set-name eth0. an I think that collides with the existing nic2(eth0) name on the instance as it is booted [17:58] which results in our leaving one instance renamed as cirename0 [17:58] leaving one *nic* renames ... [17:58] anyway, will try to reproduce the issue on my own instance now [18:02] blackboxsw: hrm [18:03] blackboxsw: that's all fine but it's not clear to me yet why we block the boot; unless you're suggesting that networkd thinks it has two different nics (eth0 and eth1) both with the same mac value so it matches ? [18:03] that sounds like a networkd bug w.r.t what it can "manage" [18:37] rharper: from debug logs, it looks like it's IPV6 router solicitation I think [18:37] Jun 22 18:01:18 bionic-hotplug-test systemd-networkd[744]: NDISC: Sent Router Solicitation, next solicitation in 1min 12s [18:38] keeps retrying over an over [18:38] checking your bug https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1765173 to see if related [18:38] Ubuntu bug 1765173 in systemd (Ubuntu) "networkd waits 10 seconds for ipv6 network discovery by default" [Undecided,Fix released] [18:52] blackboxsw: whoa; that's not right [18:52] ok so we are on the proper systemd which should proceed without blocking on RA, systemd 237-3ubuntu10 [18:53] oh, is the image down level ? [18:53] this was released before 18.04 GAed [18:53] blackboxsw: what level is systemd in your image ? [18:54] * rharper tests bionic daily to see if it's regressed [18:54] systemd 237-3ubuntu10 which matches my recent bionic lxc [18:55] 3.036s systemd-networkd-wait-online.service [18:55] [18:55] that looks like a regression [18:55] mother [18:55] wow, it's the same systemd =( [18:55] * rharper adds debugging [18:56] yeah it's as if the mentioned fix didn't work or get in. [18:56] it went it [18:56] in [18:56] I verified [18:56] so somethings different [18:56] but I'm not sure what at this point [18:57] I'm referring to this comment https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1765173/comments/10 [18:57] Ubuntu bug 1765173 in systemd (Ubuntu) "networkd waits 10 seconds for ipv6 network discovery by default" [Undecided,Fix released] [18:58] I'll need another pair of eyes/brain on this to correct some incorrect assumptions I have (and get a better education on this) rharper [18:58] sure [18:58] lemme poke at my lxd container then we can look at your instance [18:58] I'm in hangout/meet for my education by fire [18:59] [2112600.640416] b2 systemd-networkd[149]: eth0: Gained IPv6LL [18:59] [2112603.521339] b2 systemd-networkd[149]: eth0: DHCPv4 address 10.8.107.145/24 via 10.8.107.1 [18:59] there's my 3 seconds [18:59] my dnsmasq is just slow [19:01] [9271481.644265] rharper-b2 systemd[1]: Starting Wait for Network to be Configured... [19:01] [9271484.163932] rharper-b2 systemd-networkd[151]: eth0: DHCPv4 address 10.109.225.14/24 via 10.109.225.1 [19:01] 2.5 on diglett