/srv/irclogs.ubuntu.com/2021/01/29/#cloud-init.txt

=== jmcgnh_ is now known as jmcgnh
=== jmcgnh is now known as jmcgnh_
=== jmcgnh_ is now known as jmcgnh
=== hjensas|afk is now known as hjensas
march0Hello, I have a race issue between  Ubuntu 20.04.1 LTS cloudinit & unattended-upgr. Basically, my cloudinit is mime multipart of several yaml with repo_update: true & some package list, and sometimes, unattended-upgrades decides to kick in and get the /var/lib/dpkg/lock-frontend.08:30
march0and ofc, cloudinit fails08:30
march0is this a known behavior ? or should I fill a bug. So far didn't find any existing bug or related documentation08:51
tribaalmarch0: this sounds similar to https://bugs.launchpad.net/cloud-init/+bug/182720409:14
ubot5Ubuntu bug 1827204 in cloud-init "Doesn't run unattended-upgrades on first boot by default" [High,Triaged]09:14
march0it's similar, but in my case, I would expect that cloud-init detects unattended-upgrade (that seems to be randomly started, as the issue is not easily reproducible), and wait for the dpkg lock to be release09:27
march0*d09:27
march0because runcmd depends on packages statement, so the provisioning is completely failed09:28
Odd_Blokemarch0: Hmm, I'm honestly surprised we don't already have a bug filed for this, but I can't find one.  A bug report (via https://bugs.launchpad.net/cloud-init/+filebug) would be great!14:51
krylhi16:20
powersjquestion: does vendor data take priority over written cloud-config in /etc/cloud/cloud.cfg.d?17:00
powersji.e. I modify an image to create a user in /etc/cloud/cloud.cfg.d, but then vendor data comes in with its own cloud-config and defines its own users17:00
march0Odd_Bloke, this is very similar to https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image/474024 & https://github.com/systemd/systemd/issues/5659, but it breaks completely package install on fresh ubuntu at 7% rate (estimated)17:15
march0I start usually around 30 ec2 instances to get 1 or 2 failures17:15
Odd_Blokemarch0: Right, which is why I'm surprised we don't have a bug already. :)  Both the apt-daily services have a randomised wait time from boot, so it makes sense that you'd only see a proportion fail (those where $random_wait is less than the time your config takes to apply).17:20
Odd_BlokeWe do have Before=apt-daily.service, I wonder if we're missing Before=apt-daily-upgrade.service?17:21
Odd_BlokeHmm, though apt-daily-upgrade.service does have After=apt-daily.service (on my groovy machine, at least).17:21
march0I've been trying some workarounds with bootcmd but no luck so far. It's quite hard to wait or stop the unattended-upgrades, even in cloud-init early stage17:25
Odd_Blokemarch0: So I wouldn't expect you to need to workaround this issue, we've already fixed it once: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1693361.  If you file a bug and attach the output of `cloud-init collect-logs` on an affected system, we can dig into what's happening on your system, and figure out the root cause. :)17:44
ubot5Ubuntu bug 1693361 in cloud-init (Ubuntu Artful) "cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution" [Medium,Fix released]17:44
Odd_Bloke(I did not search for apt-daily previously; I _knew_ I'd seen a bug for this.)17:45
Odd_Blokefalcojr: So you're certainly not wrong that SSHing into LXDs is much slower than exec'ing, but a large part of that is due to this one sleep: https://github.com/canonical/pycloudlib/blob/1ac9d4c82fdfd5cb1407f70b8a2b17e02953569d/pycloudlib/lxd/instance.py#L8520:30
Odd_BlokeUnless LXD is incredibly fast, we'll miss finding an IP first time around, so this basically guarantees a 20s sleep before attempting to SSH into LXDs.20:31
falcojrsleep 20??? That's...a bit much20:31
Odd_BlokeYeah, if I drop that to a 1, then it's more like 6s to run the first command via SSH.20:33
falcojralso, the original "_wait_for_cloudinit" that was probably still around when they reported long wait times used increasing sleep times20:36
falcojrthere's also a ssh connect sleep that waits 10 seconds before trying again20:37
rharperCould you exec into the instance to find out if networking is up?20:39
Odd_BlokeYeah, I tried that one first and it had no effect: the 20s sleep happens before it, so by the time we're trying to SSH connect, SSH has been up for (20 - 6)s.20:39
Odd_Blokerharper: The context here is we're trying to decide what the appropriate default access method for LXD is: our existing understanding is that SSHing was much slower, so we were leaning towards `exec` for pragmatic reasons (we run these tests all the time, so saving 15s per test run will add up very fast).  However, I then noticed that we were, suspiciously, taking 21-22s to SSH in every time, so went20:41
Odd_Blokedigging.20:41
rharperhuh20:42
Odd_BlokeI suspect some of these timeout values are more sensible in the context of LXD VMs, but they're applied to containers too.20:44

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!