[08:30] <march0> Hello, I have a race issue between  Ubuntu 20.04.1 LTS cloudinit & unattended-upgr. Basically, my cloudinit is mime multipart of several yaml with repo_update: true & some package list, and sometimes, unattended-upgrades decides to kick in and get the /var/lib/dpkg/lock-frontend.
[08:30] <march0> and ofc, cloudinit fails
[08:51] <march0> is this a known behavior ? or should I fill a bug. So far didn't find any existing bug or related documentation
[09:14] <tribaal> march0: this sounds similar to https://bugs.launchpad.net/cloud-init/+bug/1827204
[09:27] <march0> it's similar, but in my case, I would expect that cloud-init detects unattended-upgrade (that seems to be randomly started, as the issue is not easily reproducible), and wait for the dpkg lock to be release
[09:27] <march0> *d
[09:28] <march0> because runcmd depends on packages statement, so the provisioning is completely failed
[14:51] <Odd_Bloke> march0: Hmm, I'm honestly surprised we don't already have a bug filed for this, but I can't find one.  A bug report (via https://bugs.launchpad.net/cloud-init/+filebug) would be great!
[16:20] <kryl> hi
[17:00] <powersj> question: does vendor data take priority over written cloud-config in /etc/cloud/cloud.cfg.d?
[17:00] <powersj> i.e. I modify an image to create a user in /etc/cloud/cloud.cfg.d, but then vendor data comes in with its own cloud-config and defines its own users
[17:15] <march0> Odd_Bloke, this is very similar to https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image/474024 & https://github.com/systemd/systemd/issues/5659, but it breaks completely package install on fresh ubuntu at 7% rate (estimated)
[17:15] <march0> I start usually around 30 ec2 instances to get 1 or 2 failures
[17:20] <Odd_Bloke> march0: Right, which is why I'm surprised we don't have a bug already. :)  Both the apt-daily services have a randomised wait time from boot, so it makes sense that you'd only see a proportion fail (those where $random_wait is less than the time your config takes to apply).
[17:21] <Odd_Bloke> We do have Before=apt-daily.service, I wonder if we're missing Before=apt-daily-upgrade.service?
[17:21] <Odd_Bloke> Hmm, though apt-daily-upgrade.service does have After=apt-daily.service (on my groovy machine, at least).
[17:25] <march0> I've been trying some workarounds with bootcmd but no luck so far. It's quite hard to wait or stop the unattended-upgrades, even in cloud-init early stage
[17:44] <Odd_Bloke> march0: So I wouldn't expect you to need to workaround this issue, we've already fixed it once: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1693361.  If you file a bug and attach the output of `cloud-init collect-logs` on an affected system, we can dig into what's happening on your system, and figure out the root cause. :)
[17:45] <Odd_Bloke> (I did not search for apt-daily previously; I _knew_ I'd seen a bug for this.)
[20:30] <Odd_Bloke> falcojr: So you're certainly not wrong that SSHing into LXDs is much slower than exec'ing, but a large part of that is due to this one sleep: https://github.com/canonical/pycloudlib/blob/1ac9d4c82fdfd5cb1407f70b8a2b17e02953569d/pycloudlib/lxd/instance.py#L85
[20:31] <Odd_Bloke> Unless LXD is incredibly fast, we'll miss finding an IP first time around, so this basically guarantees a 20s sleep before attempting to SSH into LXDs.
[20:31] <falcojr> sleep 20??? That's...a bit much
[20:33] <Odd_Bloke> Yeah, if I drop that to a 1, then it's more like 6s to run the first command via SSH.
[20:36] <falcojr> also, the original "_wait_for_cloudinit" that was probably still around when they reported long wait times used increasing sleep times
[20:37] <falcojr> there's also a ssh connect sleep that waits 10 seconds before trying again
[20:39] <rharper> Could you exec into the instance to find out if networking is up?
[20:39] <Odd_Bloke> Yeah, I tried that one first and it had no effect: the 20s sleep happens before it, so by the time we're trying to SSH connect, SSH has been up for (20 - 6)s.
[20:41] <Odd_Bloke> rharper: The context here is we're trying to decide what the appropriate default access method for LXD is: our existing understanding is that SSHing was much slower, so we were leaning towards `exec` for pragmatic reasons (we run these tests all the time, so saving 15s per test run will add up very fast).  However, I then noticed that we were, suspiciously, taking 21-22s to SSH in every time, so went
[20:41] <Odd_Bloke> digging.
[20:42] <rharper> huh
[20:44] <Odd_Bloke> I suspect some of these timeout values are more sensible in the context of LXD VMs, but they're applied to containers too.