=== jmcgnh_ is now known as jmcgnh === jmcgnh is now known as jmcgnh_ === jmcgnh_ is now known as jmcgnh === hjensas|afk is now known as hjensas [08:30] Hello, I have a race issue between Ubuntu 20.04.1 LTS cloudinit & unattended-upgr. Basically, my cloudinit is mime multipart of several yaml with repo_update: true & some package list, and sometimes, unattended-upgrades decides to kick in and get the /var/lib/dpkg/lock-frontend. [08:30] and ofc, cloudinit fails [08:51] is this a known behavior ? or should I fill a bug. So far didn't find any existing bug or related documentation [09:14] march0: this sounds similar to https://bugs.launchpad.net/cloud-init/+bug/1827204 [09:14] Ubuntu bug 1827204 in cloud-init "Doesn't run unattended-upgrades on first boot by default" [High,Triaged] [09:27] it's similar, but in my case, I would expect that cloud-init detects unattended-upgrade (that seems to be randomly started, as the issue is not easily reproducible), and wait for the dpkg lock to be release [09:27] *d [09:28] because runcmd depends on packages statement, so the provisioning is completely failed [14:51] march0: Hmm, I'm honestly surprised we don't already have a bug filed for this, but I can't find one. A bug report (via https://bugs.launchpad.net/cloud-init/+filebug) would be great! [16:20] hi [17:00] question: does vendor data take priority over written cloud-config in /etc/cloud/cloud.cfg.d? [17:00] i.e. I modify an image to create a user in /etc/cloud/cloud.cfg.d, but then vendor data comes in with its own cloud-config and defines its own users [17:15] Odd_Bloke, this is very similar to https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image/474024 & https://github.com/systemd/systemd/issues/5659, but it breaks completely package install on fresh ubuntu at 7% rate (estimated) [17:15] I start usually around 30 ec2 instances to get 1 or 2 failures [17:20] march0: Right, which is why I'm surprised we don't have a bug already. :) Both the apt-daily services have a randomised wait time from boot, so it makes sense that you'd only see a proportion fail (those where $random_wait is less than the time your config takes to apply). [17:21] We do have Before=apt-daily.service, I wonder if we're missing Before=apt-daily-upgrade.service? [17:21] Hmm, though apt-daily-upgrade.service does have After=apt-daily.service (on my groovy machine, at least). [17:25] I've been trying some workarounds with bootcmd but no luck so far. It's quite hard to wait or stop the unattended-upgrades, even in cloud-init early stage [17:44] march0: So I wouldn't expect you to need to workaround this issue, we've already fixed it once: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1693361. If you file a bug and attach the output of `cloud-init collect-logs` on an affected system, we can dig into what's happening on your system, and figure out the root cause. :) [17:44] Ubuntu bug 1693361 in cloud-init (Ubuntu Artful) "cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution" [Medium,Fix released] [17:45] (I did not search for apt-daily previously; I _knew_ I'd seen a bug for this.) [20:30] falcojr: So you're certainly not wrong that SSHing into LXDs is much slower than exec'ing, but a large part of that is due to this one sleep: https://github.com/canonical/pycloudlib/blob/1ac9d4c82fdfd5cb1407f70b8a2b17e02953569d/pycloudlib/lxd/instance.py#L85 [20:31] Unless LXD is incredibly fast, we'll miss finding an IP first time around, so this basically guarantees a 20s sleep before attempting to SSH into LXDs. [20:31] sleep 20??? That's...a bit much [20:33] Yeah, if I drop that to a 1, then it's more like 6s to run the first command via SSH. [20:36] also, the original "_wait_for_cloudinit" that was probably still around when they reported long wait times used increasing sleep times [20:37] there's also a ssh connect sleep that waits 10 seconds before trying again [20:39] Could you exec into the instance to find out if networking is up? [20:39] Yeah, I tried that one first and it had no effect: the 20s sleep happens before it, so by the time we're trying to SSH connect, SSH has been up for (20 - 6)s. [20:41] rharper: The context here is we're trying to decide what the appropriate default access method for LXD is: our existing understanding is that SSHing was much slower, so we were leaning towards `exec` for pragmatic reasons (we run these tests all the time, so saving 15s per test run will add up very fast). However, I then noticed that we were, suspiciously, taking 21-22s to SSH in every time, so went [20:41] digging. [20:42] huh [20:44] I suspect some of these timeout values are more sensible in the context of LXD VMs, but they're applied to containers too.