=== jmcgnh_ is now known as jmcgnh | ||
=== jmcgnh is now known as jmcgnh_ | ||
=== jmcgnh_ is now known as jmcgnh | ||
=== hjensas|afk is now known as hjensas | ||
march0 | Hello, I have a race issue between Ubuntu 20.04.1 LTS cloudinit & unattended-upgr. Basically, my cloudinit is mime multipart of several yaml with repo_update: true & some package list, and sometimes, unattended-upgrades decides to kick in and get the /var/lib/dpkg/lock-frontend. | 08:30 |
---|---|---|
march0 | and ofc, cloudinit fails | 08:30 |
march0 | is this a known behavior ? or should I fill a bug. So far didn't find any existing bug or related documentation | 08:51 |
tribaal | march0: this sounds similar to https://bugs.launchpad.net/cloud-init/+bug/1827204 | 09:14 |
ubot5 | Ubuntu bug 1827204 in cloud-init "Doesn't run unattended-upgrades on first boot by default" [High,Triaged] | 09:14 |
march0 | it's similar, but in my case, I would expect that cloud-init detects unattended-upgrade (that seems to be randomly started, as the issue is not easily reproducible), and wait for the dpkg lock to be release | 09:27 |
march0 | *d | 09:27 |
march0 | because runcmd depends on packages statement, so the provisioning is completely failed | 09:28 |
Odd_Bloke | march0: Hmm, I'm honestly surprised we don't already have a bug filed for this, but I can't find one. A bug report (via https://bugs.launchpad.net/cloud-init/+filebug) would be great! | 14:51 |
kryl | hi | 16:20 |
powersj | question: does vendor data take priority over written cloud-config in /etc/cloud/cloud.cfg.d? | 17:00 |
powersj | i.e. I modify an image to create a user in /etc/cloud/cloud.cfg.d, but then vendor data comes in with its own cloud-config and defines its own users | 17:00 |
march0 | Odd_Bloke, this is very similar to https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image/474024 & https://github.com/systemd/systemd/issues/5659, but it breaks completely package install on fresh ubuntu at 7% rate (estimated) | 17:15 |
march0 | I start usually around 30 ec2 instances to get 1 or 2 failures | 17:15 |
Odd_Bloke | march0: Right, which is why I'm surprised we don't have a bug already. :) Both the apt-daily services have a randomised wait time from boot, so it makes sense that you'd only see a proportion fail (those where $random_wait is less than the time your config takes to apply). | 17:20 |
Odd_Bloke | We do have Before=apt-daily.service, I wonder if we're missing Before=apt-daily-upgrade.service? | 17:21 |
Odd_Bloke | Hmm, though apt-daily-upgrade.service does have After=apt-daily.service (on my groovy machine, at least). | 17:21 |
march0 | I've been trying some workarounds with bootcmd but no luck so far. It's quite hard to wait or stop the unattended-upgrades, even in cloud-init early stage | 17:25 |
Odd_Bloke | march0: So I wouldn't expect you to need to workaround this issue, we've already fixed it once: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1693361. If you file a bug and attach the output of `cloud-init collect-logs` on an affected system, we can dig into what's happening on your system, and figure out the root cause. :) | 17:44 |
ubot5 | Ubuntu bug 1693361 in cloud-init (Ubuntu Artful) "cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution" [Medium,Fix released] | 17:44 |
Odd_Bloke | (I did not search for apt-daily previously; I _knew_ I'd seen a bug for this.) | 17:45 |
Odd_Bloke | falcojr: So you're certainly not wrong that SSHing into LXDs is much slower than exec'ing, but a large part of that is due to this one sleep: https://github.com/canonical/pycloudlib/blob/1ac9d4c82fdfd5cb1407f70b8a2b17e02953569d/pycloudlib/lxd/instance.py#L85 | 20:30 |
Odd_Bloke | Unless LXD is incredibly fast, we'll miss finding an IP first time around, so this basically guarantees a 20s sleep before attempting to SSH into LXDs. | 20:31 |
falcojr | sleep 20??? That's...a bit much | 20:31 |
Odd_Bloke | Yeah, if I drop that to a 1, then it's more like 6s to run the first command via SSH. | 20:33 |
falcojr | also, the original "_wait_for_cloudinit" that was probably still around when they reported long wait times used increasing sleep times | 20:36 |
falcojr | there's also a ssh connect sleep that waits 10 seconds before trying again | 20:37 |
rharper | Could you exec into the instance to find out if networking is up? | 20:39 |
Odd_Bloke | Yeah, I tried that one first and it had no effect: the 20s sleep happens before it, so by the time we're trying to SSH connect, SSH has been up for (20 - 6)s. | 20:39 |
Odd_Bloke | rharper: The context here is we're trying to decide what the appropriate default access method for LXD is: our existing understanding is that SSHing was much slower, so we were leaning towards `exec` for pragmatic reasons (we run these tests all the time, so saving 15s per test run will add up very fast). However, I then noticed that we were, suspiciously, taking 21-22s to SSH in every time, so went | 20:41 |
Odd_Bloke | digging. | 20:41 |
rharper | huh | 20:42 |
Odd_Bloke | I suspect some of these timeout values are more sensible in the context of LXD VMs, but they're applied to containers too. | 20:44 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!