/srv/irclogs.ubuntu.com/2021/07/21/#ubuntu-security.txt

hankhello, I noticed a new timestamp format in the focal OVAL feed, is that on purpose?14:29
hankit's usually `YYYY-MM-DD time`, this one is `YYYY-MM-DD time ??`14:30
hankthe one I ran into is `PT`, which is odd because that's not a specific timezone14:30
sbeattiehank: hrm, can you point to where you see that?15:19
sbeattieoh containerd15:22
mdeslaursbeattie: oh, does it not understand "PT"?15:25
sbeattieit's either PDT or PST, there is no PT. 15:26
mdeslaurd'oh15:26
sbeattieI converted it to UST15:26
hankUSN-5012-115:26
sbeattiehank: I've fixed it in our tracker, the OVAL feed should pick it up in the next couple of hours. Thanks for pointing it out!15:27
mdeslaursbeattie: thanks15:31
hankthanks!15:31
TJ-There may be a problem with th recent systemd 20.04 udate16:24
TJ-user in #ubuntu is reporting several server installs that have done unattended-upgrades have systemd-resolved spinning at 100% CPU and no name resolution16:25
TJ-user gave this info showing resolved timeout exceeded on the restart after installing the modified config file https://pastebin.com/mJh5N6HN16:27
Guest80Context: I received a report for some stale nodes (nodes that have not checked in to a central service) and upon investigation they stopped resolving hostnames right after some unattended upgrades. dpkg log for those upgrades: https://pastebin.com/mJh5N6HN16:27
Guest80A system reboot seems to clear it up, but some of these servers run data services, so I am also trying to find a way to prevent that. I've seen it happen in GCP and AWS.16:29
TJ-Guest80: from an affected system can you show us "pastebinit <( journalctl -u systemd-resolved --since="2 days ago" )"16:29
Guest80Failed to contact the server: <urlopen error [Errno -3] Temporary failure in name resolution>16:30
Guest80I'll create a pastebin manually16:30
TJ-ahhh16:30
TJ-Guest80: from an affected system can you show us "journalctl -u systemd-resolved --since="2 days ago" > /tmp/resolved.log" :)16:31
TJ-apologies for raw copy/edit/paste!16:31
Guest80No worries, thanks for looking at this16:31
Guest80https://pastebin.com/eF1wD2Lv ... there isn't a lot there16:31
Guest80I truncated it, as the restart loop continued16:32
Guest80I am going to sample some more nodes to see if they had the upgrades but are still able to resolve... 16:33
TJ-Guest80: could it be this? Bug #193422116:36
ubottuBug 1934221 in systemd (Ubuntu Hirsute) "systemd-resolve segfault" [Undecided, New] https://launchpad.net/bugs/193422116:36
TJ-possibly not since that seems to be for 21.04 but... mentions resolved :)16:36
TJ-Guest80: the fact it is repeatedly timing out trying to start the service is clear enough16:39
Guest80Aye16:41
TJ-that patch is applied to the focal git repo but doesn't look like it has been published... and as it is a /fix/ for a fault, it could be at least related to your issue16:42
Guest80Yeah, interestingly starting /lib/systemd/systemd-resolved works when I do it via console, and systems start functioning.16:43
TJ-that is /weird/ !16:43
Guest80but the service is still stuck in the startup loop16:43
TJ-so, what is in the unit ("systemctl cat systemd-resolved"16:43
mdeslaurwe've had other reports of similar issues, but we've been unable to reproduce so far16:43
mdeslaurGuest80: is that on a cloud provider? are you running k8s?16:44
Guest80We don't run k8s and I'm seeing it in both GCP and AWS16:44
TJ-Guest80: here's how to gather more debug info. Kill any systemd-resolved process that is running, and ensure the unit is stopped (even if marked failed)16:46
Guest80Ok, stopped.16:47
TJ-Guest80: then enable debug-level logging with "sudo kill -SIGRTMIN+22 1" (see "man systemd" and SIGRTMIN+22 - this puts the systemd services into log_level=debug". Then "systemctl start systemd-resolved" then collect logs "journalctl -b 0 --since "2 minutes ago" ") this will capture any related log info, not just resolved's, to correlate cause and effect16:47
Guest80Failed to umount /run/systemd/unit-root/dev: Device or resource busy16:49
Guest80There is a lot going on in the journal for systemd-resolved but it seems like it might be part of systemd itself16:51
TJ-Are you doing anything 'interesting' with namespaces?16:52
TJ-Guest80: yes; that signal enabled debug for all systemd components16:53
Guest80I don't understand the question, so probably not :)16:54
TJ-/run/systemd/unt-root/ is used as a temporary mountpoint in different namespaces for mounts before they're put in their final location. That error suggests the /dev/ (devtmps ?) is still active because something has a handle on it16:54
Guest80there are tons of snap.rootfs directories in /tmp16:54
Guest80ls /tmp/ | grep snap | wc -l16:55
Guest80608716:55
Guest80That seems odd to me16:55
* TJ- blinks16:55
TJ-do they have understandable names as in /tmp/snap.<someappname>16:56
Guest80They are all snap.rootfs_83S1KD16:56
Guest80ish16:56
Guest80we don't use any snaps that are not baked in16:56
Guest80I think lxd or something might be there16:56
TJ-hmmm, this /may/ not be related to resolved but it feels suspicious16:57
Guest80well, in the logs for journalctl there was a ton of activity around those before resolved failed, so yeah, could be a red herring16:57
TJ-do you have LXD containers configured as well?16:57
Guest80I think that AWS has their ssm-agent running in a container; let me look16:58
TJ-since LXD is itself a snap nowadays, those may be related to LXD containers16:58
TJ-"lxc list" - but hope you don't see 6087 containers!16:58
Guest80Interesting16:59
Guest80cannot perform operation: mount --rbind /var/log /tmp/snap.rootfs_MlEUCG//var/log: Permission denied16:59
Guest80I didn't see any containers17:00
TJ-my feeling is that this is a resource exhaustion related issue, but without seeing the detailed log it is hard to make a better guess. Are you able to share the log when SIGRTMIN+22 was invoked?17:00
Guest80let me do that... and I agree.17:01
Guest80root cause is probably an error on our part with how we setup /var/log17:01
TJ-but it shouldn't fail mysteriously!17:02
TJ-i've been giving systemd a good kicking recently over a similar lack-of-info issue with networkd/udevd !17:02
mdeslaurI'm curious, are you setting up /var/log in a special way?17:02
Guest80Yeah, we attach it to local storage (ephemeral drives) when we detect those, so I think something might be there.17:03
Guest80incoming logs17:04
Guest8036Oops, got disconnected: https://pastebin.com/eF1wD2Lv17:07
Guest8036sorry, wrong paste, it's barking because the file is too large, 1m17:08
TJ-ha! I thought someone was flying between Starlink and me!17:08
TJ-ouch :)17:09
Guest80Ok, well I had to pare it down a bit17:11
Guest80https://pastebin.com/WuPj5VKf17:11
Guest80but there were thousands of remount messages about rootfs in tmp17:12
Guest80I'm going to try something17:16
TJ-Read most of it now and it is making sense. I think I can do an experimental reproducer here17:21
TJ-oh, and I should have told you, to cancel the debug logging "kill -SIGRTMIN+23 1"17:22
Guest80Oh thanks, I ended up rebooting that node. On another node I am manually unmounting and removing all those rootfs directories to see if I can reduce the noise there and assert if that has an impact here17:24
TJ-there's a workaround that may help in the interim - extend the timeout. "systemctl edit systemd-resolved" will create a drop-in override file. That command will open a text editor with an empty file. Add to it these lines and save "[Service]"  and  "TimeoutStartSec=500" 17:30
Guest80Active: active (running) since Wed 2021-07-21 17:30:13 UTC; 42s ago17:31
TJ-I /think/ that may give enough rope for the service to start... it isn't far off when the system supervisor kills it because it has expired the default timeout17:31
Guest80Yep, yeah, unmounting and removing all those rootfs directories caused it to start17:31
TJ-but are all the affected nodes seeing the same mass of mounts?17:31
Guest80Yes, they are... so now I am moving onto that problem ;)17:32
Guest80by the way, thank you17:32
Guest80lxc list17:34
Guest80cannot perform operation: mount --rbind /var/log /tmp/snap.rootfs_I0jKoM//var/log: Permission denied17:34
TJ-I can understand why this is happening. My bet is at reboot this isn't an issue. But, due to masses of (failed) container mounts, any restart of systemd-resolved  - and likely other services - will hit this17:34
Guest80Yep, agree. The reboot is not a permanent fix17:34
Guest80so when I run that lxc command, that rootfs directory just stays in the /tmp directory17:34
TJ-is that recursive mount related to the ssm-agent you referred to earlier17:35
TJ-I've lost some connectivity here for some reason; most annoyingly web search (duckduckgo) and githubassets!17:35
TJ-looks like DDG in Ireland (msn network) is having a lay-down17:37
Guest80I'm not entirely sure if it's responsible, but /var/log is actually a symbolic link to another directory on another device.17:37
Guest80lrwxrwxrwx 1 root syslog 18 May  5 06:44 /var/log -> /mnt/ephemeral/log17:38
TJ-it could be since LXD being a snap, although privileged, may have some restrictions that prevent it accessing /mnt/17:39
TJ-possibly an apparmor profile for example17:39
Guest80ok, that is good stuff, let me look at that.17:39
mdeslaurWow, this is interesting, nice debugging17:40
Guest80I'm not very familiar with apparmor, but I do see files related to snap-confine. I'll try the easy mode permissions first. 17:41
TJ-Looking at the snap apparmor profile I see a couple of bits of interest17:45
TJ-mount options=(rw rbind) /var/log/ -> /tmp/snap.rootfs_*/var/log/,17:45
Guest80I'm seeing that too17:45
TJ-and, disproving my hypothesis about /mnt/ mount options=(rw rbind) /mnt/ -> /tmp/snap.rootfs_*/mnt/,17:46
TJ-but, this is the snap daemon itself; not clear what is applied for the LXD snap at present17:46
Guest80when I modify that line and restart snapd/apparmor and then run lxc again, it still has the old line17:48
TJ-if I recall correctly, it's stop, edit, start17:51
Guest80would that be fore apparmor?17:52
TJ-I think so ... a while since I had to wrestle with it.17:54
Guest80I'll do some research and see if I can get that error to go away17:54
TJ-I think the snapd apparmor profile is a red-herring - not implicated. Wht may be is any restrictions on LXD... but cannot get to the LXD source on github right now to chek17:57
Guest80https://askubuntu.com/questions/1222627/cannot-open-snap-apps-cannot-perform-operation-mount-rbind-dev-tmp-snap sounds oddly familiar18:00
TJ-found the apparmor lxd profile /var/lib/snapd/apparmor/profiles/18:05
Guest80I'll check that out18:08
Guest80So after running this: unlink /var/log && mkdir /var/log && mount --bind /mnt/ephemeral/log /var/log18:08
Guest80lxc commands didn't bark at me18:08
Guest80And I don't see dangling snap.rootfs mounts/folders in /tmp18:09
Guest80I don't really understand all the layers there, is it the lxc profile, lxd profile, snap profile? All of the above? Is the symlink the problem or is it permissions? If it's permissions it seems like we should be able to drop a file somewhere and /mnt/ephemeral would be available.  18:14
TJ-the various snap.lxd* profiles (all basically the same) - I don't see anything obvious in them /except/ no mention of access to /mnt/ whereas snapd itself does - ommission may be issue here when following a symlink to /mnt/ - all those /tmp/rootfs_* are, I suspect, a bug in lxc when it cannot finalise mounts for a container and is retrying without cleaning up the previous attempt. May be18:20
TJ-worth checking for issues on the lxd project issue tracker on github 18:20
Guest80I agree. I'm testing a few rules to give it access, but It does seem to be a resource leak bug as well.18:21
Guest80I'll create an issue there18:21
TJ-may be worth talking to  Stéphane Graber or other devs in #lxcontainers18:24
TJ-oh, it's #lxc now!18:24
TJ-going to dinner now - I'll keep a watch on the channels18:25
Guest80TJ- thank you for you help here. Enjoy your evening.18:28
Guest80https://github.com/lxc/lxd/issues/9029 hopefully this isn't noise for them18:29
ubottuIssue 9029 in lxc/lxd "Growing number of snap.rootfs directories" [Open]18:29
TJ-Guest80: glad to see SG confirming it looks like a snap issue19:12
Guest80Yeah, I almost have a patch for us... only several hundred nodes to target ;)19:15
Guest80Thanks again for the help19:15
TJ-oooph!19:15

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!