/srv/irclogs.ubuntu.com/2021/07/21/#ubuntu-security.txt

hank	hello, I noticed a new timestamp format in the focal OVAL feed, is that on purpose?	14:29
hank	it's usually `YYYY-MM-DD time`, this one is `YYYY-MM-DD time ??`	14:30
hank	the one I ran into is `PT`, which is odd because that's not a specific timezone	14:30
sbeattie	hank: hrm, can you point to where you see that?	15:19
sbeattie	oh containerd	15:22
mdeslaur	sbeattie: oh, does it not understand "PT"?	15:25
sbeattie	it's either PDT or PST, there is no PT.	15:26
mdeslaur	d'oh	15:26
sbeattie	I converted it to UST	15:26
hank	USN-5012-1	15:26
sbeattie	hank: I've fixed it in our tracker, the OVAL feed should pick it up in the next couple of hours. Thanks for pointing it out!	15:27
mdeslaur	sbeattie: thanks	15:31
hank	thanks!	15:31
TJ-	There may be a problem with th recent systemd 20.04 udate	16:24
TJ-	user in #ubuntu is reporting several server installs that have done unattended-upgrades have systemd-resolved spinning at 100% CPU and no name resolution	16:25
TJ-	user gave this info showing resolved timeout exceeded on the restart after installing the modified config file https://pastebin.com/mJh5N6HN	16:27
Guest80	Context: I received a report for some stale nodes (nodes that have not checked in to a central service) and upon investigation they stopped resolving hostnames right after some unattended upgrades. dpkg log for those upgrades: https://pastebin.com/mJh5N6HN	16:27
Guest80	A system reboot seems to clear it up, but some of these servers run data services, so I am also trying to find a way to prevent that. I've seen it happen in GCP and AWS.	16:29
TJ-	Guest80: from an affected system can you show us "pastebinit <( journalctl -u systemd-resolved --since="2 days ago" )"	16:29
Guest80	Failed to contact the server: <urlopen error [Errno -3] Temporary failure in name resolution>	16:30
Guest80	I'll create a pastebin manually	16:30
TJ-	ahhh	16:30
TJ-	Guest80: from an affected system can you show us "journalctl -u systemd-resolved --since="2 days ago" > /tmp/resolved.log" :)	16:31
TJ-	apologies for raw copy/edit/paste!	16:31
Guest80	No worries, thanks for looking at this	16:31
Guest80	https://pastebin.com/eF1wD2Lv ... there isn't a lot there	16:31
Guest80	I truncated it, as the restart loop continued	16:32
Guest80	I am going to sample some more nodes to see if they had the upgrades but are still able to resolve...	16:33
TJ-	Guest80: could it be this? Bug #1934221	16:36
ubottu	Bug 1934221 in systemd (Ubuntu Hirsute) "systemd-resolve segfault" [Undecided, New] https://launchpad.net/bugs/1934221	16:36
TJ-	possibly not since that seems to be for 21.04 but... mentions resolved :)	16:36
TJ-	Guest80: the fact it is repeatedly timing out trying to start the service is clear enough	16:39
Guest80	Aye	16:41
TJ-	that patch is applied to the focal git repo but doesn't look like it has been published... and as it is a /fix/ for a fault, it could be at least related to your issue	16:42
Guest80	Yeah, interestingly starting /lib/systemd/systemd-resolved works when I do it via console, and systems start functioning.	16:43
TJ-	that is /weird/ !	16:43
Guest80	but the service is still stuck in the startup loop	16:43
TJ-	so, what is in the unit ("systemctl cat systemd-resolved"	16:43
mdeslaur	we've had other reports of similar issues, but we've been unable to reproduce so far	16:43
mdeslaur	Guest80: is that on a cloud provider? are you running k8s?	16:44
Guest80	We don't run k8s and I'm seeing it in both GCP and AWS	16:44
TJ-	Guest80: here's how to gather more debug info. Kill any systemd-resolved process that is running, and ensure the unit is stopped (even if marked failed)	16:46
Guest80	Ok, stopped.	16:47
TJ-	Guest80: then enable debug-level logging with "sudo kill -SIGRTMIN+22 1" (see "man systemd" and SIGRTMIN+22 - this puts the systemd services into log_level=debug". Then "systemctl start systemd-resolved" then collect logs "journalctl -b 0 --since "2 minutes ago" ") this will capture any related log info, not just resolved's, to correlate cause and effect	16:47
Guest80	Failed to umount /run/systemd/unit-root/dev: Device or resource busy	16:49
Guest80	There is a lot going on in the journal for systemd-resolved but it seems like it might be part of systemd itself	16:51
TJ-	Are you doing anything 'interesting' with namespaces?	16:52
TJ-	Guest80: yes; that signal enabled debug for all systemd components	16:53
Guest80	I don't understand the question, so probably not :)	16:54
TJ-	/run/systemd/unt-root/ is used as a temporary mountpoint in different namespaces for mounts before they're put in their final location. That error suggests the /dev/ (devtmps ?) is still active because something has a handle on it	16:54
Guest80	there are tons of snap.rootfs directories in /tmp	16:54
Guest80	ls /tmp/ \| grep snap \| wc -l	16:55
Guest80	6087	16:55
Guest80	That seems odd to me	16:55
* TJ- blinks		16:55
TJ-	do they have understandable names as in /tmp/snap.<someappname>	16:56
Guest80	They are all snap.rootfs_83S1KD	16:56
Guest80	ish	16:56
Guest80	we don't use any snaps that are not baked in	16:56
Guest80	I think lxd or something might be there	16:56
TJ-	hmmm, this /may/ not be related to resolved but it feels suspicious	16:57
Guest80	well, in the logs for journalctl there was a ton of activity around those before resolved failed, so yeah, could be a red herring	16:57
TJ-	do you have LXD containers configured as well?	16:57
Guest80	I think that AWS has their ssm-agent running in a container; let me look	16:58
TJ-	since LXD is itself a snap nowadays, those may be related to LXD containers	16:58
TJ-	"lxc list" - but hope you don't see 6087 containers!	16:58
Guest80	Interesting	16:59
Guest80	cannot perform operation: mount --rbind /var/log /tmp/snap.rootfs_MlEUCG//var/log: Permission denied	16:59
Guest80	I didn't see any containers	17:00
TJ-	my feeling is that this is a resource exhaustion related issue, but without seeing the detailed log it is hard to make a better guess. Are you able to share the log when SIGRTMIN+22 was invoked?	17:00
Guest80	let me do that... and I agree.	17:01
Guest80	root cause is probably an error on our part with how we setup /var/log	17:01
TJ-	but it shouldn't fail mysteriously!	17:02
TJ-	i've been giving systemd a good kicking recently over a similar lack-of-info issue with networkd/udevd !	17:02
mdeslaur	I'm curious, are you setting up /var/log in a special way?	17:02
Guest80	Yeah, we attach it to local storage (ephemeral drives) when we detect those, so I think something might be there.	17:03
Guest80	incoming logs	17:04
Guest8036	Oops, got disconnected: https://pastebin.com/eF1wD2Lv	17:07
Guest8036	sorry, wrong paste, it's barking because the file is too large, 1m	17:08
TJ-	ha! I thought someone was flying between Starlink and me!	17:08
TJ-	ouch :)	17:09
Guest80	Ok, well I had to pare it down a bit	17:11
Guest80	https://pastebin.com/WuPj5VKf	17:11
Guest80	but there were thousands of remount messages about rootfs in tmp	17:12
Guest80	I'm going to try something	17:16
TJ-	Read most of it now and it is making sense. I think I can do an experimental reproducer here	17:21
TJ-	oh, and I should have told you, to cancel the debug logging "kill -SIGRTMIN+23 1"	17:22
Guest80	Oh thanks, I ended up rebooting that node. On another node I am manually unmounting and removing all those rootfs directories to see if I can reduce the noise there and assert if that has an impact here	17:24
TJ-	there's a workaround that may help in the interim - extend the timeout. "systemctl edit systemd-resolved" will create a drop-in override file. That command will open a text editor with an empty file. Add to it these lines and save "[Service]" and "TimeoutStartSec=500"	17:30
Guest80	Active: active (running) since Wed 2021-07-21 17:30:13 UTC; 42s ago	17:31
TJ-	I /think/ that may give enough rope for the service to start... it isn't far off when the system supervisor kills it because it has expired the default timeout	17:31
Guest80	Yep, yeah, unmounting and removing all those rootfs directories caused it to start	17:31
TJ-	but are all the affected nodes seeing the same mass of mounts?	17:31
Guest80	Yes, they are... so now I am moving onto that problem ;)	17:32
Guest80	by the way, thank you	17:32
Guest80	lxc list	17:34
Guest80	cannot perform operation: mount --rbind /var/log /tmp/snap.rootfs_I0jKoM//var/log: Permission denied	17:34
TJ-	I can understand why this is happening. My bet is at reboot this isn't an issue. But, due to masses of (failed) container mounts, any restart of systemd-resolved - and likely other services - will hit this	17:34
Guest80	Yep, agree. The reboot is not a permanent fix	17:34
Guest80	so when I run that lxc command, that rootfs directory just stays in the /tmp directory	17:34
TJ-	is that recursive mount related to the ssm-agent you referred to earlier	17:35
TJ-	I've lost some connectivity here for some reason; most annoyingly web search (duckduckgo) and githubassets!	17:35
TJ-	looks like DDG in Ireland (msn network) is having a lay-down	17:37
Guest80	I'm not entirely sure if it's responsible, but /var/log is actually a symbolic link to another directory on another device.	17:37
Guest80	lrwxrwxrwx 1 root syslog 18 May 5 06:44 /var/log -> /mnt/ephemeral/log	17:38
TJ-	it could be since LXD being a snap, although privileged, may have some restrictions that prevent it accessing /mnt/	17:39
TJ-	possibly an apparmor profile for example	17:39
Guest80	ok, that is good stuff, let me look at that.	17:39
mdeslaur	Wow, this is interesting, nice debugging	17:40
Guest80	I'm not very familiar with apparmor, but I do see files related to snap-confine. I'll try the easy mode permissions first.	17:41
TJ-	Looking at the snap apparmor profile I see a couple of bits of interest	17:45
TJ-	mount options=(rw rbind) /var/log/ -> /tmp/snap.rootfs_*/var/log/,	17:45
Guest80	I'm seeing that too	17:45
TJ-	and, disproving my hypothesis about /mnt/ mount options=(rw rbind) /mnt/ -> /tmp/snap.rootfs_*/mnt/,	17:46
TJ-	but, this is the snap daemon itself; not clear what is applied for the LXD snap at present	17:46
Guest80	when I modify that line and restart snapd/apparmor and then run lxc again, it still has the old line	17:48
TJ-	if I recall correctly, it's stop, edit, start	17:51
Guest80	would that be fore apparmor?	17:52
TJ-	I think so ... a while since I had to wrestle with it.	17:54
Guest80	I'll do some research and see if I can get that error to go away	17:54
TJ-	I think the snapd apparmor profile is a red-herring - not implicated. Wht may be is any restrictions on LXD... but cannot get to the LXD source on github right now to chek	17:57
Guest80	https://askubuntu.com/questions/1222627/cannot-open-snap-apps-cannot-perform-operation-mount-rbind-dev-tmp-snap sounds oddly familiar	18:00
TJ-	found the apparmor lxd profile /var/lib/snapd/apparmor/profiles/	18:05
Guest80	I'll check that out	18:08
Guest80	So after running this: unlink /var/log && mkdir /var/log && mount --bind /mnt/ephemeral/log /var/log	18:08
Guest80	lxc commands didn't bark at me	18:08
Guest80	And I don't see dangling snap.rootfs mounts/folders in /tmp	18:09
Guest80	I don't really understand all the layers there, is it the lxc profile, lxd profile, snap profile? All of the above? Is the symlink the problem or is it permissions? If it's permissions it seems like we should be able to drop a file somewhere and /mnt/ephemeral would be available.	18:14
TJ-	the various snap.lxd* profiles (all basically the same) - I don't see anything obvious in them /except/ no mention of access to /mnt/ whereas snapd itself does - ommission may be issue here when following a symlink to /mnt/ - all those /tmp/rootfs_* are, I suspect, a bug in lxc when it cannot finalise mounts for a container and is retrying without cleaning up the previous attempt. May be	18:20
TJ-	worth checking for issues on the lxd project issue tracker on github	18:20
Guest80	I agree. I'm testing a few rules to give it access, but It does seem to be a resource leak bug as well.	18:21
Guest80	I'll create an issue there	18:21
TJ-	may be worth talking to Stéphane Graber or other devs in #lxcontainers	18:24
TJ-	oh, it's #lxc now!	18:24
TJ-	going to dinner now - I'll keep a watch on the channels	18:25
Guest80	TJ- thank you for you help here. Enjoy your evening.	18:28
Guest80	https://github.com/lxc/lxd/issues/9029 hopefully this isn't noise for them	18:29
ubottu	Issue 9029 in lxc/lxd "Growing number of snap.rootfs directories" [Open]	18:29
TJ-	Guest80: glad to see SG confirming it looks like a snap issue	19:12
Guest80	Yeah, I almost have a patch for us... only several hundred nodes to target ;)	19:15
Guest80	Thanks again for the help	19:15
TJ-	oooph!	19:15

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!