[04:43] <mborzecki> morning
[05:07] <mardy> mborzecki: hi! Early bird, today :-)
[05:07] <mborzecki> mardy: hi, yeah, up since 5:30
[05:09] <mardy> sleep is overrated anyways ;-)
[05:41] <mborzecki> hm ci was very unhappy yesterday but things seem to have improved now
[05:47] <mborzecki> eh some failures in tests poking the session agent, wtf? https://paste.ubuntu.com/p/P2zpc9Jbgq/
[05:48] <mborzecki> why wasn't this faling yesterday?
[06:09] <mborzecki> can this be related? https://paste.ubuntu.com/p/3vYjVrkz5N/
[06:18] <mborzecki> doesn't seem to be a system update
[06:32] <mborzecki> hm no way of telling if something was updated in the images 😕
[07:00] <mborzecki> mvo: morning
[07:00] <mborzecki> mvo: can you land https://github.com/snapcore/snapd/pull/10386 ?
[07:01] <mvo> good morning mborzecki ! sure
[07:02] <mborzecki> mvo: i see recurring failures in 2 tests that poke user session agent: https://paste.ubuntu.com/p/P2zpc9Jbgq/
[07:02] <mborzecki> mvo: i suspect this may be related to https://paste.ubuntu.com/p/3vYjVrkz5N/ but the tests are passing in isolation (well no surprise there though)
[07:03] <mborzecki> fun, our code using the cgroup v1 freezer is completely oblivious to v2 and since it assumes that ENOENT on the freezer file is fine, nothing fails ;)
[07:06] <mborzecki> pstolowski: hey
[07:06] <mvo> good morning pstolowski 
[07:06] <pstolowski> hey mborzecki and mvo
[08:07] <zyga-mbp> good morning 
[08:10] <zyga-mbp> I wrote an ini-file encoder / decoder, similar to encoding/json
[08:10] <zyga-mbp> it's currently inside a project repo but if there is any interest I can split it out to a standalone repository
[08:10] <zyga-mbp> it has no dependencies
[08:10] <zyga-mbp> except for check for testing
[09:43] <mborzecki> trivial PR: https://github.com/snapcore/snapd/pull/10422
[09:47] <zyga-mbp> +!
[09:47] <zyga-mbp> +1
[09:49] <jamesh> pedronis: was this the post you were referring to in the meeting? https://forum.snapcraft.io/t/exceeded-maximum-runtime-installing-snap-store-font-generation-issue/24801
[09:52] <pedronis> jamesh: yes
[10:26] <mborzecki> jamesh: any clue what may be happening here? https://paste.ubuntu.com/p/P2zpc9Jbgq/ i've only started seeing those today, and found this in the logs too: https://paste.ubuntu.com/p/3vYjVrkz5N/
[10:26]  * jamesh looks
[10:28] <jamesh> mborzecki: it means there is a stale UNIX socket file that nothing (neither "systemd --user" or snap-session-agent) is listening on
[10:28] <mborzecki> it seems that most of the tests that poke user session agent are affected
[10:30] <jamesh> mborzecki: that test is mostly concerned with the session agent of the test user, but the checks during install will try to talk to all the session agents for all users logged into the system.
[10:30] <mborzecki> jchittum: ofc it isn't 100% reproducible all the time, individual tests are passing, so i'm suspecting a stale session
[10:30] <jamesh> Maybe some of the common test setup/teardown code has changed?
[10:32] <mborzecki> hm not quite sure, most of what landede seems unrelated, i looked at system package updates but it hasn't changed since may or so
[10:34] <jamesh> mborzecki: there's code in tests/lib/tools/cleanup-state that tries to make sure the root user systemd instance is in a sane state, but perhaps it is getting confused somehow
[10:34] <jamesh> but that hasn't been updated in ~ 5 months
[10:35] <mborzecki> yeah
[10:35] <mborzecki> nothing really stands out ;)
[11:39] <pedronis> pstolowski: I reviewed https://github.com/snapcore/snapd/pull/10384
[11:41] <pstolowski> pedronis: ty
[11:57] <mborzecki> mvo: first bit: https://github.com/snapcore/snapd/pull/10423
[12:23] <mvo> mborzecki: in a meeting but \o/ 
[13:51] <ijohnson[m]> oh ffs
[13:52] <ijohnson[m]> https://pastebin.ubuntu.com/p/hXvqxMm5FW/
[13:53] <ijohnson[m]> I'm pretty sure systemd on centos 7 just like doesn't understand numbers
[13:59] <ijohnson[m]> systemctl show --property TasksCurrent snap.$(systemd-escape --path group-top1/group-one/group-sub-one).slice
[13:59] <ijohnson[m]> TasksCurrent=18446744073709551615
[13:59] <ijohnson[m]> amazing
[14:00] <ijohnson[m]> our workaround for memory account doesn't work because sometimes systemd can't keep track of the tasks either
[14:00] <ijohnson[m]> *account
[14:00] <ijohnson[m]> *accounting
[14:04] <pstolowski> woah
[14:07] <mvo> ijohnson[m]: my gut feeling is that we should just not support this feature on this old systemd there if we can
[14:08] <ijohnson[m]> yeah
[14:08] <ijohnson[m]> I'm trying to see if xenial has the same problem so we can put a lower bound on the minimum systemd version
[14:09] <ijohnson[m]> damn xenial too
[14:11] <mvo> that's sad :/
[14:11] <ijohnson[m]> I'm checking bionic too fwiw
[14:12] <ijohnson[m]> I think bionic was fine, I never saw this sort of thing happen on bionic, but this is a new way to reproduce the same thing
[14:13] <ijohnson[m]> also the really unfortunate thing about this is that now we have a quota group (cgroup) which has a real non-zero memory usage with only one task in it, and removing another quota group causes systemd to think that group has infinite tasks and infinite memory usage
[14:13] <ijohnson[m]> if the bug was just affecting empty quota groups it's meh but since this is affecting real groups with real services in them it's pretty sad
[14:14] <pedronis> well I hope bionic is fine, if it is we can just make bionic systemd the minal requirement, if it's not fine we are in bigger trouble
[14:15] <ijohnson[m]> indeed, I should know in a minute or two
[14:15] <ijohnson[m]> phew bionic is okay
[14:16] <ijohnson[m]> so systemd 229 is broken, 237 is okay though
[14:16] <ijohnson[m]> I'm gonna put this new reproducer into a spread test and try with the minimum set to 230, see if any other systems are affected with systemd versions in between 229 and 237
[14:16] <pedronis> it means it's UC18+ feature, but I think that's ok
[14:17] <ijohnson[m]> sad for our original uc16 customer who originally asked for this feature in 2018 for uc16 but oh well
[14:18] <mardy> mvo, pedronis: I guess this MP needs your superpowers to get merged (but please remember to squash): https://github.com/snapcore/snapd/pull/10363
[14:18] <mvo> mardy: looking
[14:19] <mvo> mardy: you checked the failures and they are unrelated?
[14:22] <mardy> mvo: those on 20.04 are about this error: "ERROR Post http://0/v1/service-control: dial unix /run/user/0/snapd-session-agent.socket: connect: connection refused"
[14:24] <pedronis> that's the issues mborzecki mentioned in the standup 
[14:24] <mvo> I saw this a bunch of times
[14:25] <ijohnson[m]> fwiw I didn't see that at all yesterday in my afternoon
[14:26] <pedronis> it's new, we wonder a bit what changed, it's not systemd apparently
[14:31] <pedronis> mardy: ijohnson[m]: I reviewed https://github.com/snapcore/snapd/pull/10266 , I think there is more code that can be removed but conflict-wise it might be better for ijohnson[m] to do it in one its open quota PRs
[14:32] <pedronis> once this lands
[14:32] <mardy> pedronis: thanks, looking
[14:32] <ijohnson[m]> pedronis: what are your thoughts about having a flag to not format units as 1.29MB and instead return 1294336 B without the SI units ? it would be convenient for some tests at least I think
[14:32] <ijohnson[m]> pedronis: ack yeah that was my thoughts too that it will be simpler to just land mardy's PR and then do followups to clean up the quota stuff
[14:33] <ijohnson[m]> pedronis: my thoughts on the flag is that it would be a flag mixin like --abs-time like --abs-units or something
[14:33] <pedronis> yes, I was thinking that we have precedent for this in --abs-time, so the question is how to call it
[14:34] <ijohnson[m]> --no-metric-unit-prefixes
[14:34] <ijohnson[m]> ?
[14:34] <ijohnson[m]> maybe just --no-unit-prefixes or --base-units
[14:35] <pedronis> is it about anything other than sizes?
[14:35] <ijohnson[m]> not sure if we have other units that get formatted like that, I suppose right now it is just sizes
[14:36] <ijohnson[m]> though also it's probably safe to say that snapd will probably not start returning any volts or grams :-)
[14:39] <mvo> pedronis: fwiw, there was a systemd update 22h ago into focal-updates, no idea right now if that is related to the new failures
[14:40] <mvo> (changes look unrelated though)
[14:40] <pedronis> ijohnson[m]: so I discovered that we strutil.SizeToStr nad quantity.FormatAmount :/
[14:40] <pedronis> *we have
[14:41] <ijohnson[m]> nice 2 implementations is better than 1 haha
[14:43] <pedronis> ijohnson[m]: so I suppose we would need a sizesMixin like we have a timeMixin
[14:43] <ijohnson[m]> yes
[14:44] <pedronis> about the name --base-sizes  --byte-sizes ?
[14:44] <ijohnson[m]> hmm I like --byte-sizes just because it seems smaller and more bite sized to me 🥁 
[14:45] <pedronis> --byte-sizes seems fine to me
[14:45] <ijohnson[m]> alright I'll try to work that in
[14:46] <pedronis> ijohnson[m]: afaict info quota and snapshot are relevant, apparently quota is reusing code from snapshot atm
[14:47] <pedronis> s/relevant/are affected/
[14:47] <ijohnson[m]> right, I was also thinking snap info which reports the sizes of snaps too ?
[14:48] <pedronis> yea, also snapshots report snapshot sizes
[14:48] <pedronis> they define fmtSize which is used by quota
[14:48] <pedronis> info is using SizeToStr instead
[14:50] <ijohnson[m]> pedronis: I don't see any other commands that use memory so I think it's just the snapshots family, the quota family and info
[14:50] <pedronis> yea
[14:51] <pedronis> so we have strutil.SizeToStr strutil/quantity and gadget/quantity, fun
[14:52] <pedronis> not saying you sort that out now, but we should look into that at some point
[14:52] <pedronis> it's a bit much/confusing
[14:54] <ijohnson[m]> double the fun
[15:03] <pedronis> mborzecki: question in https://github.com/snapcore/snapd/pull/10421
[15:05] <ijohnson[m]> alright let's see what other versions of systemd want to lose their quota groups support https://github.com/snapcore/snapd/pull/10425
[15:08] <ijohnson[m]> I'm confused by the quota group spread test failures in https://github.com/snapcore/snapd/pull/10266, it seems that on some systems the service did get restarted :-/
[15:43] <ijohnson[m]> hmm and I can't reproduce them immediately either :-/
[16:12] <pedronis> ijohnson[m]: are these tests and maybe also your test hitting an issue where start comes back but the services is not active yet?
[16:12] <ijohnson[m]> I know the issue now
[16:13] <ijohnson[m]> pedronis: the issue is that the test runs too fast, essentially, it dies 1-2 times because of OOM, and systemd is trying to restart it since it failed (since it will try to restart it at most 5 times with default settings), and then we remove the quota before the 3rd try succeeds, and so by the time the slice is removed the 3rd try to start the service succeeds
[16:14] <ijohnson[m]> we need to wait after creating the service until systemd has given up trying to start the service
[16:14] <pedronis> ah
[16:14] <ijohnson[m]> *after creating the quota group
[16:17] <pedronis> ijohnson[m]: or set restart condition to never? but that service is shared with other tests I suppose
[16:17] <ijohnson[m]> yeah, I'm hoping there's something we can ask systemd in a loop to see if the start limit has been hit
[16:17] <ijohnson[m]> I suppose we could grep the journal for the message that systemd says when it gives up
[16:20] <ijohnson[m]> I guess we want `SubState=failed` to indicate that systemd has stopped trying to restart it, otherwise systemd says it is in `SubState=running` or `SubState=auto-restart`
[20:32] <ijohnson[m]> cachio: have you seen this sort of failure before? https://pastebin.ubuntu.com/p/qgpKPFtjrN/
[21:23] <cachio> ijohnson[m], hi
[21:23] <cachio> no 
[21:23] <cachio> first time
[21:24] <cachio> is this a PR? master?
[21:24] <ijohnson[m]> cachio: it was one of my PR's
[21:24] <ijohnson[m]> I since restarted the checks on that PR so I don't have any more logs
[21:24] <cachio> ok, I am reviewing logs and adding fixes today
[21:25] <cachio> if I see this in another pr I'll tell you
[21:36] <ijohnson[m]> ack thanks cachio