[05:52] morning [06:00] hey mborzecki [06:00] mborzecki how did that device cgroup case turn out? [06:00] did you get to the bottom of the problem? [06:01] hi mborzecki, zyga-mbp! [06:01] hey mardy :) [06:01] zyga-mbp: no, leave and mardy as well [06:01] what's your local time? [06:01] zyga-mbp: I'm trying now to add a sleep before assigning the PID to the cgroup, to see what changes [06:02] mine is 9:02 [06:02] where are you running tests that are failing? in the cloud or locally? [06:02] I wonder how much is the speed of the host related to this (apparent) race [06:02] zyga-mbp: cloud [06:02] ah, then hosts should be fast [06:02] good luck, let me know if you find something, that's one curious problem for sure [06:03] zyga-mbp: maybe you have a link to some handy resource that explains how systemd handles cgroups? I think I once read something from Lennart about it, but I don't seem to be able to find it [06:04] it's complicated, I really read systemd source to see how it really does that [06:04] especially for --user since those are a bit more tricky for unprivileged systemd to handle [06:04] I assume you know the general concepts of slices and scopes [06:04] I'll start with https://github.com/systemd/systemd/blob/main/docs/CGROUP_DELEGATION.md [06:04] zyga-mbp: mardy:i wonder if it's related to the mystery error in spread test when i see mknod fail with EPERM for no reason in prepare [06:04] delegation is different [06:05] btw, are we using delegation? [06:05] god I hope we are not [06:05] mborzecki mknod in which part? prepare outside of confinement? [06:05] yes [06:05] mardy delegation is another complex topic that depends on the rest [06:06] mardy I'd read the stack of systemd man pages related to units, scopes and slices [06:06] then follow up with tracing what happens at runtime with gdb or by following the code manually [06:06] I'm separating theory from practice here, because theory is a bit easier [06:06] and practice has interesting "aha moments" when you read the implementation and see how it's implemented [06:07] mborzecki hmm hmm [06:07] mborzecki but how could that mknod fail on anything related to snapd? [06:07] mborzecki is it perhaps related to session-tool? [06:07] (is it still called session-tool?) [06:08] zyga-mbp: i wonder if there's a case that some device controller setup is done on the wrong cgroup [06:08] inside systemd? [06:10] mborzecki do look at session tool if that's anywhere in the picture of that mknod [06:10] because at least on xenial, it does a few extra things [06:10] that work around bugs [06:39] uh, and now I cannot reproduce it anymore :-/ I guess it might indeed be tied to the machine stress, since probably the Google cloud is underused now [06:40] PR snapd#10822 opened: spread: display information about current device cgroup in debug dump [06:49] mmm... stress-ng does not seem to help either [06:57] mborzecki, zyga-mbp: isn't what we are doing with the devices controller in violation of what systemd recommends at point 1 of the first section in https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ ? [06:58] yes [06:58] it sounds like we should create a scope for the snaps (or maybe a scope for each snap) [06:58] it's complicated [06:59] services get their own place automatically [06:59] we cannot create a scope ourselves [06:59] refresh app awareness really helps whit making this sane [06:59] *with [06:59] note that as it is implemented with r-a-a enabled, various users will have correct cgroup paths for their snap processes [06:59] so in the users slice [07:00] with the right user scope [07:00] and then with the right snap scope [07:00] but that's all new [07:00] device code is old and pre-dates any of that [07:00] it's a hack we've inherited since ubuntu phone click days [07:00] we've only made the code less scary [07:02] morning [07:02] good morning pstolowski [07:02] hey pstolowski and mvo :) [07:03] pstolowski, mvo: hi! [07:06] hey zyga-mbp and mardy ! [07:12] zyga-mbp: I guess I'm missing something here, but shouldn't we (snapd) only call StartTransientUnit() on systemd to create a scope, and set the right properties so that the devices controller is setup the way we want it? [07:13] it doesn't look like we are doing anything that systemd couldn't be doing for us [07:14] pstolowski: mvo: hey [07:15] meh 2021-09-22 06:08:02 Cannot allocate google:ubuntu-21.10-64: cannot find any Google image matching "ubuntu-os-cloud-devel/ubuntu-2110" on project "ubuntu-os-cloud-devel" [07:15] PR snapd#10819 closed: interfaces/builtin/opengl.go: add libOpenGL.so* too [07:17] mardy it's not as pretty [07:17] mborzecki: good morning ! is 10803 ready ? just looking over it right now, would love to do this impish upload [07:17] mardy we should not make scopes for services [07:17] so that's the first fracture [07:18] the second fracture is that this API is reliable from 20.10 onwards IIRC [07:18] before it has lots of failure modes [07:18] mvo: yes, there's a uc20 job running, but it should be finishing shortly and then we can land it [07:18] cool [07:18] and lastly, at the time it was fine, since systemd did not touch this cgroup at all, apart from, well, scopes which set up all the cgroups to a different base [07:18] mborzecki: looks quite happy overal [07:18] yeah [07:20] PR snapd#10797 closed: usersession/client: refactor doMany() method [07:21] zyga-mbp: hm we're not creating scopes for services, unless i'm misremembering things [07:28] mvo: https://github.com/snapcore/snapd/pull/10803 spread job just finished [07:28] PR #10803: tests, interfaces/builtin: introduce 21.10 cgroupv2 variant, tweak tests for cgroupv2, update builtin interfaces [07:29] zyga-mbp, mborzecki: I think we should evaluate what "this API is reliable from 20.10 onwards" means in practice; but if it's really unreliable, we could still have snapd create a single, transient scope for all snaps, with delegation, and then snap-confine would be free to create subgroups within it, without being disturbed by systemd [07:30] (there's quite a lot of handwaving in here, I acknowledge that, but would it make sense?) [07:37] mborzecki: hm, looks like the cgroup-devices-v2 test fails on debian-sid in 10803 :/ [07:49] hm let me see [08:00] PR snapd#10573 closed: sysconfig/cloud-init: filter MAAS c-i config from ubuntu-seed on grade signed [08:18] sid is weird again: https://paste.ubuntu.com/p/2gPQmWqG9Q/ how come s-c is trying to send over a unix socket? was a file descriptor inherited or something? [08:36] heh, so I added SNAP_CONFINE_DEBUG=1, and there's way more warnings, as if writing to stderr is getting blocked [08:41] oh, ok it's happening for other snaps too now [08:41] https://github.com/snapcore/snapd/pull/10814 is ready for review now (was draft) [08:41] PR #10814: o/ifacestate: don't loose connections if snaps are broken [08:44] mardy no [08:44] mardy scopes are not hard-pinned [08:44] mardy scopes live under slices [08:44] mardy and under services [08:44] this, sadly, doesn't work like that [08:45] brb reboot [08:48] mardy to make something sensible snapd would have to be a delegate [08:48] but that has huge downsides [08:49] and would totally cripple services [09:04] zyga-mbp: it would cripple them if the services were created in a subgroup of the one managed by snapd, but I don't think it necessarily has to be this way [09:05] it would cripple services because by design, systemd would no longer manage the pids [09:09] zyga-mbp: but why? wouldn't snapd be able to create the systemd service units so that they are children of the root cgroup? [09:10] the way I understand delegation is: systemd carves a space and you get to manage anything inside [09:10] stuff like systemd managing a service is gone [09:10] yes, but that only applies to that subtree [09:10] you can still create stuff outside of it [09:10] it's incompatible with the notion of systemd managing a process which is started by the service [09:10] outside where? [09:11] perhaps I'm confused [09:11] or I am :-) So, my understanding is that currently, snapd creates units for the services [09:11] yes [09:12] and systemd manages them. I'm not proposing that we change this [09:12] wtf is wrong with sid, s-c cannot create a bpf map whe running for a service [09:12] just EPERM and that's it [09:12] no log, nothing [09:13] mborzecki hmm [09:13] mborzecki maybe hardening present in sid? [09:13] check config/patches [09:13] I'm only proposing to change (or rather, investigate the possibilify of changing) snap-run and/or snap-confine, so that when it's starting a program (which is not a service) it would create it inside a cgroup managed by snapd [09:14] that's exactly what r-a-a is doing [09:14] but we've ended up doing that for scopes [09:14] as otherwise resource control is broken [09:14] because if I launch a process it would escape my user slice [09:15] tl;d scopes are the way, it just took a while to not have broken implementation on the other side [09:16] zyga-mbp: is this a plan, or is it already implemented? [09:16] that's what r-a-a is doing [09:18] mardy except that it's done in snap-run [09:18] as the user [09:18] so that the right request is sent [09:19] zyga-mbp: it's that CreateTransientScopeForTracking() method, right? [09:19] yes [09:21] zyga-mbp: so, what if I changed this to add the DeviceAllow property to these units, instead of creating the group in snap-confine? Is it only because it's not reliable on older systems? [09:21] it would break how udev changes are made [09:21] remember that udev calls us [09:22] and we modify a non-systemd hacky cgroup in v1 world [09:22] (unless mborzecki changed that already, I don't know) [09:22] it works now because that's a fixed list of snap udev tags [09:22] and we can find the cgroups from that [09:24] pfff so there's a silly locked memory limit being set in sid [09:30] why must sid be always such a PITA [09:31] mborzecki safety against EBPF exploits? [09:32] zyga-mbp: ok, it sounds like I need to investigate this better. And the other option, of having snapd create at startup a single scope under the system slice, with Delegate=devices, and then have snap-confine create its cgroups like it does nowadays, just under this scope's cgroup, do you see problems with it? [09:32] zyga-mbp: pff i don't think so, fun fact, on arch 5.14.6, a map of 9+1bytes * 500 entries, locks only 8k of kernel memory, on sid same thing 5.10-cloud kernel locks .... 45k !! [09:32] yes, because that would move stuff away from users [09:32] that's why bpf map create failed [09:32] I don't think any solution with a single scope makes sense [09:33] but I fully defer to mborzecki as I'm not involved with the decision process anymore [09:33] zyga-mbp: two scopes, one in system and one in the user slice? [09:33] so systemd (or sth else) set a ludicrous the limit, and kernel does silly things [09:33] zyga-mbp: I'm just brainstorming on what I could try to investigate further, not trying to get approval on any decisions [09:34] right, I'm just setting a clear expectation that my words are non binding [09:34] do investigate this [09:34] I could have misunderstood or missed something [09:34] I've spent probably 18 months on this and I have some bruises [09:46] pstolowski I've left a comment on https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1944447 [09:46] Bug #1944447: refresh-app-awareness=true does not seem to concern dependencies [09:46] ijohnson[m] ^ in case you are around [10:07] zyga-mbp: thanks, interesting point. we do something like this for refresh gating (not yet enabled, experimental) [10:09] doh, error: cannot refresh "snapd": unexpectedly empty response from the server (try again later) [10:14] "dont call us, we will call you" [10:18] haha [10:19] I read it as "keep pressing ctrl+R or F5 until the it works" ;) [10:21] PR snapd#10823 opened: sysconfig: set TMPDIR in tests to avoid cluttering the real /tmp [10:31] mvo, miguelpires haha, so the (now failing in Miguel's PR) spread test was introduced wit my https://github.com/snapcore/snapd/pull/8928 exactly to demonstrate the issue with snap remove, see PR description and also the comment from Ian at the bottom ;) [10:31] PR #8928: tests: add spread test for disconnect undo caused by failing disconnect hook [10:38] zyga-mbp yeah thanks I totally agree that we should extend the logic for refresh app awareness to content snaps [10:38] I assume you say comment on your old PR :-) [10:39] *saw my [10:42] ijohnson[m] no, I didn't yet [10:42] a bit backlog on my side [10:48] fwtw we now have affectedByRefresh() helper that does this and certain interfaces may have AffectsPlugOnRefresh flag set on them to drive this logc; it should be reusable with little or no changes [10:58] pstolowski yeah that's great, fwiw after my adventure yesterday, imho I think we can avoid doing it immediately and just release refresh-app-awareness as-is without the feature, but we need to fix this part of refresh-app-awareness before we fix the per-user snap mount namespaces not getting updated bug that zyga-mbp fixed originally in the MS_SHARED PR [10:59] MS_SHARED oh my, that is one huge change [11:26] PR snapd#10824 opened: tests: check that a snap that doesn't have gate-auto-refresh hook can call --proceed [11:35] mvo: i think we should land #10803, as the code is already in master, but failing on debian, I have 2 patches on top that kind of fix it there [11:35] Bug #10803: New message arrival sound does not work [11:35] PR #10803: tests, interfaces/builtin: introduce 21.10 cgroupv2 variant, tweak tests for cgroupv2, update builtin interfaces [12:01] PR snapd#10825 opened: tests: Include the tools from snapd-testing-tools project in "$TESTSTOOLS" [12:31] PR snapd#10826 opened: Bboozzoo/cgroupv2 test and systems with memlock fix [12:32] mvo: this would be the new PR ^^ [12:34] * ijohnson[m] is pleasantly surprised to see he's no the only one who sometimes forgets to set the PR title before hitting open :-D [12:35] hahah [12:35] :-) [12:37] THIS [12:37] :) [12:43] mborzecki: woah, thank you! [12:56] PR snapd#10827 opened: fde: add new device-setup support to fde-setup [13:23] code checkout fail in some github jobs: https://paste.ubuntu.com/p/p3tXyKhT52/ [13:55] any concerns if I merge the secboot update pr (10715)? [13:55] got enough reviews including from chris [14:03] * cachio_ -> app doctor [14:05] mvo: so shall we land https://github.com/snapcore/snapd/pull/10803 ? [14:05] PR #10803: tests, interfaces/builtin: introduce 21.10 cgroupv2 variant, tweak tests for cgroupv2, update builtin interfaces [14:27] mborzecki, zyga-mbp: a couple of small updates: [14:28] hit us [14:28] the issue affects 18.04 as well, whereas so far I haven't been able to reproduce it in 20.04 [14:29] 2. adding a "sleep(1)" in snap-confine right before sc_device_cgroup_attach_pid() in udev-support.c seems to fix the issue. Ship it? ;-) [14:29] ok, that was a joke, but indeed it seems to "fix" it [14:30] could it be that what happens is this: [14:30] 1. we create the new cgroup [14:31] 2. systemd does not notice the group (yet -- timing issue) [14:31] 3. we add the PID to the new cgroup [14:31] 4. systemd sees the PID disappearing from the user.slice cgroup [14:32] 5. systemd sees that the PID is still alive, and moves it back to user.slice [14:35] whereas, if 2 does not happen (that is, systemd sees the new cgroup), the bug won't happen [14:36] mborzecki, zyga-mbp: all the above is just a theory, BTW [14:36] check if systemd has this logic [14:36] that's a nice theory [14:36] it could be it [14:40] zyga-mbp also I looked at your MS_SHARED branch and it looked good to me, we are going to discuss more internally about the way forward but do you mind if I pick it up when the time comes? [14:41] mborzecki: +1 [14:41] I would be honored if anything there is still useful [14:41] mborzecki: in a meeting [14:41] mborzecki: but yeah [14:42] mborzecki: if the failures are unrelated I can merge, just shout [14:44] zyga-mbp: one thing about that branch I was curious about is that in the per-user mount namespace files that are part of the tests, I see now that there are a bunch of mounts that have "shared:XYZ" in the opts flags, but I guess I expected them to now be "master:XYZ", where the per-snap ns has "shared:XYZ" in it, thoughts? [14:45] do you want to talk about it later this week, I would have to refresh my mind; I think that based on what you said that branch may hold more bugs than I would like to admit. [14:45] mvo: yes, please do [14:46] if I can offer any advice, it would be to reduce it to one change at a time and make sure it is convincing us that it's still correct [14:46] zyga-mbp: it doesn't have to be this week, whenever you think you'll have time to look at it [14:47] PR snapd#10803 closed: tests, interfaces/builtin: introduce 21.10 cgroupv2 variant, tweak tests for cgroupv2, update builtin interfaces [14:47] set some time though, otherwise I will just fail at my promise to look [14:48] zyga-mbp: how about next week ? does around nowish in terms of time of day work? [14:49] yeah [14:49] that's good [14:49] zyga-mbp: I could do Monday or Wednesday, does one of those work better for you ? [15:05] huh why is https://github.com/snapcore/snapd/pull/10715 failing all of a sudden on git submodules ... ? [15:05] PR #10715: secboot: move to new version [15:13] is `2021-09-22T15:01:06.1755693Z fatal: No url found for submodule path 'tests/lib/external/snapd-testing-tools' in .gitmodules` known? all builds seems to fail right now [15:18] mvo: I don't know how that is on master now, but maybe it is related to Sergio's unmerged PR: https://github.com/snapcore/snapd/pull/10825 [15:18] PR #10825: tests: Include the tools from snapd-testing-tools project in "$TESTSTOOLS" [15:18] mvo: but I am equally puzzled how this is affecting builds for pr's that do not have code from 10825 [15:20] ijohnson[m]: yeah, I'm very confused [15:24] did something maybe change in github actions config on the repo itself ? [15:27] I don't see anything from the actions page on github [15:29] ijohnson: maybe that's part of the checkout action? [15:29] still, puzzling why it's complaining about tests/lib/external/snapd-testing-tools [15:30] ijohnson: https://github.com/actions/checkout/issues/590 [15:31] waaaattttttt [15:32] I really didn't think it was possible for me to think less of git submodules than I did before, but lo and behold here we are [15:34] cachio_: mvo someone needs to go fix the self-hosted runners then, I closed the PR which triggered this so it doesn't happen again for now, but I'm not sure how to fix the self-hosted runners tbh [15:34] ijohnson[m]: ok, hopefully cachio_ can help [15:34] ijohnson[m]: thanks so much! (also mborzecki ) [15:37] PR snapd#10825 closed: tests: Include the tools from snapd-testing-tools project in "$TESTSTOOLS" [15:39] mvo: do you have a moment for https://github.com/snapcore/snapd/pull/10814 ? [15:39] PR #10814: o/ifacestate: don't lose connections if snaps are broken [15:40] pstolowski: in a meeting unfortunately [15:40] pstolowski: I put it on my list [15:40] mvo: no worries, thanks [15:40] ijohnson[m] both are fine [15:41] ijohnson[m] sorry I was away to take a break and make some coffee [15:42] zyga-mbp: no worries, let's do monday then at 10:00 CDT which I think is 17:00 in central european timezone [15:42] perfect, I'll be around but I'll drop an item to my calendar to rember [15:43] 👍️ [15:54] ijohnson[m], hey [15:54] sorry for the delay [15:55] which is the fix we need? [15:55] I can apply the patch [15:55] cachio_: I don't know the self-hosted runners aren't able to checkout the git repo for snapd [15:56] cachio_: maybe try clearing out any git directories for snapd on the runners ? [15:57] ijohnson[m], ahh, I see the snpad-testing-tools is merged [15:57] cachio_: it's not merged [15:58] ? [15:58] I see [16:01] let me research a bit more how to fix the agents === benfrancis5 is now known as benfrancis [16:52] PR snapd#10828 opened: tests: cleanup the job workspace as first step of the actions workflow [17:31] ijohnson[m], please take a look to #10828 [17:31] PR #10828: tests: cleanup the job workspace as first step of the actions workflow [17:31] Bug #10828: customizing partitions is buggy [17:42] PR snapd#10771 closed: DRAFT: Tests reproduce uc20 boot error <⛔ Blocked> [17:57] ijohnson[m], I am trying subtree instead of submodules [17:57] I see many people complains about submodules [18:19] * ijohnson[m] is one of those people :-) [18:20] cachio_: I'll take a look at the PR [18:21] cachio_: it looks like the tests are passing there again [18:21] yes, I added the submodule using ssh [18:21] and the firewall is blocking this in the test env [18:21] so it failed [18:22] and for some raeson the next job was failing to checkout [18:22] so I added this new step to clean the workspace [18:23] cachio_: yeah it's clearly a bug in the checkout action I think, but good that we can work around it like this [18:23] cachio_: to be clear, maybe git submodules will work and be the best possible solution, but if there are alternatives I'd like to see them evaluated first :-) [18:23] ijohnson[m], in fact I see is pretty common to cleanup the workspace [18:24] the best alternative I see is to use git subtree [18:24] I am going to test it [18:26] ijohnson[m], subtree will work in any environment [18:27] the problem is that we need to keep it updated [18:30] I see [18:31] git submodules should not break anymore if I change the url to use https [18:32] at least it should work well in github actions [18:32] not sure about autopackage tests [18:32] Right it's the other environments I'm concerned about but I'm sure it can be made to work [18:33] I don't know how to test that [18:42] PR snapd#10825 opened: tests: Include the tools from snapd-testing-tools project in "$TESTSTOOLS" [20:58] PR snapd#10829 opened: tests: use our own image for ubuntu impish [21:58] Bug #1944625 opened: snapctl returns EOF