=== JanC is now known as Guest39672 === JanC_ is now known as JanC [05:39] morning [05:54] Hey [05:54] Taking the dog out [05:54] zyga: found anything interesting about the lxd issue? [06:03] zyga: reading through https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1871652 nice find! [06:03] Bug #1871652: snap run hangs on system-key mismatch due to reexec and shutdown [06:14] mvo: hey [06:14] good morning mborzecki [06:15] zyga: hm i was worried that maybe the client gets stuck, but looking at the backtrace the client timeout seems to work [06:16] zyga: i think that the unfortunate part is that the client timeout (overall timeout for all retried requests) is 50s, then *12 retries in snap run, we're looking at 10 minutes after which snap run would fail eventually [06:18] mborzecki: it's debugged [06:18] :) [06:18] mborzecki: if you are talking about the lxd issue [06:18] mvo: good morning :) [06:18] * zyga woke up in good mood and just returned from dog & bike ride [06:18] zyga: good morning! you seem to be in a good mood :) ? [06:18] indeed [06:18] zyga: i know, just looking at the backtrace you posted there, https://github.com/snapcore/snapd/pull/8462 and the client.do() loop [06:18] PR #8462: cmd/snap: don't wait for system key when stopping [06:19] yesterday surely ended on a high note [06:19] I have some more thoughts about how this problem is annoying [06:19] but I think the fix is valid [06:19] mborzecki: yeah, a timeout of 10min seems a bit excessive [06:19] zyga: yeah, I like hte idea to just check for shutdown [06:20] mvo: there's this comment that isn't true anymore https://github.com/snapcore/snapd/pull/8462/files#diff-0ffbc404d8a8e3aaeca8cd9d066c3d71R160 [06:20] PR #8462: cmd/snap: don't wait for system key when stopping [06:20] uhh it's `// connect timeout for client is 5s on each try, so 12*5s = 60s` [06:21] mborzecki: uh, so it looks like we try to accomodate thie situation already? is that check buggy? [06:21] mborzecki: also if the retry timeout should be max 60s but in reality is 10min is there a different bug there too :( ? [06:22] I can debug this further [06:22] I wanted to check the solution in practice [06:22] the socket is there [06:22] but will never activate [06:22] mvo: probably the client retry bits evolved separately [06:22] maybe we just hang on connect? [06:22] ahhh [06:22] fiun [06:22] hhe [06:22] zyga: yeah [06:22] ok, I'll get back to my coffee [06:23] mborzecki: aha, yeah, that makes sense [06:23] but I'm happy :) [06:23] mborzecki: sorry, I see these lines are from an open PR [06:23] if the socket wasn't there it would fail much earlier i believe [06:23] zyga: woah, thanks so much for adding this PR so quickly [06:23] :D [06:23] after breakfast I'll verify this [06:24] and write some tests [06:24] zyga: yeah, having a test there would be great [06:24] zyga: mvo: so actually a funny scenario, the socket we use to talk to snapd is there, but snapd may be inactive, how do you find out that the other end is inactive if poking th socket isn't reliable? [06:24] any issues with spread? [06:26] mborzecki: we should think about how to prevent the bug for real [06:26] mborzecki: I realized it's much harder because the dependency is dynamic [06:26] mborzecki: we depend on the active reexecution target that may be core or snapd and the revisions may change at runtime any number of times [06:27] mborzecki: which is not great [06:31] mvo: so 2.44... 4? [06:31] zyga: 2.44.3 [06:31] perfect [06:31] thank you [06:31] zyga: I hope I can upload that today [06:31] tonight [06:31] something like this :) [07:00] hmm preseed reset is failing in an intresting way [07:01] saw the same rpoblem twice already [07:01] yeah I noticed [07:01] did you debug it more? I didn't look deeper [07:01] the log https://paste.ubuntu.com/p/gTcdq4h8CF/ [07:14] morning [07:15] good morning pawel [07:17] pstolowski: hey [07:17] pstolowski: preseed reset hangs on 20.04 https://paste.ubuntu.com/p/gTcdq4h8CF/ [07:17] pstolowski: but i'm not able to reproduce it manually [07:17] pstolowski: good morning [07:19] mborzecki: interesting, i'll take a look [07:22] doing 'restart all jobs' in gh actions is actually very confusing [07:23] looks like it's first restating the unit tests job, then the canary jobs, and then the stable ones [07:23] and E: Failed to fetch http://pkg.jenkins.io/debian-stable/binary/jenkins_2.222.1_all.deb Could not connect to pkg.jenkins.io:80 (52.202.51.185), connection timed out [07:26] presumably it is respecting the job dependencies: the stable jobs can't restart until the restarted canary jobs have completed, which can't restart until the restarted unit tests have completed [07:27] jamesh: exactly [07:27] mborzecki: yeah, they don't invalidate past results until such jobs actually start [07:27] mborzecki: if it's a one off failure like that just ask mvo to override [07:27] no use in burning money on this [07:31] can I get a 2nd review on green https://github.com/snapcore/snapd/pull/8403 [07:31] PR #8403: sandbox/cgroup: avoid making arrays we don't use [07:31] it's not much and I'd like to get it in and have one less [07:35] jdstrand: thank you for the reviews [07:35] I'll break for breakfast and then get back to work [07:36] zyga: which PR is that? I can override if needed [07:43] pstolowski: I reviewed #8414, thank you [07:43] PR #8414: o/configstate: core config handler for persistent journal [07:43] couple small comments [07:50] pedronis: ty [07:52] mborzecki: yeah, hangs for me too when run on gc. will try to add some debug [07:53] pstolowski: oh, you manged to reproduce it? [07:54] zyga: https://github.com/snapcore/snapd/pull/8462#pullrequestreview-390565291 [07:54] PR #8462: cmd/snap: don't wait for system key when stopping <⚠ Critical> [07:54] mborzecki: it seems so.. it's hanging on < /mnt/cloudimg/var/lib/snapd/desktop/applications, i'm waiting for spread to timeout [07:55] pstolowski: ha, interesting, i ran with -shell and executed the test line by line [07:55] pstolowski: btw. diff -up is easier to read there [07:55] mborzecki: also it's interesting it found a diff [07:58] PR snapd#8450 closed: selinux: export MockIsEnforcing; systemd: use in tests [08:00] mvo: not sure, it was pawel [08:00] mborzecki: ta [08:00] * mborzecki wonders why it's showing `degraded` here [08:01] mborzecki: systemctl --failed [08:02] zyga: yeah, that's the mystery, shadow.service apparenty failed :P [08:02] shadow.service? [08:02] what is that [08:02] I don't have it [08:02] is it related to homed? [08:02] zyga: idk maybe https://paste.ubuntu.com/p/f7hkWx94vQ/ [08:03] which package ships that? [08:04] zyga: surprise surprise.. `shadow` :P [08:05] zyga: btw it runs the following /bin/sh -c '/usr/bin/pwck -r || r=1; /usr/bin/grpck -r && exit $r' [08:09] pwck? [08:09] wow [08:09] I learned something today already [08:09] YoU HaVe BeEn Hax0rEd [08:16] re [08:20] zyga: yeah, from what i managed to find it checks /etc/group against /etc/gshadow [08:20] so is something corrupted on your system? [08:21] zyga: nah, i had a `sudo` group at some point that was listed in gshadow, but got removed from /etc/group [08:23] zyga: probably something went out of sync during one of my arch 'installs', that's actually rsyncing whole sysroot from another arch system, wouldn't be surprised since the actuall install from scratch was years ago [08:43] mborzecki: can you look at https://github.com/snapcore/snapd/pull/8462 again please [08:43] PR #8462: cmd/snap: don't wait for system key when stopping <⚠ Critical> [08:50] zyga: thanks for adding the test to 8462 [09:01] mvo: I'll verify this in the machine where it is easy to reproduce now [09:02] zyga: \o/ [09:14] I believe I can write a spread test for this as well [09:29] zyga: extra browny points if that is possible and not too much work [09:29] mvo: I think so, just a moment to know [09:30] zyga: \o/ [09:30] mvo: gotta justify a push to fix that silly typo :D [09:45] test in progress [09:51] * mvo hugs zyga [10:01] morning folks [10:01] hey zyga I saw this failure for session-tool on one of my PR's that is relatively up to date with master https://pastebin.ubuntu.com/p/tj9qTTN6Wf/ [10:01] looks like it's a gdm issue with 19.10? [10:01] yeah [10:01] I saw it, I asked sergio to remove gdm [10:01] hi ijohnson ! [10:01] ok [10:01] I'll send a patch to stop the gdm session [10:01] o/ pstolowski [10:01] though [10:01] maybe I should remove that part of the test [10:01] it's not like we are leaking our sessions [10:02] it's just a goose chase [10:02] ijohnson: what do you think? [10:02] mmm it's a bit annoying, but in this instance also genuinely useful to have it tell us that something is on the image that is leaking state around [10:02] I dunno [10:03] is it related to the bug sergio raised recently? https://bugs.launchpad.net/snapd/+bug/1868857 [10:03] Bug #1868857: Installing evolution-data-server on test images pulls in GDM and the desktop [10:04] pstolowski: yes [10:04] after checking and checking I managed to convince him we have GDM :) [10:07] brb [10:10] mmm okay another random failure from overnight about being unable to connect to the systemd user session: https://pastebin.ubuntu.com/p/QdyVGHHDrJ/ [10:11] mborzecki: this preseed-reset hang issue is misterious; i added debug that should show up right after the last diff line where it hangs , but it's quiet :/ [10:11] pstolowski: core-persistent-journal is failing on core16 [10:12] pedronis: yeah, i've seen this, didn't happen when run locally, investigating [10:15] pedronis: mvo: i'm looking into the mount point rename [10:27] mborzecki: thank you [10:30] mvo: pedronis: i'll split the /run/mnt/host bind mount till after we have the directory in the core snap [10:31] i mean the /host directory [10:31] +1 [10:32] afk for another moment, sorry :/ === Aavar_ is now known as Aavar [10:42] mvo: could you use your magical powers to merge https://github.com/snapcore/snapd/pull/8451 ? It's been restarted numerous times and all the current failures there have either been reproduced and known by others, or have been reported [10:42] PR #8451: osutil: mock proc/self/mountinfo properly everywhere [10:44] ijohnson: sure [10:44] PR snapd#8451 closed: osutil: mock proc/self/mountinfo properly everywhere [10:44] thank you \/o [10:44] oh whoops I was too excited [10:45] \o/ [10:45] haha [10:51] pstolowski: now, it passed, it seems flakey somehow [10:55] pedronis: yes, maybe there is something flaky. i'm running it in a loop locally now [11:03] re [11:05] PR snapd#8464 opened: cmd/snap-boostrap, boot: use /run/mnt/data instead of ubuntu-data [11:05] mvo: pedronis: ^^ [11:05] i did not add the /run/mnt/data -> /run/mnt/ubuntu-data bind mount too, let's see if the tests pass [11:07] mborzecki: they won't, actually initramfs will need some changes [11:07] hmm ah right, there's some hard coded names there too [11:07] mborzecki: https://paste.ubuntu.com/p/cF4NVBChbG/ [11:08] looks like the bind mount data -> ubuntu-data could make it work tho [11:08] yes, but we do want then to change the initrd [11:09] because otherwise is a bit too many level of mounts [11:09] mborzecki: also my pastebin has type -d (not sure why I did that), so there's a couple more things actually [11:10] pedronis: just run it 10 times without failure. weird. will give it one more spin [11:10] pedronis: mborzecki: suspicious that the initrd has things like this: [11:10] echo 'LABEL=ubuntu-boot /run/mnt/ubuntu-boot auto defaults 0 0' >> /run/image.fstab [11:11] that seems like it would defeat the purpose of our cross-checking no? [11:11] ijohnson: it's optimizing some mounts [11:11] ijohnson: you'll have to discuss what that means [11:12] mmm yes [11:12] pedronis: is the results of the discussion this morning summarized somewhere ? [11:13] ijohnson: seems you got the doc [11:13] yes mborzecki PM'd it to me [11:14] zyga: btw i've re-requested your review of #8414 as it changed substantially [11:14] PR #8414: o/configstate: core config handler for persistent journal [11:14] ack [11:14] I'll look in 10 minutes [11:18] pedronis: I left a comment in the doc, so will we now have /run/mnt/boot instead of (or in addition to) /run/mnt/ubuntu-boot ? [11:18] ijohnson: no [11:18] ok, so the changes are just for ubuntu-data really [11:19] (and all alter egos of ubuntu-data) [11:19] ijohnson: yes, we'll have temporarely both data and ubuntu-data until initramfs is fixed [11:19] sure [11:28] pstolowski: looking [11:33] jdstrand: possibly I'm still broken? https://forum.snapcraft.io/t/snapcraft-and-strict-multipass-call-for-testing/16488/5 [11:33] diddledan: yeah that's not something we're doing [11:33] strangely at least two snaps have started fine, but those are desktop apps and I spent some time cooking toast [11:34] ... after reboot, so I was well past apparmor starting when I logged-in [11:35] Saviq, yeah LXD is also dead [11:38] pstolowski: +1 [11:39] diddledan: have a look at https://github.com/ubuntu/zsys/issues/60#issuecomment-609729305 for what fixed things for me on zfs root - not snapd, but the overall problem may be the same - look in the journal for things refusing to mount due to target not being empty [11:44] it's not that.. I have a correct set of files in /etc/zfs-list.cache and there are no failed mounts in the journal [11:46] mvo: sorry for the lag, i have a test [12:04] I'm running a few more iterations to recheck if fails without the fix and to remove redundant parts [12:04] I'll push the final version before the standup [12:05] most likely in 20 minutes, after the next run [12:06] pedronis: 100 runs and no failure; i wonder if we were seeing a failure from before USR1 commit [12:06] pstolowski: maybe, let's see if it gets green and can land [12:07] pedronis: doh.. it failed on 20.04 [12:08] pedronis: on preseed-reset, which is the other issue i'm investigating [12:08] pstolowski: could you recheck quickly my #8436 , I had to change the spread test because I remember it passing on core20 but actually the systemd there now uses a different property name [12:08] PR #8436: configcore,tests: use daemon-reexec to apply watchdog config [12:09] pstolowski: maybe we need an explicit journalctl --flush ? or do some activity that is none to produce logs? [12:09] pedronis: looking [12:09] s/none/known/ [12:10] pedronis: maybe, but it seems that enabling logging writes a single starting entry, so only question is if flush is needed [12:11] pedronis: but the problem is now preseed-reset test which breaks with master [12:12] diddledan: can you perform: sudo systemd-analyze plot > ./1871148-vm-no-varlib-mount_diddledan.svg' and attach it to https://bugs.launchpad.net/apparmor/+bug/1871148? [12:12] Bug #1871148: services start before apparmor profiles are loaded [12:12] (without that trailing "'" of course [12:12] ) [12:21] jdstranddone :-) [12:21] jdstrand done :-) [12:22] aaha, we started shipping var/lib/snapd/desktop/applications in the pkg, that's the primary reason of preseed-reset test failure === pedronis_ is now known as pedronis [12:38] diddledan: https://bugs.launchpad.net/snapd/+bug/1871148/comments/24 [12:38] Bug #1871148: services start before apparmor profiles are loaded [12:39] mvo, zyga: ^ I added a snapd task. please see my comment. it seems that root on zfs is aggravating the condition that apparmor.service might start after snap services [12:39] looking [12:40] since we don't see it on non-root-on-zfs systems (even though the possibility is there) [12:40] yeah, I think we need to think about how to handle this [12:41] mvo: this is possibly another 2.44 point release. up to you to decide, but with focal making zfs an option in the installer, and that seems to push the system into this bug more than others, ... [12:42] PR snapd#8465 opened: tests: update snap-preseed --reset logic to acommodate for 2.44 change <⚠ Critical> [12:44] pedronis, hey [12:44] zyga: I need to step away, but maybe this is the time to align with non-Ubuntu-but-apparmor-enabled systems? I forget the details, but iirc, there is an additional snap-apparmor unit or similar that can be After apparmor, and then snapd can add After snap-apparmor to the units. I defer to you, mvo, etc on the design and am happy to review a PR [12:44] I see this error on uc20 nested tests [12:44] https://paste.ubuntu.com/p/sBWXC4VZyG/ [12:44] is it something new? [12:44] first time I see this [12:44] jdstrand: that's a brilliant idea [12:44] jdstrand: we can just no-op if apparmor is of [12:44] *off [12:44] yeah [12:44] and we can always put the dependency into the units [12:44] yeah [12:44] mvo: ^ that's a solution that's easy [12:45] I can look at this after the current bug [12:45] diddledan: thank you for your persistence :) [12:45] * jdstrand -> steps away [12:45] zyga, jdstrand works for me [12:46] pstolowski, hey, I see also this error in nested test for uc20 Apr 09 12:08:53 ubuntu snapd[720]: hotplug.go:131: internal error: cannot get global device context: broken assertion storage, looking for model: broken assertion storage, cannot decode assertion: asser [12:46] PR #9: Added the travis config file [12:46] pstolowski, any idea? [12:48] cachio: no, maybe core20 requires something new to be done in that area, i'll need to investigate [12:48] pstolowski, thanks [12:48] #8465 should unbreak master [12:48] PR #8465: tests: update snap-preseed --reset logic to acommodate for 2.44 change <⚠ Critical> [12:48] it sounds like some code is running before seeding is done [12:48] mborzecki: ^ [12:48] I thought that code waiting on seeding [12:49] today is sponsored by tag [12:49] heh ;) [12:49] zyga: yeah, it totally is [12:49] it's the N-days before the release feeling [12:50] pedronis: it doesn't wait [12:50] pstolowski: it waits for the system snap to be there at least [12:50] pedronis: right, that's true [12:51] mborzecki: for clarity, i pinged you about 8465 [12:51] pstolowski: figured ;) [12:52] mvo: pedronis: added the compatibility bind mount, i can succesfuly go through the install mode, but it hangs in initramfs in run mode [12:52] zyga: so you are suggesting all our snap unit have "After=apparmor.service", is that what you said earlier? are you on it? should I? [12:52] mvo: pedronis: pushed a patch to #8464 anyway [12:52] PR #8464: cmd/snap-boostrap, boot: use /run/mnt/data instead of ubuntu-data [12:52] mborzecki: please do, maybe something for dimitri [12:53] pstolowski: do we need 8465 for 2.44 as well as a cherry pick? [12:53] mvo: we have no way to rewrite atm though [12:53] to rewrite units [12:53] pstolowski: do you know what that blocked the test? [12:54] pstolowski: it would actually hit the kill timeout [12:54] pedronis: yeah, but at least all new focal install with zfs will not be affected if we have it now [12:54] pstolowski: that's really weird because 131 is definitely after we are getting events [12:56] something is very broken [12:57] mborzecki: not yet, as i said in the comment and standup notes i'm investigating; maybe qemu-nbd hangs when we leave execute. if i re-arrange the test to first unmount and clean up and then fail on diffing, it fails as expected and doesn't hang [12:58] mvo: probably yes === hggdh is now known as hggdh-msft [13:28] mvo: I've verified that the spread test fails without the fix and passes with the fix [13:29] mvo: I'll jump into the apparmor issue in a moment, after the call [13:30] zyga, I take it from that the fix doesn't work? I'm understanding how CI testing works, right? [13:31] diddledan: ? [13:31] :-p [13:31] diddledan: hopefully not :) [13:31] fail test == fix works; passing test == shruggy shoulders no idea [13:32] maybe it works? [13:32] zyga: \o/ thank you [13:32] #shipit! [13:38] mvo: I pushed the test now [13:44] that was a nice bug [13:45] jdstrand: I'm looking after apparmor now [13:54] mvo: have a look at 8462 again please [13:54] maybe mborzecki as well [13:54] I'll get rid of my plate, grab coffee and jump into apparmor and zfs [13:55] zyga: sure, will do [13:56] pedronis: I merged master into 8424, the only thing missing there are tests for lsblk.go and the reimplementation of lsblk with sysfs + udev, which probably means we need to rename the file or maybe move it to it's own package somewhere [13:56] but all the logic in boot and cmd_initramfs_mounts should be there and that is tested and ready to review [13:59] ijohnson: so I should focus on cmd_initramfs_mounts ? and a bit on boot , if I understand correctly [13:59] pedronis: yes [14:00] ok, having a break and then I will look [14:00] thanks [14:02] zyga: btw, I didn't says this in the standup but I definitely prefer our snap services' units to be after something we control (or well defined from systemd) than a 3rd package service, we can control the dep on that one in one place at least [14:02] +1 [14:02] yeah, I strongly agree [14:02] this is so much cleaner than vague dependency on apparmor in each of the service files we write [14:06] mvo: the script ijohnson linked requires greasemonkey which works on pretty much every browser [14:08] zyga: hah, so when a workflow is successful, there's no way to restart it [14:08] pedronis: should I cherry pick 8459 (omit many snap-ids) for 2.44.3 too? [14:08] pstolowski: nice! [14:10] mborzecki: yeah :) [14:10] mborzecki: I have some ideas on that though [14:10] omg, close/reopen didn't trigger anything [14:10] oh w8, it did [14:11]