=== mup_ is now known as mup === Eickmeyer is now known as Eickmeyer-Quasse === Eickmeyer-Quasse is now known as Eickmeyer[q] [05:50] morning [06:20] o/ [06:22] zyga: hey [06:22] zyga: some trouble with the cla-check job [06:22] uc20-snap-recovery failed [06:22] zyga: where? [06:22] but it ran on 19.10 [06:22] https://github.com/snapcore/snapd/pull/8440/checks?check_run_id=569193826 [06:22] PR #8440: github: move spread to self-hosted workers [06:23] zyga: uh, merge master [06:23] is that even expected? [06:23] known issue? [06:23] zyga: yes, it's fixed already [06:24] k [06:25] how did cla check fail? [06:25] it passed on my branch just now [06:25] 38seconds [06:26] meanwhile, travis is broken [06:26] https://t.co/h3UEAleWVW?amp=1 [06:27] I think I can just go back to bed [06:27] zyga: if you open a PR with a commit right on top of the master so that no merge commit is generated it will fail [06:27] I see [06:57] PR snapd#8439 closed: secboot: import secboot on ubuntu, provide dummy on !ubuntu [07:01] morning [07:01] good morning pstolowski [07:02] zyga: quick question, do we have a 32bit machine in travis actions? [07:03] mvo: travis actions? [07:03] zyga: sorry, gh actions [07:03] mvo: as I said yesterday I didn't add a 32bit xenial machine to github actions [07:03] mvo: though it's a one-liner in the matrix, it slipped through the cracks in the initial PRs [07:04] good morning :) [07:04] last night store went belly up [07:04] and everything running failed one way or another [07:04] so I just called it quits and went to sleep (too late anyway) [07:05] mvo: pstolowski: hey [07:05] PR snapd#8455 opened: tests/lib/cla_check: expect explicit commit range [07:05] zyga: can we skip the spread jobs? [07:05] mborzecki: in principle yes but it's not something we coded, we should try that if: ... expression I pasted before [07:05] one sec [07:05] maybe add that to your PR [07:06] contains(github.event.issue.labels.*.name, 'skip-spread') or somesuch? [07:06] yes [07:06] if: !contains ... [07:06] idk tho, just copied and pasted from the docs :P [07:06] :) [07:06] I tried to get https://github.com/snapcore/snapd/pull/8440 green [07:06] PR #8440: github: move spread to self-hosted workers [07:07] but each time something random failed [07:07] PR snapd#8456 opened: tests: add 32 bit machine to GH actions [07:07] some desktop service, some store bits, some reboot tests [07:07] so tough luck [07:18] mvo: could you please merge https://github.com/snapcore/snapd/pull/8454 [07:18] PR #8454: tests/session-tool: session ordering is non-deterministic [07:22] zyga: hm the docs are kinda meh [07:23] PR snapd#8457 opened: github: skip spread jobs when corresponding label is set [07:32] mborzecki: interesting, except that the status check is required [07:32] mborzecki: perhaps instead wrap that in ${{ }} [07:32] and have the worker essentially do nothing? [07:33] mborzecki: ${{ .. }} is required in run blocks [07:33] zyga: hm which pr? [07:33] your pr [07:33] there's 2 ;) [07:33] 8457 [07:33] and there's a syntax error [07:34] I would drop the first part [07:34] as all events are pull reqeusts [07:34] let me pull the docs [07:35] if: contains(github.event.pull_request.labels.*.name, 'Skip spread') [07:35] then just negate [07:35] if: !contains(github.event.pull_request.labels.*.name, 'Skip spread') [07:35] but as we learned, that should not go into if because then the status check wont report [07:35] so maybe: [07:36] run: | echo ${{ !contains(...) }} [07:36] and see what that prints (probably true as that is just js) [07:36] heh [07:36] then wrap that into a shell [07:36] and should be good [07:36] i mean, wtf are the docs about labels? [07:36] they are there [07:36] hold on [07:36] it's somewhat confusing because they are not in the action docs [07:36] but in the bigger github docs [07:36] the whole object model is documented [07:37] https://developer.github.com/v3/issues/labels/ [07:37] by doing ${{ ... }} you're effectively tapping into that [07:38] zyga: the pull request event is this: https://developer.github.com/v3/activity/events/types/#pullrequestevent doesn't list the label there but it's in the example [07:38] and it's an empty array [07:39] however, there's actually an example in the issues event payload [07:39] https://developer.github.com/v3/pulls/ has the labels listed [07:40] zyga: re 8454 sure, I will merge once the spread tests finished, they are still running [07:40] thanks [07:41] one test already failed [07:41] on portal info [07:41] zyga: oh, ok. is james aware of the flakiness here? [07:41] I don't know [07:41] it's in spread-unstable so perhaps nobody noticed? [07:42] jamesh: can you please check if this is expected [07:42] aha, could be [07:42] https://github.com/snapcore/snapd/pull/8454/checks?check_run_id=569207445#step:4:814 [07:42] PR #8454: tests/session-tool: session ordering is non-deterministic [07:43] fedora failed to prepare, network error [07:44] zyga: idk, i think that the labels is not actually included there [07:44] mborzecki: where specifically? [07:44] zyga: is the pull_request object is the same as pull_request in https://developer.github.com/v3/activity/events/types/#pullrequestevent then the label is not htere [07:44] but should be? [07:44] idk [07:44] pull request *event* [07:44] zyga: It isn't expected. If you're seeing this error, then it can't map the process ID to a snap via cgroups [07:44] refers to pull request [07:45] that has labels [07:45] jamesh: fun, I guess it is debug time then [07:56] mvo: https://github.com/snapcore/snapd/pull/8456/files [07:56] is the vendor change expected? [07:56] PR #8456: tests: add 32 bit machine to GH actions [07:56] mvo: https://github.com/snapcore/snapd/pull/8440 is green [07:56] PR #8440: github: move spread to self-hosted workers [07:56] but let's chat about that in the call [07:56] zyga: don't think that check works https://github.com/snapcore/snapd/pull/8457 looks like the spread jobs are still schedule [07:56] PR #8457: github: skip spread jobs when corresponding label is set [07:56] d [07:57] mborzecki: how do you determine that? [07:57] mborzecki: they are required, so they are marked as expected [07:57] mborzecki: note that normally you don't get any jobs until the previous pass is successful [07:58] so I don't believe this is accurate as measurement [07:58] ok, let's wait then [07:58] zyga: yeah [07:59] ah [07:59] I see the 2nd commit now [07:59] cool [07:59] thanks [08:11] PR snapd#8440 closed: github: move spread to self-hosted workers [08:15] mborzecki: one option would be to move the if: clause down to the step level [08:16] zyga: have you seen the 'cancel workflow' request to have any effect? [08:16] jamesh: supposedly job level `if` is supported now https://github.blog/changelog/2019-10-01-github-actions-new-workflow-syntax-features/ [08:16] mborzecki: it's not quite as efficient since a job would still be sent to a runner, but it would mean the job would be considered successful [08:17] unless it isn't :/ idk, maybe i just need to wait [08:17] mborzecki: yes, but if the conditional causes the job not to run, then it isn't considered successful [08:18] if you want to get rid of the "Some checks haven’t completed yet" message, the jobs need to at least do something [08:25] mvo: there's a problem with the -32 bit build [08:25] src/github.com/snapcore/snapd/vendor/github.com/chrisccoulson/go-tpm2/mu.go:267:17: constant 4294967295 overflows int [08:26] chrisccoulson: ^ FYI [08:26] mborzecki: IIRC cancelling works but spread doesn't cancel and the worker is killed [08:33] zyga: I know, I updated the PR that adds 32bit works, it should have a fix [08:34] maybe the hash is wrong? [08:34] zyga: oh, let me double check :( [08:34] zyga: could be that govendor confused me [08:34] when you push again merge master please [08:35] zyga: sorry, I'm an idiot, I updated go-tpm instead go-tpm2 [08:35] * zyga hugs mvo [08:36] https://github.com/snapcore/snapd/pull/8403 needs a 2nd review [08:36] PR #8403: sandbox/cgroup: avoid making arrays we don't use [08:37] it failed on store traffic: - Fetch and check assertions for snap "test-snapd-content-slot-no-content-attr" (1) (error reading assertion headers: read tcp 10.240.1.50:58298->91.189.92.20:443: use of closed network connection (Client.Timeout exceeded while reading body)) [08:41] PR snapd#8458 opened: github: allow cached debian downloads to restore [08:41] jamesh: https://github.com/snapcore/snapd/pull/8458 [08:41] PR #8458: github: allow cached debian downloads to restore [08:41] this should fix the cache [08:42] though I think it looks only in the scope of the PR, there's still more opportunity to cache things than we exploit [08:42] (caches are associated with objects and are not global) [08:44] brb [08:46] I suspect caches are probably scoped to the (repo, user) pair [08:53] * zyga monitors https://github.com/snapcore/snapd/actions?query=is%3Aqueued [09:01] PR snapd#8421 closed: tests: enable unit tests on debian-sid again [09:03] mvo: that seems to have fixed things [09:03] oh, I spoke too soon [09:03] mvo: src/github.com/snapcore/snapd/vendor/github.com/snapcore/secboot/utils.go:73:37: cannot call non-function he.TPMError.Code (type tpm2.ErrorCode) [09:03] I think this commit is not good :/ [09:04] why didn't this get flagged by the unit test run? [09:04] are we not building / testing secboot? [09:04] ahh wait [09:04] that's weird [09:04] ah, snapcore/secboot is a different repository [09:04] oh well [09:05] (we don't seem to test anything there in CI) [09:06] zyga: meh [09:06] but at least the tests were quick now :) [09:08] zyga: haha, yes. but that's slightly annoying that this fails [09:10] zyga: one more try [09:10] ok [09:10] still 0 queued [09:11] (which is good) [09:15] mborzecki: thanks for the suggestion in https://github.com/snapcore/snapd/pull/7614 [09:15] updated [09:15] PR #7614: cmd/snap-confine: implement snap-device-helper internally [09:16] still 0 queued [09:16] mvo: I also wonder if actions are more heavily used in US, making afternoon "harder" [09:17] I've always found CI runs faster before you Europeans wake up [09:18] mborzecki: could you look at https://github.com/snapcore/snapd/pull/7825 and tell me if you think it's work splitting [09:18] I think it is more a case of two groups of users using CI at once [09:18] PR #7825: many: use transient scope for tracking apps and hooks [09:18] I could take the go bits that do cgroup scanning out and push separately [09:18] jamesh: haha, yeah [09:20] heh, as jamesh commented, https://github.com/snapcore/snapd/pull/8457 does appear to be stuck [09:20] PR #8457: github: skip spread jobs when corresponding label is set [09:20] the unit tests job should run though, but it hasn't yet [09:22] wierd, i'll wait a little bit longer [09:22] could it have rejected the workflow entirely? [09:27] idk, clearly something is off [09:27] one job queued [09:28] (all 32 spread workers are busy) [09:28] mborzecki: werid [09:28] mborzecki: can you rebase on master and push? [09:29] at 32 spread runs I'm seeing roughly 1MB/s in and 1MB/s out [09:29] that's not too terrible [09:29] it spikes to 10MB/s [09:29] especially when new jobs kick in and there's the initial sync [09:29] zyga: where do you see that? [09:29] spread has an inefficiency where the starting worker pushes the same tarball to each node [09:29] mborzecki: on the machine running spread workers [09:30] we could optimize that traffic down by just sending the tarball once and then fetching it from the cloud [09:36] pedronis: hi. currently FilesystemOnlyApply skips core-only handlers if release is classic; i think this needs to be relaxed for image/setupSeed with a flag passed down to FilesystemOnlyApply; makes sense? [09:39] pstolowski: let me look [09:44] pstolowski: yes, the cleanest thing is probably for the package not use release.OnClassic at all, and get info through some options [09:45] pedronis: k, thanks for confirming [10:05] core 18 revert tests failed: https://github.com/snapcore/snapd/pull/8454/checks?check_run_id=570248002 [10:05] PR #8454: tests/session-tool: session ordering is non-deterministic [10:06] + snap list [10:06] error: cannot list snaps: cannot communicate with server: timeout exceeded while waiting for response [10:10] PR snapd#8454 closed: tests/session-tool: session ordering is non-deterministic [10:15] ogra where should bugs about ubuntu core images be filed? [10:16] actually, probably a bug in the installer, is that subiquity on core? (the first run thing) [10:21] * popey starts a forum thread. [10:26] mborzecki: TBH I really wish there were type annotations [10:26] reading foreign python code is like "where are the types" :( [10:30] mborzecki: did you try adding any annotations? [10:34] zyga: not really, i've had enough fun with implementing the chooser ui [10:34] zyga: anyways if you want to play with it, better talk to mwhudson first [10:37] mborzecki: https://github.com/CanonicalLtd/subiquity/pull/692#pullrequestreview-389844549 [10:37] * zyga goes upstaris to make tea [10:37] PR CanonicalLtd/subiquity#692: console_conf: various recover chooser tweaks [10:37] we are running at 23/32 workers now [10:37] we've reached saturation once for about 20 minutes [10:39] mvo: I made some comments in #8325, some are really general hindsight questions [10:39] PR #8325: snap-bootstrap: copy auth data from real ubuntu-data in recovery mode [10:45] PR snapd#8458 closed: github: allow cached debian downloads to restore [10:47] PR snapd#8448 closed: tests/session-tool: add session-tool --dump [10:47] thanks! [10:48] pedronis: thanks, will look in a wee bit, looks like it is closed, I will try to get it to a landable point today :) [10:51] popey, yeah, subiquity is correct [10:52] popey, but the issue is indeed the clock ... [11:02] mvo: I don't know, there are some open questions [11:03] mborzecki: https://github.com/CanonicalLtd/subiquity/pull/692#pullrequestreview-389870317 [11:03] PR CanonicalLtd/subiquity#692: console_conf: various recover chooser tweaks [11:04] zyga: thanks! [11:07] ogra ok [11:22] mvo: can you merge #8449, it's all green but travis never came back or started, afaict ? [11:22] PR #8449: dirs: don't depend on osutil anymore, mv apparmor vars to apparmor pkg [11:25] pedronis: sure [11:26] PR snapd#8449 closed: dirs: don't depend on osutil anymore, mv apparmor vars to apparmor pkg [11:30] core 20 recovery design [11:30] MAGA - make appliance good again [11:30] * zyga hides [11:31] we are at 3/32 workers [11:31] though it will go back to ~20 once canary jobs are done [11:32] MAGA ? so should we deny it exists until it hits us hard ? :) [11:37] ogra: you mean another customs war? [11:39] I started implementing snapctl refresh-available [11:39] should have a simple version today [11:39] but first, *hot* tea [11:39] the office is horribly cold even today [11:40] I need a 2nd review for https://github.com/snapcore/snapd/pull/8403 [11:40] PR #8403: sandbox/cgroup: avoid making arrays we don't use [12:08] PR snapd#8459 opened: asserts: it should be possible to omit many snap-ids if allowed, fix [12:15] pedronis: ^ gofmt [12:17] no, you go fmt! [12:26] PR snapd#8460 opened: tests/session-tool: kill cron session, if any [12:26] pedronis: ta [12:42] I'm seeing failures on core-16-64, that are not obviously bogus [12:44] what kind of failures? [12:44] I saw two kinds today: [12:44] - reboot that went nowhere [12:45] - snap rollback and timeout on "snap list" [12:45] that felt really broken [12:45] mborzecki: https://github.com/snapcore/snapd/pull/8457/checks?check_run_id=570718915 <- cache of debian deps worked! [12:45] PR #8457: github: skip spread jobs when corresponding label is set [12:45] zyga: possibly, yes, reboot that went nowhere, but it seems new and real [12:45] mborzecki: I wonder if we can set cache scope to "global" to make sure everyone benefits [12:45] pedronis: I saw the reboot failure about twice last week as well [12:46] but never when testing with -debug to see :/ [12:46] mborzecki: spread-canary started on your skip label PR [12:46] mborzecki: and it works!!! [12:46] mborzecki: cool [12:47] mborzecki: with some extra love you could set a status label that shows it was skipped [12:47] but the feature works :) [12:47] zyga: uhh i don't like it though [12:47] mborzecki: why? [12:47] zyga: we still need to take as many workers as distros [12:47] mborzecki: but not spread Vms [12:47] mborzecki: that's nearly free [12:47] mborzecki: they all passed now [12:47] mborzecki: it adds ~30 seconds [12:48] and it's green - except for "pending travis" [12:48] hahah [12:48] mvo: https://github.com/snapcore/snapd/pull/8457 <- [12:48] PR #8457: github: skip spread jobs when corresponding label is set [12:48] no surprises there [12:48] * zyga hugs maciek [12:48] thank you :) [12:53] now, i still need to figure out that cla check [12:54] looks like there's a difference in what gets merged where between gh and travis [12:56] mvo: src/github.com/snapcore/snapd/vendor/github.com/chrisccoulson/go-tpm2/mu.go:267:17: constant 4294967295 overflows int [12:57] mvo: this now breaks master .deb builds [12:57] mborzecki: you can change how we check out things [12:57] mborzecki: there's also plenty of 3rd party solutions for this but I didn't look deeper [12:57] mborzecki: one was cool though, each CLA signature was a signed file in the repo [12:57] mborzecki: so the check was entirely offline [12:57] zyga: it's because we are getting a pc-kernel update in the middle of the tests [12:58] ohh [12:58] pedronis: did you reproduce it [12:58] zyga: there's an action ready for that, broought to you by SAP (?!) [12:58] zyga: no, but the log is obvious [12:58] pedronis: we should probably hold refreshes for snaps that cause reboots [12:58] once you look at it and that the tests [12:58] mborzecki: yes, SAP [12:58] https://github.com/cla-assistant/github-action [12:58] mborzecki: fun world :) [12:58] zyga: we don't have single snaps holding [12:58] I read that one [12:58] but it's still strange [12:58] pedronis: oh... right [12:58] hmm [12:58] but why doesn't it come back? [12:58] why would we try to refresh kernel anyway [12:58] maybe really buggy? [12:58] something is off [12:58] oh [12:58] standup time [12:59] * zyga needs to check one thing first [12:59] zyga: because we make reboot slow explicitly [12:59] the test don't really support unexplicit [12:59] reboots [12:59] ah, indeed [12:59] these tests are not meant to reboot [13:00] PR snapd#8457 closed: github: skip spread jobs when corresponding label is set [13:00] zyga: anyway I do think we want to add per-snap holding at some point, just not clear when [13:09] I just accidentally published a snap to beta, when nothing was published in beta before. Can I undo that? [13:10] rbasak: yes [13:10] rbasak: snapcraft close your-snap beta [13:10] Ah, I found a "close" option? [13:10] Got it. Thanks! [13:10] πŸ‘ [13:12] While I'm on the topic, is there any way I can unpublish the i386 snaps (on a different snap)? We don't build those now, so the ones that are there are way behind and probably useless now. [13:18] zyga: yeah, trying to fix it in 8456 [13:23] zyga: note, I re-read through PR 8408 yesterday (though after it was merged; it was fine) [13:23] PR #8408: snap/naming: add validator for snap security tag [13:23] jdstrand: thank you! [13:24] zyga: PR 7614 and PR 7825 are very high on the list [13:24] jdstrand: thanks [13:24] PR #7614: cmd/snap-confine: implement snap-device-helper internally [13:24] PR #7825: many: use transient scope for tracking apps and hooks [13:24] zyga: the apparmor upload and some training I gave set me back a bit, but I will be getting to them [13:24] jdstrand: both fail in CI now, one on silly thing and one (f31 or f32) fails on something that seems real, I'll invesetigate soon [13:24] jdstrand: but any feedback would be great [13:25] jdstrand: in the branch about refresh-app-awareness please note if I should split up the cgroup scanning code to a separate PR, it could be reviewed faster and land, aiding further review [13:26] * jdstrand nods [13:49] uh [13:50] my daughter's school friend is at a hospital [13:50] he lives next door :/ [13:50] ohnoes [13:51] FYI we run at capacity now, saturated 32 workers [13:51] but the queue is empty [13:51] and we should see a drop to ~ half of that in a few minutes [13:52] I will look at implementing the ideas we had today, that should reduce the queue load significantly [13:52] well, worker load [13:52] we're still not queueing because we manage to process everything (for now) [13:56] and we are at 24/32 [14:06] zyga: oh [14:06] PR snapcraft#3019 closed: static: consolidate tooling setup to setup.cfg [14:06] PR snapcraft#3020 closed: spread tests: default base for local plugin tests [14:06] PR snapcraft#3022 opened: plugins: introduce v2.PluginV2 and v2.NilPlugin [14:07] pstolowski: yeah :/ [14:07] * zyga is hungry and breaks for lunch [14:07] o/ [14:11] PR snapd#8411 closed: boot: cleanup more things, simplify code [14:11] zyga: back to what I asked you about yesterday, any reason why "snap run --command=stop" would block on snapd.socket? [14:11] zyga: if so, you need to find a way to make snapd keep running until the last snap has been stopped [14:11] yes, it does so when system key is different to the one on disk [14:12] it normally happens when you boot a new kernel [14:12] we ask snapd to generate new profiles and wait until it does so [14:12] zyga: the current situation means LXD is never stopped properly, causes a 10min shutdown delay and data loss [14:12] *never stopped properly in those cases [14:13] I'm not sure it's just about kernel updates, I have a system that seems to reproduce it every time, let me try it again today [14:13] stgraber: please raise a bug to mvo [14:13] * zyga is at lunch [14:13] (well almost) [14:13] mborzecki: https://github.com/snapcore/snapd/pull/8461 [14:13] PR #8461: github: run non-canary if label is present [14:13] * zyga is gone [14:14] PR snapd#8461 opened: github: run non-canary if label is present [14:16] stgraber: FYI I experienced this issue but was unable to debug it at the time [14:16] yeah, got it easily reproducible on an arm64 system somehow, seems to happen at every single reboot [14:17] the new kernel thing would explain why other users only get it somewhat randomly though [14:17] For me it was x86 [14:17] filing a critical bug against snapd claiming data loss [14:17] Perhaps system key is buggy [14:17] Please! [14:22] mvo: https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1871652 [14:22] Bug #1871652: Daemon snaps not properly stopped in some cases [14:23] mvo: as you know, every single server and cloud instances of 20.04 will use the LXD snap and all upgrading users of 18.04 snap will upgrade to the snap too, so we really really need this resolved or we're in for a lot of data loss / corruption issues. [14:25] stgraber: looking [14:25] mvo: thanks! [14:25] I'm creating a test VM on that arm64 system which can be played with as much as needed, should make fixing this easier [14:27] stgraber: I think zyga is on to something here, snap run will wait for snapd to re-generate the profiles, if snapd is already stopped this of course won't work, I need to see why this happens/how to do fix it [14:40] mvo: I've updated the LP bug with my reproducer on arm64 [14:40] mvo: I'm happy to sort out a way for someone from your team to access that system if that helps [14:41] * cachio afk [14:42] * cachio afk [14:42] stgraber: in various meetings right now, need to find someone to look at this while I'm "off" [14:42] re [14:42] stgraber: I have plenty of arm64 boards [14:43] I can look later today [14:43] zyga: VM capable? [14:43] stgraber: hmmmm [14:43] stgraber: good question [14:43] * zyga checks [14:43] zyga: the system I'm testing this on is a 48 core, 128GB RAM, arm64 server :) [14:43] stgraber: I think you win :) [14:43] (which Qualcomm kindly forgot in my basement before firing the entire team who designed it) [14:44] GCE had a hiccup, restarted a job to see if it was temporary [14:53] PR snapd#8459 closed: asserts: it should be possible to omit many snap-ids if allowed, fix <⚠ Critical> [14:55] PR snapd#8460 closed: tests/session-tool: kill cron session, if any [14:56] that's one way to get hardware [14:58] stgraber: if you _ever_ want to throw it out [14:58] just remember [14:59] bring it to europe on a plane and I can relieve you of it ;-) [15:00] :) [15:00] stgraber: are there any arm servers available that don't require a datacenter contract? [15:04] stgraber: looking at the bug now [15:04] stgraber: would it be possible for me to get a shell on a machine where this can be reproduced? [15:05] alternatively, I'd love to see the system key snapd writes [15:05] zyga: happen to have IPv6 on your side? [15:05] stgraber: if is in /var/lib/snapd/system-key [15:05] stgraber: unfortunately no :/ [15:05] maybe don't open access for now, [15:05] system-key is ... well, the key [15:05] zyga: thanks for looking, I'm a bit busy with meetings [15:05] it may be revealing [15:08] {"version":10,"build-id":"799a88b406b245795da51b18f6224003020c6fb9","apparmor-features":["caps","dbus","domain","file","mount","namespaces","network","network_v8","policy","ptrace","query","rlimit","signal"],"apparmor-parser-mtime":1538072454,"apparmor-parser-features":["unsafe"],"nfs-home":false,"overlay-root":"","seccomp-features":["allow","errno","kill_process","kill_thread","log","trace","trap","user_noti [15:08] f"],"seccomp-compiler-version":"66988dd2c3fb0abf9b1fb29be212771d7c38ae85 2.4.1 8c73f36d3de1f71977107bf6687514f16787f639058b4db4c67b28dfdb2fd3af bpf-actlog","cgroup-version":"1"} [15:08] thanks, let me inspect things now [15:11] stgraber: so, why does lxd stop itself using snap run? [15:12] this is not a bug on your side, I think, I'm just curious [15:12] that's how the systemd units are generated [15:12] all Commands in there wrap using snap run I think [15:12] ah, I see, [15:13] indeed [15:15] stgraber: is it possible to reproduce this with SNAPD_DEBUG=1 set [15:16] stgraber: if so please attach that [15:16] I need to break now, my 1yo daughter just woke up [15:16] but I have a hunch I know what it is [15:16] having that will confirm [15:18] so, fun fact about persistent journal; restarting systemd-journald triggers snapd restart (?) and since this is happening from config hook, bad things happen :/ [15:21] pstolowski: that's a problem for sure :/ [15:22] pedronis: it's annoying, because core16 seems to need journald restart [15:25] Whaaat? [15:25] Why do we restart? [15:25] Can we reload it instead? [15:26] zyga: i need to try [15:39] pstolowski: core18 and 20 work without? [15:41] pedronis: core18 - yes. i haven't checked 20 [15:43] Failed to reload systemd-journald.service: Job type reload is not applicable for unit systemd-journald.service. [15:43] :} [15:43] there you go [15:43] only systemctl restart does it [15:44] pstolowski: if it works with 18 and 20 without, I would just go without, restarting the journal is kind of weird anyway [15:45] pedronis: ok. i'll double check if i wasn't dreaming [15:46] pstolowski: anyway what you could try is kill USR1 [15:46] pstolowski: see man systemd-journald [15:47] pedronis: aaha, thanks! [15:49] pstolowski: as usual is not super clear what it does [15:49] from the man [15:55] zyga: mvo: I got again a bunch of allocation problems: https://github.com/snapcore/snapd/pull/8436/checks?check_run_id=570978994 [15:55] PR #8436: configcore,tests: use daemon-reexec to apply watchdog config [15:56] pedronis: looking [15:56] pedronis: happened once today [15:56] it looks like some permission issue [15:56] it was mentioned on the internal channel [15:56] please restart the workflow, it's not a capacity problem [15:56] we don't know what caused it [15:57] better yet, merge master for more fixes :) [15:58] zyga: I merged master many times [16:10] pedronis: systemctl kill --signal=SIGUSR1 systemd-journald does the job on core16 [16:10] pstolowski: good [16:10] pstolowski: that seems safe everywhere [16:11] systemd has Kill I think, right? [16:11] I mean systemd our package [16:12] pedronis: yes, i'm just looking at it right now [16:41] zyga: do you have anything to share about the lxd bug ? anything you figured out already that is worth for me to know? [16:41] mvo: I'm still partially AFK but give me some more time [16:42] mvo: I have conditions to reproduce it [16:42] mvo: and I _suspect_ I know what the problem is [16:43] zyga: nice, keep me updated please [16:47] pedronis: did you want me to change to use a string pointer for mockedMountInfo in 8451 ? [16:48] ijohnson: yes, it's it not too annoying [16:48] heh [16:48] if it's not [16:48] sure I mean I'll only have to re-start the workflows 1000 more times anyways so it's not a big deal [16:49] ijohnson: about the selinux tests, yes, that's fine, anyway is a different package, it was really testing two levels [16:49] right [17:37] let's merge https://github.com/snapcore/snapd/pull/8456#pullrequestreview-390189749 [17:37] it needs a 2nd review [17:38] PR #8456: tests: add 32 bit machine to GH actions [17:38] ijohnson: any issues? [17:39] mm? [17:39] oh the PR you just mentioned? [17:39] with CI [17:40] I haven't been looking at CI in the past hour or two, just seems annoying that every time I look at one of my PR's exactly one check out of the 17 failed and so I have to restart everything [17:41] ijohnson: I'll prepare the quad workflows for tomorrow [17:41] I reviewed 8456 [17:41] ijohnson: it's late and I'm looking at something else [17:41] yes that would be much appreciated [17:41] also did you see the mount ns bug I assigned to you last night ? [17:41] * zyga needs coffee and checks [17:41] I couldn't reproduce it with robust-mount-namespace-updates=true with a small reproducer snap, but with the full snap I can still reproduce the EBUSY [17:42] anyways I can send you the snap when you have time to look at the issue [17:42] PR snapd#8456 closed: tests: add 32 bit machine to GH actions [17:43] ijohnson: cannot find it, let me check my mail [17:50] did we get a newer systemd recently in 20.04 ? [18:01] ijohnson: I can override failures in merges fwiw [18:02] ijohnson: we are a bit timezone challenged so not ideal but do ping me if you have such a case [18:04] mvo ack maybe I'll send you an email at my EOD if needed [18:04] ijohnson: sure thing! [18:08] re, back to work [18:09] pedronis: 244 was in Feburary [18:09] February [18:09] 245 was in March [18:10] we are now on 245.2 [18:13] ok, just confused because a test that I tried failed now, anyway it indeeds needs tweaking for systemd >=243 [19:17] mvo: I debugged the issue related to lxd and snapd [19:19] cachio: 19.10 images also have GDM [19:19] cachio: it would be good to regenerate them so that we don't have the desktop [19:20] zyga: what was the issue with lxd and snapd ? [19:20] I'm curious [19:25] ijohnson: https://bugs.launchpad.net/snapd/+bug/1871652 [19:25] it's all there [19:25] Bug #1871652: Daemon snaps not properly stopped in some cases [19:25] ijohnson: but tl;dr; is in the last comment [19:25] * ijohnson reads [19:25] ijohnson: it's pretty interesting actually [19:25] stgraber: thank you for the debugging environment [19:25] stgraber: it's late so unless it's very urgent I will fix it first thing tomorrow after discussing with the team [19:26] zyga: it's been happening for a long long time, we can wait another day :) [19:26] I hope one last day [19:26] let me do one more test today [19:26] it's just that now that we understand it, we also understand the danger from it (containers aren't stopped at all, filesystem isn't unmounted, so data loss potential) [19:26] it actually explains why we've seen some odd db corruption in the past which we couldn't really explained based on logs [19:27] yes, I think the bug is well marked as critical [19:28] zyga: ohhhh, what did you find out? [19:28] zyga: ok, how involved is the fix :) ? [19:28] stgraber: as a small note, setting [19:28] SNAPD_DEBUG_SYSTEM_KEY_RETRY=0 [19:28] should work around it [19:29] mvo: it depends [19:29] mvo: please read https://bugs.launchpad.net/snapd/+bug/1871652 [19:29] mvo: it's probably something we can fix tomorrow [19:29] Bug #1871652: Daemon snaps not properly stopped in some cases [19:29] mvo: tl;dr; is https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1871652/comments/7 [19:29] zyga: nice [19:29] zyga: but it does sounds like the fix will not be entirely easy [19:30] mvo: it's actually very easy [19:30] mvo: just the if ( ... ) part needs discussing [19:30] I like the sound of that [19:30] we cannot wait for system key on shutdown [19:30] and we probably should depend on core/snapd [19:30] and not let them go away / unmounted [19:30] this was never expressed in systemd terms [19:30] but let's discuss that tomorrow [19:30] ok [19:30] it's late and I'd love to get off my chair :) [19:31] zyga: sounds great, thank you so much [19:31] we know exactly how to reproduce this [19:31] and we know what to change to fix it, we need to discuss how to introduce the changes [19:31] mvo: I suspect we _can_ do a minimal fix tomorrow [19:31] without ill effects [19:31] and work on a more proper fix for +1 [19:32] the minimal fix will just detect the shutdown and ignore system key [19:32] the proper fix will introduce dependencies [19:32] so lxd will not stop after core is unmounted [19:32] zyga: sounds good to me [19:32] but that's more iffy for the reasons you probably know about (wrappers and ensure) [19:32] * zyga waves and takes a break [19:33] zyga, I'll run the update7 [19:33] cachio: let me know if you need mount-ns test changes for that [19:33] zyga: thank again and good night [19:33] cachio: it slipped from my radar but I will send the patches tomorrow [19:34] zyga, sure [19:34] stgraber: and I know why I cannot reproduce it, for development I disabled reexec on my main machine [19:34] ah, that'd do it [19:37] zyga, job running [19:38] zyga, https://travis-ci.org/github/snapcore/spread-cron/builds/672666095 [19:38] it would be ready in about 40 minutes [19:49] PR snapcraft#3021 closed: remote-build: remove artifact sanity check [19:50] stgraber: ^ [19:50] https://github.com/snapcore/snapd/pull/8462 [19:50] actually ^ [19:50] PR #8462: cmd/snap: don't wait for system key when stopping [19:50] it should do the job, we need to package it with tests and stuff [19:50] PR snapd#8462 opened: cmd/snap: don't wait for system key when stopping [19:51] that code didn't seem to have unit tests before so it will take me more [19:51] now I'm really gone [19:51] mvo: T [19:51] ^ [19:51] looks simple enough :) [20:50] PR snapcraft#3023 opened: pluginhandler: move attributes to PluginHandler [21:16] PR snapd#8463 opened: secboot: key sealing also depends on secure boot enabled