[00:09] PR snapd#9495 opened: logger: use KernelCommandLineSplit to parse debug flag [03:05] PR snapd#9495 closed: logger: use KernelCommandLineSplit to parse debug flag [05:35] o/ [05:45] PR snapd#9494 closed: logger: use strutil.KernelCommandLineSplit in debugEnabledOnKernelCmdline [05:51] morning [05:51] mborzecki: o/ [05:51] zyga-x240: hey [05:51] mborzecki: I think we should revert bits of the speedup changes [05:51] it's been hanging yesterday [05:51] or investigate and fix [05:51] it may be a python-version-specific bug or just a general bug [05:52] hmmm [05:53] zyga-x240: did the change where we respect workers count land? [05:53] I think so [05:54] afk for some time, lucy just woke up [06:31] PR snapd#9480 closed: snap: support different exit-code in the snap command [06:36] mvo: we need to fix spread-shellcheck [06:38] zyga-x240: tests/unit/go hangs on 20.04 too right? [06:39] zyga-x240, mborzecki yeah, something is wrong, I was just startng a spread google run to see what is going on. do you have an idea already? [06:42] mborzecki: not sure [06:42] mborzecki: I just don't know [06:42] mvo: not immediately, either bug in older python (related to recursive executor submit) or something else [06:43] tryign with --max-procs=2 might be useful [06:46] zyga-x240: otoh, the unit tests job as run by gh actions does not seem to fail or hang for that matter [06:46] yeah, that's what makes me think it may be python version-specific behavior [06:48] I'm finishing my breakfast now, I'll start in a moment [06:49] zyga: looking at one of my PRs, it failed on 20.04, failed on 18.04 (though tests/unit/spread-shellcheck test) hit kill-timeout in both cases, 16.04 passed [06:49] heh [06:49] it's all over the place [06:49] shall we revert and fix this async? [06:50] I'd rewrite the code so that it returns todo units like what jamesh suggested [06:50] then it's one loop and one executor only [06:50] zyga: sounds like more work though ;) [06:50] yes [06:50] zyga: anyways, i think we should revert that last change [06:50] that's why separate actions 1) revert 2) fix [06:51] mvo how does that sound? [06:51] also tests/unit/spread-shellcheck duplicates work right now, run-checks is run in tests/unit/go [06:51] so either we can disable spread-shellcheck in tests/unit/go or drop the other test [06:51] mborzecki: I think we can kill tests/unit/spread-shellcheck now [06:52] mvo: sgtm [06:54] running manually now, those spread nodes have 1 cpu [06:55] zyga: heh, so some deadlock, a number of jobs submitted, nothing happening, cpu usage 0% [06:55] backtrace! [06:55] that's quick [06:57] good morning! [06:58] pstolowski: hello [06:58] zyga-x240: https://paste.ubuntu.com/p/ZqDTCcFvnb/ heh (cc mvo) [06:58] pstolowski: hey [06:58] zyga-x240: only 2 threads and both are waiting [06:59] interesting [06:59] and good idea to use gdb! [07:00] so one is running checkpaths [07:00] going through each location [07:01] while the other is in checkfile [07:01] waiting for the result [07:01] yeah [07:01] mborzecki perhaps just ensuring we have 3 workers miniumum :P [07:01] a bit lame but ... [07:01] (as in the minimum N) [07:01] mborzecki what do you think? [07:05] zyga: --max-procs 3 seems to work [07:06] In the end, you really want the code waiting for futures moved outside of the thread pool [07:07] yeah, I think that's the proper solution [07:11] next step: overengineer it with asyncio [07:14] jamesh no no ;) [07:14] haha [07:15] zyga: but just think of all the threads you'd save! [07:16] mborzecki: nice find! [07:16] jamesh: save all the threads! [07:21] hmm so maybe we just need +1 worker really [07:22] mborzecki I think that's the easy fix [07:22] and we should rewrite that slightly as jamesh mentioned [07:22] zyga: i'd land an easy fix with a comment, and then maybe work on the larger fix/refactor [07:23] +1 [07:23] +1 [07:24] zyga: fwiw i can reproduce this locally with --max-procs=1 [07:24] yeah, my fault for testing on my beefiest system [07:41] zyga: are you opening a quick pr with the workaround? [07:41] mborzecki nope, I thought you want to do that [07:41] I can though [07:41] zyga: no worries, i can push it [07:41] thanks! [07:42] mvo: hi! #8960 got +1 from Samuele and has 3 reviews; would be great to land at the most convienient moment after cutting a new release branch. perhaps it would make sense to squash-merge it in case of anything unexpected (and a need for a revert)? [07:42] PR #8960: o/snapstate,servicestate: use service-control task for service actions (9/9) <β›” Blocked> [07:47] I think I need to solve the blockers on https://github.com/snapcore/snapd/pull/9204 [07:47] PR #9204: sandbox: track applications unconditionally [07:47] as that's really required to enable r-a-a [07:50] mborzecki: IIRC fedora will disable getenforce/setenforce soon [07:50] how will that affect our test suite? [07:50] zyga-x240: hm, got more info? [07:50] mborzecki: one sec [07:51] https://lwn.net/Articles/831748/ [07:51] PR snapd#9496 opened: spread-shellcheck: temporary workaround for deadlock, drop unnecessary test [07:52] zyga-x240: it's ok for us, we only switch between permissive/enforcing [07:57] zyga-mbp: mvo: ^^ 9496 [08:00] mborzecki: looking [08:01] PR snapd#9497 opened: Have session agent connect to the D-Bus session bus [08:03] zyga-x240: ^^ this PR might help out a bit with your notifications work. [08:04] jamesh: interesting [08:04] I was using the other socket for broadcast but this may be useful as well [08:05] other socket? [08:05] jamesh: the snapd-user-agent socket [08:05] not dbus :) [08:06] why do we have to be launched by systemd? [08:06] ah [08:06] zyga-x240: we're already launched by systemd [08:06] that's a dbus service [08:06] got confused for a sec [08:06] we want to make sure that if we get activated via D-Bus first, we still get our file descriptor [08:06] for REST [08:07] mmm [08:07] I'm not suggesting replacing the snapd <-> agent communication with D-Bus [08:07] this would be for a return path of the agent <-> desktop shell communication [08:07] right [08:10] jamesh: I've sent a quick review just now [08:22] something weird on centos 7 [08:30] hmm [08:30] type=SYSCALL msg=audit(10/13/20 08:11:54.336:587) : arch=x86_64 syscall=kill success=yes exit=0 a0=0xffffffffffffa7e8 a1=SIGKILL a2=0x0 a3=0xf91e50 items=0 ppid=1 pid=22326 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=snapd exe=/usr/libexec/snapd/snapd subj=system_u:system_r:snappy_t:s0 key=(null) [08:30] type=AVC msg=audit(10/13/20 08:11:54.336:587) : avc: denied { sigkill } for pid=22326 comm=snapd scontext=system_u:system_r:snappy_t:s0 tcontext=system_u:system_r:snappy_cli_t:s0 tclass=process permissive=1 [08:31] type=SYSCALL msg=audit(10/13/20 08:01:54.377:396) : arch=x86_64 syscall=connect success=yes exit=0 a0=0x8 a1=0xc000320f10 a2=0x22 a3=0x4 items=0 ppid=22326 pid=22552 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=snap exe=/usr/bin/snap subj=system_u:system_r:snappy_cli_t:s0 key=(null) [08:33] mborzecki: do you know how to recover from: Corrupted checkpoint file. Inode match, but newer complete event (1602577959.287:791) found before loaded checkpoint 1602577772.146:790 [08:33] that's from ausearch [08:34] nvm I got it [08:34] so [08:34] type=AVC msg=audit(10/13/20 08:32:39.287:791) : avc: denied { connectto } for pid=23299 comm=snap path=/run/dbus/system_bus_socket scontext=system_u:system_r:snappy_cli_t:s0 tcontext=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 tclass=unix_stream_socket permissive=1 [08:34] type=AVC msg=audit(10/13/20 08:32:39.287:791) : avc: denied { search } for pid=23299 comm=snap name=dbus dev="tmpfs" ino=13502 scontext=system_u:system_r:snappy_cli_t:s0 tcontext=system_u:object_r:system_dbusd_var_run_t:s0 tclass=dir permissive=1 [08:34] those are the first two denials to fix [08:45] * zyga-x240 tries to adjust the policy [08:57] zyga-x240: did you manage to fix it? [08:57] (the policy i mean) [08:58] mborzecki: I'm slow at iterating at this [08:58] not yet [08:59] mborzecki: should we bump the snapd policy version perhaps? :) [08:59] maybe it should match snapd version [09:00] zyga-x240: try with dbus_stream_connect_system_dbusd(snappy_cli_t) and dbus_chat_system_bus(snappy_cli_t) [09:00] zyga-x240: perhaps we should, though we never did [09:01] zyga-x240: setting it to version of snapd sounds ok [09:01] thanks, trying [09:02] zyga-x240: the modules in core policy ahve a version bump on each change (in theory) [09:02] mm [09:02] I suspect nothing just cares about this number [09:02] just an observation, it's not important [09:03] zyga-x240: there maybe some automation under the hood, in our development we replace the module by hand, so it automatically gets a higher priority than the currently loaded one [09:09] mborzecki: I'll read https://github.com/SELinuxProject/refpolicy/blob/master/policy/modules/services/dbus.te and friends to see if there's something we should use [09:12] haha [09:12] zyga-x240: though you really want to read this: https://github.com/SELinuxProject/refpolicy/blob/master/policy/modules/services/dbus.if [09:13] (I meant all three) [09:13] zyga-x240: the *.te (type enforcement?) file is for the actual system dbus daemon, *.fc (file context?) is the files/sockets/dirs [09:13] zyga-x240: *.if is the interfaces for use from other modules [09:13] mmm [09:22] mborzecki: is all the comment XML in those .if files processed by anything? [09:22] is there a "compiled" version anywhere? [09:23] zyga-x240: yes, there should be documentation in your system, though i don't think is avaialble anywhere online [09:23] ah, ok [09:23] I'll check on F32 [09:23] zyga-x240: latest tech :P [09:28] mborzecki: more denials: https://pastebin.ubuntu.com/p/5K5TTpKwkT/ [09:28] I think the blocker is type=AVC msg=audit(10/13/20 09:16:35.588:270) : avc: denied { search } for pid=22502 comm=snap name=dbus dev="tmpfs" ino=13389 scontext=system_u:system_r:snappy_cli_t:s0 tcontext=system_u:object_r:system_dbusd_var_run_t:s0 tclass=dir permissive=1 [09:29] but the rest is also interesting, it seems snapd cannot stop the hook [09:29] (after a timeout) [09:29] yeah, sigkill and all [09:30] zyga-x240: dbus_system_bus_client(snappy_cli_t) should do it [09:30] yep [09:31] PR snapd#9293 closed: snap: auto-import will not try to auto-create users on managed devices [09:31] PR snapd#9498 opened: client,daemon,snap: auto-import does not error on managed devices [09:32] zyga-x240: and you can probably drop dbus_connect_system_bus(snappy_cli_t) looks like it duplicates some of the things from *_bus_client [09:32] ok [09:38] #9463 needs 2nd reviews (it's small) [09:38] PR #9463: seed/seedwriter/writer.go: check DevModeConfinement for dangerous features [09:40] ack [09:44] https://github.com/snapcore/snapd/pull/9463#pullrequestreview-507255123 [09:44] PR #9463: seed/seedwriter/writer.go: check DevModeConfinement for dangerous features [09:44] made one remark, so perhaps something to adjust before landing [09:45] mborzecki: nice, passing [09:45] mborzecki: we need session equivalent though [09:46] I don't see any in the refpolicy gh repo [09:46] mborzecki: rebased and pushed back to https://github.com/snapcore/snapd/pull/9204 [09:46] PR #9204: sandbox: track applications unconditionally [09:55] pedronis: should I push a trivial patch for https://github.com/snapcore/snapd/pull/9463 [09:55] PR #9463: seed/seedwriter/writer.go: check DevModeConfinement for dangerous features [09:55] er [09:55] for https://github.com/snapcore/snapd/pull/9463#discussion_r503813935 [09:55] zyga-x240: if you have time yes [09:56] on it [09:56] mvo: should we close #8845 and #8929 until we have time to discuss/adjust them and reprose? [09:56] PR #8845: [RFC] many: add "system.service.snapd-autoimport.disable" setting <β›” Blocked> [09:56] PR #8929: [RFC] many: add new "daemon-startup: inhibit" option [09:57] *re-propose [09:57] pedronis: sure [09:58] thx [09:58] closed [10:01] pushed [10:01] https://github.com/snapcore/snapd/pull/9463/commits/26f6b6680027ea9b1262a03b27a28ac6f3d60a9e if anyone wants to cross-check [10:01] PR #9463: seed/seedwriter/writer.go: check DevModeConfinement for dangerous features [10:02] PR snapd#8845 closed: [RFC] many: add "system.service.snapd-autoimport.disable" setting <β›” Blocked> [10:14] mborzecki: any idea on how to provide sigkill and other permissions [10:14] I think we're missing a test [10:14] that we can kill a hook that is running for too long [10:14] (or miss a test that runs selinux checks) [10:14] mborzecki: do you think we could move the "no denials" check to an invariant? [10:19] zyga-x240: hm this should do it `allow snappy_t snappy_cli_t:process { sigkill };` [10:19] zyga-x240: wondering though, why the hook process was still snappy_cli_t [10:19] can you reproduce that and grab ps -Z ? [10:19] oh [10:19] I will try [10:19] sure [10:21] ah I know why [10:21] mborzecki: because that was still the "snap run" phase [10:21] we got a denialy on dbus method call [10:21] and got stuck waiting for a response [10:22] mborzecki: I wonder if we should setup a timeout / guard of some sort [10:22] but that explains the label [10:22] we switch that in snap-exec [10:22] er [10:22] snap-confine === JanC_ is now known as JanC [10:25] mborzecki: what is the label we transition to when we run as a snap app? [10:25] I added this [10:25] # snapd tries to kill hooks that run for over 10 minutes. [10:25] allow snappy_t snappy_cli_t:process { sigkill }; [10:25] zyga-x240: if it's under systemd the final label is unconfined_service_t [10:25] but I think we need more than that [10:26] not under systemd [10:26] those are specifically hooks [10:26] systemd just tracks them [10:26] not spawns them [10:26] zyga-x240: right, but the transitions take the same route iirc [10:26] zyga-x240: so the hook ends up as unconfined service too [10:28] oh [10:28] ok, that's good then [10:28] I'll add two lines [10:29] * zyga-x240 tests killing w [10:30] with disabled dbus perms [10:38] pedronis: hi, i've spent quite a bit investigation undo on remove, found a couple of issues and filed https://bugs.launchpad.net/snapd/+bug/1899614 (and also worked on addressing point #2 there) [10:38] Bug #1899614: multiple problems with undo for 'snap remove' [10:40] *investigating [11:01] pstolowski: rogpeppe is around and could provide extra information about the other bug we've discussed [11:02] zyga-x240: thanks, but i don't have anything specific atm, but mvo had requested some info there before, would be good to add it if possible [11:02] rogpeppe ^ [11:03] pstolowski: ok, i've added that command output [11:03] thank you! [11:04] thanks! [11:04] i don't seem to get email notifications from launchpad, so please ping me here if you want any further interaction from me in that issue, thanks! [11:05] ah, will do [11:06] ok, that log is very interesting.. we saw " there was a rollback across reboot" before, didn't we? [11:07] aha [11:08] this at least explains the refresh and undo on Oct 5 that I also saw in the state timings; but what happened after next 2 reboots and why did it work till Oct 7th is a mistery [11:08] pstolowski: I don't recall how this layer works, what would happen if the boot partition is corrputed (vfat) and we cannot really set the next boot to anything different [11:08] like it's always stuck at one thing [11:08] would that explain anything? [11:08] i've no idea about that code [11:08] rogpeppe: I think you could try unmounting the boot partition (vfat) and fscking it [11:08] rogpeppe: also if you want to recover the system, we should be able to help [11:08] just not sure what's the best way to do that [11:09] I was also talking to mborzecki about this issue [11:09] and we don't remove snapd.service from disk [11:09] rogpeppe: could you check if snapd.service is in /etc/systemd/system? [11:09] afaiu the system works again, no? logs from oct12 [11:09] rogpeppe: oh, is snapd running now? [11:10] maybe i'm making assumptions.. but afaiu it stopped working on 7th? [11:10] one mo, let me check [11:10] the system isn't working currently [11:11] snapd does seem to be running: [11:11] rogpeppe@localhost:~$ ps alxw | grep snapd [11:11] 4 0 725 1 20 0 928048 18712 - Ssl ? 2:57 /snap/snapd/9169/usr/lib/snapd/snapd [11:11] ! [11:11] rogpeppe: ha, that's weird! [11:11] can you run /snap/snapd/9169/usr/bin/snap? [11:11] rogpeppe: and no /snap/snapd/current symlink right? [11:11] /snap/snapd/current is still not there, right [11:12] i can run /snap/snapd/9169/usr/bin/snap ok [11:12] pstolowski: ^ and ideas on what to explore? [11:12] rogpeppe: maybe snap list --all [11:12] for a first sanity check [11:12] i could give you a login to the system if you want [11:12] then maybe snap install snapd? [11:12] pstolowski: do you want to debug this? [11:14] this is the output of snap list --all: https://paste.ubuntu.com/p/Rwvr2dPD9M/ [11:14] rogpeppe: yes, sure, that would be great [11:15] core 16-2.46.1 9995 latest/stable canonical* core [11:15] core is linked [11:15] do we do anything magical where snapd would start without current symlink after rebooot? (don't think so, but...) [11:15] but the boot base is core18, right? [11:15] rogpeppe: what's /meta/snap.yaml snap name? [11:15] * rogpeppe tries to remember how to grant ssh access. so rusty! [11:15] pstolowski: I don't think we do [11:15] rogpeppe: you can try ssh-import-id [11:15] not sure if it's preinstalled [11:15] that's the command i was trying to remember! [11:16] cool :) [11:16] mvo: ^ very interesting bug [11:16] lots for us to learn on robustness [11:16] pstolowski: is your launchpad username pstolowski ? [11:17] ok, no ssh-import-id command [11:17] rogpeppe: 1 sec, i will import ssh keys from my current box [11:17] :thumbsup: [11:19] zyga-x240: is there more news? [11:19] mvo: yes, we have access to the device [11:19] snapd is disabled! [11:20] woah, how did that happen :( ? [11:20] the boot partition is still corrputed I suspect [11:20] rogpeppe: my lp user is stolowski [11:20] we can get all the logs [11:20] \o/ [11:20] mvo: missing undo in unlink snap, I bet [11:20] but could be something much more complex [11:20] i'm not sure it's this zyga-x240 [11:20] but cannot exclude it of course [11:21] thank you so much rogpeppe ! [11:21] pstolowski: ack [11:21] mvo, zyga-x240 any suggestions what to collect? anything regarding boot? [11:21] * mvo hugs pstolowski and zyga-x240 for their tireless debugging also [11:21] rogpeppe@localhost:~/.ssh$ ed [11:21] -bash: ed: command not found [11:21] dammit! [11:21] pstolowski: maybe to be safe collect all journal logs [11:21] +1 [11:21] rogpeppe: vi is there [11:21] rogpeppe: you can also echo >> [11:21] yeah, i'll use cat [11:21] i don't use an ANSI terminal so vi isn't good for me [11:23] zyga-x240: is tar /var/log.. good enough, or is there a better way? [11:23] thanks rogpeppe, checking [11:23] I think that's good [11:23] pstolowski: you can use journalctl with a standlone directory to examine machine logs without journald itself [11:24] zyga-x240: yeah i did it once.. slightly inconvinent but works [11:25] pstolowski: journalctl -D /path/to/var/log/journal [11:25] and then it works IIRC [11:31] mborzecki: is spread-shellcheck fixed? [11:31] ah [11:31] I see the PR [11:31] thanks! [11:32] approved [11:36] zyga-x240: hey so you mentioned re docker-support/multipass-support being broken with aa3 on groovy that you would prefer a snap-update-ns approach - can you elaborate more on what you are thinking here? I am not sure I understand exactly what you have in mind. [11:36] sure [11:36] I was thinking that the special interfaces they rely on could provide a mount profile that puts the base snap's apparmor config in /etc [11:37] mvo: anything re boot env that can be useful? [11:37] something like mount --bind /snap/core18/current/etc/apparmor.d /etc/apparmor.d [11:37] pstolowski: if you can, try fscking the boot partition [11:37] or [11:37] dd it [11:37] to analyze post-mortem [11:37] you may want to flip it read only for that operation [11:37] or unmount it [11:38] do you remember that vfat bug we ran into before? [11:38] zyga-x240: ok so is this already easily possible with the existing way that interfaces are declared? I am not super familiar with that... [11:38] amurray: I believe it should, the only thin that would be required in addition to this, is the permission for snap-update-ns to do this as well [11:38] amurray: if that's urgent I could look [11:38] amurray: but do look at the mount profile part [11:38] the apparmor part should be easy once that is in the works [11:39] you can test this by making a snap that uses the new interface (or the vanilla original snaps) [11:39] and looking at the generated mount profile in /var/lib/snapd/apparmor/mount/ [11:39] there are some examples [11:39] for instance, the desktop interface uses this mechanism to bind mount fonts around [11:39] zyga-x240: oh can you point me at examples, I am still confused πŸ˜• [11:39] sure [11:39] ah ok will take a look [11:40] amurray: in the snapd tree please look at interfaces/builtin/desktop.go [11:40] let me open it as well [11:40] yep am just looking now [11:40] zyga-x240: i'd rather avoid any potentially destructive steps atm, would leave that to rogpeppe [11:40] pstolowski: ok, a dd of the vfat while it is mounted would be useful as well [11:40] even if you just stash it on the device [11:40] not sure how large it is [11:40] amurray: so if you scroll to line 295 [11:40] zyga-x240: I am guessing that AddMountEntry() would be the thing? [11:41] you can see how it grants apparmor permissions [11:41] dd of /boot partition? nb, logs will be huuge [11:41] there are several profiles involved [11:41] pstolowski: the vfat [11:41] not sure how big it is [11:41] ah yep and the corresponding apparmor bits - thanks :) [11:41] amurray: the key part there is AddUpdateNSf function [11:41] which adds a piece of text for per-snap profile for snap-update-ns [11:42] this just needs the permission to bind /snap/{base}/*/etc/apparmor.d -> /etc/apparmor.d [11:42] now jump to 322 [11:42] this does what you mentioned bfore [11:42] *before [11:42] the difference is that we need spec.AddMountENtry (not *User*) [11:42] there's more [11:42] I believe those should be permanent things [11:43] regardless of connection [11:43] so the method signature is different [11:43] you can see that in ... [11:43] if you go to interfaces/mount/spec.go:206 [11:43] AddPermanentPlug [11:43] there's a Slot variant just below [11:44] the difference is in the arguments provided, [11:44] the Permanent methods get an interface and a plug or a slot, not a connected plug / slot [11:44] so it's just one side that you see [11:44] anyway, [11:44] journal logs are 324M, tgz [11:44] I think that'sa sensible approach [11:44] pstolowski: oh my [11:44] pstolowski: maybe too much [11:44] pstolowski: not sure, if we can send that over, that's good [11:44] but confirm with rogpeppe for sure [11:44] zyga-x240: ok thanks heaps for your guidance - I'll try take a look tomorrow morning and see if I can cook something up [11:45] amurray: let me know how this feels like [11:45] ok [11:45] amurray: if you get stuck I'll help [11:45] rogpeppe: ok to transfer 324M ^ ? [11:45] amurray: which interfaces were those? docker-support and multipass-something? [11:45] zyga-x240: thanks - so multipass-support iirc [11:46] right [11:46] ideally we'd have a spread test that installs those snaps [11:46] and looks at the mount profile or at the mount namespace [11:46] I think that's the last step though [11:46] I can definitely help [11:47] yeah I was wondering if I should try and add a test with whatever fix I come up with for this but will focus on a getting the right fix first and then can look at that if time permits... [11:47] amurray: you can start with a quick failing test [11:48] do you know how to write those? [11:48] mkdir tests/regression/lp-XXX [11:48] add a summary: with some info [11:48] then execute: | (newline)(tab)false [11:49] run that test with SPREAD_DEBUG_EACH=0 spread -debug -v google:ubuntu-20.10-64:tests/regression/lp-XXX [11:49] in the shell install the snap you need [11:49] use nsenter / cat to explore the files in /etc/apparmor.d [11:49] eventually copy those ideas over to the yaml [11:49] quit the debug shell and re-run to verify [11:49] at some point it will measure failure [11:49] and then that's a good start [11:50] we have a library of helper programs that assist in writing tests [11:50] but the best thing is you can really experience this from the point of view of a user [11:50] and create a valid test [11:50] that only later needs tweaking so that it fits the rest of the test stack [11:50] this will be my first time writing a test so again I really appreciate the guidance - cheers [11:51] amurray: look at various tests around, though you may stumble on more unusual test from time to time [11:51] you can also use qemu locally [11:51] you will need a test image, you can get that with adt [11:51] I can find the magic line if you want to use that instead of the google backend [11:51] just let me know [11:51] I think, on 20.10, that is autopkgtest-buildvm-ubuntu-cloud [11:52] you just need a few extra args to get a groovy image [11:52] sure any help with magic incantations are greatly appreciated :) [11:52] drop that into ~/.spread/qemu [11:52] as ubuntu-20.10-64.img [11:52] and you're good [11:52] a bit of advice that qemu tests are heavy on networking [11:53] so you may want a good connection [11:53] over time you can speed up with things like apt-cacher-ng [11:53] anyway, let me know if this helps and if you get stuck on anything just ask [11:53] will do - thanks again (my connection is ok, not great so will see how I fare...) [11:54] ok time for me to go eod - thanks again zyga-x240 for your help - have a great day [11:54] likewise! [11:55] see you later [11:58] zyga-x240: sorry, i'm not sure about dd and vfat, can you elaborate? [11:58] pstolowski: how large is the fvat partition on that pi? [11:58] I don't recall [11:59] pstolowski: I wonder what's the impact of the fact that the partition is not cleanly unmounted [11:59] and may not have been unmounted [11:59] zyga-x240: i don't see any mounted vfat partitions [11:59] cleanly that is [11:59] hmmm [11:59] can you paste mount? [12:00] rogpeppe: do you recall if you unmounted the boot partition last time we were looking at tihs? [12:00] zyga-x240: yeah, i might have [12:00] ah, that explains things [12:00] thank you [12:00] zyga-x240,z rogpeppe i've already collected mount output [12:01] rogpeppe: did you try to fsck that partition after unmounting it? [12:01] zyga-x240: i tried, but there's no fsck command available [12:01] rogpeppe: oh [12:01] zyga-x240: (and no way to install one, of course :) ) [12:01] pstolowski: what's the boot base (/meta/snap.yaml will help) [12:01] is that core18 or core? [12:03] I see /sbin/fsck.vfat in both core and core18 [12:03] can you confirm those are on PATH pstolowski? [12:04] rogpeppe: ^ [12:04] rogpeppe: if you can, perhaps fsck.vfat /dev/mmcblk0p{something} [12:05] zyga-x240: it's core18 - https://paste.ubuntu.com/p/Rwvr2dPD9M/ [12:05] pstolowski: right and that is the boot base for sure? [12:05] our list output doesn't show this [12:05] zyga-x240: what's that "{something}" supposed to be a placeholder for? [12:05] rogpeppe: the number of the partition with vfat [12:05] lsblk can help finding it [12:05] I think it's just p0 or p1 [12:06] no fsck.vfat on path! [12:06] pstolowski: and in /sbin/fsck.vfat? [12:06] pstolowski: feel free to run fsck... [12:06] nope [12:06] $ ls /sbin/fsck* [12:06] pstolowski: can you check /meta/snap.yaml to ensure that the boot base is core18 for sure, I'm surprised to see three core revisions and one core18 [12:07] woah! [12:07] let me check [12:07] "/sbin/fsck /sbin/fsck.cramfs /sbin/fsck.ext2 /sbin/fsck.ext3 /sbin/fsck.ext4 /sbin/fsck.minix" [12:07] mvo: ^^^ [12:07] that's very likely a serious problem [12:07] pstolowski: how about fsck.fat? [12:07] is that gone too? [12:07] I see it in my core18 snap on x86-64 [12:08] pstolowski: that's worth reporting as a separate bug with a regression test that checks that's an executable program [12:08] zyga-x240: on what path on your system? [12:08] and that it can run --help [12:08] pstolowski: /snap/core18/current/sbin/fsck.vfat [12:08] that's a symlink to fsck.fat [12:09] but I see that in the core snap as well [12:09] * zyga-x240 looks at revision numbers [12:10] * zyga-x240 checks stable channel [12:10] pstolowski: waiting for your confirmation of the boot base please [12:11] zyga-x240: yeah, core has fsck.vfat here. but core18 doesn't. and it's not symlinked anywhere [12:11] stable core has fsck [12:11] ok [12:11] probably core18 is at fault then [12:11] pstolowski: but core is the boot base [12:11] zyga-x240: boot base is core18 [12:11] so what's going on? [12:11] ahh [12:12] ok [12:12] i'm collecting all this and will soon attach to the report [12:12] pstolowski: I think core18 didn't refresh [12:12] it's very old [12:12] core18 is revision 1885 here [12:12] but 1885 in your log [12:12] I think that could be related [12:12] rogpeppe: consider runnin fsck.vfat from /snap/core/current/sbin/fsck.vfat [12:13] then we could try running snap refresh core18 [12:13] and snap install snapd [12:13] that may recover the system [12:13] I think this system is just stuck at old revisions and cannot move forward [12:13] the fsck bug was fixed [12:13] but this device is still affected [12:14] interesting [12:14] not sure what you think but the revision number there is really old [12:14] pstolowski: so is snapd running now? [12:14] the service I mean [12:14] zyga-x240: yes [12:15] pstolowski: can you try refreshing core18 [12:15] though wait [12:15] wait please [12:15] that would cause a reboot [12:15] and IIRC that's a problem [12:15] rogpeppe: ^ [12:15] rogpeppe: is rebooting that device acceptable for you? [12:15] or will it misbehave? [12:15] (I think that given its state we should first fsck boot partitio [12:15] zyga-x240: that's fine, although it probably won't restart [12:15] mount it [12:15] and look at what's there [12:15] zyga-x240: it will probably need manual intervention [12:15] pstolowski: ^ ok, let's not reboot it yet [12:16] pstolowski: please fsck boot [12:16] zyga-x240: it almost always needs to be restarted twice for some reason [12:16] using core's fsck [12:16] rogpeppe: I think because it reboots to try new core18 snap [12:16] but fails [12:16] because boot data is corrputed [12:16] so it gets stuck [12:16] (no watchdog timer) [12:16] then reboot rolls back [12:16] and we're back here [12:16] yeah, i'd avoid any intervention now. i'd discuss the solution for rogpeppe and pass it to him to do [12:16] rollback across reboot as pstolowski noted earlier [12:16] let's discuss on the standup [12:16] pstolowski: agreed [12:17] rogpeppe: ^ if you can fsck is IMO safe [12:17] then mounting the partition back [12:17] how on earth did the boot data get corrupted anyway? i haven't made any changes to it since i installed [12:17] that may recover everything, assuming you try to install snapd again [12:17] rogpeppe: it's fvat [12:17] and it's written to by both uboot and linux [12:17] we really don't know [12:17] zyga-x240: vfat can corrupt even when you're not changing it? [12:18] rogpeppe: we do change it on boot [12:18] we set a flag that says "we're trying to boot that thing now" [12:18] oh [12:18] so that on reboot (if we fail) we don't retry [12:18] i guess that's where the issue comes from [12:18] but boot a safe value [12:18] indeed [12:18] we found some issues with uboot FAT before [12:18] those got fixed [12:18] or was that GRUB fat [12:18] anyway [12:18] I think you can fix the partition so that it's not unclean [12:18] then [12:18] mount it [12:18] so that snapd can write to it [12:18] then [12:19] install snapd using the snap binary from core snap [12:19] that should make snapd snap not disabled anymore [12:19] if that works [12:19] I'd try to refresh core18 and see if it works [12:19] I suspect it might just [12:21] rogpeppe: this is your decision [12:21] zyga-x240: go for it [12:21] zyga-x240: i can easily manually reboot if needed [12:21] zyga-x240: this shouldn't corrupt the main data, right? [12:21] rogpeppe: give me 30 minutes to finish the download :) [12:22] rogpeppe: it's a separate partition [12:22] rogpeppe: I think pawel asked you to perform those changes though, I think it's best to let him upload logs for forensics [12:22] and for you two to agree as to who runs the commands [12:22] so that it's not racy :) [12:22] i'm happy for pstolowski to do whatever's needed. i'm quite busy currently. [12:23] ok [12:23] pstolowski: it's up to us [12:23] on the upside the bug may be fixed [12:23] it's just prevented your device from refreshing [12:24] yeah, that all makes sense [12:24] but but [12:24] why no current symlink? [12:24] pstolowski: although understanding exactly how snapd became unlinked would be useful [12:24] undo bug? [12:24] I wonder what happens in that special boot code [12:24] that reconciles what the system was booted with (Base name and rev) [12:24] with what's in snapd state [12:25] maybe that, when a skew is detected, somehow took out snapd snap? [12:26] brb, afk for a moment [12:26] or maybe until standup [12:52] mvo could you please land https://github.com/snapcore/snapd/pull/9496 [12:52] PR #9496: spread-shellcheck: temporary workaround for deadlock, drop unnecessary test [12:52] the store had some issues with searching, probably usual load or something like that, and this blocks master [12:56] zyga-mbp: done [12:56] thank you! [12:57] PR snapd#9496 closed: spread-shellcheck: temporary workaround for deadlock, drop unnecessary test [13:47] off to pick up my daughter from school [13:54] pstolowski I'm hacking tests for this case but if you want to pair-program on fixing that device remotely, let me know [13:55] zyga: that's actually a good idea. but i'm having a short lunch break now, how about in 30? [13:56] no rush, I didn't have lunch yet either [13:56] let me know when its comfortable for you [14:12] re [14:13] that fsck test is actually pretty cool [14:20] zyga: i've a meeting in 10 minutes, how about afterwards? [14:21] pstolowski: I wonder if this will happen, lot's of people declined [14:21] pstolowski: how long is the meeting? [14:21] mvo: ah ok [14:21] pstolowski: I'm here for now, I can grab lunch and eat it next to a hangout [14:27] zyga: ok, i'll know in a few moments if the meeting happens (likely not) [14:29] pstolowski: ok, can we do it in 15 minutes [14:29] I'm getting lunch in this instant [14:31] see you at 16:45 [14:31] pstolowski: noone in the call so far [14:34] zyga: ok, works for me, see you soon [14:45] back [14:46] ok [15:06] mvo, pedronis: pstolowski discovered the root cause [15:10] bug report coming [15:12] zyga: ohhh? [15:19] * cachio lunch [15:27] mvo still in a call [15:27] but vey unexpected [15:27] *very [15:51] okay, back [15:51] mvo so two well defined bugs [15:51] mvo one triggers the other [15:51] mvo one breaks snapd refreshes [15:51] other causes snapd to deactivate itself on failed refresh [15:51] the root cause was session services [15:52] and the fact that it silently depends on a specific revision of core/core18 [15:52] we hit a EROFS and snapd refresh fails to proceed [15:52] pawel is reporting both issues [15:52] they are well defined and should be relatively easy to fix [15:52] at least the breaking root cause [15:52] rogpeppe: we fixed snapd on your system [15:53] rogpeppe but we refrained from doing anything that would reboot the device [15:53] rogpeppe we also fixed the boot partition [15:53] rogpeppe with a bit of luck, you should be able to refresh core18 snap now [15:53] rogpeppe and it should reboot successfully [15:53] or if it does not, there are more bugs for us to find [15:53] rogpeppe if you choose to refresh core18, do let us know what happened [15:53] rogpeppe we also mounted the boot partition back, so it should be all right now [15:54] mvo let me know if you want to talk about any details [15:54] * mvo is in a meeting [15:56] ack [16:01] mvo, zyga https://bugs.launchpad.net/snapd/+bug/1899664 and https://bugs.launchpad.net/snapd/+bug/1899665 [16:01] Bug #1899664: snapd refresh on old core18 fails due to read-only /etc/dbus-1/session.d [16:01] Bug #1899665: Failed refresh of snapd drops current symlink on failure [16:02] pedronis: ^ [16:03] pstolowski \o/ [16:04] thank you! [16:04] pstolowski: without having looked at those, how hard do you think this is to fix? could we work on this as a blue item realtively soon? [16:04] zyga: thanks! [16:04] and thanks zyga and pstolowski (and rogpeppe of course!) [16:04] zyga: how would i go about refreshing the core18 snap? [16:05] rogpeppe snap refresh core18 [16:06] zyga: ok, i'll try that. i.e. run that command, then `sudo reboot now`, right? [16:06] no need, it will reboot itself [16:06] mvo: yes i will work on them, r/o filesystem will be easy, not sure about the one re symlink, but probably not too complicated either [16:06] rogpeppe note that the way we fixed your system is ephemeral [16:06] and if core18 fails to refresh [16:06] snapd will break itself again [16:07] that is why I wanted to know that this works or fails once the reboot completes [16:07] rogpeppe oh, and we re-started the hydroctl service as well [16:07] zyga: yay! [16:08] zyga: it's working! [16:08] rogpeppe: feel free to remove my ssh access when you confirm core18 works [16:09] mvo: interestingly, snapd will hit this case in a normal way [16:09] that I did not think about before [16:09] pstolowski we are here because that device has long refresh window [16:09] it will refresh infrequently [16:09] and when it does [16:09] it refreshes snapd first [16:09] so I think the severity of https://bugs.launchpad.net/snapd/+bug/1899664 should be increased [16:09] Bug #1899664: snapd refresh on old core18 fails due to read-only /etc/dbus-1/session.d [16:10] it's not such a special case after all [16:10] yeah i didn't set sev yet [16:10] +1 [16:13] * mvo hugs pstolowski [16:13] * mvo hugs zyga too [16:13] was a collective effort really [16:13] zyga: after "snap refresh core18": [16:13] error: cannot perform the following tasks: [16:13] - Make current revision for snap "core18" unavailable (cannot set next boot: cannot determine bootloader) [16:13] - Make snap "core18" (1889) available to the system (cannot set next boot: cannot determine bootloader) [16:13] mvo: eod for today, i'll work on fixes tomorrow morning [16:14] pstolowski: sure thing [16:14] ooh woot [16:14] looks like i'll log in there again [16:15] core18 20190723 1076 latest/stable canonicalβœ“ base,disabled [16:15] it did the same for core18 [16:15] it is disabled now [16:18] rogpeppe oohh [16:18] rogpeppe thank you, we will look again tomorrow [16:18] zyga: ok, thanks [16:18] rogpeppe this device, while very unfortunate for you, will really help make snapd more robust [16:19] zyga: i hope so :) [16:22] looks like an error on 'Make snap ... available to the system" results in wrong undo, this is what i see for core18 and the same happened for snapd [16:22] but yes, i'll continue tomorrow [16:23] cu [16:47] * zyga-x240 works on the test [16:48] so cool things [16:48] this is my favourite test! [18:27] * zyga runs the 2nd fsck test [18:42] ... [18:42] rebooting [18:42] I mean in the test [18:44] * cachio doctor appointment [18:47] o/ [19:16] more iterations [19:18] * zyga goes to do some evening housework while tests run [19:24] woot [19:24] tests pass [19:45] OMG THE RAIN [19:46] my dog decided to have a slow long walk [19:46] I'm so wet [19:49] trying core fsck test now [19:49] I'll push both tests at once [19:50] https://github.com/snapcore/snapd/pull/9446 needs review [19:50] PR #9446: overlord,usersession: initial notifications of pending refreshes [19:50] https://github.com/snapcore/snapd/pull/9422 needs review as well and is short [19:50] PR #9422: overlord: add link participant for linkage transitions [19:56] * zyga shower [20:09] 16 passes [20:09] 18 and 20 are in progress [20:09] * zyga tea [20:14] amurray: oh is it morning for you? :) [20:21] 18 passes [20:21] now just 20 [20:21] actually making tea :) [20:44] core20 takes foreeeever to test boot [20:44] but close [20:45] PR snapcraft#3315 opened: build(deps-dev): bump junit from 3.8.1 to 4.13.1 in /tests/spread/plugins/v1/maven/snaps/maven-hello/my-app [20:45] PR snapcraft#3316 opened: build(deps-dev): bump junit from 3.8.1 to 4.13.1 in /tests/spread/plugins/v1/maven/snaps/legacy-maven-hello/my-app [21:03] core20 should be good, running a clean test now [21:31] * zyga-x240 pushed https://github.com/snapcore/snapd/pull/9499 and EODs [21:31] PR #9499: tests: add tests for fsck [21:31] xnox: ^ perhaps you know what is responsible for fsck.vfat in uc20 [21:32] this test shows that ubuntu-seed stays corrupted across reboot, a regression compared to core 16 and core 18 [21:33] * zyga-x240 EODs [21:33] xnox: if you have feedback please comment on the PR, I'll review that tomorrow [21:35] PR snapd#9499 opened: tests: add tests for fsck