[05:15] morning [05:25] Good morning [05:26] Iā€™m doing my school run now. See you all later [06:08] back now [06:08] mborzecki: it's cold today [06:08] 13C at most [06:08] brrr [06:08] I hope it won't rain later [06:08] and rainy here [06:09] mborzecki: still, all the cars are stuck in traffic [06:09] biking to school is way more robust [06:09] how are we doing today? [06:09] tests were awful yesterday [06:09] failing left and right on random stuff [06:09] store, portals, you name it [06:14] good morning mvo [06:15] a little cold and rainy today :) [06:15] how was your Monday? [06:19] quick breakfast [06:19] hey zyga [06:21] mvo: hey [06:21] hey mborzecki [06:25] PR snapd#7425 closed: channel: introduce Resolve and ResolveLocked [06:25] zyga: can you take another look at #7412 ? looks like we could land it easily [06:25] PR #7412: tests: run dbus-launch inside a systemd unit [06:25] sure [06:26] looks like we can merge 7342 too? [06:26] and 7125 needs a second review (should be simple) [06:27] +1 on 7412 [06:27] mvo, do you want me to mere or do you want to yourself? [06:28] mvo: +1 on 7342 [06:28] I can do the merge, I just noticed it has no second +1 [06:28] looking at 7125 now [06:29] PR snapd#7412 closed: tests: run dbus-launch inside a systemd unit [06:31] PR snapd#7342 closed: fixme: rename ubuntu*architectures to dpkg*architectures [06:42] mvo: reviewed 7125, +1 but please check my comment there [06:43] zyga: thanks, looking now [06:50] thank you [06:52] mvo, mborzecki: I'd like to ask for a review of https://github.com/snapcore/snapd/pull/7435 [06:53] PR #7435: tests: explicitly restore after using LXD [06:53] it's a blocker for progress on https://github.com/snapcore/snapd/pull/7168 [06:53] PR #7168: tests: measure testbed for leaking mountinfo entries [06:54] I have one PR slot open so I'll work on finishing and proposing a mount-ns extension that involves a mimic, so that we can properly evaluate https://github.com/snapcore/snapd/pull/7436 later [06:54] PR #7436: many: make per-snap mount namespace MS_SHARED [06:57] mvo: offtopic, last night I was playing with raspberry pi [06:57] and I think we can slightly improve our watchdog story there [06:57] specifically around try mode boots [06:59] zyga: oh? tell me more [07:00] I read a little about the watchdog on the pi [07:00] it's a bit weird, has fixed 15 second interval [07:00] we could enable it from the boot loader [07:00] so any try mode boot can recover [07:00] I will poke around in free time over evenings [07:00] maybe I will reach something that works [07:01] hey, is anybody experiencing problems with the telegram snap? I've opened https://forum.snapcraft.io/t/telegram-snap-fails-to-start/13132 [07:02] abeato: that's new to me [07:02] abeato: ls -ld /mnt ? [07:03] zyga, $ ls -ld /mnt [07:03] drwxr-xr-x 2 root root 4096 jul 19 2016 /mnt [07:03] ok, so regular directory, not a symlink === pstolowski|afk is now known as pstolowski [07:03] mornings [07:05] abeato: can you please check how many files matching "*snap-confine*" glob are present in /etc/apparmor.d/ [07:06] zyga, https://paste.ubuntu.com/p/xt7NyYgPKr/ [07:06] that's that! [07:06] mvo: ^^^^ [07:06] sep 11? [07:06] abeato: one of the files is wrong [07:06] it's a bug in our postinst script I believe [07:06] mvo: should that be fixed and released? [07:06] hm, interesting [07:06] abeato: what does "apt-cache policy snapd" say [07:06] that will be most useful to mvo [07:07] zyga, https://paste.ubuntu.com/p/G2JMwHJSJG/ [07:08] abeato: the fix for this bug is in 61cc58dbb0a7a1a785e9e3c391b6f593df892839 [07:08] Date: Wed Aug 14 09:43:41 2019 +0200 [07:08] it may not be released yet, perhaps [07:08] mvo: ^ can you confirm if 2.40 has this [07:09] abeato: yeah, the /etc/apparmor.d/usr.lib.snapd.snap-confine should not be there :( [07:09] zyga: yeah 2.40 should fix it [07:09] if you want to fix your system please remove the file mvo mentioned and call sudo apparmor_parser -r /etc/apparmor.d/usr.lib.snapd.snap-confine.real [07:09] mvo, zyga snapd journal: https://paste.ubuntu.com/p/wHsRR4R2xD/ [07:09] mvo: in that case the fix doesn't work [07:09] zyga: oh well [07:09] zyga: let me look at this again [07:09] thank you! [07:10] do you need more data? [07:10] abeato: can you please update the forum thread with the log from this conversation? [07:10] zyga: can you take a quick look at https://github.com/snapcore/snapd/pull/7109 ? pushed some changes there yday [07:10] PR #7109: snap-confine: fallback gracefully on a cgroup v2 only system [07:10] abeato: a bugreport (super small, just the data you already pasted) [07:10] zyga, sure [07:10] abeato: then I will do a sledgehammer fix [07:10] abeato: thank you [07:10] mborzecki: sure, looking now [07:10] mvo, launchpad? [07:10] abeato: plus the content of the /etc/apparmor.d/usr.lib.snapd.snap-confine please [07:10] abeato: yeah [07:10] ok [07:10] abeato: or if its alreaady in the forum thats fine [07:11] abeato: just need a refrence in the PR [07:11] it is already in the forum, yes [07:11] I will update then the forum post [07:11] abeato: then that should be fine [07:11] abeato: thank you! [07:11] np [07:12] anyone else seen this google:debian-9-64:tests/main/snap-service-watchdog to fail recently? [07:12] PR snapd#7438 opened: devicestate: add support for base->base remodel [07:13] for soem reason the snap app gets SIGABRT https://paste.ubuntu.com/p/Z6k382RCrD/ [07:17] mvo, zyga https://forum.snapcraft.io/t/telegram-snap-fails-to-start/13132 updated [07:18] this one is interesting https://paste.ubuntu.com/p/kmGzQzZRHJ/ probably something for Chipaca or pedronis [07:18] store woes? [07:24] idk, nonce is logged in the POST request, so it is sent ;) [07:27] abeato: could you please pastebin me /var/lib/dpkg/info/ubuntu-core-launcher.conffiles [07:27] abeato: and snapd.conffile too ? [07:28] mvo, there is nothing starting as ubuntu-core* in /var/lib/dpkg/info/ === Greyztar- is now known as Greyztar [07:29] abeato: uh, sorry, please see if there is "snap-confine.conffiles" [07:29] mvo: are conffiles retained after a package is removed? [07:30] mvo: that is, they remain until purged? [07:30] zyga: yes [07:30] zyga: correct [07:30] mvo, that file is not there either [07:30] mvo, snapd.conffiles exists [07:31] mvo, https://paste.ubuntu.com/p/pTN8P7Mtt2/ - but note that I already removed the old file and run apparmor_parser to fix the problem [07:31] abeato: any output from grep usr.lib.snapd.snap-confine /var/lib/dpkg/info/*.conffiles [07:31] ? [07:31] $ grep usr.lib.snapd.snap-confine /var/lib/dpkg/info/*.conffiles [07:32] abeato: yeah, thats fine - I'm mostly trying to figure out if its still leftover in some dpkg files [07:32] /var/lib/dpkg/info/snapd.conffiles:/etc/apparmor.d/usr.lib.snapd.snap-confine.real [07:32] abeato: thats the only match? [07:32] mvo, yes [07:33] abeato: thanks! I'm slightly puzzled but thats fine, I think I know what to do (even though I'm not sure how this happens, i.e. dpkg should either know about the file or it should be gone :/ [07:33] right, it's weird... [07:39] abeato: did you develop snapd on this machine? [07:39] perhaps it came from some earlier hacking [07:39] zyga, no [07:39] mvo: I need to take a break, back-pain after last evening's longer session [07:39] I'll stretch and be back in a few moments [07:40] zyga: sure thing, get well! [07:43] PR snapd#7439 opened: packaging: remove obsolete usr.lib.snapd.snap-confine in postinst [07:44] abeato: -^ [07:45] mvo, great! [08:16] zyga: https://forum.snapcraft.io/t/significance-of-info-files-in-run-snapd-ns/12938 [08:17] tests are red again :/ [08:19] pedronis: hi, i thnk you weren't around when i liked it, interesting failure i stumbled upon today https://paste.ubuntu.com/p/kmGzQzZRHJ/ [08:20] s/liked/linked/ [08:21] mborzecki: weird error given that we see the nonce in the requests log [08:21] mhm [08:23] for some reason the nonce the store just gave us is considered invalid [08:23] unless if it's repeats worth poking the store people [08:23] but we haven't touched anything in that area since a while [08:28] we do send the exact thing we get back [08:28] fwiw [08:28] yes [08:28] so it's not missing [08:28] this isn't the first time we've seen this error [08:29] i think it's worth chasing down [08:29] * Chipaca gets on it [08:36] mborzecki: anything changed that make the tests red? [08:37] mvo: no, looks like the usual stuff, desktop portal, occasionally installing snapd deps from package archive or store hiccups [08:37] :( [08:46] mborzecki: thank you, replied [08:50] snap/channel/channel_test.go:139:17: undefined: arch.UbuntuArchitecture on master [08:51] opening a pr in a bit [08:52] mvo: I did a pass over the 2.42 PRs (except mine), they all need a little bit more work I fear [08:52] mborzecki: crossing merges ? [08:52] pedronis: yes [08:53] pedronis: ok, I have a look, thank you! [08:54] mvo: I'll try to tweak the test to check for systemd version in mine when I get a 2nd, worst case it will not make 2.42 [08:54] PR snapd#7440 opened: snap/channel: fix unit tests, UbuntuArchitecture -> DpkgArchitecture [08:55] also something weird with tests/unit/go on debian, gofmt is not installed (?) [08:57] Chipaca: also in #7411 I have two wonderings that maybe you can help with (I @chipaca-ed you on them) [08:57] PR #7411: cmd/model: output tweaks, add'l tests [08:57] pedronis: ack [08:58] pedronis: i noticed i didn't notice some of the things you pointed out [08:58] will look in a bit [09:32] mvo: your PRs for 2.42 also need 2nd reviews [09:57] zyga: i wonder if this might have some relevance: Sep 09 19:27:54 localhost snapd[692]: handlers.go:459: Reported install problem for "core18" as f732dba4-d35a-11e9-a660-fa163e6cac46 OOPSID [09:57] rogpeppe: it means that snapd fails to refresh core18 [09:57] aborts the transaction [09:57] and rolls back [09:57] that explains your reboot loop [09:57] this is very useful information [09:58] zyga: i'm not seeing a reboot loop currently FWIW [09:58] mvo: ^ can we pull the log from that error tracker entry? [09:58] rogpeppe: it will be re-attempted again [09:58] rogpeppe: until it refreshes successfully [09:58] that report has hints as to what went wrong [09:58] zyga: ah, so that's what explains the fact that it's rebooting periodically, i guess [09:58] yes [09:58] it's the transactional nature [09:58] it's just not immune to problems [09:59] * zyga found a bug (in what he was doing since morning) [09:59] zyga: here's another one: Sep 09 07:40:40 localhost snapd[715]: handlers.go:459: Reported install problem for "core18" as fc12d9b2-d2e7-11e9-b568-fa163e102db1 OOPSID [10:01] zyga: INFO Waiting for restart... [10:01] ERROR cannot finish core18 installation, there was a rollback across reboot [10:01] huh [10:01] interesting [10:01] nothing magic [10:02] rogpeppe: what does snap version say? [10:02] looks like the restart is not happening, I see a lot of Waiting for resart [10:02] snapd version is 2.40 [10:03] zyga: https://pastebin.canonical.com/p/WHVwWjRTVs/ [10:03] this is snapd + core18 arrangement, right? [10:03] zyga: [10:03] rogpeppe@localhost:~$ snap version [10:03] snap 2.40 [10:03] snapd 2.40 [10:03] series 16 [10:03] kernel 4.15.0-1041-raspi2 [10:05] rogpeppe: can you run snap changes [10:05] and pastebin that? [10:06] zyga: http://paste.ubuntu.com/p/GbtJH9N6vv/ [10:06] how about snap tasks 11 [10:06] and snap tasks 13 [10:07] zyga: ? [10:08] run the command: snap tasks 11 [10:08] and pastebin the output please [10:08] zyga: http://paste.ubuntu.com/p/CnjtBWwrtr/ [10:09] oh [10:09] Error yesterday at 19:25 UTC yesterday at 19:27 UTC Automatically connect eligible plugs and slots of snap "core18" [10:09] it failed on auto-connect?! [10:09] off to school to pick up the kids, afk for a bit [10:09] pstolowski: ^ can auto-connect prevent a reboot? [10:09] zyga: this is task 13: http://paste.ubuntu.com/p/SvvXP7rP5R/ [10:09] same thing happened here [10:10] I think we're getting somewhere now [10:16] * Chipaca takes a break [10:17] zyga: as any other task handler that errors out and triggers undo. it's implemented to retry on conflicts (which it did a couple of times in that log), but then there is a bunch of things it looks up in the state that can error out. slightly weird we don't see what the error was for this task [10:17] pstolowski: we can presumably ask rogpeppe for the state file [10:17] do you think it would help [10:18] yes it might help, maybe we will be able to track down what changed in the state that made task error out [10:20] rogpeppe: can you please report a bug on bugs.launchpad.net/snapd [10:20] rogpeppe: include a rough description of the problem how you see it [10:21] rogpeppe: so yes, if you can grab and send us state.json that would be great (don't pastebin as it has your macaroon etc) [10:21] rogpeppe: and then work with pstolowski to attach the logs there [10:21] rogpeppe: and the state file [10:21] pstolowski: where does that file live? [10:21] rogpeppe: /var/lib/snapd/ [10:21] rogpeppe: please make sure to use a private bug (when reporting it) because as pawel said, the state file contains some shared secrets [10:21] let us know if you need any help with reporting the bug [10:22] we will use it for tracking and eventual regression testing [10:22] mborzecki: found a small-ish bug just now, /etc/ssl is a special case, as you know [10:22] mborzecki: as is /etc/alternatives [10:22] mborzecki: and they don't play nicely with the trespassing detector [10:23] zyga: are the macaroons the only secret things in there? [10:23] yes [10:24] zyga: ok, i've redacted them specifically. [10:26] rogpeppe: ty [10:43] zyga, pstolowski: https://bugs.launchpad.net/snapd/+bug/1843417 [10:43] Bug #1843417: ubuntu core installation goes down regularly [10:43] thank you very much [10:43] we'll try to get to the bottom of this [10:43] pstolowski: can we use the state file to simulate a refresh somehow? [10:44] zyga: thanks. BTW if it can't be fixed fairly soon, i'll need to move to another distribution, because winter is coming and my parents need heat in the house :) [10:44] rogpeppe: can you try one specific command that may help you [10:44] rogpeppe: snap refresh core18 [10:44] that will refresh the base snap [10:44] zyga: ok, trying [10:44] then you can try to refresh snapd [10:44] essentially one-by one [10:44] one-by-one [10:45] rather than all at once [10:45] if there's some kind of conflict happening [10:45] it might be averted this way [10:45] zyga: ok, it's refreshed and is rebooting [10:45] zyga: well, in a minute [10:46] zyga: ok, we'll see if it comes back up [10:46] fingers crossed [10:46] systems at scale is complex [10:46] we wish to make this totally unattended [10:46] but as reality shows, it's not trivial [10:47] zyga: in theory possible but lots of mocking [10:48] zyga: i'm somewhat unhappy about the reboot just happening randomly, not under some sort of control, particularly if there's a possiblity that the system might not recover from it [10:48] zyga: in this particular case, if this happened when people were away, it could result in the house not being heated enough and pipes freezing, leading to expensive damage [10:49] rogpeppe: you can schedule updates [10:49] there's a way to update predictably at very precise moments [10:49] zyga: oh? how would i do that? [10:49] https://snapcraft.io/docs/keeping-snaps-up-to-date [10:49] in doubt ask mborzecki, he knows this code very well [10:50] or ask degville about the documentation for suggestions or improvements [10:50] i'm looking into the state [10:50] zyga: is there some documentation for the actual refresh timer syntax that isn't just examples? [10:51] re [10:51] I don't believe there is, what would you like, a more formal syntax? [10:51] zyga: yup [10:51] I believe it's a set of ranges [10:51] but mborzecki can correct me on this [10:51] zyga: and an explanation of the semantics [10:51] mborzecki: is there a more formal syntax of https://snapcraft.io/docs/keeping-snaps-up-to-date [10:51] zyga: with those docs, i'm left guessing [10:51] zyga: what does the "5" in "fri5" mean, for example? [10:52] rogpeppe: I think you're right - we should add a formal explanation of the syntax. [10:52] rogpeppe: the semantics is that snapd will only attempt to refresh in time that fits that schedule [10:52] zyga: it's surely not the 5th friday in the month [10:52] rogpeppe: just examples [10:52] rogpeppe: it actually is [10:52] rogpeppe: you can plan a monthly refresh this way [10:52] zyga: most months don't have a 5th friday [10:52] perhaps the example is not the best but that is the intent [10:53] mborzecki: what would happen if you pick the 5th Friday actually? [10:53] rogpeppe: exmples list: fri5,23:00-01:00 Last Friday of the month, from 23:00 to 1:00 the next day [10:53] rogpeppe: it's day[], 5 means the last week basically [10:53] mborzecki: i don't understand why that's the last friday of the month - that's what i mean when i say that it would be good to actually document the semantics [10:54] mborzecki: if the digit "5" is special, then it should say so [10:54] mborzecki: for example, would fri9 mean the same thing? [10:55] rogpeppe: some syntax was initially described here https://forum.snapcraft.io/t/refresh-scheduling-on-specific-days-of-the-month/1239/6 some changes were made along the way though [10:55] mborzecki: more unknowns: would you be allowed to do "22:00~23:00/2" ? [10:55] mborzecki: if so, what's the difference between that and "22:00-23:00/2" ? [10:56] I think those are all good questions, thank you for engaging with us rogpeppe [10:56] mborzecki: basically, i'd like to see some actual description of the semantics, not just a set of examples where i'm left to try to infer the actual rules [10:57] rogpeppe: there's this page too: https://forum.snapcraft.io/t/timer-string-format/6562 [10:58] i gues it could use a little update with some details [10:58] PR snapd#7441 opened: asserts,seed/seedwriter: follow snap type sorting in the model assertion snap listings [10:58] that design doc from niemeyer is a great start. it would be nice if some more of that made it into the actual docs. [11:00] it also mentions some stuff that isn't documented at all, such as `0:00~24:00/6:00` [11:01] but maybe that's not implemented, i guess [11:02] zyga: FYI the pi did not successfully reboot. [11:02] zyga: i'm gonna have to ask for it to be power cycled again [11:02] rogpeppe: did it reboot but fail or failed to reboot? [11:03] zyga: i've got no way of knowing, i'm afraid [11:03] huh, I see :/ [11:03] zyga: it said it was going down, my ssh connection was terminated, and it hasn't come back up again [11:03] it has rebooted then [11:03] zyga: well, it shut down :) [11:03] and after someone reboots it again it should roll back [11:04] zyga: and then i'm back in the same position as before? [11:04] I was discussing a way to make that automatic in case of failure on the pi specifically [11:04] yes [11:04] zyga: ok. :-( [11:04] rogpeppe: you can set the schedule to avoid refreshes while we try to understand the cause of the failure [11:05] zyga: can i set the schedule so refreshes are turned off entirely? [11:05] no, that's explicitly not available [11:05] zyga: AFAICS the minimum frequency is once per month [11:05] rogpeppe: you can try one more thing [11:05] you can refresh snapd itself [11:05] that should not reboot [11:05] but give you new software stack [11:05] so that some bugs that were fixed since 2.40 can be applied [11:06] well the fixes that is, not the bugs [11:06] rogpeppe: you can try to refresh snapd to candidate channel to get it [11:06] rogpeppe: with "snap refresh snapd --candidate" [11:06] on core18 systems you no longer need to reboot to get new snapd, fortunately [11:07] using the same strategy you could even refresh to a hotfix branch that contains a fix for your machine [11:07] which would allow you to refresh the rest of the system correctly, once we understand the nature of the failure [11:09] zyga: ok, i'll try that when the system has been restarted, thanks [11:12] zyga: so, auto-connect handler errors out because WaitRestart() reports a rollback error, we hit the "// TODO: make sure this revision gets ignored for automatic refreshes" case again, there is a revision mismatch there. this was discussed a few months ago when we hit similiar case. pedronis also did some work around reboots recently but i'm not sure if that's in play here [11:14] pstolowski: the checking for reboots has been added [11:14] maybe we didn't remove all the TODOs [11:14] zyga: tbh, in situations like this, maybe we should have a mechanism to temporarily disable refreshes, some local assertion or whatnot [11:14] zyga: so auto-connect is just a vitim here, problem is elsewhere [11:14] mborzecki: yeah [11:15] I don't remember when it landed though [11:16] zyga: do you think that the failure to reboot correctly is related to this problem here, or just another problem that happens to be exacerbating the issue? [11:16] I think that it may be a separate problem [11:17] perhaps it'd be good to look at snap boot environment and see what it says [11:17] mborzecki: you can set refresh.hold no [11:17] aahh [11:17] right [11:18] a bit annoying to set, we really need a command for it, but is there [11:18] zyga: ok, i'm back into the system [11:19] pstolowski: we should drop that todo, the check is now done, is done much earlier, in daemon itself [11:19] zyga: it's maybe interesting that it seems the system only reboots successfully after exactly two power cycles. [11:19] pedronis: refresh.hold? what is that [11:20] zyga: https://forum.snapcraft.io/t/system-options/87#heading--refresh-hold [11:20] it's on same page as timer [11:21] ah, it's not on https://snapcraft.io/docs/keeping-snaps-up-to-date [11:21] zyga: afaiu it's not intended to be used by the user [11:22] well, it's annoying but it exists [11:22] I would not recommend to use it without a reason [11:22] but it can be used [11:24] am i right that there's no way to specify that the system will refresh on the first day of every month? [11:25] I believe you're correct [11:25] pedronis: should the restart check be dropped from auto-connect handler? [11:25] zyga: thanks [11:25] pstolowski: in which sense? [11:26] the answer is likely no [11:26] pedronis: removing WaitRestart() from auto-connect [11:26] pstolowski: no [11:26] it will be hit for a bit until we restart/reboot [11:26] what I said is checked earlier if when a reboot is triggered [11:26] whether it happeneed [11:27] at some point we might be able to not use WaitRestart but that requires changes to taskrunner etc [11:31] pedronis: i see. ok, we need to find out what triggers this check to fail sometimes [11:32] pstolowski: nowadays, if we reboot 3 times and fails it will trigger [11:32] i've a power outage, need to power off soon before my ups runs out of battery [11:32] well, try to reboot 3 times [11:37] 2019 is still the year when I wish lauchpad to support markdown [11:41] * pstolowski lunch [11:42] mborzecki: I reported two bugs that I found today https://bugs.launchpad.net/snapd/+bug/1843421 and https://bugs.launchpad.net/snapd/+bug/1843423 [11:42] Bug #1843421: snap-update-ns doesn't know about the special property of /etc/ssl and /etc/alternatives [11:42] Bug #1843423: snap-update-ns fails to construct a layout in /etc/test-snapd/foo [11:43] ondra: ^ some useful bugs for you [11:43] ogra: in case you run into something like that in the field [11:44] I meant ondra twice but I think ogra may run into things like that as well [11:45] zyga: btw. does s-u-n need to know aboutl nsswitch.conf too? [11:45] mborzecki: probably so [11:46] zyga thank you :) [11:46] I'll do my best to fix them obviously [11:46] writing tests is useful [11:56] imma go lunch [11:57] enjoy :) [11:57] i'll try [12:11] is it just me or are things extra slow today [12:11] setting up main test suite takes about 10 minutes to complete [12:13] zyga: and spread test are failing [12:13] how? [12:14] I haven't seen any failures though I'm mostly writing new tests now [12:14] (but no store related failures during that process either) === ricab is now known as ricab|brb === grumble is now known as \emph{grumble} [12:41] * zyga quick lunchj === ricab is now known as ricab|lunch [13:25] jdstrand: hello, can you please enqueue https://github.com/snapcore/snapd/pull/7421 for a non-security review but rather a concept review of the idea [13:25] PR #7421: cmd/snap-confine: unmount /writable from snap view [13:38] zyga: sure [13:42] mborzecki: I don't understand the spread failures, many of them don't even seem to have clear errors, or I'm not looking right (quite possible) [13:59] am I doing something dumb? [13:59] $ snap info multipass | grep installed [13:59] installed: 0.9.0-dev.171+g7a968814 (x4) 194MB classic [13:59] $ snap refresh multipass --revision 1125 --amend [13:59] error: local snap "multipass" is unknown to the store, use --amend to proceed anyway [14:01] PR core18#139 opened: hooks: add missing dosfstools to get fsck.fat [14:01] rogpeppe: hello [14:01] rogpeppe: we have some more ideas [14:02] zyga: cool! [14:03] zyga: BTW ISTM that the fail-to-reboot issue is the main problem here - i'm not sure that the other issue would be a real problem if the reboot hadn't failed to restart [14:03] rogpeppe: we looked some more and we suspect the boot partition that uses fat is corrupted, we found a bug related to absent fsck on core18 systems [14:03] rogpeppe: we devised a way forward that you should be able to do remotely [14:03] rogpeppe: if you have an app using "core" installed you should have access to fsck.vfat from /snap/core/current/usr/sbin/ [14:03] rogpeppe: you can use that to fsck the boot partition [14:04] rogpeppe: you can unmount it for the duration of the check as well [14:04] zyga: given that i re-flashed the card very recently, it seems slightly unlikely that it's corrupted already (within a few hours of first installing) but happy to try [14:04] rogpeppe: mvo looked at some of the error tracker logs and found what I believe was kernel telling us about fs corruption of the FAT partition [14:05] rogpeppe: so that's the first step, I think you know how to run that without hand-holding but please ask for help if you need any [14:05] rogpeppe: try to run it in mode verbose enough for us to see if there were any errors there [14:05] rogpeppe@localhost:~$ ls -l /snap/core/current/usr/sbin/*fsck* [14:05] ls: cannot access '/snap/core/current/usr/sbin/*fsck*': No such file or directory [14:05] rogpeppe@localhost:~$ ls -l /snap/core/current/usr/sbin/*vfat* [14:05] ls: cannot access '/snap/core/current/usr/sbin/*vfat*': No such file or directory [14:06] oh, silly me [14:06] just /snap/core/current/sbin/ [14:06] not /usr [14:08] zyga: so i'm planning to run these commands; do they seem right to you? [14:08] umount /boot/uboot [14:08] fsck -V /dev/mmcblk0p1 [14:08] yes [14:08] they look good [14:09] (assuming PATH is setup to find the fsck.vfat) [14:10] rogpeppe: as an extra remark, it's sometimes good to stop snapd.service during things like this (hand's on experiments) === ricab|lunch is now known as ricab [14:10] to avoid background activity [14:10] zyga: http://paste.ubuntu.com/p/Fm3RpdC8d2/ [14:11] interesting! [14:11] zyga: i'd definitely unmounted the fs [14:11] perhaps uboot and kernel disagree on which boot sector to use and then something gets out of sync later [14:11] for the purpose of experiment, copy original to backup [14:11] I _believe_ [14:12] that is what the kernel would use [14:12] but I welcome the advice of mborzecki [14:12] mborzecki: ^ [14:12] zyga: and remove dirty bit? [14:12] yes, but please understand my POV of trying to fix the partition and see if that means you can correctly boot out of the problem [14:12] one more idea [14:12] perhaps tarball all of the boot partition [14:13] or even dd the whole partition to ext4 somewhere [14:13] for forensics [14:13] dd is better [14:13] zyga: sure, i could send you a copy [14:13] as you don't have to mount do do [14:13] *to do [14:13] yes, we can then look at it bit by bit in hexedit [14:13] fortunately we don't keep that many files there [14:15] zyga: ok, downloading disk image now; will upload to s3 [14:15] thank you [14:17] zyga: i still find it very odd that it only boots up ok once every other time. i can't think of something that might be causing such predictable boot failures. [14:17] I can offer one [14:17] zyga: fwiw, i think it's worth checking wether the same incorrectly unmounted warning appears on a cleanly built image [14:17] like out of the box [14:18] mborzecki: good idea [14:18] rogpeppe: if uboot reads the FAT differently (we saw that at least once in the past) and sees a different file than linux [14:18] then snapd will configure the boot loader to boot kernel-1, core-1 in "trying" mode [14:18] zyga: BTW the s/w i'm running ran fine without any issues for months on end previously [14:18] the boot loader will go but never see those values [14:18] booting something else [14:18] perhaps something that is removed now [14:19] or perhaps something that is there but disagrees with what snapd expected [14:19] so snapd will change boot configuration again [14:19] and plan another reboot [14:20] rogpeppe: but the point is that the oscillation may be kernel/uboot disagreeing on the contents of a specific file [14:20] and snapd writing to that file in between boots [14:22] zyga: ok, interrresting [14:23] PR snapd#7440 closed: snap/channel: fix unit tests, UbuntuArchitecture -> DpkgArchitecture [14:25] zyga: try this: https://rogpeppe-scratch.s3.amazon.com/bootcopy.gz [14:25] grabbing [14:25] dns issue? [14:26] zyga: i've probably forgotten how s3 urls work [14:26] i thought they'd turned off that feature [14:27] of being able to just url the stuff [14:27] (but i didn't pay too much attention to that email because i don't use it) [14:29] zyga: just usual amazon eventual consistency issues; try again [14:29] zyga: fsck does't complain with a pristine core image [14:29] mborzecki: thank you for checking! [14:29] rogpeppe: still nothing [14:29] zyga: ha, it works for me :-\ [14:29] my dns may have cached it as gone? [14:30] how big is it? [14:30] can you email me@zygoon.pl [14:30] zyga: 48621287 bytes [14:30] might be faster :) [14:30] I'll go over the bits in the evening to see what's wrong [14:30] meanwhile you can attempt to fix FAT [14:30] or even remake it and copy the files back [14:31] zyga: for the time being, i've just slowed down updates so it'll only update on the first monday of the month [14:31] excellent [14:31] rogpeppe: as pedronis said, you can also use refresh.hold to delay up to 60 days [14:32] zyga: sent [14:33] PR snapd#7442 opened: tests: extend mount-ns test to handle mimics [14:33] zyga: i think 60 days isn't enough longer than 30 days to justify the unpredictability [14:33] sure, just saying [14:33] let's try to fix that FAT online [14:33] apply all the fixes that fsck would normally do [14:34] zyga: you think that might fix the issue for good? [14:34] I think it's likely [14:34] but we don't have the data trends from the error tracker AFAIK so perhaps there's more but we're not aware of it [14:34] zyga: this is quite concerning BTW - this was an absolutely pristine image created following the instructions on the web to the letter [14:35] zyga: if i'm seeing this problem, then i'd guess that everyone else using ubuntu core is too [14:35] indeed, we proposed that the boot partition is read only outside of the transactions that need to use it [14:35] so that random power failures don't leave FAT in mounted state [14:35] zyga: that seems like a good idea. [14:35] rogpeppe: there are some issues that can be specific to your device, e.g. the SD card may be just really failing [14:35] I have a number of cards that reliably corrupt a fixed offset [14:35] zyga: it's a near-new SD card too, but i guess that's possible [14:36] you can write zeros or ones, you keep reading that blob that they somehow store [14:36] rogpeppe: the one I have is little-used sandisk pro 32GB card [14:36] rogpeppe: it failed in a few weeks after purchase [14:36] zyga: that would make it both 32GB SD cards that are failing in a similar way then [14:37] zyga: because i saw a similar issue with the Pi 2 and assumed the sd card had gone [14:37] zyga: i actually have that pi with me in fact [14:37] I got the boot image now, thank you [14:38] rogpeppe: try something like 'etcher' to see if you can write a pristine image and read it back correctly if you want to check that [14:38] zyga: yeah, i'll try that [14:40] zyga: i wasn't surprised when the original sd card failed BTW - it had been in constant use for about 3 years. [14:40] zyga: ... if it did fail, of course [14:40] indeed, analysis will reveal the cause [14:40] there are some moving parts [14:40] and some failures on our end [14:40] sil2100: hey, there's one PR for core18 [14:40] sil2100: it's related to what we are discussing now [14:41] sil2100: we didn't seed fsck.vfat on core18 [14:41] sil2100: do you think you could review it please? [14:41] https://github.com/snapcore/core18/pull/139 [14:41] PR core18#139: hooks: add missing dosfstools to get fsck.fat [14:48] rogpeppe: thank you so much for reporting this - and yes, super concerning to us and we dig into it [14:48] * mvo hugs zyga for digging into it [14:48] mvo: thanks for your interaction :) [14:49] rogpeppe: this is what I see on the partition: https://pastebin.ubuntu.com/p/tVXVjCK99Q/ [14:50] I will now check the contents of config.txt, cmdline.txt and uboot.env [14:50] for one, I like that mtools exists [14:50] and wish that there was an ext4 variant :) [14:50] zyga: i don't know about mtools... [14:50] rogpeppe: it's a GNU tool for interacting with FAT offline [14:51] hello folks, is there a fix upcoming for "- Download snap "crystal" (71) from channel "latest/stable" (stream error: stream ID 1; PROTOCOL_ERROR)" ? [14:52] this is really slowing us down [15:00] sergiusens: we don't have a fix, only a workaround mvo worked on [15:01] zyga: does the workaround require work (a setting) on our side? [15:05] zyga I found one cloud instance with 4 broken snaps I was not able to remove, with latest snapd, I was able to remove all them :) Great work! [15:05] sergiusens: no, it's automatic [15:05] ondra: thank you so much :) [15:06] zyga thank you for fixing it :) [15:06] ondra: fixing bugs is sometimes very draining, I'm very glad I was able to help you and others in FE [15:06] rogpeppe: this is the hexyl dump of uboot.env, I'll check out what it says next https://paste.ubuntu.com/p/GgScB76RyT/ -- specifically to see if it is in agreement with snapd's state [15:07] though I must say that the colorized output from hexyl is easier to read as it shows NUL bytes and other such stuff in clear, distinct color [15:08] mvo: one other idea, just looking at this, is to have two uboot environment files: one for the fixed program and the other one for just the handful of actual variables we need [15:08] oh, mvo is not online anymore [15:12] rogpeppe: so looking here I see we have the following things: snap_core=core18_1076.snap, snap_kernel=pi-kernel_44.snap, snap_mode= (empty string), snap_try_core=core18_1100.snap, snap_try_kernel=pi-kernel_51.snap, [15:12] I need to reference the boot logic for a second to understand snap_mode="" [15:12] mborzecki: ^ unless you remember [15:12] rogpeppe: do you have the snaps and revisions listed there in /var/lib/snapd/snaps/ [15:13] they should all be mounted as well [15:13] and visible in snap list --all [15:13] rogpeppe: (output of snap list --all would help as well) [15:16] zyga: this? https://github.com/snapcore/core-build/blob/master/initramfs/scripts/bootloader-script#L89 [15:17] zyga: or this one https://github.com/snapcore/pi3-gadget/blob/16/uboot.env.in#L48 [15:17] huh [15:17] https://github.com/snapcore/core-build/blob/master/initramfs/scripts/bootloader-script#L104 [15:18] zyga: http://paste.ubuntu.com/p/ktw4XYgrvG/ [15:18] I don't understand how that works [15:18] ah, wait [15:18] didn't notice nesting [15:18] * zyga re-reads [15:19] rogpeppe: do you have core18 at revision 1100? [15:19] I mean [15:19] I don't see it [15:19] so I assume that's why it fails [15:19] it seems we are trying to get to core18 revision that's simply not here [15:20] that would explain the immediate failure [15:20] zyga: i see core18 at 1076 [15:20] though I didn't check what uboot script does if it cannot find that snap [15:20] zyga: what's special about 1100? [15:20] rogpeppe: nothing, it's just referenced from your uboot.env but not present on the system [15:20] zyga: oh, i see [15:21] zyga: i wonder how that happened [15:21] indeed [15:21] though snapd may have undone 1100 transaction [15:21] removing the file from disk [15:22] if you feel lucky, fix the boot partition with fsck [15:22] snap refresh core18 [15:22] and check if it manages [15:22] one other lesson from this [15:22] is for snapd to fix any boot variables that are inconsistent with reality [15:22] we have one boot mode [15:23] as in, one variable called snap_mode [15:23] that impacts two variables "trying" [15:23] i'm not feeling very lucky currently :) [15:23] and it's clear that in this case there's a chance one of them will fail [15:23] rogpeppe: I'll collect this in a retrospective [15:23] there's a lot for us to learn from this [15:23] rogpeppe: I would suggest to fix the FAT partition [15:24] that might be enough to fix other issues [15:24] zyga: ok, i'll try that [15:24] I'll collect all of this for a retrospective and share with you [15:25] I was thinking about breaking for dinner now [15:25] zyga: ok, dirty bit removed. it didn't give me an option to do anything else [15:25] zyga: i suspect i may have done the wrong thing there :-' [15:25] :-\ [15:25] oh? [15:25] note, you can always dd it back! [15:26] zyga: good point! [15:26] what did you do? [15:26] try fixing it, mounting it [15:26] and looking at the files [15:26] zyga: http://paste.ubuntu.com/p/MtVGCFGGwk/ [15:26] at least the boot.env [15:27] zyga: i suspect i should've said "no" to removing the dirty bit, i think [15:27] ah, no [15:27] I think that's fine [15:27] it's just a bit [15:27] fat has no journal [15:27] so apart from top-down scan there's little to do [15:27] zyga: i thought i'd get the option to address the other issue too [15:28] zyga: so the dirty bit is the only thing that was wrong? [15:28] ah right [15:28] I honestly don't know [15:28] can you fsck again? [15:28] maybe with some --force option [15:29] zyga: a ran it again - it does nothing [15:29] *i* ran it again [15:30] did you use -V? [15:38] rogpeppe: so [15:38] rogpeppe: writing the report I'm not sure we understand what really failed on boot [15:38] rogpeppe: we know that the fat was slightly corrupt [15:38] that it was not unmounted cleanly [15:38] rogpeppe: we do know that core18 revision 1100 was missing from your disk [15:39] rogpeppe: though perhaps it was removed by snapd in its undo path [15:39] rogpeppe: I would like to know if you'd be willing to attempt another reboot, at your convenience, coupled with another reboot done by "snap refresh core18" [15:40] zyga: so one reboot without "snap refresh core18", then run "snap refresh core18", then let that reboot by itself? [15:40] yes [15:40] but only if you have confidence you can recover manually [15:40] and not super inconvenient for you [15:41] zyga: ok, i'll try that. what should i do when it fails to restart after the first reboot? [15:41] power cycle [15:42] zyga: ok [15:42] if that fails we may be SOL but I dont't think it will come to that [15:42] you may want to mount the partition back [15:42] mvo: hey welcome back [15:42] zyga: does that make a difference? [15:42] rogpeppe: in case snapd wishes to write to it [15:42] otherwise no [15:42] zyga: ok, i'll mount it again now [15:43] hey zyga - whats new? [15:43] mvo: we found some things, one sec, I'll share my notes [15:47] hey folks, could I get another review on https://github.com/snapcore/snapd/pull/7429 ? mvo maybe if you're not EOD yet and have a couple minutes :-) [15:47] PR #7429: wrappers/services: add CurrentSnapServiceStates + unit tests [15:55] ijohnson: heh, let me have a look [15:57] thanks :-) [15:58] rogpeppe: let us know what you find pleaes [16:01] * Chipaca goes for a run [16:02] ijohnson: you have feedback [16:02] ijohnson: should be super simple [16:02] yay thanks looking now [16:04] zyga: will do. it might be tomorrow. [16:04] ack, thank you for the note [16:13] mvo: thanks I fixed the out of date comment / for loop, but I think it's still nice to have the more verbose/complete systemctl script [16:14] ijohnson: thats fine [16:14] ijohnson: keep it if you prefer it :) [16:14] okay, cool so with your and mborzecki's review am I good to merge? [16:15] * ijohnson waits on the merge button [16:15] ijohnson: yes [16:15] oh well I guess the tests have to pass too [16:15] ijohnson: *cough* [16:15] :-O [16:15] ijohnson: :( [16:15] they're not failing just haven't started running yet [16:16] ijohnson: ok [16:17] thanks mvo, I'll merge sometime in my afternoon then [16:18] ijohnson: good luck! [16:18] :-) === pstolowski is now known as pstolowski|afk [17:25] anyone got any idea why `xdg-open` has stopped opening pdf files correctly for me (it opens them in an ebook viewer). I think it might have something to do with the fact that `xdg-mime query filetype anyfile.pdf` doesn't print anything, but I'm not sure how that works. [17:25] oops, wrong channel, sorry! [17:44] PR snapcraft#2644 closed: Release changelog for 2.44 [20:14] PR snapcraft#2709 opened: incorporate content provider snaps in dependency resolution