[07:01] morning [07:44] hello [07:44] mvo: around? [07:44] pstolowski: hi [07:45] zyga-x240: sure [07:45] mvo: please look at https://bugs.launchpad.net/snapd/+bug/1898934 [07:45] it looks worrying [07:45] probably something very valuable for us to learn there [07:45] we looked at state.json via debug command - no changes (empty collection) [07:46] it seems like whatever happened made snapd inactive and the system dead/broken [07:46] zyga-x240: can you subscribe me? this seems private [07:46] sure [07:46] I made it private as there's state.json attached [07:47] zyga-x240: thanks, makes sense. the interessting part is why the snapd.failover.service did not DTRT :( [07:47] we have access to that device, roger said he would help us look [07:47] zyga-x240: ohhh, maybe we have an overlooked failure mode here [07:47] oh? :) [07:48] do you already see what the problem is? [07:48] we discussed this close to midnight so I was rather tired [07:48] zyga-x240: hold on, need to look at a core system [07:48] sure [07:48] * zyga-x240 hugs mvo [07:48] zyga-x240: a big **maybe** for now :) [07:50] hmm, my MBP cannot connecto it IRC (on fallback ISP connection) [07:54] mvo: I spoke with ken, there may be a way to ask apps to quit nicely, I'm looking at that now, no promises [07:55] zyga-x240: hm, no, it seems we are doing the right thing, we should not need the current sysmlink for failover [07:56] mvo: aha [07:56] mvo: at the time roger checked, snapd was not running, and IIRC there was no snapd.service [07:56] my hunch is that whatever happened made snapd stay in inactive state [07:56] as in the snap was disabled [07:56] mid refresh or something [07:56] zyga-x240: snap debug state /tmp/state.json [07:56] ID Status Spawn Ready Label Summary [07:56] roger also said this was a power cut [07:56] yeah thats is so weird! [07:56] zyga-x240: i.e. *nothing* in the state :( [07:56] well [07:56] not nothing [07:56] did you look at snapstate? [07:56] what's the state of snapd? [07:57] mvo: so maybe this is a weird case of disk corrption [07:57] mvo: one more bit of information: the boot partition needs fsck and I'm sure we're just not doing that ever in practice [07:58] mvo: perhaps snapd got broken because boot partition is inconsistent with snapstate [07:58] and we "amend" stuff in a broken way [07:58] i.e. what would happen if the boot config was broken in some way [07:58] and could not be saved [07:58] no smoking gun, just possibilities [07:58] zyga-x240: sorry, I misspoke. not nothing but no tasks when there should be tasks like refreshes that happend in the last days [07:58] mvo: yeah [07:59] mvo: I really really wish we kept like state.1.json [07:59] from last day/week/month [07:59] for forensics [08:00] how soon would we remove failed changes from state? [08:00] zyga-x240: yeah, something is missing here. I don't remember, need to look but it should be a bit longer, it seems the refresh happend 3 days ago [08:01] 3 days ago and power cut was last night (based on journalctl data) [08:01] fortunately this pi has persistency [08:01] zyga-x240: but "data" did not need a fsck? [08:01] mvo: I don't know, but I didn't see such message in the log [08:01] mvo: not sure if we have a fsck test [08:01] we should really make this right for core20 [08:01] (as in, make sure there's a test that shows we can fsck) [08:01] anyway [08:02] unless this is really just data corruption [08:02] we have some things to improve in failover perhaps [08:02] perhaps the service that starts snapd [08:02] should do one more thing: [08:02] if there's no state.json -> seed [08:02] if there's state json but snapd is not good (no symlink, no service, etc) -> do something [08:02] IIRC right now that's not handled [08:02] if you have no service (roger did not) for snapd then we just stay dead [08:06] mvo: maybe helpful: https://paste.ubuntu.com/p/7cTd6b8Yq4/ [08:07] that's so weird [08:07] snapd is running there [08:07] but /usr/bin/snap is a broken symlink! [08:07] rogpeppe: hi [08:07] zyga-x240: hm, I think we need to look if we *always* have a service file for snapd on disk during the refresh [08:07] rogpeppe: are you around? [08:07] mvo: yeah [08:07] I suspect we do not [08:07] like with regular refresh [08:07] unlink snap [08:07] will remove snapd.service for sure [08:08] unless there's a special case that I didn't see while poking at this recently [08:08] zyga-x240: but yeah, in this log snapd is alive and kicking [08:08] right? [08:09] but this is the log before the power cut IIRC [08:09] * zyga-x240 looks [08:09] after the power cut that's gone [08:09] no snapd.service [08:09] no anything [08:09] but it may hint at an earlier problem [08:09] or maybe that's just fallback snapd? [08:09] * zyga-x240 doesn't know [08:11] mvo: if you look at the attached --list-boots output [08:11] mvo: then at the dates [08:11] mvo: it seems to indicate that the journal log was from the -1 boot (not the current but the previous one) [08:11] mvo: so it may also suggest that the real problem happened earlier [08:12] and each subsequent boot corrupted the system some more [08:12] I'd love to see logs from -2 and 0 [08:12] rogpeppe: ^^ [08:12] rogpeppe: could you attach output of: [08:12] rogpeppe: journalctl -b -2 [08:12] rogpeppe: journalctl -b 0 [08:12] (separately) [08:12] assuming you have not rebooted since, as then the indices shift [08:12] thanks for digging into this zyga-x240 [08:23] LMAO [08:23] gnome shell has a DBus API for evaluating javascript [08:23] so... [08:23] the shell _is_ a calculator [08:23] ... why why why [08:30] zyga-x240: it wouldn't be evaulating JavaScript in a bare namespace: presumably it is in the same context as the rest of the UI is running [08:30] I'm very tempted to eval an infinite loop but perhaps another day [08:30] jamesh btw, I had a look at gnome-session-quit [08:30] hoping there's an API to ask one specific app to quit [08:30] but after going through gnome-session I don't think there is [08:30] it's all-or-nothing [08:31] do you know if there's some way to ask a specific app to quit that I've missed? [08:31] jamesh (I think that having eval as an API is a bit irresponsible) [08:31] but perhaps this is some debug leftover [08:31] zyga: in X11, I'd probably try to map the process ID to a window and try to close the window [08:32] jamesh any pointers on how to do that? [08:32] I'm happy to write C [08:32] (as you know :) [08:35] zyga: there is a _NET_WM_PID that most applications set. It is not something you could rely on though [08:35] i.e. it is controlled by the client (so could be forged), and may be absent [08:35] given that I'm not an X developer, how would I do the whole thing [08:35] 1) connect to X [08:36] 2) ... [08:36] browse through all the windows [08:36] find this property [08:36] etc [08:36] doesn't X enforce anything to prevent forgery/ [08:36] I don't remember the exact set of calls [08:36] how do I ask a window to close? [08:36] an ICCCM message, I think (it's ages since I've looked at this) [08:37] and Wayland is going to be different [08:37] how does it look like in wayland? [08:37] can it be done by non-shell? [08:38] I don't think there is a standardised way to do it with Wayland: in general you don't want one application managing the windows of another app [08:38] with that said, the Wayland compositor should be able to securely identify and close clients [08:38] yeah, I just wish there was a way for us to ask an app to close, politely [08:38] not to kill it [08:39] not to unmap it [08:39] ask it to close [08:39] On the X11 side, you'd probably send the same message the window manager does when the user clicks the window's close button [08:39] (one of the consequences of having out of process window frames) [08:41] I think that our goal of having a nice interaction button is mostly futile, at least for now [08:41] but I will try some more [08:43] If you're okay with having snaps opt in to a nice behaviour, we could have a flag in snap.yaml saying that the app will do a clean shutdown on e.g. SIGHUP [08:43] that's clearly not going to work everywhere though [08:48] jamesh I really want something that works for regular apps, the chrome and firefox and the random desktop app equally [08:48] if it's not nice, it's not worth having [08:52] zyga: you probably want something that does a combination of xlsclients and xkill [08:52] xkill is not useful, I tried that, it just unmaps the window [08:52] chrome keeps running [08:52] running chrome again doesn't do anything as existing chrome tries to open a new window and fails [08:52] (or maybe opens a tab in the unmapped window) [08:53] usability wise it's not usefil [08:53] *useful [08:53] zyga: xkill calls XKillClient() [08:53] so maybe that's not what you want either [08:54] it isn't unmapping the window [08:54] code is at https://gitlab.freedesktop.org/xorg/app/xlsclients and https://gitlab.freedesktop.org/xorg/app/xkill [08:54] thank you, I will look through those and play some more [08:55] maybe I made a mistake but I tried xkill with a browser and just got headless chrome runnnig [08:58] Most apps will exit if their X connection is closed. Maybe Chrome is different [08:58] but yeah: XKillClient is not going to allow a graceful exit [09:00] https://www.x.org/releases/X11R7.6/doc/xorg-docs/specs/ICCCM/icccm.html is the spec for how window traditional window management works. The _NET_* properties are from the EWMH spec: https://specifications.freedesktop.org/wm-spec/wm-spec-1.3.html [09:22] zyga: if you're confining yourself to EWMH compliant window managers, things are relatively simple for enumerating clients: https://paste.ubuntu.com/p/CsfbGTh6Jz/ [09:23] zyga: it also has a _NET_CLOSE_WINDOW message you can send to ask the window manager to close a window on your behalf: it was designed to support panel/pager apps as separate processes to the WM [09:30] jamesh: thank you, I will try that [09:31] I think that's enough for what we need [09:31] it's not perfect but I think it's close [09:37] hi all [09:37] hi [09:47] sigh, ./get-deps hangs... [09:47] zyga-x240: nice, thanks for pushing on this ! [09:47] zyga-x240: not that ./get-deps hangs of course .) [09:47] haha [09:48] oh well [09:49] fatal: unable to access 'https://github.com/kardianos/govendor/': Could not resolve host: github.com [09:50] wat? [09:53] zyga-x240: haha [10:02] Just wait until we switch to modules. Then you'll only need to worry about proxy.golang.org failing to resolve [10:46] * zyga goes to prep coffee for the calls [11:39] xnox: could we chnage https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fgithub.com%2Fsnapcore%2Fpi-gadget%2Fblob%2F20-arm64%2Fgadget.yaml%23L17 to ext4 (instead of vfat). ondra just raised this [12:39] mvo: I don't think we can change ubuntu-boot to ext4 because the uboot bootloader needs to read boot.sel from that partition, and last we talked about it, uboot _can_ read ext4 but it takes __AGES__ like minutes [12:40] so if you're okay with multi-minute bootloader time then yes we could probably change it but I don't think we want to do that [12:41] ijohnson: ohhh, maybe we should add a comment then, that's a bit unfortunate [12:41] would it be better if ubuntu-boot was ext4 but without some features [12:41] mvo: yeah it is unfortunate [12:41] ijohnson: but yeah, let's not do that :) [12:41] perhaps the modern set of feature flags makes it slow [12:41] and a more plain ext3 like set would be better [12:42] zyga: I think the issue is more uboot [12:42] are you saying that uboot ext* implementation is just unusable? [12:42] yes that's what I remember [12:42] hmmm [12:43] maybe I'm wrong, would need to confirm with Dave to be sure [12:45] I see some references that https://github.com/u-boot/u-boot/commit/d5aee659f217746395ff58adf3a863627ff02ec1 makes this fast [12:45] but I have no idea what I'm talking about [13:07] pstolowski I subscribed you to https://bugs.launchpad.net/snapd/+bug/1898934 - have a look if the missing undo handler for unlink could be related [13:07] there's state.json there [13:07] though there's no changes there _at all_ [13:08] but there's snap state [13:08] zyga: mhm, ok thanks [13:08] thank you! [13:08] zyga: ah, it's roger [13:09] rogpeppe: if you are around and could reply to pawel's and mvo's questions in the bug, that would help [13:21] pstolowski the finding about timings is brilliant [13:22] mvo we recycle timing data at a different rate so when we have lost changes because they failed and got collected, we can see their shadow in timing data [13:26] i'ev commented [13:26] *i've [13:27] jeez, typos there, i wish LP allowed to edit comments [13:29] zyga: ta [13:29] mvo: no we cannot. [13:29] mvo: pi gpu bootloader cannot read ext4, nor read dtbs off there. [13:29] xnox: that's sad, thanks for letting me know [13:30] mvo: we can discuss things. but probably better with waveform. We might be able to do a different gadget/model where boot is ext4. [13:31] xnox: it's not super urgent, if there are good reasons that's okay for me, it was an action item from a meeting with field/ondra to discuss this, they had concerns [13:31] mvo: i think there is scope to have "ubootish gadget" which could have boot as ext4; and "pibootish gadget" which will not have uboot, and must have ubuntu-boot as vfat. [13:31] mvo: but they want it for uboot, or on pi? [13:31] xnox: the concern was robustness [13:32] (i think there is field things on both now, hence the two are no longer the same) [13:32] mvo: when the underlying mmc is crap, it's not going to be robust either way! =) [13:32] mvo: also i think uboot's ext implementation is not robust either. so.... [13:33] xnox: right, a comment maybe in the gadget why we made this decision for the next person that wonders. it's fine for me, you guys own it [13:36] zyga: we just need take timings with a grain of salt; we may miss them for tasks that errored out before Save() [13:44] mvo: xnox: the ubuntu-boot partition does not need to be read by the pi bootloader though ? [13:44] because we first load u-boot from ubuntu-seed, then uboot will read the boot.sel file from ubuntu-boot, then load the kernel from ubuntu-boot and then boot into the kernel iirc [13:45] xnox: FYI: https://github.com/snapcore/snapd/pull/9482 needs changes [13:46] zyga: ack [13:47] amurray, jdstrand: low hanging fruit that needs a security review: https://github.com/snapcore/snapd/pull/9449 [14:09] zyga-x240: ack [14:14] zyga-x240, mvo : so timings from rogpeppe 's state show some interesting stuff, there is snapd refresh on 2020-09-07, and there is undo for that change ("change-id": "37") [14:14] pstolowski: do you think this is related to missing undo handler for unlink? [14:15] zyga-x240, mvo i wonder if we don't have an issue somewhere where an unplanned reboot in a wrong moment when this is happening (before undo restores previous snapd) leaves us with no active snapd [14:16] mmm [14:16] it was a power loss [14:16] perhaps very unlucky one [14:16] this is very important for reliability at scale, apart from this issue, the failover logic did not help [14:16] we should identify both issues [14:16] zyga: no, i don't think it's missing undo for unlink; as i said this other problem only affects snap remove and snap disable [14:16] I see [14:17] but i think perf timings are useful to understand what happened, combined with timestamps of reboot [14:18] I'll grab lunch [14:26] zyga, hey, I pushed a chagne to #9414 [14:26] I renames most of the names for the nested tool [14:27] could you please take a look to see if new names make sense? [14:44] cachio sure, not immediately though [14:52] back from lunch and back to backend bits [14:56] * zyga cancelled PT today, unsafe to go out [14:57] will focus on finishing stuff [15:29] * cachio lunch [15:57] zyga, mvo i've added some observations to Roger's bug [15:58] pstolowski: \o/ thanks for this [15:58] mvo: does it make sense? [15:58] pstolowski: in a meeting so I don't know (yet) :/ [15:58] ah, sure, no worries [16:19] mvo: I was wondering if we don't refresh things on separate lanes [16:19] perhaps we don't, I don't recall [16:35] hmm [16:35] something beeped [16:36] * zyga-x240 looks for a window with some app [16:38] mattermost :) [16:51] ./snapmgr.go:412:13: too many errors [16:51] well [16:51] little by little [17:13] * zyga-x240 is stuck [17:24] pedronis_: if you have a moment, I'd like to ask about one problem tomorrow [17:25] pedronis_: requirng backend for soft refresh check is difficult, as it's called from doInstall which normally doesn't have access to the backend === pedronis_ is now known as pedronis [17:56] zyga: mmh, let's chat tomorrow morning if possible [21:32] * ijohnson EODs