[07:01] <pstolowski> morning
[07:44] <zyga-x240> hello
[07:44] <zyga-x240> mvo: around?
[07:44] <zyga-x240> pstolowski: hi
[07:45] <mvo> zyga-x240: sure
[07:45] <zyga-x240> mvo: please look at https://bugs.launchpad.net/snapd/+bug/1898934
[07:45] <zyga-x240> it looks worrying
[07:45] <zyga-x240> probably something very valuable for us to learn there
[07:45] <zyga-x240> we looked at state.json via debug command - no changes (empty collection)
[07:46] <zyga-x240> it seems like whatever happened made snapd inactive and the system dead/broken
[07:46] <mvo> zyga-x240: can you subscribe me? this seems private
[07:46] <zyga-x240> sure
[07:46] <zyga-x240> I made it private as there's state.json attached
[07:47] <mvo> zyga-x240: thanks, makes sense. the interessting part is why the snapd.failover.service did not DTRT :(
[07:47] <zyga-x240> we have access to that device, roger said he would help us look
[07:47] <mvo> zyga-x240: ohhh, maybe we have an overlooked failure mode here
[07:47] <zyga-x240> oh? :)
[07:48] <zyga-x240> do you already see what the problem is?
[07:48] <zyga-x240> we discussed this close to midnight so I was rather tired
[07:48] <mvo> zyga-x240: hold on, need to look at a core system
[07:48] <zyga-x240> sure
[07:48]  * zyga-x240 hugs mvo
[07:48] <mvo> zyga-x240: a big **maybe** for now :)
[07:50] <zyga-x240> hmm, my MBP cannot connecto it IRC (on fallback ISP connection)
[07:54] <zyga-x240> mvo: I spoke with ken, there may be a way to ask apps to quit nicely, I'm looking at that now, no promises
[07:55] <mvo> zyga-x240: hm, no, it seems we are doing the right thing, we should not need the current sysmlink for failover
[07:56] <zyga-x240> mvo: aha
[07:56] <zyga-x240> mvo: at the time roger checked, snapd was not running, and IIRC there was no snapd.service
[07:56] <zyga-x240> my hunch is that whatever happened made snapd stay in inactive state
[07:56] <zyga-x240> as in the snap was disabled
[07:56] <zyga-x240> mid refresh or something
[07:56] <mvo> zyga-x240: snap debug state /tmp/state.json
[07:56] <mvo> ID   Status  Spawn  Ready  Label  Summary
[07:56] <zyga-x240> roger also said this was a power cut
[07:56] <zyga-x240> yeah thats is so weird!
[07:56] <mvo> zyga-x240: i.e. *nothing* in the state :(
[07:56] <zyga-x240> well
[07:56] <zyga-x240> not nothing
[07:56] <zyga-x240> did you look at snapstate?
[07:56] <zyga-x240> what's the state of snapd?
[07:57] <zyga-x240> mvo: so maybe this is a weird case of disk corrption
[07:57] <zyga-x240> mvo: one more bit of information: the boot partition needs fsck and I'm sure we're just not doing that ever in practice
[07:58] <zyga-x240> mvo: perhaps snapd got broken because boot partition is inconsistent with snapstate
[07:58] <zyga-x240> and we "amend" stuff in a broken way
[07:58] <zyga-x240> i.e. what would happen if the boot config was broken in some way
[07:58] <zyga-x240> and could not be saved
[07:58] <zyga-x240> no smoking gun, just possibilities
[07:58] <mvo> zyga-x240: sorry, I misspoke. not nothing but no tasks when there should be tasks like refreshes that happend in the last days
[07:58] <zyga-x240> mvo: yeah
[07:59] <zyga-x240> mvo: I really really wish we kept like state.1.json
[07:59] <zyga-x240> from last day/week/month
[07:59] <zyga-x240> for forensics
[08:00] <zyga-x240> how soon would we remove failed changes from state?
[08:00] <mvo> zyga-x240: yeah, something is missing here. I don't remember, need to look but it should be a bit longer, it seems the refresh happend 3 days ago
[08:01] <zyga-x240> 3 days ago and power cut was last night (based on journalctl data)
[08:01] <zyga-x240> fortunately this pi has persistency
[08:01] <mvo> zyga-x240: but "data" did not need a fsck?
[08:01] <zyga-x240> mvo: I don't know, but I didn't see such message in the log
[08:01] <zyga-x240> mvo: not sure if we have a fsck test
[08:01] <zyga-x240> we should really make this right for core20
[08:01] <zyga-x240> (as in, make sure there's a test that shows we can fsck)
[08:01] <zyga-x240> anyway
[08:02] <zyga-x240> unless this is really just data corruption
[08:02] <zyga-x240> we have some things to improve in failover perhaps
[08:02] <zyga-x240> perhaps the service that starts snapd
[08:02] <zyga-x240> should do one more thing:
[08:02] <zyga-x240> if there's no state.json -> seed
[08:02] <zyga-x240> if there's state json but snapd is not good (no symlink, no service, etc) -> do something
[08:02] <zyga-x240> IIRC right now that's not handled
[08:02] <zyga-x240> if you have no service (roger did not) for snapd then we just stay dead
[08:06] <zyga-x240> mvo: maybe helpful: https://paste.ubuntu.com/p/7cTd6b8Yq4/
[08:07] <zyga-x240> that's so weird
[08:07] <zyga-x240> snapd is running there
[08:07] <zyga-x240> but /usr/bin/snap is a broken symlink!
[08:07] <zyga-x240> rogpeppe: hi
[08:07] <mvo> zyga-x240: hm, I think we need to look if we *always* have a service file for snapd on disk during the refresh
[08:07] <zyga-x240> rogpeppe: are you around?
[08:07] <zyga-x240> mvo: yeah
[08:07] <zyga-x240> I suspect we do not
[08:07] <zyga-x240> like with regular refresh
[08:07] <zyga-x240> unlink snap
[08:07] <zyga-x240> will remove snapd.service for sure
[08:08] <zyga-x240> unless there's a special case that I didn't see while poking at this recently
[08:08] <mvo> zyga-x240: but yeah, in this log snapd is alive and kicking
[08:08] <zyga-x240> right?
[08:09] <zyga-x240> but this is the log before the power cut IIRC
[08:09]  * zyga-x240 looks
[08:09] <zyga-x240> after the power cut that's gone
[08:09] <zyga-x240> no snapd.service
[08:09] <zyga-x240> no anything
[08:09] <zyga-x240> but it may hint at an earlier problem
[08:09] <zyga-x240> or maybe that's just fallback snapd?
[08:09]  * zyga-x240 doesn't know
[08:11] <zyga-x240> mvo: if you look at the attached --list-boots output
[08:11] <zyga-x240> mvo: then at the dates
[08:11] <zyga-x240> mvo: it seems to indicate that the journal log was from the -1 boot (not the current but the previous one)
[08:11] <zyga-x240> mvo: so it may also suggest that the real problem happened earlier
[08:12] <zyga-x240> and each subsequent boot corrupted the system some more
[08:12] <zyga-x240> I'd love to see logs from -2 and 0
[08:12] <zyga-x240> rogpeppe: ^^
[08:12] <zyga-x240> rogpeppe: could you attach output of:
[08:12] <zyga-x240> rogpeppe: journalctl -b -2
[08:12] <zyga-x240> rogpeppe: journalctl -b 0
[08:12] <zyga-x240> (separately)
[08:12] <zyga-x240> assuming you have not rebooted since, as then the indices shift
[08:12] <mvo> thanks for digging into this zyga-x240
[08:23] <zyga-x240> LMAO
[08:23] <zyga-x240> gnome shell has a DBus API for evaluating javascript
[08:23] <zyga-x240> so...
[08:23] <zyga-x240> the shell _is_ a calculator
[08:23] <zyga-x240> ... why why why
[08:30] <jamesh> zyga-x240: it wouldn't be evaulating JavaScript in a bare namespace: presumably it is in the same context as the rest of the UI is running
[08:30] <zyga> I'm very tempted to eval an infinite loop but perhaps another day
[08:30] <zyga> jamesh btw, I had a look at gnome-session-quit
[08:30] <zyga> hoping there's an API to ask one specific app to quit
[08:30] <zyga> but after going through gnome-session I don't think there is
[08:30] <zyga> it's all-or-nothing
[08:31] <zyga> do you know if there's some way to ask a specific app to quit that I've missed?
[08:31] <zyga> jamesh (I think that having eval as an API is a bit irresponsible)
[08:31] <zyga> but perhaps this is some debug leftover
[08:31] <jamesh> zyga: in X11, I'd probably try to map the process ID to a window and try to close the window
[08:32] <zyga> jamesh any pointers on how to do that?
[08:32] <zyga> I'm happy to write C
[08:32] <zyga> (as you know :)
[08:35] <jamesh> zyga: there is a _NET_WM_PID that most applications set.  It is not something you could rely on though
[08:35] <jamesh> i.e. it is controlled by the client (so could be forged), and may be absent
[08:35] <zyga> given that I'm not an X developer, how would I do the whole thing
[08:35] <zyga> 1) connect to X
[08:36] <zyga> 2) ...
[08:36] <zyga> browse through all the windows
[08:36] <zyga> find this property
[08:36] <zyga> etc
[08:36] <zyga> doesn't X enforce anything to prevent forgery/
[08:36] <jamesh> I don't remember the exact set of calls
[08:36] <zyga> how do I ask a window to close?
[08:36] <jamesh> an ICCCM message, I think (it's ages since I've looked at this)
[08:37] <jamesh> and Wayland is going to be different
[08:37] <zyga> how does it look like in wayland?
[08:37] <zyga> can it be done by non-shell?
[08:38] <jamesh> I don't think there is a standardised way to do it with Wayland: in general you don't want one application managing the windows of another app
[08:38] <jamesh> with that said, the Wayland compositor should be able to securely identify and close clients
[08:38] <zyga> yeah, I just wish there was a way for us to ask an app to close, politely
[08:38] <zyga> not to kill it
[08:39] <zyga> not to unmap it
[08:39] <zyga> ask it to close
[08:39] <jamesh> On the X11 side, you'd probably send the same message the window manager does when the user clicks the window's close button
[08:39] <jamesh> (one of the consequences of having out of process window frames)
[08:41] <zyga> I think that our goal of having a nice interaction button is mostly futile, at least for now
[08:41] <zyga> but I will try some more
[08:43] <jamesh> If you're okay with having snaps opt in to a nice behaviour, we could have a flag in snap.yaml saying that the app will do a clean shutdown on e.g. SIGHUP
[08:43] <jamesh> that's clearly not going to work everywhere though
[08:48] <zyga> jamesh I really want something that works for regular apps, the chrome and firefox and the random desktop app equally
[08:48] <zyga> if it's not nice, it's not worth having
[08:52] <jamesh> zyga: you probably want something that does a combination of xlsclients and xkill
[08:52] <zyga> xkill is not useful, I tried that, it just unmaps the window
[08:52] <zyga> chrome keeps running
[08:52] <zyga> running chrome again doesn't do anything as existing chrome tries to open a new window and fails
[08:52] <zyga> (or maybe opens a tab in the unmapped window)
[08:53] <zyga> usability wise it's not usefil
[08:53] <zyga> *useful
[08:53] <jamesh> zyga: xkill calls XKillClient()
[08:53] <jamesh> so maybe that's not what you want either
[08:54] <jamesh> it isn't unmapping the window
[08:54] <jamesh> code is at https://gitlab.freedesktop.org/xorg/app/xlsclients and https://gitlab.freedesktop.org/xorg/app/xkill
[08:54] <zyga> thank you, I will look through those and play some more
[08:55] <zyga> maybe I made a mistake but I tried xkill with a browser and just got headless chrome runnnig
[08:58] <jamesh> Most apps will exit if their X connection is closed.  Maybe Chrome is different
[08:58] <jamesh> but yeah: XKillClient is not going to allow a graceful exit
[09:00] <jamesh> https://www.x.org/releases/X11R7.6/doc/xorg-docs/specs/ICCCM/icccm.html is the spec for how window traditional window management works.  The _NET_* properties are from the EWMH spec: https://specifications.freedesktop.org/wm-spec/wm-spec-1.3.html
[09:22] <jamesh> zyga: if you're confining yourself to EWMH compliant window managers, things are relatively simple for enumerating clients: https://paste.ubuntu.com/p/CsfbGTh6Jz/
[09:23] <jamesh> zyga: it also has a _NET_CLOSE_WINDOW message you can send to ask the window manager to close a window on your behalf: it was designed to support panel/pager apps as separate processes to the WM
[09:30] <zyga-x240> jamesh: thank you, I will try that
[09:31] <zyga-x240> I think that's enough for what we need
[09:31] <zyga-x240> it's not perfect but I think it's close
[09:37] <dot-tobias> hi all
[09:37] <zyga-x240> hi
[09:47] <zyga-x240> sigh, ./get-deps hangs...
[09:47] <mvo> zyga-x240: nice, thanks for pushing on this !
[09:47] <mvo> zyga-x240: not that ./get-deps hangs of course .)
[09:47] <zyga-x240> haha
[09:48] <zyga-x240> oh well
[09:49] <zyga-x240> fatal: unable to access 'https://github.com/kardianos/govendor/': Could not resolve host: github.com
[09:50] <zyga-x240> wat?
[09:53] <mvo> zyga-x240: haha
[10:02] <jamesh> Just wait until we switch to modules.  Then you'll only need to worry about proxy.golang.org failing to resolve
[10:46]  * zyga goes to prep coffee for the calls
[11:39] <mvo> xnox: could we chnage https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fgithub.com%2Fsnapcore%2Fpi-gadget%2Fblob%2F20-arm64%2Fgadget.yaml%23L17 to ext4 (instead of vfat). ondra  just raised this
[12:39] <ijohnson> mvo: I don't think we can change ubuntu-boot to ext4 because the uboot bootloader needs to read boot.sel from that partition, and last we talked about it, uboot _can_ read ext4 but it takes __AGES__ like minutes
[12:40] <ijohnson> so if you're okay with multi-minute bootloader time then yes we could probably change it but I don't think we want to do that
[12:41] <mvo> ijohnson: ohhh, maybe we should add a comment then, that's a bit unfortunate
[12:41] <zyga> would it be better if ubuntu-boot was ext4 but without some features
[12:41] <ijohnson> mvo: yeah it is unfortunate
[12:41] <mvo> ijohnson: but yeah, let's not do that :)
[12:41] <zyga> perhaps the modern set of feature flags makes it slow
[12:41] <zyga> and a more plain ext3 like set would be better
[12:42] <ijohnson> zyga: I think the issue is more uboot
[12:42] <zyga> are you saying that uboot ext* implementation is just unusable?
[12:42] <ijohnson> yes that's what I remember
[12:42] <zyga> hmmm
[12:43] <ijohnson> maybe I'm wrong, would need to confirm with Dave to be sure
[12:45] <mvo> I see some references that https://github.com/u-boot/u-boot/commit/d5aee659f217746395ff58adf3a863627ff02ec1 makes this fast
[12:45] <mvo> but I have no idea what I'm talking about
[13:07] <zyga> pstolowski I subscribed you to https://bugs.launchpad.net/snapd/+bug/1898934 - have a look if the missing undo handler for unlink could be related
[13:07] <zyga> there's state.json there
[13:07] <zyga> though there's no changes there _at all_
[13:08] <zyga> but there's snap state
[13:08] <pstolowski> zyga: mhm, ok thanks
[13:08] <zyga> thank you!
[13:08] <pstolowski> zyga: ah, it's roger
[13:09] <zyga> rogpeppe: if you are around and could reply to pawel's and mvo's questions in the bug, that would help
[13:21] <zyga> pstolowski the finding about timings is brilliant
[13:22] <zyga> mvo we recycle timing data  at a different rate so when we have lost changes because they failed and got collected, we can see their shadow in timing data
[13:26] <pstolowski> i'ev commented
[13:26] <pstolowski> *i've
[13:27] <pstolowski> jeez, typos there, i wish LP allowed to edit comments
[13:29] <mvo> zyga: ta
[13:29] <xnox> mvo:  no we cannot.
[13:29] <xnox> mvo: pi gpu bootloader cannot read ext4, nor read dtbs off there.
[13:29] <mvo> xnox: that's sad, thanks for letting me know
[13:30] <xnox> mvo: we can discuss things. but probably better with waveform. We might be able to do a different gadget/model where boot is ext4.
[13:31] <mvo> xnox: it's not super urgent, if there are good reasons that's okay for me, it was an action item from a meeting with field/ondra to discuss this, they had concerns
[13:31] <xnox> mvo: i think there is scope to have "ubootish gadget" which could have boot as ext4; and "pibootish gadget" which will not have uboot, and must have ubuntu-boot as vfat.
[13:31] <xnox> mvo:  but they want it for uboot, or on pi?
[13:31] <mvo> xnox: the concern was robustness
[13:32] <xnox> (i think there is field things on both now, hence the two are no longer the same)
[13:32] <xnox> mvo:  when the underlying mmc is crap, it's not going to be robust either way! =)
[13:32] <xnox> mvo:  also i think uboot's ext implementation is not robust either. so....
[13:33] <mvo> xnox: right, a comment maybe in the gadget why we made this decision for the next person that wonders. it's fine for me, you guys own it
[13:36] <pstolowski> zyga: we just need take timings with a grain of salt; we may miss them for tasks that errored out before Save()
[13:44] <ijohnson> mvo: xnox: the ubuntu-boot partition does not need to be read by the pi bootloader though ?
[13:44] <ijohnson> because we first load u-boot from ubuntu-seed, then uboot will read the boot.sel file from ubuntu-boot, then load the kernel from ubuntu-boot and then boot into the kernel iirc
[13:45] <zyga> xnox: FYI: https://github.com/snapcore/snapd/pull/9482 needs changes
[13:46] <xnox> zyga:  ack
[13:47] <zyga> amurray, jdstrand: low hanging fruit that needs a security review: https://github.com/snapcore/snapd/pull/9449
[14:09] <jdstrand> zyga-x240: ack
[14:14] <pstolowski> zyga-x240, mvo : so timings from rogpeppe 's state show some interesting stuff, there is snapd refresh on 2020-09-07, and there is undo for that change ("change-id": "37")
[14:14] <zyga> pstolowski: do you think this is related to missing undo handler for unlink?
[14:15] <pstolowski> zyga-x240, mvo i wonder if we don't have an issue somewhere where an unplanned reboot in a wrong moment when this is happening (before undo restores previous snapd) leaves us with no active snapd
[14:16] <zyga> mmm
[14:16] <zyga> it was a power loss
[14:16] <zyga> perhaps very unlucky one
[14:16] <zyga> this is very important for reliability at scale, apart from this issue, the failover logic did not help
[14:16] <zyga> we should identify both issues
[14:16] <pstolowski> zyga: no, i don't think it's missing undo for unlink; as i said this other problem only affects snap remove and snap disable
[14:16] <zyga> I see
[14:17] <pstolowski> but i think perf timings are useful to understand what happened, combined with timestamps of reboot
[14:18] <zyga> I'll grab lunch
[14:26] <cachio> zyga, hey, I pushed a chagne to #9414
[14:26] <cachio> I renames most of the names for the nested tool
[14:27] <cachio> could you please take a look to see if new names make sense?
[14:44] <zyga> cachio sure, not immediately though
[14:52] <zyga-x240> back from lunch and back to backend bits
[14:56]  * zyga cancelled PT today, unsafe to go out
[14:57] <zyga> will focus on finishing stuff
[15:29]  * cachio lunch
[15:57] <pstolowski> zyga, mvo i've added some observations to Roger's bug
[15:58] <mvo> pstolowski: \o/ thanks for this
[15:58] <pstolowski> mvo: does it make sense?
[15:58] <mvo> pstolowski: in a meeting so I don't know (yet) :/
[15:58] <pstolowski> ah, sure, no worries
[16:19] <zyga-x240> mvo: I was wondering if we don't refresh things on separate lanes
[16:19] <zyga-x240> perhaps we don't, I don't recall
[16:35] <zyga-x240> hmm
[16:35] <zyga-x240> something beeped
[16:36]  * zyga-x240 looks for a window with some app
[16:38] <zyga-x240> mattermost :)
[16:51] <zyga-x240> ./snapmgr.go:412:13: too many errors
[16:51] <zyga-x240> well
[16:51] <zyga-x240> little by little
[17:13]  * zyga-x240 is stuck
[17:24] <zyga-x240> pedronis_: if you have a moment, I'd like to ask about one problem tomorrow
[17:25] <zyga-x240> pedronis_: requirng backend for soft refresh check is difficult, as it's called from doInstall which normally doesn't have access to the backend
[17:56] <pedronis> zyga: mmh, let's chat tomorrow morning if possible
[21:32]  * ijohnson EODs