pstolowski | morning | 07:01 |
---|---|---|
zyga-x240 | hello | 07:44 |
zyga-x240 | mvo: around? | 07:44 |
zyga-x240 | pstolowski: hi | 07:44 |
mvo | zyga-x240: sure | 07:45 |
zyga-x240 | mvo: please look at https://bugs.launchpad.net/snapd/+bug/1898934 | 07:45 |
zyga-x240 | it looks worrying | 07:45 |
zyga-x240 | probably something very valuable for us to learn there | 07:45 |
zyga-x240 | we looked at state.json via debug command - no changes (empty collection) | 07:45 |
zyga-x240 | it seems like whatever happened made snapd inactive and the system dead/broken | 07:46 |
mvo | zyga-x240: can you subscribe me? this seems private | 07:46 |
zyga-x240 | sure | 07:46 |
zyga-x240 | I made it private as there's state.json attached | 07:46 |
mvo | zyga-x240: thanks, makes sense. the interessting part is why the snapd.failover.service did not DTRT :( | 07:47 |
zyga-x240 | we have access to that device, roger said he would help us look | 07:47 |
mvo | zyga-x240: ohhh, maybe we have an overlooked failure mode here | 07:47 |
zyga-x240 | oh? :) | 07:47 |
zyga-x240 | do you already see what the problem is? | 07:48 |
zyga-x240 | we discussed this close to midnight so I was rather tired | 07:48 |
mvo | zyga-x240: hold on, need to look at a core system | 07:48 |
zyga-x240 | sure | 07:48 |
* zyga-x240 hugs mvo | 07:48 | |
mvo | zyga-x240: a big **maybe** for now :) | 07:48 |
zyga-x240 | hmm, my MBP cannot connecto it IRC (on fallback ISP connection) | 07:50 |
zyga-x240 | mvo: I spoke with ken, there may be a way to ask apps to quit nicely, I'm looking at that now, no promises | 07:54 |
mvo | zyga-x240: hm, no, it seems we are doing the right thing, we should not need the current sysmlink for failover | 07:55 |
zyga-x240 | mvo: aha | 07:56 |
zyga-x240 | mvo: at the time roger checked, snapd was not running, and IIRC there was no snapd.service | 07:56 |
zyga-x240 | my hunch is that whatever happened made snapd stay in inactive state | 07:56 |
zyga-x240 | as in the snap was disabled | 07:56 |
zyga-x240 | mid refresh or something | 07:56 |
mvo | zyga-x240: snap debug state /tmp/state.json | 07:56 |
mvo | ID Status Spawn Ready Label Summary | 07:56 |
zyga-x240 | roger also said this was a power cut | 07:56 |
zyga-x240 | yeah thats is so weird! | 07:56 |
mvo | zyga-x240: i.e. *nothing* in the state :( | 07:56 |
zyga-x240 | well | 07:56 |
zyga-x240 | not nothing | 07:56 |
zyga-x240 | did you look at snapstate? | 07:56 |
zyga-x240 | what's the state of snapd? | 07:56 |
zyga-x240 | mvo: so maybe this is a weird case of disk corrption | 07:57 |
zyga-x240 | mvo: one more bit of information: the boot partition needs fsck and I'm sure we're just not doing that ever in practice | 07:57 |
zyga-x240 | mvo: perhaps snapd got broken because boot partition is inconsistent with snapstate | 07:58 |
zyga-x240 | and we "amend" stuff in a broken way | 07:58 |
zyga-x240 | i.e. what would happen if the boot config was broken in some way | 07:58 |
zyga-x240 | and could not be saved | 07:58 |
zyga-x240 | no smoking gun, just possibilities | 07:58 |
mvo | zyga-x240: sorry, I misspoke. not nothing but no tasks when there should be tasks like refreshes that happend in the last days | 07:58 |
zyga-x240 | mvo: yeah | 07:58 |
zyga-x240 | mvo: I really really wish we kept like state.1.json | 07:59 |
zyga-x240 | from last day/week/month | 07:59 |
zyga-x240 | for forensics | 07:59 |
zyga-x240 | how soon would we remove failed changes from state? | 08:00 |
mvo | zyga-x240: yeah, something is missing here. I don't remember, need to look but it should be a bit longer, it seems the refresh happend 3 days ago | 08:00 |
zyga-x240 | 3 days ago and power cut was last night (based on journalctl data) | 08:01 |
zyga-x240 | fortunately this pi has persistency | 08:01 |
mvo | zyga-x240: but "data" did not need a fsck? | 08:01 |
zyga-x240 | mvo: I don't know, but I didn't see such message in the log | 08:01 |
zyga-x240 | mvo: not sure if we have a fsck test | 08:01 |
zyga-x240 | we should really make this right for core20 | 08:01 |
zyga-x240 | (as in, make sure there's a test that shows we can fsck) | 08:01 |
zyga-x240 | anyway | 08:01 |
zyga-x240 | unless this is really just data corruption | 08:02 |
zyga-x240 | we have some things to improve in failover perhaps | 08:02 |
zyga-x240 | perhaps the service that starts snapd | 08:02 |
zyga-x240 | should do one more thing: | 08:02 |
zyga-x240 | if there's no state.json -> seed | 08:02 |
zyga-x240 | if there's state json but snapd is not good (no symlink, no service, etc) -> do something | 08:02 |
zyga-x240 | IIRC right now that's not handled | 08:02 |
zyga-x240 | if you have no service (roger did not) for snapd then we just stay dead | 08:02 |
zyga-x240 | mvo: maybe helpful: https://paste.ubuntu.com/p/7cTd6b8Yq4/ | 08:06 |
zyga-x240 | that's so weird | 08:07 |
zyga-x240 | snapd is running there | 08:07 |
zyga-x240 | but /usr/bin/snap is a broken symlink! | 08:07 |
zyga-x240 | rogpeppe: hi | 08:07 |
mvo | zyga-x240: hm, I think we need to look if we *always* have a service file for snapd on disk during the refresh | 08:07 |
zyga-x240 | rogpeppe: are you around? | 08:07 |
zyga-x240 | mvo: yeah | 08:07 |
zyga-x240 | I suspect we do not | 08:07 |
zyga-x240 | like with regular refresh | 08:07 |
zyga-x240 | unlink snap | 08:07 |
zyga-x240 | will remove snapd.service for sure | 08:07 |
zyga-x240 | unless there's a special case that I didn't see while poking at this recently | 08:08 |
mvo | zyga-x240: but yeah, in this log snapd is alive and kicking | 08:08 |
zyga-x240 | right? | 08:08 |
zyga-x240 | but this is the log before the power cut IIRC | 08:09 |
* zyga-x240 looks | 08:09 | |
zyga-x240 | after the power cut that's gone | 08:09 |
zyga-x240 | no snapd.service | 08:09 |
zyga-x240 | no anything | 08:09 |
zyga-x240 | but it may hint at an earlier problem | 08:09 |
zyga-x240 | or maybe that's just fallback snapd? | 08:09 |
* zyga-x240 doesn't know | 08:09 | |
zyga-x240 | mvo: if you look at the attached --list-boots output | 08:11 |
zyga-x240 | mvo: then at the dates | 08:11 |
zyga-x240 | mvo: it seems to indicate that the journal log was from the -1 boot (not the current but the previous one) | 08:11 |
zyga-x240 | mvo: so it may also suggest that the real problem happened earlier | 08:11 |
zyga-x240 | and each subsequent boot corrupted the system some more | 08:12 |
zyga-x240 | I'd love to see logs from -2 and 0 | 08:12 |
zyga-x240 | rogpeppe: ^^ | 08:12 |
zyga-x240 | rogpeppe: could you attach output of: | 08:12 |
zyga-x240 | rogpeppe: journalctl -b -2 | 08:12 |
zyga-x240 | rogpeppe: journalctl -b 0 | 08:12 |
zyga-x240 | (separately) | 08:12 |
zyga-x240 | assuming you have not rebooted since, as then the indices shift | 08:12 |
mvo | thanks for digging into this zyga-x240 | 08:12 |
zyga-x240 | LMAO | 08:23 |
zyga-x240 | gnome shell has a DBus API for evaluating javascript | 08:23 |
zyga-x240 | so... | 08:23 |
zyga-x240 | the shell _is_ a calculator | 08:23 |
zyga-x240 | ... why why why | 08:23 |
jamesh | zyga-x240: it wouldn't be evaulating JavaScript in a bare namespace: presumably it is in the same context as the rest of the UI is running | 08:30 |
zyga | I'm very tempted to eval an infinite loop but perhaps another day | 08:30 |
zyga | jamesh btw, I had a look at gnome-session-quit | 08:30 |
zyga | hoping there's an API to ask one specific app to quit | 08:30 |
zyga | but after going through gnome-session I don't think there is | 08:30 |
zyga | it's all-or-nothing | 08:30 |
zyga | do you know if there's some way to ask a specific app to quit that I've missed? | 08:31 |
zyga | jamesh (I think that having eval as an API is a bit irresponsible) | 08:31 |
zyga | but perhaps this is some debug leftover | 08:31 |
jamesh | zyga: in X11, I'd probably try to map the process ID to a window and try to close the window | 08:31 |
zyga | jamesh any pointers on how to do that? | 08:32 |
zyga | I'm happy to write C | 08:32 |
zyga | (as you know :) | 08:32 |
jamesh | zyga: there is a _NET_WM_PID that most applications set. It is not something you could rely on though | 08:35 |
jamesh | i.e. it is controlled by the client (so could be forged), and may be absent | 08:35 |
zyga | given that I'm not an X developer, how would I do the whole thing | 08:35 |
zyga | 1) connect to X | 08:35 |
zyga | 2) ... | 08:36 |
zyga | browse through all the windows | 08:36 |
zyga | find this property | 08:36 |
zyga | etc | 08:36 |
zyga | doesn't X enforce anything to prevent forgery/ | 08:36 |
jamesh | I don't remember the exact set of calls | 08:36 |
zyga | how do I ask a window to close? | 08:36 |
jamesh | an ICCCM message, I think (it's ages since I've looked at this) | 08:36 |
jamesh | and Wayland is going to be different | 08:37 |
zyga | how does it look like in wayland? | 08:37 |
zyga | can it be done by non-shell? | 08:37 |
jamesh | I don't think there is a standardised way to do it with Wayland: in general you don't want one application managing the windows of another app | 08:38 |
jamesh | with that said, the Wayland compositor should be able to securely identify and close clients | 08:38 |
zyga | yeah, I just wish there was a way for us to ask an app to close, politely | 08:38 |
zyga | not to kill it | 08:38 |
zyga | not to unmap it | 08:39 |
zyga | ask it to close | 08:39 |
jamesh | On the X11 side, you'd probably send the same message the window manager does when the user clicks the window's close button | 08:39 |
jamesh | (one of the consequences of having out of process window frames) | 08:39 |
zyga | I think that our goal of having a nice interaction button is mostly futile, at least for now | 08:41 |
zyga | but I will try some more | 08:41 |
jamesh | If you're okay with having snaps opt in to a nice behaviour, we could have a flag in snap.yaml saying that the app will do a clean shutdown on e.g. SIGHUP | 08:43 |
jamesh | that's clearly not going to work everywhere though | 08:43 |
zyga | jamesh I really want something that works for regular apps, the chrome and firefox and the random desktop app equally | 08:48 |
zyga | if it's not nice, it's not worth having | 08:48 |
jamesh | zyga: you probably want something that does a combination of xlsclients and xkill | 08:52 |
zyga | xkill is not useful, I tried that, it just unmaps the window | 08:52 |
zyga | chrome keeps running | 08:52 |
zyga | running chrome again doesn't do anything as existing chrome tries to open a new window and fails | 08:52 |
zyga | (or maybe opens a tab in the unmapped window) | 08:52 |
zyga | usability wise it's not usefil | 08:53 |
zyga | *useful | 08:53 |
jamesh | zyga: xkill calls XKillClient() | 08:53 |
jamesh | so maybe that's not what you want either | 08:53 |
jamesh | it isn't unmapping the window | 08:54 |
jamesh | code is at https://gitlab.freedesktop.org/xorg/app/xlsclients and https://gitlab.freedesktop.org/xorg/app/xkill | 08:54 |
zyga | thank you, I will look through those and play some more | 08:54 |
zyga | maybe I made a mistake but I tried xkill with a browser and just got headless chrome runnnig | 08:55 |
jamesh | Most apps will exit if their X connection is closed. Maybe Chrome is different | 08:58 |
jamesh | but yeah: XKillClient is not going to allow a graceful exit | 08:58 |
jamesh | https://www.x.org/releases/X11R7.6/doc/xorg-docs/specs/ICCCM/icccm.html is the spec for how window traditional window management works. The _NET_* properties are from the EWMH spec: https://specifications.freedesktop.org/wm-spec/wm-spec-1.3.html | 09:00 |
jamesh | zyga: if you're confining yourself to EWMH compliant window managers, things are relatively simple for enumerating clients: https://paste.ubuntu.com/p/CsfbGTh6Jz/ | 09:22 |
jamesh | zyga: it also has a _NET_CLOSE_WINDOW message you can send to ask the window manager to close a window on your behalf: it was designed to support panel/pager apps as separate processes to the WM | 09:23 |
zyga-x240 | jamesh: thank you, I will try that | 09:30 |
zyga-x240 | I think that's enough for what we need | 09:31 |
zyga-x240 | it's not perfect but I think it's close | 09:31 |
dot-tobias | hi all | 09:37 |
zyga-x240 | hi | 09:37 |
zyga-x240 | sigh, ./get-deps hangs... | 09:47 |
mvo | zyga-x240: nice, thanks for pushing on this ! | 09:47 |
mvo | zyga-x240: not that ./get-deps hangs of course .) | 09:47 |
zyga-x240 | haha | 09:47 |
zyga-x240 | oh well | 09:48 |
zyga-x240 | fatal: unable to access 'https://github.com/kardianos/govendor/': Could not resolve host: github.com | 09:49 |
zyga-x240 | wat? | 09:50 |
mvo | zyga-x240: haha | 09:53 |
jamesh | Just wait until we switch to modules. Then you'll only need to worry about proxy.golang.org failing to resolve | 10:02 |
* zyga goes to prep coffee for the calls | 10:46 | |
mvo | xnox: could we chnage https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fgithub.com%2Fsnapcore%2Fpi-gadget%2Fblob%2F20-arm64%2Fgadget.yaml%23L17 to ext4 (instead of vfat). ondra just raised this | 11:39 |
ijohnson | mvo: I don't think we can change ubuntu-boot to ext4 because the uboot bootloader needs to read boot.sel from that partition, and last we talked about it, uboot _can_ read ext4 but it takes __AGES__ like minutes | 12:39 |
ijohnson | so if you're okay with multi-minute bootloader time then yes we could probably change it but I don't think we want to do that | 12:40 |
mvo | ijohnson: ohhh, maybe we should add a comment then, that's a bit unfortunate | 12:41 |
zyga | would it be better if ubuntu-boot was ext4 but without some features | 12:41 |
ijohnson | mvo: yeah it is unfortunate | 12:41 |
mvo | ijohnson: but yeah, let's not do that :) | 12:41 |
zyga | perhaps the modern set of feature flags makes it slow | 12:41 |
zyga | and a more plain ext3 like set would be better | 12:41 |
ijohnson | zyga: I think the issue is more uboot | 12:42 |
zyga | are you saying that uboot ext* implementation is just unusable? | 12:42 |
ijohnson | yes that's what I remember | 12:42 |
zyga | hmmm | 12:42 |
ijohnson | maybe I'm wrong, would need to confirm with Dave to be sure | 12:43 |
mvo | I see some references that https://github.com/u-boot/u-boot/commit/d5aee659f217746395ff58adf3a863627ff02ec1 makes this fast | 12:45 |
mvo | but I have no idea what I'm talking about | 12:45 |
zyga | pstolowski I subscribed you to https://bugs.launchpad.net/snapd/+bug/1898934 - have a look if the missing undo handler for unlink could be related | 13:07 |
zyga | there's state.json there | 13:07 |
zyga | though there's no changes there _at all_ | 13:07 |
zyga | but there's snap state | 13:08 |
pstolowski | zyga: mhm, ok thanks | 13:08 |
zyga | thank you! | 13:08 |
pstolowski | zyga: ah, it's roger | 13:08 |
zyga | rogpeppe: if you are around and could reply to pawel's and mvo's questions in the bug, that would help | 13:09 |
zyga | pstolowski the finding about timings is brilliant | 13:21 |
zyga | mvo we recycle timing data at a different rate so when we have lost changes because they failed and got collected, we can see their shadow in timing data | 13:22 |
pstolowski | i'ev commented | 13:26 |
pstolowski | *i've | 13:26 |
pstolowski | jeez, typos there, i wish LP allowed to edit comments | 13:27 |
mvo | zyga: ta | 13:29 |
xnox | mvo: no we cannot. | 13:29 |
xnox | mvo: pi gpu bootloader cannot read ext4, nor read dtbs off there. | 13:29 |
mvo | xnox: that's sad, thanks for letting me know | 13:29 |
xnox | mvo: we can discuss things. but probably better with waveform. We might be able to do a different gadget/model where boot is ext4. | 13:30 |
mvo | xnox: it's not super urgent, if there are good reasons that's okay for me, it was an action item from a meeting with field/ondra to discuss this, they had concerns | 13:31 |
xnox | mvo: i think there is scope to have "ubootish gadget" which could have boot as ext4; and "pibootish gadget" which will not have uboot, and must have ubuntu-boot as vfat. | 13:31 |
xnox | mvo: but they want it for uboot, or on pi? | 13:31 |
mvo | xnox: the concern was robustness | 13:31 |
xnox | (i think there is field things on both now, hence the two are no longer the same) | 13:32 |
xnox | mvo: when the underlying mmc is crap, it's not going to be robust either way! =) | 13:32 |
xnox | mvo: also i think uboot's ext implementation is not robust either. so.... | 13:32 |
mvo | xnox: right, a comment maybe in the gadget why we made this decision for the next person that wonders. it's fine for me, you guys own it | 13:33 |
pstolowski | zyga: we just need take timings with a grain of salt; we may miss them for tasks that errored out before Save() | 13:36 |
ijohnson | mvo: xnox: the ubuntu-boot partition does not need to be read by the pi bootloader though ? | 13:44 |
ijohnson | because we first load u-boot from ubuntu-seed, then uboot will read the boot.sel file from ubuntu-boot, then load the kernel from ubuntu-boot and then boot into the kernel iirc | 13:44 |
zyga | xnox: FYI: https://github.com/snapcore/snapd/pull/9482 needs changes | 13:45 |
xnox | zyga: ack | 13:46 |
zyga | amurray, jdstrand: low hanging fruit that needs a security review: https://github.com/snapcore/snapd/pull/9449 | 13:47 |
jdstrand | zyga-x240: ack | 14:09 |
pstolowski | zyga-x240, mvo : so timings from rogpeppe 's state show some interesting stuff, there is snapd refresh on 2020-09-07, and there is undo for that change ("change-id": "37") | 14:14 |
zyga | pstolowski: do you think this is related to missing undo handler for unlink? | 14:14 |
pstolowski | zyga-x240, mvo i wonder if we don't have an issue somewhere where an unplanned reboot in a wrong moment when this is happening (before undo restores previous snapd) leaves us with no active snapd | 14:15 |
zyga | mmm | 14:16 |
zyga | it was a power loss | 14:16 |
zyga | perhaps very unlucky one | 14:16 |
zyga | this is very important for reliability at scale, apart from this issue, the failover logic did not help | 14:16 |
zyga | we should identify both issues | 14:16 |
pstolowski | zyga: no, i don't think it's missing undo for unlink; as i said this other problem only affects snap remove and snap disable | 14:16 |
zyga | I see | 14:16 |
pstolowski | but i think perf timings are useful to understand what happened, combined with timestamps of reboot | 14:17 |
zyga | I'll grab lunch | 14:18 |
cachio | zyga, hey, I pushed a chagne to #9414 | 14:26 |
cachio | I renames most of the names for the nested tool | 14:26 |
cachio | could you please take a look to see if new names make sense? | 14:27 |
zyga | cachio sure, not immediately though | 14:44 |
zyga-x240 | back from lunch and back to backend bits | 14:52 |
* zyga cancelled PT today, unsafe to go out | 14:56 | |
zyga | will focus on finishing stuff | 14:57 |
* cachio lunch | 15:29 | |
pstolowski | zyga, mvo i've added some observations to Roger's bug | 15:57 |
mvo | pstolowski: \o/ thanks for this | 15:58 |
pstolowski | mvo: does it make sense? | 15:58 |
mvo | pstolowski: in a meeting so I don't know (yet) :/ | 15:58 |
pstolowski | ah, sure, no worries | 15:58 |
zyga-x240 | mvo: I was wondering if we don't refresh things on separate lanes | 16:19 |
zyga-x240 | perhaps we don't, I don't recall | 16:19 |
zyga-x240 | hmm | 16:35 |
zyga-x240 | something beeped | 16:35 |
* zyga-x240 looks for a window with some app | 16:36 | |
zyga-x240 | mattermost :) | 16:38 |
zyga-x240 | ./snapmgr.go:412:13: too many errors | 16:51 |
zyga-x240 | well | 16:51 |
zyga-x240 | little by little | 16:51 |
* zyga-x240 is stuck | 17:13 | |
zyga-x240 | pedronis_: if you have a moment, I'd like to ask about one problem tomorrow | 17:24 |
zyga-x240 | pedronis_: requirng backend for soft refresh check is difficult, as it's called from doInstall which normally doesn't have access to the backend | 17:25 |
=== pedronis_ is now known as pedronis | ||
pedronis | zyga: mmh, let's chat tomorrow morning if possible | 17:56 |
* ijohnson EODs | 21:32 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!