/srv/irclogs.ubuntu.com/2020/10/08/#snappy.txt

pstolowskimorning07:01
zyga-x240hello07:44
zyga-x240mvo: around?07:44
zyga-x240pstolowski: hi07:44
mvozyga-x240: sure07:45
zyga-x240mvo: please look at https://bugs.launchpad.net/snapd/+bug/189893407:45
zyga-x240it looks worrying07:45
zyga-x240probably something very valuable for us to learn there07:45
zyga-x240we looked at state.json via debug command - no changes (empty collection)07:45
zyga-x240it seems like whatever happened made snapd inactive and the system dead/broken07:46
mvozyga-x240: can you subscribe me? this seems private07:46
zyga-x240sure07:46
zyga-x240I made it private as there's state.json attached07:46
mvozyga-x240: thanks, makes sense. the interessting part is why the snapd.failover.service did not DTRT :(07:47
zyga-x240we have access to that device, roger said he would help us look07:47
mvozyga-x240: ohhh, maybe we have an overlooked failure mode here07:47
zyga-x240oh? :)07:47
zyga-x240do you already see what the problem is?07:48
zyga-x240we discussed this close to midnight so I was rather tired07:48
mvozyga-x240: hold on, need to look at a core system07:48
zyga-x240sure07:48
* zyga-x240 hugs mvo07:48
mvozyga-x240: a big **maybe** for now :)07:48
zyga-x240hmm, my MBP cannot connecto it IRC (on fallback ISP connection)07:50
zyga-x240mvo: I spoke with ken, there may be a way to ask apps to quit nicely, I'm looking at that now, no promises07:54
mvozyga-x240: hm, no, it seems we are doing the right thing, we should not need the current sysmlink for failover07:55
zyga-x240mvo: aha07:56
zyga-x240mvo: at the time roger checked, snapd was not running, and IIRC there was no snapd.service07:56
zyga-x240my hunch is that whatever happened made snapd stay in inactive state07:56
zyga-x240as in the snap was disabled07:56
zyga-x240mid refresh or something07:56
mvozyga-x240: snap debug state /tmp/state.json07:56
mvoID   Status  Spawn  Ready  Label  Summary07:56
zyga-x240roger also said this was a power cut07:56
zyga-x240yeah thats is so weird!07:56
mvozyga-x240: i.e. *nothing* in the state :(07:56
zyga-x240well07:56
zyga-x240not nothing07:56
zyga-x240did you look at snapstate?07:56
zyga-x240what's the state of snapd?07:56
zyga-x240mvo: so maybe this is a weird case of disk corrption07:57
zyga-x240mvo: one more bit of information: the boot partition needs fsck and I'm sure we're just not doing that ever in practice07:57
zyga-x240mvo: perhaps snapd got broken because boot partition is inconsistent with snapstate07:58
zyga-x240and we "amend" stuff in a broken way07:58
zyga-x240i.e. what would happen if the boot config was broken in some way07:58
zyga-x240and could not be saved07:58
zyga-x240no smoking gun, just possibilities07:58
mvozyga-x240: sorry, I misspoke. not nothing but no tasks when there should be tasks like refreshes that happend in the last days07:58
zyga-x240mvo: yeah07:58
zyga-x240mvo: I really really wish we kept like state.1.json07:59
zyga-x240from last day/week/month07:59
zyga-x240for forensics07:59
zyga-x240how soon would we remove failed changes from state?08:00
mvozyga-x240: yeah, something is missing here. I don't remember, need to look but it should be a bit longer, it seems the refresh happend 3 days ago08:00
zyga-x2403 days ago and power cut was last night (based on journalctl data)08:01
zyga-x240fortunately this pi has persistency08:01
mvozyga-x240: but "data" did not need a fsck?08:01
zyga-x240mvo: I don't know, but I didn't see such message in the log08:01
zyga-x240mvo: not sure if we have a fsck test08:01
zyga-x240we should really make this right for core2008:01
zyga-x240(as in, make sure there's a test that shows we can fsck)08:01
zyga-x240anyway08:01
zyga-x240unless this is really just data corruption08:02
zyga-x240we have some things to improve in failover perhaps08:02
zyga-x240perhaps the service that starts snapd08:02
zyga-x240should do one more thing:08:02
zyga-x240if there's no state.json -> seed08:02
zyga-x240if there's state json but snapd is not good (no symlink, no service, etc) -> do something08:02
zyga-x240IIRC right now that's not handled08:02
zyga-x240if you have no service (roger did not) for snapd then we just stay dead08:02
zyga-x240mvo: maybe helpful: https://paste.ubuntu.com/p/7cTd6b8Yq4/08:06
zyga-x240that's so weird08:07
zyga-x240snapd is running there08:07
zyga-x240but /usr/bin/snap is a broken symlink!08:07
zyga-x240rogpeppe: hi08:07
mvozyga-x240: hm, I think we need to look if we *always* have a service file for snapd on disk during the refresh08:07
zyga-x240rogpeppe: are you around?08:07
zyga-x240mvo: yeah08:07
zyga-x240I suspect we do not08:07
zyga-x240like with regular refresh08:07
zyga-x240unlink snap08:07
zyga-x240will remove snapd.service for sure08:07
zyga-x240unless there's a special case that I didn't see while poking at this recently08:08
mvozyga-x240: but yeah, in this log snapd is alive and kicking08:08
zyga-x240right?08:08
zyga-x240but this is the log before the power cut IIRC08:09
* zyga-x240 looks08:09
zyga-x240after the power cut that's gone08:09
zyga-x240no snapd.service08:09
zyga-x240no anything08:09
zyga-x240but it may hint at an earlier problem08:09
zyga-x240or maybe that's just fallback snapd?08:09
* zyga-x240 doesn't know08:09
zyga-x240mvo: if you look at the attached --list-boots output08:11
zyga-x240mvo: then at the dates08:11
zyga-x240mvo: it seems to indicate that the journal log was from the -1 boot (not the current but the previous one)08:11
zyga-x240mvo: so it may also suggest that the real problem happened earlier08:11
zyga-x240and each subsequent boot corrupted the system some more08:12
zyga-x240I'd love to see logs from -2 and 008:12
zyga-x240rogpeppe: ^^08:12
zyga-x240rogpeppe: could you attach output of:08:12
zyga-x240rogpeppe: journalctl -b -208:12
zyga-x240rogpeppe: journalctl -b 008:12
zyga-x240(separately)08:12
zyga-x240assuming you have not rebooted since, as then the indices shift08:12
mvothanks for digging into this zyga-x24008:12
zyga-x240LMAO08:23
zyga-x240gnome shell has a DBus API for evaluating javascript08:23
zyga-x240so...08:23
zyga-x240the shell _is_ a calculator08:23
zyga-x240... why why why08:23
jameshzyga-x240: it wouldn't be evaulating JavaScript in a bare namespace: presumably it is in the same context as the rest of the UI is running08:30
zygaI'm very tempted to eval an infinite loop but perhaps another day08:30
zygajamesh btw, I had a look at gnome-session-quit08:30
zygahoping there's an API to ask one specific app to quit08:30
zygabut after going through gnome-session I don't think there is08:30
zygait's all-or-nothing08:30
zygado you know if there's some way to ask a specific app to quit that I've missed?08:31
zygajamesh (I think that having eval as an API is a bit irresponsible)08:31
zygabut perhaps this is some debug leftover08:31
jameshzyga: in X11, I'd probably try to map the process ID to a window and try to close the window08:31
zygajamesh any pointers on how to do that?08:32
zygaI'm happy to write C08:32
zyga(as you know :)08:32
jameshzyga: there is a _NET_WM_PID that most applications set.  It is not something you could rely on though08:35
jameshi.e. it is controlled by the client (so could be forged), and may be absent08:35
zygagiven that I'm not an X developer, how would I do the whole thing08:35
zyga1) connect to X08:35
zyga2) ...08:36
zygabrowse through all the windows08:36
zygafind this property08:36
zygaetc08:36
zygadoesn't X enforce anything to prevent forgery/08:36
jameshI don't remember the exact set of calls08:36
zygahow do I ask a window to close?08:36
jameshan ICCCM message, I think (it's ages since I've looked at this)08:36
jameshand Wayland is going to be different08:37
zygahow does it look like in wayland?08:37
zygacan it be done by non-shell?08:37
jameshI don't think there is a standardised way to do it with Wayland: in general you don't want one application managing the windows of another app08:38
jameshwith that said, the Wayland compositor should be able to securely identify and close clients08:38
zygayeah, I just wish there was a way for us to ask an app to close, politely08:38
zyganot to kill it08:38
zyganot to unmap it08:39
zygaask it to close08:39
jameshOn the X11 side, you'd probably send the same message the window manager does when the user clicks the window's close button08:39
jamesh(one of the consequences of having out of process window frames)08:39
zygaI think that our goal of having a nice interaction button is mostly futile, at least for now08:41
zygabut I will try some more08:41
jameshIf you're okay with having snaps opt in to a nice behaviour, we could have a flag in snap.yaml saying that the app will do a clean shutdown on e.g. SIGHUP08:43
jameshthat's clearly not going to work everywhere though08:43
zygajamesh I really want something that works for regular apps, the chrome and firefox and the random desktop app equally08:48
zygaif it's not nice, it's not worth having08:48
jameshzyga: you probably want something that does a combination of xlsclients and xkill08:52
zygaxkill is not useful, I tried that, it just unmaps the window08:52
zygachrome keeps running08:52
zygarunning chrome again doesn't do anything as existing chrome tries to open a new window and fails08:52
zyga(or maybe opens a tab in the unmapped window)08:52
zygausability wise it's not usefil08:53
zyga*useful08:53
jameshzyga: xkill calls XKillClient()08:53
jameshso maybe that's not what you want either08:53
jameshit isn't unmapping the window08:54
jameshcode is at https://gitlab.freedesktop.org/xorg/app/xlsclients and https://gitlab.freedesktop.org/xorg/app/xkill08:54
zygathank you, I will look through those and play some more08:54
zygamaybe I made a mistake but I tried xkill with a browser and just got headless chrome runnnig08:55
jameshMost apps will exit if their X connection is closed.  Maybe Chrome is different08:58
jameshbut yeah: XKillClient is not going to allow a graceful exit08:58
jameshhttps://www.x.org/releases/X11R7.6/doc/xorg-docs/specs/ICCCM/icccm.html is the spec for how window traditional window management works.  The _NET_* properties are from the EWMH spec: https://specifications.freedesktop.org/wm-spec/wm-spec-1.3.html09:00
jameshzyga: if you're confining yourself to EWMH compliant window managers, things are relatively simple for enumerating clients: https://paste.ubuntu.com/p/CsfbGTh6Jz/09:22
jameshzyga: it also has a _NET_CLOSE_WINDOW message you can send to ask the window manager to close a window on your behalf: it was designed to support panel/pager apps as separate processes to the WM09:23
zyga-x240jamesh: thank you, I will try that09:30
zyga-x240I think that's enough for what we need09:31
zyga-x240it's not perfect but I think it's close09:31
dot-tobiashi all09:37
zyga-x240hi09:37
zyga-x240sigh, ./get-deps hangs...09:47
mvozyga-x240: nice, thanks for pushing on this !09:47
mvozyga-x240: not that ./get-deps hangs of course .)09:47
zyga-x240haha09:47
zyga-x240oh well09:48
zyga-x240fatal: unable to access 'https://github.com/kardianos/govendor/': Could not resolve host: github.com09:49
zyga-x240wat?09:50
mvozyga-x240: haha09:53
jameshJust wait until we switch to modules.  Then you'll only need to worry about proxy.golang.org failing to resolve10:02
* zyga goes to prep coffee for the calls10:46
mvoxnox: could we chnage https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fgithub.com%2Fsnapcore%2Fpi-gadget%2Fblob%2F20-arm64%2Fgadget.yaml%23L17 to ext4 (instead of vfat). ondra  just raised this11:39
ijohnsonmvo: I don't think we can change ubuntu-boot to ext4 because the uboot bootloader needs to read boot.sel from that partition, and last we talked about it, uboot _can_ read ext4 but it takes __AGES__ like minutes12:39
ijohnsonso if you're okay with multi-minute bootloader time then yes we could probably change it but I don't think we want to do that12:40
mvoijohnson: ohhh, maybe we should add a comment then, that's a bit unfortunate12:41
zygawould it be better if ubuntu-boot was ext4 but without some features12:41
ijohnsonmvo: yeah it is unfortunate12:41
mvoijohnson: but yeah, let's not do that :)12:41
zygaperhaps the modern set of feature flags makes it slow12:41
zygaand a more plain ext3 like set would be better12:41
ijohnsonzyga: I think the issue is more uboot12:42
zygaare you saying that uboot ext* implementation is just unusable?12:42
ijohnsonyes that's what I remember12:42
zygahmmm12:42
ijohnsonmaybe I'm wrong, would need to confirm with Dave to be sure12:43
mvoI see some references that https://github.com/u-boot/u-boot/commit/d5aee659f217746395ff58adf3a863627ff02ec1 makes this fast12:45
mvobut I have no idea what I'm talking about12:45
zygapstolowski I subscribed you to https://bugs.launchpad.net/snapd/+bug/1898934 - have a look if the missing undo handler for unlink could be related13:07
zygathere's state.json there13:07
zygathough there's no changes there _at all_13:07
zygabut there's snap state13:08
pstolowskizyga: mhm, ok thanks13:08
zygathank you!13:08
pstolowskizyga: ah, it's roger13:08
zygarogpeppe: if you are around and could reply to pawel's and mvo's questions in the bug, that would help13:09
zygapstolowski the finding about timings is brilliant13:21
zygamvo we recycle timing data  at a different rate so when we have lost changes because they failed and got collected, we can see their shadow in timing data13:22
pstolowskii'ev commented13:26
pstolowski*i've13:26
pstolowskijeez, typos there, i wish LP allowed to edit comments13:27
mvozyga: ta13:29
xnoxmvo:  no we cannot.13:29
xnoxmvo: pi gpu bootloader cannot read ext4, nor read dtbs off there.13:29
mvoxnox: that's sad, thanks for letting me know13:29
xnoxmvo: we can discuss things. but probably better with waveform. We might be able to do a different gadget/model where boot is ext4.13:30
mvoxnox: it's not super urgent, if there are good reasons that's okay for me, it was an action item from a meeting with field/ondra to discuss this, they had concerns13:31
xnoxmvo: i think there is scope to have "ubootish gadget" which could have boot as ext4; and "pibootish gadget" which will not have uboot, and must have ubuntu-boot as vfat.13:31
xnoxmvo:  but they want it for uboot, or on pi?13:31
mvoxnox: the concern was robustness13:31
xnox(i think there is field things on both now, hence the two are no longer the same)13:32
xnoxmvo:  when the underlying mmc is crap, it's not going to be robust either way! =)13:32
xnoxmvo:  also i think uboot's ext implementation is not robust either. so....13:32
mvoxnox: right, a comment maybe in the gadget why we made this decision for the next person that wonders. it's fine for me, you guys own it13:33
pstolowskizyga: we just need take timings with a grain of salt; we may miss them for tasks that errored out before Save()13:36
ijohnsonmvo: xnox: the ubuntu-boot partition does not need to be read by the pi bootloader though ?13:44
ijohnsonbecause we first load u-boot from ubuntu-seed, then uboot will read the boot.sel file from ubuntu-boot, then load the kernel from ubuntu-boot and then boot into the kernel iirc13:44
zygaxnox: FYI: https://github.com/snapcore/snapd/pull/9482 needs changes13:45
xnoxzyga:  ack13:46
zygaamurray, jdstrand: low hanging fruit that needs a security review: https://github.com/snapcore/snapd/pull/944913:47
jdstrandzyga-x240: ack14:09
pstolowskizyga-x240, mvo : so timings from rogpeppe 's state show some interesting stuff, there is snapd refresh on 2020-09-07, and there is undo for that change ("change-id": "37")14:14
zygapstolowski: do you think this is related to missing undo handler for unlink?14:14
pstolowskizyga-x240, mvo i wonder if we don't have an issue somewhere where an unplanned reboot in a wrong moment when this is happening (before undo restores previous snapd) leaves us with no active snapd14:15
zygammm14:16
zygait was a power loss14:16
zygaperhaps very unlucky one14:16
zygathis is very important for reliability at scale, apart from this issue, the failover logic did not help14:16
zygawe should identify both issues14:16
pstolowskizyga: no, i don't think it's missing undo for unlink; as i said this other problem only affects snap remove and snap disable14:16
zygaI see14:16
pstolowskibut i think perf timings are useful to understand what happened, combined with timestamps of reboot14:17
zygaI'll grab lunch14:18
cachiozyga, hey, I pushed a chagne to #941414:26
cachioI renames most of the names for the nested tool14:26
cachiocould you please take a look to see if new names make sense?14:27
zygacachio sure, not immediately though14:44
zyga-x240back from lunch and back to backend bits14:52
* zyga cancelled PT today, unsafe to go out14:56
zygawill focus on finishing stuff14:57
* cachio lunch15:29
pstolowskizyga, mvo i've added some observations to Roger's bug15:57
mvopstolowski: \o/ thanks for this15:58
pstolowskimvo: does it make sense?15:58
mvopstolowski: in a meeting so I don't know (yet) :/15:58
pstolowskiah, sure, no worries15:58
zyga-x240mvo: I was wondering if we don't refresh things on separate lanes16:19
zyga-x240perhaps we don't, I don't recall16:19
zyga-x240hmm16:35
zyga-x240something beeped16:35
* zyga-x240 looks for a window with some app16:36
zyga-x240mattermost :)16:38
zyga-x240./snapmgr.go:412:13: too many errors16:51
zyga-x240well16:51
zyga-x240little by little16:51
* zyga-x240 is stuck17:13
zyga-x240pedronis_: if you have a moment, I'd like to ask about one problem tomorrow17:24
zyga-x240pedronis_: requirng backend for soft refresh check is difficult, as it's called from doInstall which normally doesn't have access to the backend17:25
=== pedronis_ is now known as pedronis
pedroniszyga: mmh, let's chat tomorrow morning if possible17:56
* ijohnson EODs21:32

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!