[00:25] PR snapcraft#3286 opened: [legacy backport] v1 plugins: lock godep's dependencies [01:26] PR snapcraft#3286 closed: [legacy backport] v1 plugins: lock godep's dependencies [04:08] PR snapd#9348 opened: tests: print all the serial logs for the nested test [06:09] morning [06:11] mvo: hey, saw your comment under #9347, i can start looking into this [06:11] PR #9347: tests/lib/nested.sh: use more focused cloud-init config for uc20 [06:11] mborzecki: thanks [06:11] mborzecki: might be as simple as adding something to nested_start_core_vm_unit that already waits for ssh [06:12] mhm [06:12] anyway, looks like we're becoming experts in cloud init too [06:14] mborzecki: yeah :( [06:14] mborzecki: fun, looks like your better logging PR passed except for the smoke test where the change to add logging broke the test because now snap-confine prints stuff to stderr [06:14] hahah [06:16] mvo: oh, and idk if you seen the comment about the nested suite sending 150MB of project data to the vm [06:17] maybe that's the kernel/core/snapd snaps repacked [06:18] mborzecki: yeah, I think [06:19] mborzecki: looks like we need to either put them elsewhere or clean them after the repacking [06:19] mborzecki: but yeah, looks like there is some junk around [06:21] mborzecki: given that all the other tests have passed and that is super rare - what functional change could have triggered this? or do you think it's pure coincidence? [06:21] mborzecki: i.e. could the cloud-init chnage be responsible? [06:21] feels like a stroke of luck [06:22] mborzecki: ok [06:34] mvo: as for SNAP_DEBUG=1 in environment, we could drop an override for snapd.service, i had that at some point, but then dropped it and added to /etc/environemnt (because no logs were appearing which i didn't know at the time to be caused by the MaxLevel* in journald) [06:36] mborzecki: yeah, let me quickly push an idea [06:36] ok, runing the nested suite now [06:37] mborzecki: 9349 use your idea but puts it only for the snapd unit [06:37] mvo: this is what i have for the snap command; https://paste.ubuntu.com/p/s4zr5NsKX4/ [06:38] mborzecki: that looks reasonable, a bit less central than I had hoped, we can't put it into nested_start_for_core_vm ? this waits for ssh already? [06:39] PR snapd#9349 opened: tests: change snapd.spread-tests-run-mode-tweaks.sh to add logging [06:39] mvo: hm, need to double check it wouldn't conflict the nested/manual tests [06:43] mborzecki: yeah, probably fine [06:43] mborzecki: mostly wondering [06:44] mvo: just checked and it should be fine, i've updated the patch [06:44] mborzecki: it's also true that muddeling the concept of start and wait-for-ready is a bit strange so keeping seprarate may well be better [06:44] mborzecki: cool [06:44] mborzecki: again, mostly thinking that this is okay since we wait for ssh already [06:45] maybe i should just push all of it into #9343 [06:45] PR #9343: tests: more logging for UC20 kernel test [06:45] i mean ian's path + the tweak for waiting [06:46] s/path/patch/ [06:48] mborzecki: yes [06:48] mborzecki: I think that's good [06:52] mvo: mborzecki: should we sync? [06:53] pedronis: at 9? i'll grab some tea [06:53] yes [06:54] mvo: I'm quite confused by #9349 given Ian's merge from yesterday [06:54] PR #9349: tests: change snapd.spread-tests-run-mode-tweaks.sh to add logging [06:58] NOMATCH is something that was added to spread recently? [06:58] pedronis: maybe I'm confused, let me double check what happend :( [06:59] pedronis: oh, I see :( [06:59] mvo: afaik on master we don't use the function you are changing anymore [07:00] mvo: I asked you to comment on that yesterday evening when Ian asked to merge and you said yes :) [07:00] pedronis: it is fine [07:00] I'm going to the SU [07:00] pedronis: it's just that it's all a bit of a maze and apparently I did not had enough tea yet [07:00] pedronis: joining [07:01] morning [07:04] PR snapd#9349 closed: tests: change snapd.spread-tests-run-mode-tweaks.sh to add logging [07:07] hello [07:07] good morning pstolowski and zyga-kaveri [07:07] how is core20? [07:07] I'm baby-sitting still [07:12] pstolowski: hi [07:12] pstolowski: could you please look at https://pastebin.ubuntu.com/p/QMJ3x8HK4h/ [07:13] is anything there out of place for preseeding? [07:13] ah [07:13] it's not the full list [07:13] I think I know what is going on [07:16] * zyga-kaveri fixes [07:22] zyga-kaveri: seems 3 tasks are missing (your new tasks I presume) [07:23] pstolowski: yeah exactly :D [07:25] man that run was pretty green overall [07:25] one test adjustment, some random failures [07:25] but otherwise it's pretty good [07:25] I was only running core and main tests for ubuntu 16 [08:15] mvo: pedronis: i've updated #9347 [08:15] PR #9347: tests/lib/nested.sh: use more focused cloud-init config for uc20 [08:25] mborzecki: \o/ [08:32] your video froze mvo [08:49] mborzecki: mvo: so it takes around 50-60 to a resealing in the non-accel vms, almost 80 with two run kernels I suppose [08:49] wow [08:49] mborzecki: we are hasing 2 or 3 kernels each time plus other ops plus signing [08:51] mvo: mborzecki: actually not, we are not hashing kernels yet, so it will become even slower [08:51] raspberry pi 1 speed! [08:52] mborzecki: mvo: anyway, we should be able to cut down on the number of reseals, we get so many because current unasserted kernel logic [08:53] speeding them up is a different matter, it would need help from secboot, also there are trade offs [08:59] PR snapd#9350 opened: [RFC] store: handle v2 error when fetching assertions [09:00] pedronis: hi, let me know if this is what you had in mind ^ ; i'll add tests if the idea is ok [09:03] * pstolowski back to snapshots [09:19] PR snapd#9351 opened: tests: simplify repack_snapd_snap_with_deb_content_and_run_mode_first_boot_tweaks [09:32] 2020-09-15 11:22:19 Executing google:ubuntu-20.04-64:tests/main/preseed (sep150911-454471) (1/1)... [09:32] 2020-09-15 11:22:55 Successful tasks: 1 [09:32] :D [09:41] pstolowski: we should find a quiet time [09:41] pstolowski: and demo vscode to the team [09:41] pstolowski: it's such a gamechanger [09:43] okay, onto snap-confine changes [09:43] NOW IS THE TIME :D [09:54] zyga-kaveri: is it vim vs emacs vs vscode now? ;) [09:55] pstolowski: for some people it's always vim vs emacs [09:55] pstolowski: the rest just uses code [10:03] For me, it's emacs AND code. [10:04] I'm code and vim [10:04] but vim much less often [10:04] mainly inertia [10:04] but code is just genuinely better [10:06] code's remote is nice. [10:06] Especially with low memory/disk space VM... [10:06] yeah [10:07] I only wish it had riscv and mips binaries [10:08] Too bad for Perl, it needs at least 5.18 and my platform uses 5.8.9... [10:17] mborzecki: 9347 is at 1:28h already, I wonder if that is good or bad [10:17] mvo: looks like it's slow, i was able to see the logs and minimal smoke hit a 40m timeout [10:17] * mvo nods [10:18] maybe we should have a longer kill timeout for the nested suite [10:18] yeah, might be needed [10:18] let's see if it completes, I really hope it does (successfully) [10:22] mborzecki: and it just failed in 2020-09-15T10:22:06.2176601Z - google-nested:ubuntu-20.04-64:tests/nested/manual/minimal-smoke :( [10:22] mvo: so that's it, hit a timeout on minimal/smoke, but otherwise the rest ran succesfuly [10:23] mborzecki: aha, timeout? ok, that sounds very promising [10:23] mvo: yup, a spread kill timeout [10:23] mborzecki: yeah, just saw it. sounds like we should raise it and re-run, can you do that? [10:23] mborzecki: but that is *very* encouraging, well done mborzecki and ijohnson :) [10:24] mborzecki: I really hope/think this could give us the reliable tests we need [10:27] mvo: hm i'd be leaning towards landing the PR and bumping the timeout in another one, looking at the logs, the test stated at 9:10, and at 9:49 it was executing tests/smoke/remove in the nested vm, so just running slow/late [10:28] wdyt? [10:28] mborzecki: works for me [10:29] mborzecki: done [10:29] thanks! [10:29] mborzecki: ok, let's include the longer timeout as one of the needed next steps (is more logging next?) [10:29] PR snapd#9347 closed: tests/lib/nested.sh: use more focused cloud-init config for uc20 [10:41] mvo: pedronis: #9343 is updated with a bump in kill timeouts and a little tweak for cleaning the serial log before each test (otherwise it would accumulate) [10:41] PR #9343: tests: more logging for UC20 kernel test [10:43] mvo: what's #9348 and how it relates to #9343 ? (cc mborzecki ) [10:43] PR #9348: tests: print all the serial logs for the nested test [10:43] PR #9343: tests: more logging for UC20 kernel test [10:45] pedronis: looks like a followup/tweak we could do once 9343 lands? [10:46] mborzecki: the changes to the cloud init config don't make sense anymore in it? don't we land some other code now? [10:46] heh, didn't we land [10:46] ah, damn missed those [10:47] it needs a master merge? [10:47] ah, it just had one? [10:48] mborzecki: mvo: fwiw I'm working on resealing a bit less even with unasserted kernels [10:49] mborzecki: does it make sense to land 9343 as it is or instead pull the logging changes out and land that separately and then we merge master into the kernel reseal tests? [10:49] (cc pedronis -^) [10:50] we can land the logging first maybe? [10:50] i can cherry pick the relevant commits and open a PR [10:56] mborzecki: would think it's better, reading 9343 is a bit hard [10:56] ok [10:57] hmm landing #9311 would make it smaller still [10:57] PR #9311: nested: add support to telnet to serial port in nested VM [11:01] mborzecki: that looks ok and has a lot of +1 (it's a bit repetitive but so nested.sh) [11:02] mborzecki: I merged master into 9311, once that is green we should merge to master [11:03] mborzecki: also 9311 is not really needed for gce, we can always merge later [11:03] mhm [11:03] but in a meeting so only have 10% of my brain [11:15] PR snapd#9352 opened: test: improve logging in nested tests [11:15] pedronis: mvo: ^^ just the logging [11:23] mborzecki: \o/ [11:23] * mvo is still in a meeting [11:26] I am putting together a custom-image as documented, works pretty good. I would now like to have some configurations changed like "sudo snap set somesnap foo=bar" for my custom-image, what would be the right way to do this? create a snap which is exclusively running those inside the configure hook? like a configuration-snap ? or is there some other prefered way? [11:29] dariball, the typical way is to use a custom gadget and set them in the "defaults:" in snapycraft.yaml [11:30] err. [11:30] s/snapcraft.yaml/gadget.yaml/ [11:31] dariball, https://snapcraft.io/docs/gadget-snap [11:31] therefore I would fork the default gadget snap, right, because I can only have one gadget snap? [11:31] yep [11:32] and I would add arbitrary commands to the prepare-device hook? [11:32] no [11:32] yu would use the gadget.yaml options for connections and for setting defaults [11:32] look at the linked page above [11:33] and running arbitrary commands shall be avoided? [11:34] because let's say I have some cli tool from a snap I would like to run ? [11:35] well, thats not esaily doable without forking the snap and make some daemonized script [11:36] (snaps can not easily call commands from other snaps, else you'd be able to just create a snap full of scripts) [11:37] okok, thx this already helps alot, hope I get along with connections + defaults, looks good... [11:37] if you have a brands stroe you *can* use the snapd REST API from a config or tooling snap though, that allows things beyond the limited stuff [11:37] *brand store [11:38] but use of the interface (snapd-control) that gives you access to the API is brand store bound, snaps in the global store do not get permission to use it [11:40] mvo: mborzecki: I haven't proposed but here are the changes to reseal less with unasserted kernels: https://github.com/pedronis/snappy/commit/39746b163590c1f6be0103362b8fcfeee0f32338 [11:57] ok, modified snap-confine and some small bits, running in complain snap-confine apparmor profile to see what we get [12:05] PR snapd#9185 closed: secboot: use the snapcore/secboot native recovery key type [12:05] funny when spread hangs on discarding a node [12:07] brb [12:10] pedronis: the patch looks reasonable, are you going to propose a branch? [12:12] mborzecki: probably not before we get other things in [12:12] PR snapcraft#3285 closed: v1 plugins: lock godep's dependencies [12:12] PR snapcraft#3287 opened: elf: reduce noise in the developer debug logs [12:12] pedronis: ack [12:24] heh, finally the nested suite is in progress in 9352 [12:29] exported tools work [12:29] I need to adjust apparmor profiles [12:29] but it's functional [12:29] \o/ [12:43] heh, nested tests i ran: !!!! X64 Exception Type - 0E(#PF - Page-Fault) CPU Apic ID - 00000000 !!!! [12:44] that's when rebooting from install -> run [12:44] mborzecki, I saww that yesterday as erll [12:44] well [12:44] ouch [12:45] and random, didn't happen in other tests [12:45] it is very sporadic [12:46] maybe -smp 1 could help mitigate this? [12:47] cmatsuoka, I saw that just running with kvm enabled [12:47] mborzecki, are you running with kvm enalbed right? [12:48] cachio: what version of ovmf are we using in our tests? [12:48] cmatsuoka, need to check, I tried with the latest and the once from proposed [12:51] cachio: no, this was on gcp [12:52] cmatsuoka: and the whichever version of qemu/ovmf we have in focal [12:52] mborzecki, how did you run the test on gcp? [12:52] cachio: spread -debug -v google-nested:ubuntu-20.04-64:tests/nested/... [12:53] cmatsuoka, this is the version we use of ovmf 0~20191122.bd85bf54-2ubuntu3 [12:54] mborzecki, well the tests in core20 suite will run without kvm but tests on manual suite will run with kvm enabled [12:54] mborzecki, don't know which test showed that panic [12:55] cachio: tests/nested/manual/core20-early-config so maybe it's related to kvm [12:55] mborzecki, ahh, I think so [12:58] the groovy ovmf is a bit newer, but many changes in upstream since then [13:00] PR snapd#9311 closed: nested: add support to telnet to serial port in nested VM [13:24] mborzecki: 9352 just passesd [13:24] mvo: just landed it [13:24] mborzecki: \o/ third in a row, fingers crossed [13:24] mvo: i'll cherry pick the spread test from your PR and add it to kernel reseal one [13:24] mborzecki: sounds great [13:25] mborzecki: feel free to close the test PR from me once you chrry-picked to the right place [13:25] PR snapd#9352 closed: test: improve logging in nested tests [13:42] I'm going to break for lunch now [13:42] ttyl [13:44] pedronis: mvo: i've updated https://github.com/snapcore/snapd/pull/9331 [13:44] PR #9331: boot: reseal when changing kernel [13:45] mvo: i've dropped the patch where test waits for snap to become available, we have that covered in nested setup now [13:45] PR snapd#9338 closed: tests: add nested core20 kernel reseal test [13:45] PR snapd#9351 closed: tests: simplify repack_snapd_snap_with_deb_content_and_run_mode_first_boot_tweaks [13:49] mborzecki: reviewed, it needs a 2nd review [13:49] pedronis: thanks! [13:49] mvo: ijohnson: could you take a look at #9331? [13:49] PR #9331: boot: reseal when changing kernel [13:49] sure [13:51] mborzecki: of course [13:52] my own comment on it are addressed in #9340 (which is adding some tests, a doing some refacor/renames) [13:52] PR #9340: boot: streamline bootstate20.go reseal and tests changes === tomwardill_ is now known as tomwardill === bluesabre_ is now known as bluesabre === marosg_ is now known as marosg === beisner_ is now known as beisner === coreycb_ is now known as coreycb === philroche_ is now known as philroche === urluck_ is now known as urluck === ogra_ is now known as Guest47593 [14:10] PR snapd#9321 closed: cmd/snap/model: specify grade in the model command output [14:17] mborzecki: pedronis I had a question on the unit tests in 9331 [14:17] perhaps I have misunderstood something [14:18] there is at least one test that my PR changes to do something quite different [14:18] more in line with the name [14:19] ah, something else, yes I think some of those tests are cheating a bit [14:20] because the mock bootloader has limitations [14:20] ok, is it alright that we aren't testing the same way things are being used in real life? [14:20] well, we just feed all that to a mocked secboot reseal [14:21] that just check that the processing looked alright [14:21] do we have anywhere that unit tests end to end with the actual expected boot chain? or is that just unnecessary since we will have the spread tests [14:21] so I think it's ok but might confuse us later, but it's not a blocker atm [14:21] ok, thanks [14:21] ijohnson: yes, we do have tests that are more realistic [14:21] seal_test.go own tess are more realistic [14:21] ok sounds good [14:22] the comment about checking the env seems legit though [14:22] it might be easier to pick it up in my PR tough [14:24] that's fine with me [14:24] I'm finishing up a reivew of 9340 now btw === xnox1 is now known as xnox [14:51] 16.04 failed with google:ubuntu-16.04-64:tests/main/lxd:snapd_cgroup_neither [14:53] * mvo is off for a couple of minutes to taxi kids around, tg for emergencies [14:55] mborzecki, again for me https://paste.ubuntu.com/p/6F4jypF8Wd/ [14:55] PR snapd#9353 opened: [RFC] configcore: do not error in console-conf.disable for install mode [14:59] hm looks like we still need to do something about the client timeout [15:00] mborzecki: which client timeout? [15:00] doTimeout in client.go [15:00] was doing a couple of kernel reseal runs on gcp and one failed with cannot communicate with the server, client timeout exceeded and so on [15:01] it's set to 50s atm, maybe we should have an env variable to make it longer [15:12] * cachio lunch [15:18] ah [15:18] yeah we should probably bump that timeout, there are reports of it not being long enough on the forum too on like rpi or somethin I think [15:20] mhm [15:22] PR snapcraft#3238 closed: db: introduce generalized datastore [15:29] zyga-kaveri: hey, please have snap-confine PRs still go through us. feel free to ping me and I'll assign someone [15:30] not that you weren't going to do that, just wanted to make sure people know :) [15:30] cc pedronis ^ [15:30] * jdstrand would cc mvo but he's not here atm :) [15:43] hm mvo isn't around anymore? [15:43] yeah he said he had to go taxi his kids around [15:43] ijohnson: ah, thanks for the info [15:44] mborzecki: any pr's I should review right now? [15:44] I reviewed both 9331 and 9340 this morning [15:48] ijohnson: you can try #9341 [15:48] PR #9341: tests: add nested core20 gadget reseal test [15:48] mborzecki: ack [15:51] pedronis: #9331 spread run finished, there's failure in nested tests, tests/nested/core20/basic which expected test-snapd-sh to be there but looking at the log it got 400 over the API, and minimal-smoke which looks like it was just killed (?) but went throught 3 out of 4 smoke tests by that time [15:51] PR #9331: boot: reseal when changing kernel [15:57] mborzecki: minimal smoke failed also on 18 ? [15:57] pedronis: mborzecki I have seen minimal smoke fail on 18 a few times and have an idea why it failed [15:57] I reproduced it yesterday but ran out of time to fully debug [15:58] pedronis: yeah, but there it seems like an ssh thing [15:58] essentially when we create the users and specify them as sudoers by writing the files, somehow those files get lost/ reset to be empty when the vm is shut down [15:58] so I was going to try by doing a sync inside the vm before shutting down, then a sync outside the vm before starting it up again [15:58] funny, becuse the setup was successful, `cannot connect: cannot connect to external:ubuntu-core-18-64: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain` [15:59] ijohnson: wooo, but we call shutdown in the vm, don't we? [15:59] yes [15:59] it's bugs all the way up and down the stack in nested vms [15:59] haha [15:59] apaprently ;) [15:59] but I would not be worried about minimal smoke on 18 if that was the only thing that failed [16:00] anyways, i don't think the failures in nested core20 were related to the kernel reseal test [16:01] and we really should do the repacking somewhere outside of $SPREAD_PATH, `2020-09-15T15:30:24.5255269Z 2020-09-15 15:13:31 Project content is packed for delivery (350.90MB).` [16:01] fun [16:01] oof [16:02] mborzecki: ijohnson: so you are saying I should force merge #9331 and the tweak my follow-up? [16:02] PR #9331: boot: reseal when changing kernel [16:02] pedronis: yeah, i'll add a note about failures to the PR [16:03] pedronis: let me have a look again at the test to make sure I'm not errantly identifying the fialure without looking at the logs [16:03] ha, ok i see why nested/basic failed, it's `error: cannot communicate with server: timeout exceeded while waiting for response` [16:03] so the timeout tweak [16:05] and the test does something weird, there's `nested_exec "sudo snap install test-snapd-sh" || true` with a note that ssh sometimes dies at this point> [16:05] mmm very suspicious [16:05] it should do --no-wait maybe? [16:05] but pedronis mborzecki I looked at the non-uc20 logs for 9331 and I wouldn't be worried about any of those other failures [16:06] mmh, it doesn't have two +1 atm [16:07] sorry my comment wasn't a +1 because I was unsure about the test situation [16:07] pedronis: I +1d it formally [16:07] thx [16:08] mborzecki: I'll thos kernel_status checks to mine [16:08] *I'll add those [16:08] pedronis: cool, thank you [16:10] mborzecki: this one is also adding the spread test, right? [16:11] pedronis: yes, the kernel-reseal test was successful [16:13] mborzecki: ijohnson: merged [16:13] pedronis: thanks! [16:13] \o/ [16:13] mvo will be happy [16:13] now I will to deal with conflicts in mine [16:14] pedronis: yeah just saw it \o/ [16:14] ijohnson: I am! [16:14] mborzecki: \o/ as well [16:14] pedronis: I assume you merge master into 9340 now? [16:14] :-) [16:15] mvo: shall i merge master to #9341 and make it non-draft? [16:15] PR #9341: tests: add nested core20 gadget reseal test [16:15] mborzecki: yeah [16:16] mborzecki: it has this really ugly wart that it makes assumptions about the edition [16:16] PR snapd#9331 closed: boot: reseal when changing kernel [16:16] PR snapd#9343 closed: tests: more logging for UC20 kernel test [16:16] mborzecki: let me quickly look if I could improve that with some python-json [16:17] mborzecki: mvo: in that pr, what does the XXX2 about? how were you trying to update ubuntu-boot? [16:17] ijohnson: just bumping the editon there [16:17] does it still fail like that? the nested test is green on that pr [16:17] maybe that was an old failure? [16:17] it will do nothing if the content is the same though [16:18] pedronis: the content is changed [16:18] whitespace is added to the recovery grub [16:18] ah, ok [16:18] shall i propose this? https://paste.ubuntu.com/p/fKWpKjkGNz/ [16:19] yeah, so I think it will still be a problem, I can do a separate PR that illustrates the problem, I think it might be a bug in our [16:19] in our gadget update code [16:19] mborzecki: lgtm ship it [16:19] mborzecki: seems reasonable, I think we got a similar request recently by a customer [16:20] i can set spread to run the nested suite when opening the PR and we'll see about that basic test [16:20] * mvo nods [16:21] PR snapd#9354 opened: client: bump the default request timeout to 120s [16:21] ijohnson: but to be clear about the XXX2, I think it can wait for later this week :) [16:21] ack [16:22] I +1d it, lgtm with some small suggestions that can be a followup with no-spread tag [16:22] \o/ [16:28] I will work on a cleanup with better python3 json instead of the fugly shell|pipes [16:30] you could use an imcprehensible set of jq expressions too :-) [16:32] pedronis: instead of CloudInitRestrictOptions.DisableCloudInit as an opt name, what do you think about AlwaysDisableAfterLocalDatasourcesRun ? [16:33] that conveys that it onnly disables after it runs, and lets us make clear that it applies to both NoCloud and now None too [16:33] (and potentially others as we see fit later) [16:33] seems ok, do we need the "Always" bit in it? [16:34] mmm maybe not [16:34] DisableAfterLocalDatasourcesRun seems to make sense enough [16:39] ijohnson: added a comment explaining what's happening https://github.com/snapcore/snapd/pull/9341/files#r488806612 [16:39] PR #9341: tests: add nested core20 gadget reseal test [16:39] mborzecki: thanks looking now [16:40] mborzecki: ah thanks that makes sense I forgot that the edition for the different structures can be different [16:40] ijohnson: (cc mvo) that error is pure accident by trying simple s/edition: 1/edition: 2/ [16:40] mborzecki: haha - so the sed is too stupid? [16:40] ijohnson: mborzecki: I merged master and updated #9340 [16:40] PR #9340: boot: streamline bootstate20.go reseal and tests changes [16:40] can I mark 9341 ready for review ? [16:41] pedronis: thanks will try to look before my meeting in 5 minutes [16:41] ijohnson: +1 [16:41] ijohnson: done that now [16:41] mvo: the regular gadget update spread test mangles gadget.yaml with some python code, but we can improve the nested test later [16:42] mborzecki: yeah, that sounds reaonable. as long as its a silly issue on the test side by me I'm happy [16:42] mborzecki: I put this on my todo [16:42] mvo: take a look at tests/core/gadget-update-pc/generate.py [16:44] pedronis: ah it occurs to me now that all of the tests using MockTrustedAssetsBootloader are not using ExtractedRunKernelImageBootloader logic, they are all using EnvRefExtractedKernelBootloader's instead [16:44] not sure if that's an issue or not [16:44] mborzecki: cool, thanks [16:44] because you have snap_kernel in the bl vars, which won't be the case for a real trusted assets bl because it will be an extracted one [16:45] * ijohnson -> meeting [16:45] ijohnson: that is something that comes from the previous PR [16:45] that we just landed [16:45] pedronis: yes I didn't realize it before [16:45] sorry [16:45] I don't think it's an issue [16:45] well [16:45] I don't think it's a large issue [16:45] ijohnson: yeah, mocking the bootloader was a bit pita [16:45] well, bootloader is a zoo that we need to fix [16:45] but it would be nice if we had at least one unit tests that uses both the extracted run kernel image impl and the trusted assets bl [16:46] because we won't have any devices in the wild for a while yet that are env ref kernel and trusted assets [16:46] * ijohnson -> really meeting now [16:46] * mvo needs to taxi around again [16:47] mborzecki: anyway this a problem we get from bootloadertest, no? [16:47] WithTrusted assets has the wrong mix of things? [16:48] all the test do is tab := bootloadertest.Mock("trusted", "").WithTrustedAssets() [16:48] pedronis: yes [16:49] pedronis: we'd have to have different variants of trusted assets bootloader, one for extracted run kernel, and another for env ref [16:49] well, we really need one for now [16:49] but we have kind of the wrong one :) [16:50] anyway this is not something I can solve in my PR [16:50] I can add a TODO though [16:50] we won't have any trusted assets bootloaders in practice that are env ref (like uboot or lk) until we add support for more bootloaders to uc20 [16:50] the only trusted bootloader today will be grub which is always extracted run kernel on uc20 [16:51] yes I think TODO is fine [16:51] yes, what I meant with we need one [16:51] right [16:51] iirc tried to use extracted run kernel one at some point, but it was a bit larger than any of the prs i had at the time [16:52] i can look into that tomorrow morning [16:52] I don't doubt that it was incredibly complicated [16:52] yes, it has a more involved interface [16:52] mock bootloaders is a mess of a zoo on top of the mess of a zoo that is already bootloaders [16:52] and usually we have helpers around it [16:52] mborzecki: it might not be worth fixing it, before we fix the zoo [16:52] that may be true [16:56] mborzecki: ijohnson: added this: https://github.com/snapcore/snapd/pull/9340/commits/e22c82485de00a54aeb8a31bec7092af89778991 [16:57] PR #9340: boot: streamline bootstate20.go reseal and tests changes [16:57] pedronis: thanks [16:58] +1d the pr [17:04] take a look at the changelog: https://bodhi.fedoraproject.org/updates/FEDORA-2020-0d5e544db7 [17:04] /var/lib/snapd/snap/bin is in secure_path in sudo config on fedora now [17:04] zyga: Eighth_Doctor: ^^ [17:13] mborzecki In a call [17:13] but will look soon [17:15] ijohnson, hey, I updated #9326 [17:15] PR #9326: tests: fix for basic20 test running on external backend and rpi [17:15] bodhi seems very slow somehow [17:16] if you could take quick look would be nice [17:16] thanks cachio will look in a bit [17:18] mborzecki: yay! [17:28] mborzecki: ijohnson: I proposed #9355 [17:28] PR #9355: boot: with unasserted kernels reseal if there's a hint modeenv changed [17:31] PR snapd#9355 opened: boot: with unasserted kernels reseal if there's a hint modeenv changed [18:24] mvo: pedronis: i've pinged you in #9354 [18:24] PR #9354: client: bump the default request timeout to 120s [18:24] mmm minimal-smoke on uc18 strikes again in that pr [18:24] I'll file a pr with that sync idea I had, I don't think it can hurt anything [18:25] ijohnson: ok, if not i'll try to investigate that tomorrow [18:25] note that I only reproduced it by running spread in a loop, don't use the -repeat option to spread, because that will just reuse the same vm [18:26] i.e. try something like `for i in $(seq 1 10); do spread --debug google-nested:ubuntu-18.04-64:tests/nested/manual/minimal-smoke; done` [18:26] uhh #9341 has spread jobs still queued [18:26] PR #9341: tests: add nested core20 gadget reseal test [18:27] mborzecki: hmm try canceling the jobs and restarting them maybe? [18:27] i'll let it sit for a while [18:28] actually need to wrap it up, chase the kids to their beds otherwise they are going to have hard time getting up for school tomorrow [18:28] oh man don't ever run `go test -count=100 ./...` unless you want to cry and be scared about our tests [18:28] hahah [18:28] so many tests get broken somehow [18:29] perhaps we should have a spread test which just does this [18:33] why? [18:33] I see that the checks for 9340 got canceled, why? [18:33] I assume lots of tests aren't actually properly cleaning themselves up [18:34] mvo: I just canceled them to restart them [18:34] they were stuck [18:34] thanks! [18:34] restarted now [18:34] ijohnson: I'm not even sure go check supports that tbh [18:35] pedronis: oh really ? huh [18:35] it's a go test option after all [18:35] we have one Test in the go level sense per package [18:35] so not sure if it makes sense, what it does in practice for us [18:36] mmm I guess gocheck doesn't even have a count function [18:36] err s/function/option/ [18:37] it doesn't [18:37] still feels like this should work though [18:37] and to be fair for low numbers of count it does seem to work [18:38] heh [18:38] anyway I fear is a matter where feelings don't count. if we have enough seen flakiness without poking for more [18:38] mvo: I canceled them because I had more to push [18:39] ok [18:39] yes I'm ok if we just want to shove this under the rug until it seems relevant [18:39] mvo: I'm not sure how you want to proceed landing wise [18:39] mvo: do we need to sync? [18:39] pedronis about the order? [18:39] yes, and whether you want to have some control over what we run tests for vs not [18:40] mvo: we have 6 PRs for 2.47: https://github.com/snapcore/snapd/milestone/32 [18:41] plus the one Maciej asked to consider [18:41] pedronis: which one is the one from maciej? [18:41] pedronis: yeah, maybe a sync would be best [18:41] mvo: https://github.com/snapcore/snapd/pull/9354 [18:41] PR #9354: client: bump the default request timeout to 120s [18:42] mvo: we can sync if you want [18:42] pedronis: yeah, let me grab my headset [18:43] pedronis: ready when you are in the standup channel [18:51] PR snapd#9340 closed: boot: streamline bootstate20.go reseal and tests changes [18:54] ijohnson: do you think you could have a look at 9355 ? [18:54] mvo: sure it is in my queue for today, should I look now? [18:55] ijohnson: I had hoped to land it in my day but I'm not sure that is feasible :/ [18:55] mvo: ack I will look at it now then [19:02] PR snapd#9353 closed: configcore: do not error in console-conf.disable for install mode [19:06] ijohnson: thank you! [19:07] PR snapd#9354 closed: client: bump the default request timeout to 120s [19:12] PR snapd#9356 opened: sysconfig,o/devicestate: mv DisableNoCloud to DisableAfterLocalDatasourcesRun [19:12] mvo: pedronis: approved 9355 conditional on my understanding being correct about a question on gadget update codeflow [19:14] * zyga will sort out family and will return to make some more changes to export manager [19:14] ijohnson: I answered, let me know if the answer makes sense [19:14] * ijohnson looks [19:15] pedronis: yep that's basically what I thought so good to go from me [19:47] PR snapd#9357 opened: interfaces/builtin/process-control: add scheduler_setattr to seccomp [19:48] thanks [19:51] mvo: the gadget update failed [19:52] cmatsuoka: oh no, in what way? [19:53] mvo: it tries to compare the volume name and ends with "cannot find device matching structure #0 ("mbr"): unexpected number of matches (0) for /sys/block/*/ubuntu-data-20833fbc-9153-4ccd-b00b-d687b76dfd0a" [19:53] jdstrand lol [19:53] jdstrand funny [19:53] my review comment for that? ^ :) [19:53] yes [19:53] cmatsuoka: woah, that is strange [19:53] mvo: cmatsuoka: that's a more fundamental problem, not really related to reseal [19:54] jdstrand I asked about RR scheduling [19:54] it feels unsafe [19:54] pedronis: yes, exactly [19:54] cmatsuoka: thanks! and yes, not related to reseal :( [19:54] unless process control is very privileged already [19:55] mvo: the full error message is in my SU notes, now checking the gadget update code [19:55] cmatsuoka: thanks, I think worst case is we need to skip this PR for 2.47~rc1 [19:55] zyga: it is quite privileged already [19:56] you can kill or renice an process [19:56] jdstrand the risk here, as you surely know, is that rr scheduling can lock out all processes and hang the system for most practical purposes [19:56] fixed my typo :-) [19:56] ijohnson do you know if the desire is to use RR scheduling or anything else? [19:56] mvo: did we test gadget updates recently? it could be there since we added the uuid to ubuntu-data [19:56] zyga: sure. so can creative use of kill :) [19:56] zyga: all I know is what's on the forum post [19:57] cmatsuoka: we should have gadget asset update spread tests that run on every pr [19:57] jdstrand ish, we can mediate kill, there's no real way around runaway RR process [19:57] ijohnson let me read it [19:57] cmatsuoka: see tests/core/gadget-update-pc [19:57] ijohnson: ah ok, that's good, thanks [19:58] maybe the test is insufficient for what we changed in uc20 though [19:58] zyga: plus, we can't really mediate to the level that we would like since we can't control the pid or the sched_attr struct [19:58] oh, right :( [19:58] zyga: I'm saying, we can mediate kill, and we do, but we grant it in process-control [19:59] I wish we had that CPU pinning interface so that we can ensure at least one core remains for system apps [19:59] that's fine, I just wanted to raise this [19:59] if we get some finer controls, we can move towards breaking this up [19:59] cmatsuoka: I think that mborzecki had a comment about gadget asset updates this morning and there was something for him to work on related to this area [19:59] (specifically around the device mapper nodes,etc.) [20:00] done [20:00] ijohnson: the existing test is unencrypted only, and it fails with the encrypted device name with the uuid I think [20:00] cmatsuoka: though looking at the generate.py, it only handles ubuntu-seed edition updates afaict [20:01] cmatsuoka: right [20:01] yeah so I think this is the bug that mborzecki was referring to this morning [20:02] https://github.com/snapcore/snapd/pull/9341/files has the bug in the comment [20:02] PR #9341: tests: add nested core20 gadget reseal test [20:02] in the test [20:03] cachio: sorry one more comment on 9326 [20:04] mvo: I'm not sure I understand the test in #9341, isn't it changing a file that we ignore [20:04] PR #9341: tests: add nested core20 gadget reseal test [20:04] ok, so maciek is aware [20:05] pedronis: isn't the recovery grub in the boot chain for run mode though? [20:05] pedronis: ahhhh but that's the recovery grub executable not the config [20:05] good catch [20:05] the config should come from snapd [20:06] ijohnson, done [20:06] pedronis: but is the .cfg file included in the resealing ? [20:06] no [20:06] it's not a trusted asset either way [20:07] so I'm not quite sure what that test does [20:07] right so a "realer" test would involve changing grubx64.efi right? [20:07] thanks cachio, +1 [20:07] ijohnson, thanks [20:07] ijohnson: yes [20:07] hmm I wonder if we can just pad 0's on the end or something [20:07] I am going to pay some taxes now [20:07] I'll be back in 30 minutes [20:08] * cachio afk [20:08] ijohnson: mvo: anyway in theory we should drop those .conf files from the gadget [20:08] they should be ignored anyway [20:08] pedronis: uh, good catch indeed. but the test seems to indicate we do reseal? [20:08] mvo: well, we reseal at each reboot [20:08] so not sure [20:08] at least it did indicate that at some point when I was running it [20:08] pedronis: fun :( [20:09] pedronis: ok, then the test is no good [20:09] I mean, it will mean something after my branch lands [20:09] pedronis: let me mark it as blocked and it will need a better test [20:09] mvo: pedronis: I can try modifying the test to just pad 0's on the end of grubx64.efi, not sure if that will boot or not [20:10] ijohnson: \o/ [20:11] ijohnson: it's a pe binary I think, there might be some tools to change something in it and then sign it again [20:11] mmm yeah but padding 0's on the end is easier if it works [20:12] it has section with sizes, so not sure what adding 0 add it end does [20:12] I would suspect not to work but maybe I'm wrong [20:12] maybe it works but it's a noop, not sure [20:13] easy enough to try :) [20:13] alright building image with 0 appended to the grubx64.efi [20:13] let's see what happens [20:15] nope didn't work [20:16] https://www.irccloud.com/pastebin/jeEA8icw/ [20:16] alright let's look at these pe tools then [20:17] ijohnson: you stripped the signature added the zero and signed it again with snakeoil? [20:18] (just making sure= [20:18] ... no? I just added 0's [20:18] it's signed [20:18] I guess I didn't realize I needed to resign grub [20:18] ah [20:18] does shim verify the signature of grub too? [20:18] * cmatsuoka afk for a few minutes to replace a faulty network card [20:18] do I need to modify shim at all? I already resigned shim with snakeoil [20:19] ijohnson: yes [20:19] but if you modify grub, you need to sign with snakeoil too [20:19] I mean strip signature and sign with snakeoil [20:19] the same thing we do with shim [20:19] I guess the question is, do I need to modify shim beyond just strip signature from shim and resign shim with snakeoil [20:19] ack ok I will try that then [20:20] ijohnson: no, that's all you need to do, is the same princicple by which we can then sign kernel with snakeoil [20:20] so far we didn't need to sign grub again [20:20] ok [20:20] but if we change it we have to [20:21] ijohnson: for all I know changing what signs it might enough of a change [20:21] but not sure [20:22] does the tpm resealing use the full hash of the file? or maybe it strips the signature and hashes the actual binary content minus the signature [20:22] ijohnson: the latter, but it happens only on cmatsuoka PR [20:22] before it only matters what are the ca [20:22] certificates [20:23] right [20:23] but as I said it might be that signing grub changes that, we would have to look at tcg log [20:25] alright I'm at least able to boot if I resign grub with snakeoil, I'm just gonna wait to see if install finishes with grub resigned with snakeoil too just to be extra sure [20:25] then I'll try my resigned with snakeoil grub that is padded with a 0 [20:26] added a comment and marked as blocked [20:26] ijohnson: feel free to remove blocked once your changes are there :) [20:26] sure [20:27] ta [20:30] mmm this seems not good [20:31] stateengine.go:150: state ensure error: devicemgr: cannot mark boot successful: cannot compose run mode boot chains: cannot find expected boot asset bootx64.efi in modeenv [20:33] that is weird [20:33] this is with an edge snapd [20:34] but with rebuilt kernel and gadget snaps so maybe I broke something [20:34] let me retry this boot [20:35] ijohnson: mvo: the kernel test has the same problem btw, we are reinstalling the same, we should at least unpack it and add a file to the initramfs or something. anyway doing that is much more typical , we do it already [20:35] ok [20:36] mmm good point [20:36] yeah this is what I expected to be needed eventually, I can also work on preparing that kind of change too [20:36] it's probably more important than the gadget [20:36] I already do something similar in the kernel-failover test to make the initramfs panic [20:36] right now [20:37] yea, that covers our case a bit but given it fails to boot, we don't have e2e reseal story with the failover one [20:37] we just reseal back to the one kernel from before [20:37] mmm I can still reproduce this issue with cannot find expected boot asset in modeenv [20:38] that is weird, is install failing to write the right bit in modeenv [20:38] pedronis: if it's okay I would like to keep trying to understand this failure, because it just started today [20:38] yes [20:39] we are installing and booting in spread though [20:39] right [20:39] so seems something misaligned [20:39] or we have a silly bug that spread doesn't exercise [20:40] now I'm trying just a fresh boot with all edge snaps without rebuilding anything, so all ms key signatures [20:40] pedronis: we should probably have a spread test too which just tries to boot uc20 with all edge snaps, but with snapd from spread [20:41] btw the gadget-reseal test failed on that PR [20:41] weird, stat: cannot stat '/run/mnt/ubuntu-seed/device/fde/ubuntu-data.sealed-key' [20:42] alright I reproduced this install failure with all edge snaps [20:42] so we have a real problem somewhere [20:44] snapd from edge as well? [20:44] pedronis: yes [20:44] pedronis: here's a full log and all the snaps I used to build the image: https://pastebin.ubuntu.com/p/wgMYSfKZjv/ [20:44] * ijohnson needs to step out for a little bit, will return to this in 10ish minutes [20:45] ijohnson: I suspect kernel doesn't have yet the right bits? [20:45] I mean snap-boostrap dropping values when it deals with modeenv ? [20:46] pedronis: ah I bet you are right [20:46] kernel gets the initrd bits from the beta channel so that would be 2.46 until we release 2.47~rc1 [20:46] pedronis: I will verify that theory when I get back [20:46] thanks, I need to drop, will read standup doc in the morning or mattermost/tg [20:46] * mvo waves [21:13] ok back now [21:15] ijohnson how is sealing going? [21:15] We seem to have regressions with the set of snaps on edge [21:16] Presumably because something with resealing is broken with the old snap-bootstrap in the initramfs [21:16] so no 2.47? [21:17] not tonight at least [21:17] unclear if this is a blocker or not [21:17] I mean obviously it's a blocker that edge images can't boot [21:17] err can't seed/isntall [21:18] right [21:18] oh well [21:18] I think everyone working on this is really tired [21:18] so no surprise something doesn't work sometimes [21:19] yeah I think we just missed it due to a lack of testing for this scenario specifically, with edge kernel snap instead of rebuilt kernel snap [21:19] ah, we always rebuild the kernel for compatibility? [21:20] yes in spread tests, we rebuild the initramfs to have the new snap-bootstrap bits [21:20] drat I need to go to the office [21:20] (I'm coding from bed) [21:20] anyway, time to sleep [21:20] pedronis: yes so rebuilding with snap-bootstrap from the edge snapd works to boot an edge uc20 image [21:20] I'll finish this in the morning [21:20] bye zyga! [21:20] good luck guys [21:20] have a good evening [21:20] you too :) [21:21] :-) [21:21] pedronis: I'm looking into what bits from snap-bootstrap might be effected here [21:22] ijohnson: it the modeenv manipulations I suspect, they don't preserve fields they don't know about [21:22] yes that's my suspicions as well [21:22] currently looking at InitramfsRunModeSelectSnapsToMount [21:23] but what I'm confused about is that nothing should have changed to require a rewrite of the modeenv on first boot of run mode [21:30] ijohnson: do we have base set as try for some reason? [21:30] pedronis: I don't think so [21:30] but I'm trying to get a peek at the modeenv as written by install mode before run mode snapd runs [21:31] because I think part of what happens is that we can't run and so snapd uninstalls things [21:39] pedronis: no, the modeenv is not rewritten from the initramfs during first boot [21:40] pedronis: after snap-bootstrap exits, I still see the same modeenv with current_trusted_boot_assets, etc. in it [21:40] so it must be something else [21:40] this very weird [21:40] oh wait that was with the new snapd [21:40] sorry let me re-run that test [21:40] heh === mwhudson_ is now known as mwhudso === mwhudso is now known as mwhudson [21:43] pedronis: ok so after the snap-bootstrap in the kernel snap runs, the modeenv does _not_ have the keys in it [21:43] but try_base is not set [21:43] so it's a bit unclear to me why we would rewrite the modeenv [21:43] but rewriting the modeenv seems to be the issue here I think [21:44] ijohnson: I agree, seems a bug? [21:44] yes probably a bug [21:44] anyway we need to teach modeenv to behave correctly [21:44] that too [21:44] but would still be good to understand the bug [21:44] you might have to look at the code in 2.46 though [21:44] because things have changed since [21:44] ah yes very good point [21:44] actually wait [21:45] no we need to look at 2.45.2 (?) code [21:45] I don't think that ubuntu-core-initramfs was ever rev'd [21:45] oh [21:45] actually I don't even know what version of snapd snap-bootstrap is in the kernel snap right now [21:45] * ijohnson goes to go digging [21:46] so ubuntu-core-initramfs is at version 36, uploaded on may 15th [21:47] so snapd 2.45+20.04 is what's in the initramfs right now [21:50] ijohnson: it was writing to modeenv incondtionally [21:50] pedronis: yeah so we always write the modeenv from the initramfs in 2.45 [21:50] ah yes [21:50] we found the same code :-) [21:50] hmm, I'm not sure what to do [21:51] maybe we need to re-release snapd snap to the previous one, and then get xnox to re-build the initramfs asap and release a new kernel snap that doesn't unconditionally write the modeenv [21:51] pedronis: thoughts ? [21:53] ijohnson: how is the code in 2.46 ? [21:53] pedronis: let me look quickly [21:54] pedronis: 2.46.1 does not always write modeenv [21:54] pedronis: let me quickly check that 2.46.1 snap-bootstrap boots properly [22:00] pedronis: 2.46.1 snap-bootstrap works [22:01] ok, notice that once we switch to v1 for sealed keys we'll have a bigger problem [22:01] pedronis: how so ? [22:01] oh [22:01] I see [22:01] yeah that will be a problem [22:02] the secboot inside old kernel won't know what to do with it [22:02] I think the right thing to do is to get new initramfs re-spun as soon as possible [22:02] regardless of what we do with snapd on edge right now [22:04] anyway we need a fix in master to teach modeenv to keep and write back any fields it doesn't use/know [22:04] pedronis: yes I'm working on that right now [22:20] pedronis: https://github.com/snapcore/snapd/pull/9358 [22:20] PR #9358: boot/modeenv: track unknown keys in Read and put back into modeenv during Write <âš  Critical> [22:21] mmm maybe I should add a DeepEquals test for that too [22:21] * ijohnson goes to write one [22:22] PR snapd#9358 opened: boot/modeenv: track unknown keys in Read and put back into modeenv during Write <âš  Critical> [22:26] ok, done [22:27] pedronis: I'm gonna eod now, but I will be semi-around for reviews or to land things if needed, otherwise I will be working early tomorrow === ahayzen_ is now known as ahayzen [23:06] cmatsuoka: are you still around? [23:06] pedronis: yes [23:06] I just landed my PR, could you merge master into #9277 ? [23:06] PR #9277: secboot: add boot manager profile to pcr protection profile <â›” Blocked> [23:07] pedronis: sure, doing it now [23:07] thx [23:08] PR snapd#9355 closed: boot: with unasserted kernels reseal if there's a hint modeenv changed [23:13] PR snapcraft#3288 opened: cmake v2 plugin: add help for cmake generators