[05:27] morning [05:31] driving the kids to school, bbl [06:04] re [06:05] Hello, I'm trying to install snap on a CentOS, but get "No package snapd available." even after "sudo yum install epel-release"... [06:06] (Package matching epel-release-7-11.noarch already installed.) [06:10] gko: centos 7/8? [06:11] 7 [06:12] If I search "snapd", I get: "snapd-debuginfo.x86_64 : Debug information for package snapd" and "snapd-glib-debuginfo.x86_64 : Debug information for package snapd-glib" [06:12] CentOS Linux release 7.6.1810 (Core) [06:13] gko: seems working here: https://paste.ubuntu.com/p/z8JmK6qJZX/ [06:15] good morning [06:16] gko: is epel actually enabled? maybe only epel-debuginfo is enabled, can you paste the output of `yum repolist`? [06:22] * zyga eats breakfast, will start soon [06:23] PR snapd#9337 closed: boot,many: reseal only when meaningful and necessary [06:26] mvo: morning [06:26] mborzecki: good morning [06:26] hey mvo [06:27] mvo: left you a comment yesterday https://github.com/snapcore/snapd/pull/9341#issuecomment-691692486 but looks like you missed it [06:27] PR #9341: tests: add nested core20 gadget reseal test [06:28] mvo: anyways, i've merged 9337, so maybe you can just merge master and push again [06:28] mvo: fwiw, the nested test passed when i pushed that commit i mentioned to your pr [06:28] good morning zyga [06:29] mborzecki: cool, looking [06:29] mvo: today is different, wife's Mondays got swapped to the afternoon shift [06:31] mborzecki: repo id repo name status [06:31] epel EPEL 9 [06:33] mborzecki: I merged master into 9341 now [06:33] mvo: cool [06:33] mborzecki: thank you! and you said it passed earlier? [06:34] mvo: yes, i think so, it was one of those github action job status emails, let me check if i still have it in the trash [06:35] mborzecki: nice [06:35] mvo: https://paste.ubuntu.com/p/kX8kqQSQ9y/ that's the branch right? [06:35] mborzecki: I think it passed locally but it was a bit of a pain, some strangess like I see in the "snap change 8" that installs the pc gagdet an info about "restarting snapd" but it does restart the system [06:35] ah w8, 4 annotations, pfff missed that [06:36] mvo: saw a column of 0s and was too happy about that ;) [06:36] mborzecki: haha [06:36] mborzecki: no worries [06:36] mborzecki: I think the test itself is good [06:37] mborzecki: but it highlighted some small issues [06:37] gko: it's showing `epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 13,446` ? [06:37] mborzecki: anyway, I'm quite happy that things seems to be working :) [06:38] gko: something seems off in your system, it's listing 13k packages here, but only 9 in your setup [06:39] mborzecki: right... no wonder if can't find anything. [06:40] gko: maybe try to `yum reinstall epel-release`? [06:43] PR snapd#9339 closed: boot: make MockUC20Device use a model and MockDevice more realistic [06:53] mborzecki: hm, 9331 has conflicts now :/ [06:53] mvo: yup, resolving them right now [06:53] mborzecki: \o/ [06:56] mborzecki: OK, my fault... there was another repo file also using epel... Thanks! [06:59] mvo: updated [07:05] morning [07:07] good morning pstolowski [07:12] * tobias_ waves general good morning [07:32] pstolowski: hey [07:33] re [07:35] * zyga canceled PT and jumps into code [07:43] mvo: mborzecki: hi, do we need to sync? [07:43] pedronis: hi, yes, in 5? (cc mvo?) [07:43] pedronis, mborzecki sounds good [07:44] ok [08:01] zyga: in a call right now [08:31] mvo: tweaked that diff a bit more https://paste.ubuntu.com/p/K9Dg4tQgkg/ [08:32] mborzecki: thanks! [08:33] mborzecki: running locally now [08:39] mvo: i'm setting up an image for running nested locally too, are the nested tests using qemu from the repos or some special-flavor one? [08:39] pstolowski: hi, can you setup a slot after your lunch to chat about next topics? [08:40] pedronis: hey, sure [08:40] thx [08:41] mborzecki: I'm just using focal [08:42] mborzecki: I merged 9311 to have more options [08:42] pedronis: sent [08:42] mborzecki: also make sure you have master, 9310 and 9305 are important [08:43] mborzecki: and then I just run spread -debug -v qemu-nested:ubuntu-20.04-64:tests/nested/core20/kernel-reseal [08:44] mborzecki: for me (with amd) nested kvm also works so things are somewhat fast [08:44] mborzecki: on my intel laptop that appears to be not working but I did not debug this further because the laptop is slower anyway [08:44] mvo: mhm, let me try that [08:45] mborzecki: for intel it's also: $ cat /sys/module/kvm_*/parameters/nested [08:45] Y [08:45] mborzecki: (instead of 1 on amd) [08:45] mborzecki: so the existing detection for support will not trigger on intel (which is a bit of a feature because it did not work for me :) [08:46] mborzecki: anyway, hope that I'm not spaming you too much [08:46] mborzecki: usually I then just telnet into the spread serial port and login and monitor stuff [08:46] mborzecki: like tail /tmp/work-dir/logs/serial.log or journalctl -u nested-vm [08:46] mborzecki: etc [08:47] mhm, the /sys/ bit is set to 1, so that's good [08:47] mborzecki: but I wish we could get a more holisitic view from spread too [08:47] mborzecki: yay [08:47] mborzecki: is that on intel or amd for you? [08:47] waiting for the base image to get updated, thena reboot and i'll try to run it [08:47] mvo: amd [08:49] mborzecki: cool [09:01] mborzecki: something seems to be not quite working with your diff, no sanpd output on the serial port right now :/ and huge delays [09:01] mvo: this is what i see in the logs in the vm the prepare fails: [09:01] Sep 14 08:51:38 ubuntu snapd[721]: devicemgr.go:725: System initialized, cloud-init reported to be done, set datasource_list to [ None ] [09:01] Sep 14 08:51:44 ubuntu snapd[721]: taskrunner.go:271: [change 2 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc" [09:01] Sep 14 08:56:45 ubuntu snapd[721]: taskrunner.go:271: [change 3 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc" [09:02] mborzecki: I think the request serial are red-herrings [09:02] mborzecki: would be great to get more debug output from cloud-init I guess :/ [09:03] mborzecki: i.e. if it actually created the user for us [09:10] mborzecki: ha, we don't set "enable_ssh" in from nested.sh so your tweaks need a slightly different place [09:10] mborzecki: or we just enable ssh, wonder why we don't [09:18] mborzecki: I tweaked "nested.sh" now to run "repack_snapd_snap_with_..." to set "enable_ssh" to true, this should also make debugging a lot simpler, lets see how it goes and if that breaks anything [09:19] mvo: hmm ssh seems to be working fine, i can ssh into the vm after prepare fails :/ [09:23] mvo: heh, and the test is ofc executing now [09:23] mborzecki: mborzecki: yes the deliver serial request are red-herrings, we just need to use serial-authority [09:23] we probably should because they just pollute the logs for nothing [09:24] PR snapd#9221 closed: tests: disk space awareness spread test [09:25] mvo: with None I would suspect not [09:25] about creating the suer [09:26] it should say NoCloud if it created the user ? [09:26] mborzecki: oh, can you paste all the info you have why preapre failed? [09:26] anyway it seems the cloud-init info is not seen/passed right if we get None [09:28] pedronis: oh, interessting [09:28] or maybe is another strange cloud-init corner case [09:30] in principle we should teach the code to turn None into disabled [09:31] mvo: https://paste.ubuntu.com/p/MF6sjVhwyr/ timeout checking whether snapd seeded [09:32] maybe we should have that timeout configurable via env too [09:33] mborzecki: and if you ssh into the nested system, what do you see there for "journalctl -u snapd" ? [09:33] mborzecki: i.e. does it actually fail to seed? [09:33] mborzecki: i.e. "nested_exec sudo journalctl -u snapd" ? [09:33] mvo: no, just the snap command hit a timeout, the system seeded ok [09:33] mborzecki: ohhh [09:33] mborzecki: ok, let me push something [09:34] I don't even understand where we build the cloud-init data atm with a bit of grepping [09:35] mborzecki: I pushed a small change that waits for the snap command to become available as suggested by ian [09:35] mborzecki: this should make this part more robust [09:38] mvo: from what i can see, the error comes directly from our client code [09:39] mborzecki: if you log into the system, do you also get a client timeout then or was this a one-off thing? [09:39] mvo: maybe it needs to multiple steps, i.e. the command -v snap loop, then make the timeout somehow configurable via env (SNAP_CLIENT_TIMEOUT?) [09:39] mvo: it's our setup that is weird: datasource_list: [ "None"] [09:39] mborzecki: oh, interessting, do you think it's actually that slow? [09:39] pedronis: oh, nice catch [09:39] in nested.sh [09:42] mvo: in theory it's ok to use [09:43] ok [09:43] hmm i use [NoCloud, None] usually [09:44] yes, more typical, but None should be ok afaict [09:45] what I mean, it should not cause problems [09:45] you should still get a user [09:46] mborzecki: hrm,hrm,something in the nested vm after "Satrting create static device nodes in /dev" is really very slow :( [09:47] do we know when snap-bootstrap run? [09:52] hm, it takes 500s to read the reboot during initial install [10:06] read the reboot? [10:07] zyga: i've requested your re-review of #9270 because of a few more commits after your previous review (a few more cases where --root=.. was passed to systemctl) [10:07] PR #9270: wrappers, systemd: allow empty root dir and conditionally do not pass --root to systemctl [10:08] pstolowski: ack [10:10] * mvo is away for a few min to pickup kids [10:23] * zyga grabs some food [10:36] re [10:36] yum [11:30] pstolowski: you didn't add a meet [11:30] pedronis: yes, let's use standup HO [11:41] mvo: have you looked at the snapd snap produced by repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks [12:01] mborzecki: firstboot tweaks will fail because we do things differently, need to hack this a bit further, no real results yet :( [12:01] mvo: i'm trying with repacked core20, injecting the bits directly there [12:05] brb, tea [12:09] the nested suite is dog slow :/ [12:16] mborzecki: yeah, it's all a bit frustrating [12:17] mborzecki: I modified the image caching to use gzip -1 instead of xz locally but it does not make a huge diffrence [12:18] mborzecki: I think part of it is really trying to figure out what part exactly is so slow and if we do something silly somewhere [12:18] mvo: looking at dmesg timestamps in serial logs, install takes ~500s [12:21] * zyga cleans up unit tests [12:21] mborzecki: is that just install? without first seed? [12:22] pedronis: yeah, from first boot, to a reboot [12:23] we need install mode logs [12:23] pedronis: +100 [12:25] hmm, something's off, i've added a drop in override for journald to the core snap, sice it's a base that runs durin install, the logs should be visible on the console [12:25] fwiw, first reboot is at: [ 549.003080] reboot: Restarting system [12:39] PR snapcraft#3284 opened: build providers: rename clean() -> clean_parts() to clarify scope [12:42] mborzecki: I merge my install mode pr and see if that gives me any clues [12:43] mvo: which one is that? [12:44] PR snapd#9342 opened: tests: add more checks to disk space awareness spread test [12:50] mborzecki: 9317 [12:51] mborzecki: running it now so after the standup we hopefully have results :) [12:51] ack [12:55] heh `Sep 14 12:42:14 ubuntu systemd[1]: Startup finished in 36.166s (kernel) + 1min 50.783s (userspace) = 2min 26.949s.` [12:55] this is run mode starting up [13:35] mvo: this is the diff i'm trying right now: https://paste.ubuntu.com/p/hHY6hjWxTV/ [13:35] mborzecki: thanks [13:36] mborzecki: nice, does it work? [13:36] mvo: not quite, idk why i'm not seeing the run system logs [13:36] mborzecki: http://paste.ubuntu.com/p/CTQwsyvZCC/ is my heavily hacked stuff [13:38] mvo: want to try with systemd.journald.forward_to_console=1 in the command line? [13:39] mborzecki: oh, excellent idea [13:39] mborzecki: yeah, I mean, this is obviously just a quick hack to see if I can any extra data :/ [13:40] mvo: higher chance of succeeding in getting more logs then i have have here with repacking [13:40] cmatsuoka: we should try to see what happens combining #9340 and #9277 (for this we probably need to bump secboot version) [13:40] PR #9340: boot: streamline bootstate20.go reseal and tests changes [13:40] PR #9277: secboot: add boot manager profile to pcr protection profile <⛔ Blocked> [13:41] pedronis: ack [13:42] mborzecki: I added it now but will let my current run continue [13:43] mvo: do you cache the kernel/core/snapd snaps somehow locally so that the vms do not have to download them all over again? [13:45] mborzecki: I don't :( I think having a squid-deb-proxy or debcacher-ng would be helpful [13:58] re [13:59] lucy is fast asleep - back to work [13:59] mvo: heh, and now swtpm socket isn't ready by the time the vm starts [13:59] mborzecki: I have a snap proxy at home [13:59] mborzecki: meh, it's getting worse and worse [13:59] * zyga hugs mvo and mborzecki [14:00] zyga: hm spread runs vms with -net user, so afaiu the vm will not have access to lan [14:00] mborzecki: it does [14:01] -net user drops ping and stuff [14:01] but it works normally [14:01] lan or otherwise [14:01] as long as the private IP is in another subnet from LAN [14:01] (private qemu-given IP) [14:02] cachio: hey, i just hit 2020-09-14 13:24:53 Cannot allocate google-nested:ubuntu-18.04-64: cannot find any Google image matching "ubuntu-1804-64-virt-enabled" on project "computeengine" or "ubuntu-os-cloud" [14:03] pstolowski, 1 sec [14:03] it is the gce issue [14:04] cachio: should i just restart the tests or is it more permanent? [14:05] pstolowski, try now please [14:10] mborzecki: this is all very frustrating, even when adding systemd.journald.forward_to_console=yes it does not even work [14:10] mborzecki: I mean, I still don't see anything in the serial log [14:11] mvo: i think that console conf hijacks the console [14:11] mborzecki: and systemd.debug also does not work [14:11] (or the serial as such) [14:11] mborzecki: I can try this [14:11] the last line i see is a prompt from console conf [14:16] mvo: fwiw, might be an accident, but with haveged install time is now ~360s rather than ~500 [14:18] equally well might be some nework stuff causing that [14:28] Issue core20#80 closed: networking does not persist in a reboot loop on arm64 pi4 [14:46] zyga: do you have a sec for https://github.com/snapcore/snapd/pull/9342 ? [14:46] PR #9342: tests: add more checks to disk space awareness spread test [14:46] pstolowski: sure [14:48] done [14:48] zyga: ty! [14:50] PR snapd#9342 closed: tests: add more checks to disk space awareness spread test [15:03] mvo: hm maybe this https://paste.ubuntu.com/p/29Xz6JcJrg/ [15:04] hopefully ttyS1 is not hijacked and we can still get logs out of it [15:06] mborzecki: mux=on? should I also add this? [15:07] mborzecki: nice, let's hope this gives output [15:09] mvo: if that doesn't work, we can always have a service taht does `journalctl -f > /dev/someserial`, unfortunately none of the systemd-journal-gateway* things are in the core snap [15:12] mborzecki: mvo: do we need console-conf in these tests? can't we turn it off? or turn it off in most? or turn it off while debugging? [15:14] pedronis: yeah, I think we can turn them off [15:14] pedronis: well, so … having console-conf means there is a way to login [15:14] pedronis: so it's not entirely without merits but if we provide an alternative login then it's not needed [15:25] pedronis: while trying to debug I can say that sealing to commandline works, I tried to change it and got a recovery prompt [15:33] if we fail to unlock in the initramfs as part of a kernel snap update, will we reboot automatically and trigger rollback to the previous one? I don't think so, but perhaps it would be smart to teach the initramfs to do this for at least specifically the kernel snap, we could detect we are trying a kernel snap update before unlocking the encrypted partition just by looking at kernel_status and bootloader vars [15:33] mvo: not much of an improvement unfortunately, idk why journald just stops logging to serial console at some point [15:33] mborzecki: yeah, I'm also a bit stuck here, trying out more things but it's very frustrating [15:34] ijohnson: it sounds reasonable but probably to be done after we have landed the current bits [15:34] pedronis: ack I will make a small todo for myself to look into that [15:34] (for later on) [15:35] mvo: i need to taxi the kids around in 20 minutes, i'll open a branch with the patches i have [15:39] mborzecki: cool, I keep exploring this [15:40] pff, something new `Connection timed out during banner exchange` when trying to ssh into a nested vm [15:40] PR snapd#9343 opened: tests: more logging for UC20 kernel test [15:45] mvo: need to go out now, check this commit: https://github.com/snapcore/snapd/pull/9343/commits/648801163d3d09f3db18dd71bec3d79690eda3b1 [15:45] PR #9343: tests: more logging for UC20 kernel test [15:45] mborzecki: nice [15:45] mborzecki: does it work :) ? [15:46] mborzecki: I keep poking at this [15:46] mvo: idk yet, just added, check back in a bit [15:46] mborzecki: \o/ cool [15:46] mborzecki: thanks in any case [15:46] mvo: but that may be it, the dfault is 'debug' for journal/syslog, but only info for console [15:47] ok, got to go, bbl [15:58] * cachio lunch [16:25] cachio: I've seen the minimal-smoke test fail multiple times now due to issues with running spread inside the external system like this [16:25] https://pastebin.ubuntu.com/p/RPv9qTxMcY/ [16:25] and [16:26] err wait sorry need to find the other pastebin [16:28] here it is: https://pastebin.ubuntu.com/p/bmHw78Qfp4/ [16:29] any ideas on what might be wrong? it doesn't seem to be the case that there is no user on the VM we are using as an external system, just that we can't use sudo or somehow can't login the way that spread is trying to do [16:30] the first failure was from my pr https://github.com/snapcore/snapd/pull/9332, while the second failure was from https://github.com/snapcore/snapd/pull/9311 [16:30] PR #9332: spread.yaml, tests/nested: misc changes [16:30] PR #9311: nested: add support to telnet to serial port in nested VM [16:40] PR snapd#9344 opened: tests/lib/nested.sh: wait for the tpm socket to exist [16:41] simple nested test robustness PR ^ [16:44] mvo: back for a bit, the logs are visible now! [16:48] ijohnson, I'll take a look [16:49] thanks cachio [16:49] I'm running a spread run now of my pr to see if I can reproduce that issue [17:05] mborzecki: yay [17:08] mborzecki: running your PR locally now while waiting for spread to catchup [17:10] brb [17:12] mvo: this is what i got in the last local run: https://paste.ubuntu.com/p/pfZsCCyBsS/ [17:15] mborzecki: nice! good debug logs! is it still running or did creating the user fail? [17:15] mvo: it's `Failed password for user1 from 10.0.2.2 port 56336 ssh2` until the very end, either the user was not created yet (cloud-init does that right?), or something else went wrong [17:15] mborzecki: ok [17:16] mborzecki: so it failed, but least we have more logs now :) [17:16] mborzecki: hm, at least this "[ 161.918339] useradd[760]: new group: name=user1, GID=1000" is visible [17:16] mvo: yeah, at least we see it's resaling [17:17] mvo: maybe something wrong with ssh config then? [17:19] mborzecki: yeah, I wonder if the "preauth" part of the failure gives a clue already [17:22] mvo: do you remember where PasswordAuthentication in sshd_config gets enabled? [17:24] ijohnson: could you have a look at #9185 again when you have time? [17:24] PR #9185: secboot: use the snapcore/secboot native recovery key type [17:30] ijohnson, worker for me [17:31] I'll retry it in 3 machines [17:33] mvo: duh, i have no clue why ssh may be failing, trying one more time [17:34] re [17:34] ijohnson: could you run the tests/nested/core20/kernel-reseal test from #9343 locally? [17:34] PR #9343: tests: more logging for UC20 kernel test [17:34] only one function left to test [17:34] I made a lot of tea [17:38] mborzeck1: sure [17:39] ijohnson: in my runs, the test fails to login over ssh, i suspect PasswordAuthentication may have not been enabled [17:39] cachio: yeah worked for me too, I will try in a loop to see if we can reproduce [17:39] mborzeck1: hmm interesting [17:39] cmatsuoka: yes I will try to do a review tonight [17:40] thanks [17:40] ijohnson: otoh, i'm not sure how it gets enabled in those tests, we dont' seem to calling repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks in a way that would enable that, so maybe cloud-init does it implicitly when it adds a user? [17:41] mborzeck1: right it should be using whatever is the default [17:41] mborzeck1: if cloud-init ran it should have created the user with the pw [17:42] mborzeck1: is that your full log of a failed run @ https://paste.ubuntu.com/p/pfZsCCyBsS/ ? [17:42] ijohnson: yup, that's the full log i got [17:42] k [17:47] mborzeck1: it should be turned on by: ssh_pwauth: True in the cloud-init config [17:48] woot [17:48] done [17:50] pedronis: #9340 + #9277 + #9185 + some conflict fixing seem to work [17:50] PR #9340: boot: streamline bootstate20.go reseal and tests changes [17:50] PR #9277: secboot: add boot manager profile to pcr protection profile <⛔ Blocked> [17:50] PR #9185: secboot: use the snapcore/secboot native recovery key type [17:51] cmatsuoka: thx [17:51] pedronis: and snapcore/secboot update + canonical/go-tpm2 update too [18:04] mborzeck1: mvo: could we add printing of ssh config and extrausers content somewhere that runs/at end of cloud-init? [18:07] pedronis: good idea [18:07] * zyga spawns more spread tests, pushes the branch and EODs [18:09] pedronis: hm, the annoying part is that it appears that the exiting "echo test" is not output in the log that maciej posted :( [18:14] pedronis: ijohnson: so it looks like it's actually taking that long to reach the point where cloud init makes the system accessible https://paste.ubuntu.com/p/6sPDmDhFzX/ [18:15] mvo: I suspect cloud-init sends it output somewhere else [18:15] mborzeck1: ah you know what I remember this problem [18:15] funny there's this: [18:15] [ 154.076623] passwd[766]: password for 'user1' changed by 'root' [18:16] but it seems the change is picked up by pam way later: [18:16] [ 396.314719] chpasswd[1626]: pam_extrausers(chpasswd:chauthtok): password changed for user1 [18:16]