[05:27] morning [05:31] driving the kids to school, bbl [06:04] re [06:05] Hello, I'm trying to install snap on a CentOS, but get "No package snapd available." even after "sudo yum install epel-release"... [06:06] (Package matching epel-release-7-11.noarch already installed.) [06:10] gko: centos 7/8? [06:11] 7 [06:12] If I search "snapd", I get: "snapd-debuginfo.x86_64 : Debug information for package snapd" and "snapd-glib-debuginfo.x86_64 : Debug information for package snapd-glib" [06:12] CentOS Linux release 7.6.1810 (Core) [06:13] gko: seems working here: https://paste.ubuntu.com/p/z8JmK6qJZX/ [06:15] good morning [06:16] gko: is epel actually enabled? maybe only epel-debuginfo is enabled, can you paste the output of `yum repolist`? [06:22] * zyga eats breakfast, will start soon [06:23] PR snapd#9337 closed: boot,many: reseal only when meaningful and necessary [06:26] mvo: morning [06:26] mborzecki: good morning [06:26] hey mvo [06:27] mvo: left you a comment yesterday https://github.com/snapcore/snapd/pull/9341#issuecomment-691692486 but looks like you missed it [06:27] PR #9341: tests: add nested core20 gadget reseal test [06:28] mvo: anyways, i've merged 9337, so maybe you can just merge master and push again [06:28] mvo: fwiw, the nested test passed when i pushed that commit i mentioned to your pr [06:28] good morning zyga [06:29] mborzecki: cool, looking [06:29] mvo: today is different, wife's Mondays got swapped to the afternoon shift [06:31] mborzecki: repo id repo name status [06:31] epel EPEL 9 [06:33] mborzecki: I merged master into 9341 now [06:33] mvo: cool [06:33] mborzecki: thank you! and you said it passed earlier? [06:34] mvo: yes, i think so, it was one of those github action job status emails, let me check if i still have it in the trash [06:35] mborzecki: nice [06:35] mvo: https://paste.ubuntu.com/p/kX8kqQSQ9y/ that's the branch right? [06:35] mborzecki: I think it passed locally but it was a bit of a pain, some strangess like I see in the "snap change 8" that installs the pc gagdet an info about "restarting snapd" but it does restart the system [06:35] ah w8, 4 annotations, pfff missed that [06:36] mvo: saw a column of 0s and was too happy about that ;) [06:36] mborzecki: haha [06:36] mborzecki: no worries [06:36] mborzecki: I think the test itself is good [06:37] mborzecki: but it highlighted some small issues [06:37] gko: it's showing `epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 13,446` ? [06:37] mborzecki: anyway, I'm quite happy that things seems to be working :) [06:38] gko: something seems off in your system, it's listing 13k packages here, but only 9 in your setup [06:39] mborzecki: right... no wonder if can't find anything. [06:40] gko: maybe try to `yum reinstall epel-release`? [06:43] PR snapd#9339 closed: boot: make MockUC20Device use a model and MockDevice more realistic [06:53] mborzecki: hm, 9331 has conflicts now :/ [06:53] mvo: yup, resolving them right now [06:53] mborzecki: \o/ [06:56] mborzecki: OK, my fault... there was another repo file also using epel... Thanks! [06:59] mvo: updated [07:05] morning [07:07] good morning pstolowski [07:12] * tobias_ waves general good morning [07:32] pstolowski: hey [07:33] re [07:35] * zyga canceled PT and jumps into code [07:43] mvo: mborzecki: hi, do we need to sync? [07:43] pedronis: hi, yes, in 5? (cc mvo?) [07:43] pedronis, mborzecki sounds good [07:44] ok [08:01] zyga: in a call right now [08:31] mvo: tweaked that diff a bit more https://paste.ubuntu.com/p/K9Dg4tQgkg/ [08:32] mborzecki: thanks! [08:33] mborzecki: running locally now [08:39] mvo: i'm setting up an image for running nested locally too, are the nested tests using qemu from the repos or some special-flavor one? [08:39] pstolowski: hi, can you setup a slot after your lunch to chat about next topics? [08:40] pedronis: hey, sure [08:40] thx [08:41] mborzecki: I'm just using focal [08:42] mborzecki: I merged 9311 to have more options [08:42] pedronis: sent [08:42] mborzecki: also make sure you have master, 9310 and 9305 are important [08:43] mborzecki: and then I just run spread -debug -v qemu-nested:ubuntu-20.04-64:tests/nested/core20/kernel-reseal [08:44] mborzecki: for me (with amd) nested kvm also works so things are somewhat fast [08:44] mborzecki: on my intel laptop that appears to be not working but I did not debug this further because the laptop is slower anyway [08:44] mvo: mhm, let me try that [08:45] mborzecki: for intel it's also: $ cat /sys/module/kvm_*/parameters/nested [08:45] Y [08:45] mborzecki: (instead of 1 on amd) [08:45] mborzecki: so the existing detection for support will not trigger on intel (which is a bit of a feature because it did not work for me :) [08:46] mborzecki: anyway, hope that I'm not spaming you too much [08:46] mborzecki: usually I then just telnet into the spread serial port and login and monitor stuff [08:46] mborzecki: like tail /tmp/work-dir/logs/serial.log or journalctl -u nested-vm [08:46] mborzecki: etc [08:47] mhm, the /sys/ bit is set to 1, so that's good [08:47] mborzecki: but I wish we could get a more holisitic view from spread too [08:47] mborzecki: yay [08:47] mborzecki: is that on intel or amd for you? [08:47] waiting for the base image to get updated, thena reboot and i'll try to run it [08:47] mvo: amd [08:49] mborzecki: cool [09:01] mborzecki: something seems to be not quite working with your diff, no sanpd output on the serial port right now :/ and huge delays [09:01] mvo: this is what i see in the logs in the vm the prepare fails: [09:01] Sep 14 08:51:38 ubuntu snapd[721]: devicemgr.go:725: System initialized, cloud-init reported to be done, set datasource_list to [ None ] [09:01] Sep 14 08:51:44 ubuntu snapd[721]: taskrunner.go:271: [change 2 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc" [09:01] Sep 14 08:56:45 ubuntu snapd[721]: taskrunner.go:271: [change 3 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc" [09:02] mborzecki: I think the request serial are red-herrings [09:02] mborzecki: would be great to get more debug output from cloud-init I guess :/ [09:03] mborzecki: i.e. if it actually created the user for us [09:10] mborzecki: ha, we don't set "enable_ssh" in from nested.sh so your tweaks need a slightly different place [09:10] mborzecki: or we just enable ssh, wonder why we don't [09:18] mborzecki: I tweaked "nested.sh" now to run "repack_snapd_snap_with_..." to set "enable_ssh" to true, this should also make debugging a lot simpler, lets see how it goes and if that breaks anything [09:19] mvo: hmm ssh seems to be working fine, i can ssh into the vm after prepare fails :/ [09:23] mvo: heh, and the test is ofc executing now [09:23] mborzecki: mborzecki: yes the deliver serial request are red-herrings, we just need to use serial-authority [09:23] we probably should because they just pollute the logs for nothing [09:24] PR snapd#9221 closed: tests: disk space awareness spread test [09:25] mvo: with None I would suspect not [09:25] about creating the suer [09:26] it should say NoCloud if it created the user ? [09:26] mborzecki: oh, can you paste all the info you have why preapre failed? [09:26] anyway it seems the cloud-init info is not seen/passed right if we get None [09:28] pedronis: oh, interessting [09:28] or maybe is another strange cloud-init corner case [09:30] in principle we should teach the code to turn None into disabled [09:31] mvo: https://paste.ubuntu.com/p/MF6sjVhwyr/ timeout checking whether snapd seeded [09:32] maybe we should have that timeout configurable via env too [09:33] mborzecki: and if you ssh into the nested system, what do you see there for "journalctl -u snapd" ? [09:33] mborzecki: i.e. does it actually fail to seed? [09:33] mborzecki: i.e. "nested_exec sudo journalctl -u snapd" ? [09:33] mvo: no, just the snap command hit a timeout, the system seeded ok [09:33] mborzecki: ohhh [09:33] mborzecki: ok, let me push something [09:34] I don't even understand where we build the cloud-init data atm with a bit of grepping [09:35] mborzecki: I pushed a small change that waits for the snap command to become available as suggested by ian [09:35] mborzecki: this should make this part more robust [09:38] mvo: from what i can see, the error comes directly from our client code [09:39] mborzecki: if you log into the system, do you also get a client timeout then or was this a one-off thing? [09:39] mvo: maybe it needs to multiple steps, i.e. the command -v snap loop, then make the timeout somehow configurable via env (SNAP_CLIENT_TIMEOUT?) [09:39] mvo: it's our setup that is weird: datasource_list: [ "None"] [09:39] mborzecki: oh, interessting, do you think it's actually that slow? [09:39] pedronis: oh, nice catch [09:39] in nested.sh [09:42] mvo: in theory it's ok to use [09:43] ok [09:43] hmm i use [NoCloud, None] usually [09:44] yes, more typical, but None should be ok afaict [09:45] what I mean, it should not cause problems [09:45] you should still get a user [09:46] mborzecki: hrm,hrm,something in the nested vm after "Satrting create static device nodes in /dev" is really very slow :( [09:47] do we know when snap-bootstrap run? [09:52] hm, it takes 500s to read the reboot during initial install [10:06] read the reboot? [10:07] zyga: i've requested your re-review of #9270 because of a few more commits after your previous review (a few more cases where --root=.. was passed to systemctl) [10:07] PR #9270: wrappers, systemd: allow empty root dir and conditionally do not pass --root to systemctl [10:08] pstolowski: ack [10:10] * mvo is away for a few min to pickup kids [10:23] * zyga grabs some food [10:36] re [10:36] yum [11:30] pstolowski: you didn't add a meet [11:30] pedronis: yes, let's use standup HO [11:41] mvo: have you looked at the snapd snap produced by repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks [12:01] mborzecki: firstboot tweaks will fail because we do things differently, need to hack this a bit further, no real results yet :( [12:01] mvo: i'm trying with repacked core20, injecting the bits directly there [12:05] brb, tea [12:09] the nested suite is dog slow :/ [12:16] mborzecki: yeah, it's all a bit frustrating [12:17] mborzecki: I modified the image caching to use gzip -1 instead of xz locally but it does not make a huge diffrence [12:18] mborzecki: I think part of it is really trying to figure out what part exactly is so slow and if we do something silly somewhere [12:18] mvo: looking at dmesg timestamps in serial logs, install takes ~500s [12:21] * zyga cleans up unit tests [12:21] mborzecki: is that just install? without first seed? [12:22] pedronis: yeah, from first boot, to a reboot [12:23] we need install mode logs [12:23] pedronis: +100 [12:25] hmm, something's off, i've added a drop in override for journald to the core snap, sice it's a base that runs durin install, the logs should be visible on the console [12:25] fwiw, first reboot is at: [ 549.003080] reboot: Restarting system [12:39] PR snapcraft#3284 opened: build providers: rename clean() -> clean_parts() to clarify scope [12:42] mborzecki: I merge my install mode pr and see if that gives me any clues [12:43] mvo: which one is that? [12:44] PR snapd#9342 opened: tests: add more checks to disk space awareness spread test [12:50] mborzecki: 9317 [12:51] mborzecki: running it now so after the standup we hopefully have results :) [12:51] ack [12:55] heh `Sep 14 12:42:14 ubuntu systemd[1]: Startup finished in 36.166s (kernel) + 1min 50.783s (userspace) = 2min 26.949s.` [12:55] this is run mode starting up [13:35] mvo: this is the diff i'm trying right now: https://paste.ubuntu.com/p/hHY6hjWxTV/ [13:35] mborzecki: thanks [13:36] mborzecki: nice, does it work? [13:36] mvo: not quite, idk why i'm not seeing the run system logs [13:36] mborzecki: http://paste.ubuntu.com/p/CTQwsyvZCC/ is my heavily hacked stuff [13:38] mvo: want to try with systemd.journald.forward_to_console=1 in the command line? [13:39] mborzecki: oh, excellent idea [13:39] mborzecki: yeah, I mean, this is obviously just a quick hack to see if I can any extra data :/ [13:40] mvo: higher chance of succeeding in getting more logs then i have have here with repacking [13:40] cmatsuoka: we should try to see what happens combining #9340 and #9277 (for this we probably need to bump secboot version) [13:40] PR #9340: boot: streamline bootstate20.go reseal and tests changes [13:40] PR #9277: secboot: add boot manager profile to pcr protection profile <⛔ Blocked> [13:41] pedronis: ack [13:42] mborzecki: I added it now but will let my current run continue [13:43] mvo: do you cache the kernel/core/snapd snaps somehow locally so that the vms do not have to download them all over again? [13:45] mborzecki: I don't :( I think having a squid-deb-proxy or debcacher-ng would be helpful [13:58] re [13:59] lucy is fast asleep - back to work [13:59] mvo: heh, and now swtpm socket isn't ready by the time the vm starts [13:59] mborzecki: I have a snap proxy at home [13:59] mborzecki: meh, it's getting worse and worse [13:59] * zyga hugs mvo and mborzecki [14:00] zyga: hm spread runs vms with -net user, so afaiu the vm will not have access to lan [14:00] mborzecki: it does [14:01] -net user drops ping and stuff [14:01] but it works normally [14:01] lan or otherwise [14:01] as long as the private IP is in another subnet from LAN [14:01] (private qemu-given IP) [14:02] cachio: hey, i just hit 2020-09-14 13:24:53 Cannot allocate google-nested:ubuntu-18.04-64: cannot find any Google image matching "ubuntu-1804-64-virt-enabled" on project "computeengine" or "ubuntu-os-cloud" [14:03] pstolowski, 1 sec [14:03] it is the gce issue [14:04] cachio: should i just restart the tests or is it more permanent? [14:05] pstolowski, try now please [14:10] mborzecki: this is all very frustrating, even when adding systemd.journald.forward_to_console=yes it does not even work [14:10] mborzecki: I mean, I still don't see anything in the serial log [14:11] mvo: i think that console conf hijacks the console [14:11] mborzecki: and systemd.debug also does not work [14:11] (or the serial as such) [14:11] mborzecki: I can try this [14:11] the last line i see is a prompt from console conf [14:16] mvo: fwiw, might be an accident, but with haveged install time is now ~360s rather than ~500 [14:18] equally well might be some nework stuff causing that [14:28] Issue core20#80 closed: networking does not persist in a reboot loop on arm64 pi4 [14:46] zyga: do you have a sec for https://github.com/snapcore/snapd/pull/9342 ? [14:46] PR #9342: tests: add more checks to disk space awareness spread test [14:46] pstolowski: sure [14:48] done [14:48] zyga: ty! [14:50] PR snapd#9342 closed: tests: add more checks to disk space awareness spread test [15:03] mvo: hm maybe this https://paste.ubuntu.com/p/29Xz6JcJrg/ [15:04] hopefully ttyS1 is not hijacked and we can still get logs out of it [15:06] mborzecki: mux=on? should I also add this? [15:07] mborzecki: nice, let's hope this gives output [15:09] mvo: if that doesn't work, we can always have a service taht does `journalctl -f > /dev/someserial`, unfortunately none of the systemd-journal-gateway* things are in the core snap [15:12] mborzecki: mvo: do we need console-conf in these tests? can't we turn it off? or turn it off in most? or turn it off while debugging? [15:14] pedronis: yeah, I think we can turn them off [15:14] pedronis: well, so … having console-conf means there is a way to login [15:14] pedronis: so it's not entirely without merits but if we provide an alternative login then it's not needed [15:25] pedronis: while trying to debug I can say that sealing to commandline works, I tried to change it and got a recovery prompt [15:33] if we fail to unlock in the initramfs as part of a kernel snap update, will we reboot automatically and trigger rollback to the previous one? I don't think so, but perhaps it would be smart to teach the initramfs to do this for at least specifically the kernel snap, we could detect we are trying a kernel snap update before unlocking the encrypted partition just by looking at kernel_status and bootloader vars [15:33] mvo: not much of an improvement unfortunately, idk why journald just stops logging to serial console at some point [15:33] mborzecki: yeah, I'm also a bit stuck here, trying out more things but it's very frustrating [15:34] ijohnson: it sounds reasonable but probably to be done after we have landed the current bits [15:34] pedronis: ack I will make a small todo for myself to look into that [15:34] (for later on) [15:35] mvo: i need to taxi the kids around in 20 minutes, i'll open a branch with the patches i have [15:39] mborzecki: cool, I keep exploring this [15:40] pff, something new `Connection timed out during banner exchange` when trying to ssh into a nested vm [15:40] PR snapd#9343 opened: tests: more logging for UC20 kernel test [15:45] mvo: need to go out now, check this commit: https://github.com/snapcore/snapd/pull/9343/commits/648801163d3d09f3db18dd71bec3d79690eda3b1 [15:45] PR #9343: tests: more logging for UC20 kernel test [15:45] mborzecki: nice [15:45] mborzecki: does it work :) ? [15:46] mborzecki: I keep poking at this [15:46] mvo: idk yet, just added, check back in a bit [15:46] mborzecki: \o/ cool [15:46] mborzecki: thanks in any case [15:46] mvo: but that may be it, the dfault is 'debug' for journal/syslog, but only info for console [15:47] ok, got to go, bbl [15:58] * cachio lunch [16:25] cachio: I've seen the minimal-smoke test fail multiple times now due to issues with running spread inside the external system like this [16:25] https://pastebin.ubuntu.com/p/RPv9qTxMcY/ [16:25] and [16:26] err wait sorry need to find the other pastebin [16:28] here it is: https://pastebin.ubuntu.com/p/bmHw78Qfp4/ [16:29] any ideas on what might be wrong? it doesn't seem to be the case that there is no user on the VM we are using as an external system, just that we can't use sudo or somehow can't login the way that spread is trying to do [16:30] the first failure was from my pr https://github.com/snapcore/snapd/pull/9332, while the second failure was from https://github.com/snapcore/snapd/pull/9311 [16:30] PR #9332: spread.yaml, tests/nested: misc changes [16:30] PR #9311: nested: add support to telnet to serial port in nested VM [16:40] PR snapd#9344 opened: tests/lib/nested.sh: wait for the tpm socket to exist [16:41] simple nested test robustness PR ^ [16:44] mvo: back for a bit, the logs are visible now! [16:48] ijohnson, I'll take a look [16:49] thanks cachio [16:49] I'm running a spread run now of my pr to see if I can reproduce that issue [17:05] mborzecki: yay [17:08] mborzecki: running your PR locally now while waiting for spread to catchup [17:10] brb [17:12] mvo: this is what i got in the last local run: https://paste.ubuntu.com/p/pfZsCCyBsS/ [17:15] mborzecki: nice! good debug logs! is it still running or did creating the user fail? [17:15] mvo: it's `Failed password for user1 from 10.0.2.2 port 56336 ssh2` until the very end, either the user was not created yet (cloud-init does that right?), or something else went wrong [17:15] mborzecki: ok [17:16] mborzecki: so it failed, but least we have more logs now :) [17:16] mborzecki: hm, at least this "[ 161.918339] useradd[760]: new group: name=user1, GID=1000" is visible [17:16] mvo: yeah, at least we see it's resaling [17:17] mvo: maybe something wrong with ssh config then? [17:19] mborzecki: yeah, I wonder if the "preauth" part of the failure gives a clue already [17:22] mvo: do you remember where PasswordAuthentication in sshd_config gets enabled? [17:24] ijohnson: could you have a look at #9185 again when you have time? [17:24] PR #9185: secboot: use the snapcore/secboot native recovery key type [17:30] ijohnson, worker for me [17:31] I'll retry it in 3 machines [17:33] mvo: duh, i have no clue why ssh may be failing, trying one more time [17:34] re [17:34] ijohnson: could you run the tests/nested/core20/kernel-reseal test from #9343 locally? [17:34] PR #9343: tests: more logging for UC20 kernel test [17:34] only one function left to test [17:34] I made a lot of tea [17:38] mborzeck1: sure [17:39] ijohnson: in my runs, the test fails to login over ssh, i suspect PasswordAuthentication may have not been enabled [17:39] cachio: yeah worked for me too, I will try in a loop to see if we can reproduce [17:39] mborzeck1: hmm interesting [17:39] cmatsuoka: yes I will try to do a review tonight [17:40] thanks [17:40] ijohnson: otoh, i'm not sure how it gets enabled in those tests, we dont' seem to calling repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks in a way that would enable that, so maybe cloud-init does it implicitly when it adds a user? [17:41] mborzeck1: right it should be using whatever is the default [17:41] mborzeck1: if cloud-init ran it should have created the user with the pw [17:42] mborzeck1: is that your full log of a failed run @ https://paste.ubuntu.com/p/pfZsCCyBsS/ ? [17:42] ijohnson: yup, that's the full log i got [17:42] k [17:47] mborzeck1: it should be turned on by: ssh_pwauth: True in the cloud-init config [17:48] woot [17:48] done [17:50] pedronis: #9340 + #9277 + #9185 + some conflict fixing seem to work [17:50] PR #9340: boot: streamline bootstate20.go reseal and tests changes [17:50] PR #9277: secboot: add boot manager profile to pcr protection profile <⛔ Blocked> [17:50] PR #9185: secboot: use the snapcore/secboot native recovery key type [17:51] cmatsuoka: thx [17:51] pedronis: and snapcore/secboot update + canonical/go-tpm2 update too [18:04] mborzeck1: mvo: could we add printing of ssh config and extrausers content somewhere that runs/at end of cloud-init? [18:07] pedronis: good idea [18:07] * zyga spawns more spread tests, pushes the branch and EODs [18:09] pedronis: hm, the annoying part is that it appears that the exiting "echo test" is not output in the log that maciej posted :( [18:14] pedronis: ijohnson: so it looks like it's actually taking that long to reach the point where cloud init makes the system accessible https://paste.ubuntu.com/p/6sPDmDhFzX/ [18:15] mvo: I suspect cloud-init sends it output somewhere else [18:15] mborzeck1: ah you know what I remember this problem [18:15] funny there's this: [18:15] [ 154.076623] passwd[766]: password for 'user1' changed by 'root' [18:16] but it seems the change is picked up by pam way later: [18:16] [ 396.314719] chpasswd[1626]: pam_extrausers(chpasswd:chauthtok): password changed for user1 [18:16] cloud-init creates the user from config but for whatever reason the user isn't accessible until after everything is done running, I've seen this happen in other nested suites where it fails to prepare, then drops me to a shell and 20 minutes later when I see the failure I can login just fine [18:16] mborzeck1: yes that makes sense [18:17] I think we just need to bump the timeout for now and try to optimize things later on [18:17] hah, interesting [18:17] [ 404.228972] cloud-init[1623]: Cloud-init v. 20.2-45-g5f7825e2-0ubuntu1~20.04.1 running 'modules:config' at Mon, 14 Sep 2020 18:06:32 +0000. Up 390.55 seconds. [18:17] [ 405.991338] systemd[1]: Finished Apply the settings specified in cloud-config. [18:17] it would be interesting to know if gce's tpm implementation works to the point where we could use it now for many of these tests and just use nested tests for things that really actually need nested vm's [18:18] wonder whether `Up 390.55 seconds` means that it was running for that long [18:19] I'm in the vm now [18:19] systemd take 44% cpu [18:19] and console-conf was also pretty cpu heavy [18:19] and now I got kicked out :( [18:19] ijohnson: sounds like something needs reloading to notice the new user? [18:20] pedronis: I dunno, it might be a "feature" of cloud-init that your user isn't "login-able" until cloud-init thinks the system is "ready" [18:20] mborzeck1: do we run a script inside the vms? where is it defined? [18:20] pedronis: what do you mean by a script ? [18:21] for cloud-init ? [18:21] pedronis: https://github.com/snapcore/snapd/pull/9343 repacks the core, so we can inject pretty much anything now [18:21] PR #9343: tests: more logging for UC20 kernel test [18:21] mborzeck1: no, I'm thinking a bit of code we run everywhere defined in prepare stuff? [18:23] pedronis: you mean in the nested vm? [18:24] mborzeck1: no, I mean repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks [18:26] pedronis: that extra shell bit in the helper is not used when nested runs (i presume we expect cloud-init to set up ssh) [18:26] and accounts [18:26] well, just one account really [18:26] mborzeck1: my question is whether we run it or not, my fear is that things interfere with each other [18:26] that's why I ask [18:28] pedronis: no afaict, the bit is not added, ENABLE_SSH is false when the helper is called from nested prepare [18:28] I see, ok [18:30] PR snapcraft#3285 opened: v1 plugins: lock godep's dependencies [18:30] * zyga opened export manager PR and EODs [18:30] ttyl! [18:31] PR snapd#9345 opened: overlord: introduce the export manager, export snapd tools [18:39] tweaked the cloud config we use a bit [18:39] mborzeck1: \o/ [18:46] pedronis: fwiw I have an open pr that changes repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks to instead call repack_snapd_deb_to_snap from snaps.sh to reduce confusion about that [18:47] pedronis: actually that pr is now green with 2 +1's and I'd like to merge it to master, do you have objections ? [18:47] #9332 [18:47] PR #9332: spread.yaml, tests/nested: misc changes [19:00] ijohnson: I don't have objections, but mvo and mborzeck1 might [19:01] ijohnson: also why was it like this? we have two functions of which the second behaves like the first if a param is false? [19:01] pedronis: ack, well mvo already +1d the PR, so unless mborzeck1 has an opinion I will merge it [19:01] ijohnson: works for me [19:01] wfm too [19:01] pedronis: cachio added the param to the longer function before I added/created the simpler one [19:01] ah [19:01] pedronis: I created the simpler one for uc18 nested cloud-init tests specifically [19:02] pedronis: but actually the same function works for uc20 nested tests too [19:02] mborzeck1: cool I'll merge that one now [19:03] ijohnson: tried using [NoCloud,None] in datasources list for cloud init, suspecting it may be getting confused or somesuch, but i don't see any change [19:03] mborzeck1: we need to find a way to get the ssh config and the extrausers file into the logs [19:04] ijohnson: but now that parameter could be dropped from the 2nd, no? [19:04] pedronis: yes it could [19:04] pedronis: I can do that [19:05] it would make sense to me, but maybe cachio has reasons, but even then I would then write a wrapper that picks one or the other in this new world [19:06] mborzeck1: with 648801163d3d09f3db18dd71bec3d79690eda3b1 from your pr, I got a successful run on google-nested [19:06] mborzeck1: I ran `spread --debug google-nested:ubuntu-20.04-64:tests/nested/core20/kernel-reseal` [19:06] PR snapd#9332 closed: spread.yaml, tests/nested: misc changes [19:07] pedronis: I really don't think it's necessary anymore and it reads quite confusing to see "false" as an argument to an already really long bash function :-/ [19:07] but yes I will get cachio to approve the pr when ready [19:07] mborzeck1: so was the issue that you can't run it locally with qemu-nested ? [19:11] ijohnson: thanks for trying, it did work for me a couple for runs, but it's not really consistent [19:11] mborzeck1: ok, I will run more tries to see if I can reproduce the failure, to be clear though, you're running in gce with google-nested or locally with qemu-nested ? [19:12] mvo: ijohnson: pedronis: pushed a bit more changes to #9343, bumped the timeout and somesuch [19:12] PR #9343: tests: more logging for UC20 kernel test [19:12] ijohnson: google-nested [19:12] k [19:12] mborzeck1: looking [19:13] need to wrap it up, tuck the kids to bed [19:13] mborzeck1: thanks! [19:14] mborzeck1: good night [19:14] o/ [19:16] PR snapd#9346 opened: interfaces: builtin: add iotedge interface to support Azure iotedge [19:41] PR snapd#9344 closed: tests/lib/nested.sh: wait for the tpm socket to exist [20:30] ijohnson: I think I'm understanding something of the nested issues [20:31] understanding is good [20:31] I have now run 3 times the kernel reseal test on gce and it hasn't failed for me fwiw [20:31] ijohnson: basically because we use chpasswd (and not other keys), that happens in a later cloud-init phase [20:32] if you look at the logs of a successful run [20:32] ah [20:32] [ 125.319739] passwd[783]: password for 'user1' changed by 'root' [20:32] yeah that matches basically what I expected [20:32] but things works only after I see [20:33] I use this cloud-init config locally which doesn't use the chpasswd module: [20:33] [ 346.008722] chpasswd[1668]: pam_extrausers(chpasswd:chauthtok): password changed for user1 [20:33] https://pastebin.ubuntu.com/p/ZNGxtgHGTr/ [20:34] that's at 5 minutes in [20:34] and counting [20:34] the issue afaict is that part of cloud-init runs only after we are seeded [20:34] and seeding is slow [20:34] pedronis: I wonder if cloud-init creates the user and the passwd message we see is just because the user was created and the password was set to "" [20:34] yes, something like [20:34] that [20:35] but we should try something like the minimal cloud-init I just pasted instead, perhaps that would run faster as that's just using the users module/key thing [20:35] * ijohnson afk for 5ish minutes [21:28] mmm now I reproduced a failure on the kernel-reseal nested test [21:33] with the increased timeout? [21:34] yes [21:34] it looks like the vm is hard-locked up [21:34] qemu is still running [21:34] but many messages like this in the serial log [21:34] [ 649.475297] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [21:34] [ 649.587280] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sshd:2191] [21:34] fun, not [21:34] [ 684.935379] rcu: INFO: rcu_sched self-detected stall on CPU [21:34] [ 684.935379] rcu: 1-...!: (14946 ticks this GP) idle=ec6/1/0x4000000000000002 softirq=73107/73107 fqs=295 [21:34] [ 684.935379] rcu: rcu_sched kthread starved for 14407 jiffies! g142565 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [21:34] [ 684.935379] rcu: RCU grace-period kthread stack dump: [21:35] ijohnson: hummm... we've seen this before I think [21:35] yeah I dunno what's the situation, but I don't see that chpasswd message so it seems that it took too long to get to that point [21:36] and it seems that kvm was shut off properly [21:36] this is the qemu cmdline [21:36] https://www.irccloud.com/pastebin/Jyoyw8RK/ [21:37] PR snapd#9346 closed: interfaces: builtin: add iotedge interface to support Azure iotedge <⛔ Blocked> [21:37] ijohnson: I think it was when cachio's VMs were rebooting randomly and we considered it could be the watchdog rebooting the system [21:37] yeah but this is also on GCE, which had that problem [21:37] but I thought that turning off kvm fixed that problem [21:37] well it fixed the problem of randomly rebooting [21:37] now maybe the problem is just that we get hung and don't get rebooted whereas before we would at least get rebooted [21:38] ijohnson: is this happening randomly? [21:38] or did you find a way to make it reproducible? [21:39] ijohnson, hey [21:39] 2 cpus? [21:39] hmm, -smp 2 [21:39] cmatsuoka: it is random, I ran the same branch successfuly twice before it failed [21:39] I'll try to reproduce it [21:39] cachio: should we not be using 2 cpus ? [21:40] this is for nested/core20 suite [21:40] it should,t be a problem [21:41] just when you have kvm enalbed could be a prooblem [21:41] but it is not the case [21:41] cachio: kvm is disabled in this case [21:42] cachio: this is from running mborzecki's branch on PR https://github.com/snapcore/snapd/pull/9343 [21:42] PR #9343: tests: more logging for UC20 kernel test [21:43] ijohnson, ah, nice, I'll use it [21:43] thanks [21:44] cachio: also I reproduced the issue I mentioned earlier around lunch time [21:44] I don't know a root cause yet, but somehow something deleted our sudoers entries [21:44] I ran many times and all of them passed [21:45] I saw many times that until I added sync before stop the vm [21:45] https://www.irccloud.com/pastebin/sPEO3dQ2/ [21:46] yeah maybe sync just makes it less likely? [21:46] what's weird is that I can see in the spread prepare output it created the users and they executed sudo commands successfully [21:46] yes [21:46] I wonder if we need to run sync inside the VM as well as outside the VM too [21:47] ijohnson, perhaps [21:47] I can try that [21:47] I am preparing a new pr with some improvements [21:47] cachio: thanks I'm gonna move on and debug some other things [21:50] maybe the slow system under stress is confusing rcu? [21:52] https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt [21:54] ah we really have cpu soft lockups [21:54] ugh [21:56] looking at the logs it seems reaseling itself is very slow (but we would need to add dedicated logging to be sure) [22:06] cmatsuoka: this is the full log I have from that VM before I gave up and killed it https://pastebin.ubuntu.com/p/2w3XvXBh4B/ [22:06] * cmatsuoka verifies... [22:18] ijohnson: is this the kernel-reseal test? any special command line to run it? [22:20] * cmatsuoka running the test... [22:22] cmatsuoka: yes that's the test [22:22] no, just whatever's in that branch right now is what I ran [22:23] gce seems especially slow today [22:23] cachio: could you please look at pr 9347 quickly tonight? it is small and I would like to try and merge that tonight if possible so it's ready for folks tomorrow morning [22:23] PR #9347: tests/lib/nested.sh: use more focused cloud-init config for uc20 [22:26] cachio: any known problem with arch tests and cgroups recently? [22:27] PR snapd#9347 opened: tests/lib/nested.sh: use more focused cloud-init config for uc20 [22:37] dinner, then will check test results [22:47] ijohnson, sure [22:47] thanks [22:47] cmatsuoka, didn't see errors, do you have al og [22:47] ? [22:58] ijohnson, +1, now lets wait for test results [22:59] cachio thanks but very confusing that the nested runs are all done because it says they were cached? [22:59] This branch has never been run before it was opened less than an hour ago [22:59] I have to EOD now but maybe someone should look into that [23:02] ijohnson, I added the run nested tag [23:02] could you please re-push on that branch [23:02] so we force the nested tests are executed at least once for that PR? [23:02] Oh I forgot to add the tag [23:03] Sorry let me close and reopen [23:03] ijohnson, tx [23:06] cachio: it worked now, it was just a random failure I guess [23:06] cachio: thanks [23:10] cmatsuoka, ok [23:10] cmatsuoka, please send me a log in case you see it again