/srv/irclogs.ubuntu.com/2020/09/14/#snappy.txt

mborzecki	morning	05:27
mborzecki	driving the kids to school, bbl	05:31
mborzecki	re	06:04
gko	Hello, I'm trying to install snap on a CentOS, but get "No package snapd available." even after "sudo yum install epel-release"...	06:05
gko	(Package matching epel-release-7-11.noarch already installed.)	06:06
mborzecki	gko: centos 7/8?	06:10
gko	7	06:11
gko	If I search "snapd", I get: "snapd-debuginfo.x86_64 : Debug information for package snapd" and "snapd-glib-debuginfo.x86_64 : Debug information for package snapd-glib"	06:12
gko	CentOS Linux release 7.6.1810 (Core)	06:12
mborzecki	gko: seems working here: https://paste.ubuntu.com/p/z8JmK6qJZX/	06:13
zyga	good morning	06:15
mborzecki	gko: is epel actually enabled? maybe only epel-debuginfo is enabled, can you paste the output of `yum repolist`?	06:16
* zyga eats breakfast, will start soon		06:22
mup	PR snapd#9337 closed: boot,many: reseal only when meaningful and necessary <Run nested> <UC20> <Created by pedronis> <Merged by bboozzoo> <https://github.com/snapcore/snapd/pull/9337>	06:23
mborzecki	mvo: morning	06:26
mvo	mborzecki: good morning	06:26
zyga	hey mvo	06:26
mborzecki	mvo: left you a comment yesterday https://github.com/snapcore/snapd/pull/9341#issuecomment-691692486 but looks like you missed it	06:27
mup	PR #9341: tests: add nested core20 gadget reseal test <Run nested> <UC20> <Created by mvo5> <https://github.com/snapcore/snapd/pull/9341>	06:27
mborzecki	mvo: anyways, i've merged 9337, so maybe you can just merge master and push again	06:28
mborzecki	mvo: fwiw, the nested test passed when i pushed that commit i mentioned to your pr	06:28
mvo	good morning zyga	06:28
mvo	mborzecki: cool, looking	06:29
zyga	mvo: today is different, wife's Mondays got swapped to the afternoon shift	06:29
gko	mborzecki: repo id repo name status	06:31
gko	epel EPEL 9	06:31
mvo	mborzecki: I merged master into 9341 now	06:33
mborzecki	mvo: cool	06:33
mvo	mborzecki: thank you! and you said it passed earlier?	06:33
mborzecki	mvo: yes, i think so, it was one of those github action job status emails, let me check if i still have it in the trash	06:34
mvo	mborzecki: nice	06:35
mborzecki	mvo: https://paste.ubuntu.com/p/kX8kqQSQ9y/ that's the branch right?	06:35
mvo	mborzecki: I think it passed locally but it was a bit of a pain, some strangess like I see in the "snap change 8" that installs the pc gagdet an info about "restarting snapd" but it does restart the system	06:35
mborzecki	ah w8, 4 annotations, pfff missed that	06:35
mborzecki	mvo: saw a column of 0s and was too happy about that ;)	06:36
mvo	mborzecki: haha	06:36
mvo	mborzecki: no worries	06:36
mvo	mborzecki: I think the test itself is good	06:36
mvo	mborzecki: but it highlighted some small issues	06:37
mborzecki	gko: it's showing `epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 13,446` ?	06:37
mvo	mborzecki: anyway, I'm quite happy that things seems to be working :)	06:37
mborzecki	gko: something seems off in your system, it's listing 13k packages here, but only 9 in your setup	06:38
gko	mborzecki: right... no wonder if can't find anything.	06:39
mborzecki	gko: maybe try to `yum reinstall epel-release`?	06:40
mup	PR snapd#9339 closed: boot: make MockUC20Device use a model and MockDevice more realistic <UC20> <Created by pedronis> <Merged by mvo5> <https://github.com/snapcore/snapd/pull/9339>	06:43
mvo	mborzecki: hm, 9331 has conflicts now :/	06:53
mborzecki	mvo: yup, resolving them right now	06:53
mvo	mborzecki: \o/	06:53
gko	mborzecki: OK, my fault... there was another repo file also using epel... Thanks!	06:56
mborzecki	mvo: updated	06:59
pstolowski	morning	07:05
mvo	good morning pstolowski	07:07
* tobias_ waves general good morning		07:12
mborzecki	pstolowski: hey	07:32
zyga	re	07:33
* zyga canceled PT and jumps into code		07:35
pedronis	mvo: mborzecki: hi, do we need to sync?	07:43
mborzecki	pedronis: hi, yes, in 5? (cc mvo?)	07:43
mvo	pedronis, mborzecki sounds good	07:43
pedronis	ok	07:44
mvo	zyga: in a call right now	08:01
mborzecki	mvo: tweaked that diff a bit more https://paste.ubuntu.com/p/K9Dg4tQgkg/	08:31
mvo	mborzecki: thanks!	08:32
mvo	mborzecki: running locally now	08:33
mborzecki	mvo: i'm setting up an image for running nested locally too, are the nested tests using qemu from the repos or some special-flavor one?	08:39
pedronis	pstolowski: hi, can you setup a slot after your lunch to chat about next topics?	08:39
pstolowski	pedronis: hey, sure	08:40
pedronis	thx	08:40
mvo	mborzecki: I'm just using focal	08:41
mvo	mborzecki: I merged 9311 to have more options	08:42
pstolowski	pedronis: sent	08:42
mvo	mborzecki: also make sure you have master, 9310 and 9305 are important	08:42
mvo	mborzecki: and then I just run spread -debug -v qemu-nested:ubuntu-20.04-64:tests/nested/core20/kernel-reseal	08:43
mvo	mborzecki: for me (with amd) nested kvm also works so things are somewhat fast	08:44
mvo	mborzecki: on my intel laptop that appears to be not working but I did not debug this further because the laptop is slower anyway	08:44
mborzecki	mvo: mhm, let me try that	08:44
mvo	mborzecki: for intel it's also: $ cat /sys/module/kvm_*/parameters/nested	08:45
mvo	Y	08:45
mvo	mborzecki: (instead of 1 on amd)	08:45
mvo	mborzecki: so the existing detection for support will not trigger on intel (which is a bit of a feature because it did not work for me :)	08:45
mvo	mborzecki: anyway, hope that I'm not spaming you too much	08:46
mvo	mborzecki: usually I then just telnet into the spread serial port and login and monitor stuff	08:46
mvo	mborzecki: like tail /tmp/work-dir/logs/serial.log or journalctl -u nested-vm	08:46
mvo	mborzecki: etc	08:46
mborzecki	mhm, the /sys/ bit is set to 1, so that's good	08:47
mvo	mborzecki: but I wish we could get a more holisitic view from spread too	08:47
mvo	mborzecki: yay	08:47
mvo	mborzecki: is that on intel or amd for you?	08:47
mborzecki	waiting for the base image to get updated, thena reboot and i'll try to run it	08:47
mborzecki	mvo: amd	08:47
mvo	mborzecki: cool	08:49
mvo	mborzecki: something seems to be not quite working with your diff, no sanpd output on the serial port right now :/ and huge delays	09:01
mborzecki	mvo: this is what i see in the logs in the vm the prepare fails:	09:01
mborzecki	Sep 14 08:51:38 ubuntu snapd[721]: devicemgr.go:725: System initialized, cloud-init reported to be done, set datasource_list to [ None ]	09:01
mborzecki	Sep 14 08:51:44 ubuntu snapd[721]: taskrunner.go:271: [change 2 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc"	09:01
mborzecki	Sep 14 08:56:45 ubuntu snapd[721]: taskrunner.go:271: [change 3 "Request device serial" task] failed: cannot deliver device serial request: Cannot process serial request for device with brand "BhgbYoDtThegqVkEU7oiZP8GQwCoUIxz" and model "pc"	09:01
mvo	mborzecki: I think the request serial are red-herrings	09:02
mvo	mborzecki: would be great to get more debug output from cloud-init I guess :/	09:02
mvo	mborzecki: i.e. if it actually created the user for us	09:03
mvo	mborzecki: ha, we don't set "enable_ssh" in from nested.sh so your tweaks need a slightly different place	09:10
mvo	mborzecki: or we just enable ssh, wonder why we don't	09:10
mvo	mborzecki: I tweaked "nested.sh" now to run "repack_snapd_snap_with_..." to set "enable_ssh" to true, this should also make debugging a lot simpler, lets see how it goes and if that breaks anything	09:18
mborzecki	mvo: hmm ssh seems to be working fine, i can ssh into the vm after prepare fails :/	09:19
mborzecki	mvo: heh, and the test is ofc executing now	09:23
pedronis	mborzecki: mborzecki: yes the deliver serial request are red-herrings, we just need to use serial-authority	09:23
pedronis	we probably should because they just pollute the logs for nothing	09:23
mup	PR snapd#9221 closed: tests: disk space awareness spread test <Disk space awareness> <Created by stolowski> <Merged by stolowski> <https://github.com/snapcore/snapd/pull/9221>	09:24
pedronis	mvo: with None I would suspect not	09:25
pedronis	about creating the suer	09:25
pedronis	it should say NoCloud if it created the user ?	09:26
mvo	mborzecki: oh, can you paste all the info you have why preapre failed?	09:26
pedronis	anyway it seems the cloud-init info is not seen/passed right if we get None	09:26
mvo	pedronis: oh, interessting	09:28
pedronis	or maybe is another strange cloud-init corner case	09:28
pedronis	in principle we should teach the code to turn None into disabled	09:30
mborzecki	mvo: https://paste.ubuntu.com/p/MF6sjVhwyr/ timeout checking whether snapd seeded	09:31
mborzecki	maybe we should have that timeout configurable via env too	09:32
mvo	mborzecki: and if you ssh into the nested system, what do you see there for "journalctl -u snapd" ?	09:33
mvo	mborzecki: i.e. does it actually fail to seed?	09:33
mvo	mborzecki: i.e. "nested_exec sudo journalctl -u snapd" ?	09:33
mborzecki	mvo: no, just the snap command hit a timeout, the system seeded ok	09:33
mvo	mborzecki: ohhh	09:33
mvo	mborzecki: ok, let me push something	09:33
pedronis	I don't even understand where we build the cloud-init data atm with a bit of grepping	09:34
mvo	mborzecki: I pushed a small change that waits for the snap command to become available as suggested by ian	09:35
mvo	mborzecki: this should make this part more robust	09:35
mborzecki	mvo: from what i can see, the error comes directly from our client code	09:38
mvo	mborzecki: if you log into the system, do you also get a client timeout then or was this a one-off thing?	09:39
mborzecki	mvo: maybe it needs to multiple steps, i.e. the command -v snap loop, then make the timeout somehow configurable via env (SNAP_CLIENT_TIMEOUT?)	09:39
pedronis	mvo: it's our setup that is weird: datasource_list: [ "None"]	09:39
mvo	mborzecki: oh, interessting, do you think it's actually that slow?	09:39
mvo	pedronis: oh, nice catch	09:39
pedronis	in nested.sh	09:39
pedronis	mvo: in theory it's ok to use	09:42
mvo	ok	09:43
mborzecki	hmm i use [NoCloud, None] usually	09:43
pedronis	yes, more typical, but None should be ok afaict	09:44
pedronis	what I mean, it should not cause problems	09:45
pedronis	you should still get a user	09:45
mvo	mborzecki: hrm,hrm,something in the nested vm after "Satrting create static device nodes in /dev" is really very slow :(	09:46
pedronis	do we know when snap-bootstrap run?	09:47
mborzecki	hm, it takes 500s to read the reboot during initial install	09:52
zyga	read the reboot?	10:06
pstolowski	zyga: i've requested your re-review of #9270 because of a few more commits after your previous review (a few more cases where --root=.. was passed to systemctl)	10:07
mup	PR #9270: wrappers, systemd: allow empty root dir and conditionally do not pass --root to systemctl <Run nested> <Services ⚙️> <Created by stolowski> <https://github.com/snapcore/snapd/pull/9270>	10:07
zyga	pstolowski: ack	10:08
* mvo is away for a few min to pickup kids		10:10
* zyga grabs some food		10:23
zyga	re	10:36
zyga	yum	10:36
pedronis	pstolowski: you didn't add a meet	11:30
pstolowski	pedronis: yes, let's use standup HO	11:30
mborzecki	mvo: have you looked at the snapd snap produced by repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks	11:41
mvo	mborzecki: firstboot tweaks will fail because we do things differently, need to hack this a bit further, no real results yet :(	12:01
mborzecki	mvo: i'm trying with repacked core20, injecting the bits directly there	12:01
zyga	brb, tea	12:05
mborzecki	the nested suite is dog slow :/	12:09
mvo	mborzecki: yeah, it's all a bit frustrating	12:16
mvo	mborzecki: I modified the image caching to use gzip -1 instead of xz locally but it does not make a huge diffrence	12:17
mvo	mborzecki: I think part of it is really trying to figure out what part exactly is so slow and if we do something silly somewhere	12:18
mborzecki	mvo: looking at dmesg timestamps in serial logs, install takes ~500s	12:18
* zyga cleans up unit tests		12:21
pedronis	mborzecki: is that just install? without first seed?	12:21
mborzecki	pedronis: yeah, from first boot, to a reboot	12:22
pedronis	we need install mode logs	12:23
mvo	pedronis: +100	12:23
mborzecki	hmm, something's off, i've added a drop in override for journald to the core snap, sice it's a base that runs durin install, the logs should be visible on the console	12:25
mborzecki	fwiw, first reboot is at: [ 549.003080] reboot: Restarting system	12:25
mup	PR snapcraft#3284 opened: build providers: rename clean() -> clean_parts() to clarify scope <Created by cjp256> <https://github.com/snapcore/snapcraft/pull/3284>	12:39
mvo	mborzecki: I merge my install mode pr and see if that gives me any clues	12:42
mborzecki	mvo: which one is that?	12:43
mup	PR snapd#9342 opened: tests: add more checks to disk space awareness spread test <Disk space awareness> <Created by stolowski> <https://github.com/snapcore/snapd/pull/9342>	12:44
mvo	mborzecki: 9317	12:50
mvo	mborzecki: running it now so after the standup we hopefully have results :)	12:51
mborzecki	ack	12:51
mborzecki	heh `Sep 14 12:42:14 ubuntu systemd[1]: Startup finished in 36.166s (kernel) + 1min 50.783s (userspace) = 2min 26.949s.`	12:55
mborzecki	this is run mode starting up	12:55
mborzecki	mvo: this is the diff i'm trying right now: https://paste.ubuntu.com/p/hHY6hjWxTV/	13:35
mvo	mborzecki: thanks	13:35
mvo	mborzecki: nice, does it work?	13:36
mborzecki	mvo: not quite, idk why i'm not seeing the run system logs	13:36
mvo	mborzecki: http://paste.ubuntu.com/p/CTQwsyvZCC/ is my heavily hacked stuff	13:36
mborzecki	mvo: want to try with systemd.journald.forward_to_console=1 in the command line?	13:38
mvo	mborzecki: oh, excellent idea	13:39
mvo	mborzecki: yeah, I mean, this is obviously just a quick hack to see if I can any extra data :/	13:39
mborzecki	mvo: higher chance of succeeding in getting more logs then i have have here with repacking	13:40
pedronis	cmatsuoka: we should try to see what happens combining #9340 and #9277 (for this we probably need to bump secboot version)	13:40
mup	PR #9340: boot: streamline bootstate20.go reseal and tests changes <Run nested> <UC20> <Created by pedronis> <https://github.com/snapcore/snapd/pull/9340>	13:40
mup	PR #9277: secboot: add boot manager profile to pcr protection profile <Run nested> <UC20> <⛔ Blocked> <Created by cmatsuoka> <https://github.com/snapcore/snapd/pull/9277>	13:40
cmatsuoka	pedronis: ack	13:41
mvo	mborzecki: I added it now but will let my current run continue	13:42
mborzecki	mvo: do you cache the kernel/core/snapd snaps somehow locally so that the vms do not have to download them all over again?	13:43
mvo	mborzecki: I don't :( I think having a squid-deb-proxy or debcacher-ng would be helpful	13:45
zyga	re	13:58
zyga	lucy is fast asleep - back to work	13:59
mborzecki	mvo: heh, and now swtpm socket isn't ready by the time the vm starts	13:59
zyga	mborzecki: I have a snap proxy at home	13:59
mvo	mborzecki: meh, it's getting worse and worse	13:59
* zyga hugs mvo and mborzecki		13:59
mborzecki	zyga: hm spread runs vms with -net user, so afaiu the vm will not have access to lan	14:00
zyga	mborzecki: it does	14:00
zyga	-net user drops ping and stuff	14:01
zyga	but it works normally	14:01
zyga	lan or otherwise	14:01
zyga	as long as the private IP is in another subnet from LAN	14:01
zyga	(private qemu-given IP)	14:01
pstolowski	cachio: hey, i just hit 2020-09-14 13:24:53 Cannot allocate google-nested:ubuntu-18.04-64: cannot find any Google image matching "ubuntu-1804-64-virt-enabled" on project "computeengine" or "ubuntu-os-cloud"	14:02
cachio	pstolowski, 1 sec	14:03
cachio	it is the gce issue	14:03
pstolowski	cachio: should i just restart the tests or is it more permanent?	14:04
cachio	pstolowski, try now please	14:05
mvo	mborzecki: this is all very frustrating, even when adding systemd.journald.forward_to_console=yes it does not even work	14:10
mvo	mborzecki: I mean, I still don't see anything in the serial log	14:10
mborzecki	mvo: i think that console conf hijacks the console	14:11
mvo	mborzecki: and systemd.debug also does not work	14:11
mborzecki	(or the serial as such)	14:11
mvo	mborzecki: I can try this	14:11
mborzecki	the last line i see is a prompt from console conf	14:11
mborzecki	mvo: fwiw, might be an accident, but with haveged install time is now ~360s rather than ~500	14:16
mborzecki	equally well might be some nework stuff causing that	14:18
mup	Issue core20#80 closed: networking does not persist in a reboot loop on arm64 pi4 <Created by anonymouse64> <Closed by xnox> <https://github.com/snapcore/core20/issues/80>	14:28
pstolowski	zyga: do you have a sec for https://github.com/snapcore/snapd/pull/9342 ?	14:46
mup	PR #9342: tests: add more checks to disk space awareness spread test <Disk space awareness> <Created by stolowski> <https://github.com/snapcore/snapd/pull/9342>	14:46
zyga	pstolowski: sure	14:46
zyga	done	14:48
pstolowski	zyga: ty!	14:48
mup	PR snapd#9342 closed: tests: add more checks to disk space awareness spread test <Disk space awareness> <Created by stolowski> <Merged by stolowski> <https://github.com/snapcore/snapd/pull/9342>	14:50
mborzecki	mvo: hm maybe this https://paste.ubuntu.com/p/29Xz6JcJrg/	15:03
mborzecki	hopefully ttyS1 is not hijacked and we can still get logs out of it	15:04
mvo	mborzecki: mux=on? should I also add this?	15:06
mvo	mborzecki: nice, let's hope this gives output	15:07
mborzecki	mvo: if that doesn't work, we can always have a service taht does `journalctl -f > /dev/someserial`, unfortunately none of the systemd-journal-gateway* things are in the core snap	15:09
pedronis	mborzecki: mvo: do we need console-conf in these tests? can't we turn it off? or turn it off in most? or turn it off while debugging?	15:12
mvo	pedronis: yeah, I think we can turn them off	15:14
mvo	pedronis: well, so … having console-conf means there is a way to login	15:14
mvo	pedronis: so it's not entirely without merits but if we provide an alternative login then it's not needed	15:14
mvo	pedronis: while trying to debug I can say that sealing to commandline works, I tried to change it and got a recovery prompt	15:25
ijohnson	if we fail to unlock in the initramfs as part of a kernel snap update, will we reboot automatically and trigger rollback to the previous one? I don't think so, but perhaps it would be smart to teach the initramfs to do this for at least specifically the kernel snap, we could detect we are trying a kernel snap update before unlocking the encrypted partition just by looking at kernel_status and bootloader vars	15:33
mborzecki	mvo: not much of an improvement unfortunately, idk why journald just stops logging to serial console at some point	15:33
mvo	mborzecki: yeah, I'm also a bit stuck here, trying out more things but it's very frustrating	15:33
pedronis	ijohnson: it sounds reasonable but probably to be done after we have landed the current bits	15:34
ijohnson	pedronis: ack I will make a small todo for myself to look into that	15:34
ijohnson	(for later on)	15:34
mborzecki	mvo: i need to taxi the kids around in 20 minutes, i'll open a branch with the patches i have	15:35
mvo	mborzecki: cool, I keep exploring this	15:39
mborzecki	pff, something new `Connection timed out during banner exchange` when trying to ssh into a nested vm	15:40
mup	PR snapd#9343 opened: tests: more logging for UC20 kernel test <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	15:40
mborzecki	mvo: need to go out now, check this commit: https://github.com/snapcore/snapd/pull/9343/commits/648801163d3d09f3db18dd71bec3d79690eda3b1	15:45
mup	PR #9343: tests: more logging for UC20 kernel test <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	15:45
mvo	mborzecki: nice	15:45
mvo	mborzecki: does it work :) ?	15:45
mvo	mborzecki: I keep poking at this	15:46
mborzecki	mvo: idk yet, just added, check back in a bit	15:46
mvo	mborzecki: \o/ cool	15:46
mvo	mborzecki: thanks in any case	15:46
mborzecki	mvo: but that may be it, the dfault is 'debug' for journal/syslog, but only info for console	15:46
mborzecki	ok, got to go, bbl	15:47
* cachio lunch		15:58
ijohnson	cachio: I've seen the minimal-smoke test fail multiple times now due to issues with running spread inside the external system like this	16:25
ijohnson	https://pastebin.ubuntu.com/p/RPv9qTxMcY/	16:25
ijohnson	and	16:25
ijohnson	err wait sorry need to find the other pastebin	16:26
ijohnson	here it is: https://pastebin.ubuntu.com/p/bmHw78Qfp4/	16:28
ijohnson	any ideas on what might be wrong? it doesn't seem to be the case that there is no user on the VM we are using as an external system, just that we can't use sudo or somehow can't login the way that spread is trying to do	16:29
ijohnson	the first failure was from my pr https://github.com/snapcore/snapd/pull/9332, while the second failure was from https://github.com/snapcore/snapd/pull/9311	16:30
mup	PR #9332: spread.yaml, tests/nested: misc changes <Run nested> <Simple 😃> <Created by anonymouse64> <https://github.com/snapcore/snapd/pull/9332>	16:30
mup	PR #9311: nested: add support to telnet to serial port in nested VM <Run nested> <Created by mvo5> <https://github.com/snapcore/snapd/pull/9311>	16:30
mup	PR snapd#9344 opened: tests/lib/nested.sh: wait for the tpm socket to exist <Run nested> <Simple 😃> <Test Robustness> <Created by anonymouse64> <https://github.com/snapcore/snapd/pull/9344>	16:40
ijohnson	simple nested test robustness PR ^	16:41
mborzecki	mvo: back for a bit, the logs are visible now!	16:44
cachio	ijohnson, I'll take a look	16:48
ijohnson	thanks cachio	16:49
ijohnson	I'm running a spread run now of my pr to see if I can reproduce that issue	16:49
mvo	mborzecki: yay	17:05
mvo	mborzecki: running your PR locally now while waiting for spread to catchup	17:08
zyga	brb	17:10
mborzecki	mvo: this is what i got in the last local run: https://paste.ubuntu.com/p/pfZsCCyBsS/	17:12
mvo	mborzecki: nice! good debug logs! is it still running or did creating the user fail?	17:15
mborzecki	mvo: it's `Failed password for user1 from 10.0.2.2 port 56336 ssh2` until the very end, either the user was not created yet (cloud-init does that right?), or something else went wrong	17:15
mvo	mborzecki: ok	17:15
mvo	mborzecki: so it failed, but least we have more logs now :)	17:16
mvo	mborzecki: hm, at least this "[ 161.918339] useradd[760]: new group: name=user1, GID=1000" is visible	17:16
mborzecki	mvo: yeah, at least we see it's resaling	17:16
mborzecki	mvo: maybe something wrong with ssh config then?	17:17
mvo	mborzecki: yeah, I wonder if the "preauth" part of the failure gives a clue already	17:19
mborzecki	mvo: do you remember where PasswordAuthentication in sshd_config gets enabled?	17:22
cmatsuoka	ijohnson: could you have a look at #9185 again when you have time?	17:24
mup	PR #9185: secboot: use the snapcore/secboot native recovery key type <UC20> <Created by cmatsuoka> <https://github.com/snapcore/snapd/pull/9185>	17:24
cachio	ijohnson, worker for me	17:30
cachio	I'll retry it in 3 machines	17:31
mborzeck1	mvo: duh, i have no clue why ssh may be failing, trying one more time	17:33
zyga	re	17:34
mborzeck1	ijohnson: could you run the tests/nested/core20/kernel-reseal test from #9343 locally?	17:34
mup	PR #9343: tests: more logging for UC20 kernel test <Run nested> <Test Robustness> <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	17:34
zyga	only one function left to test	17:34
zyga	I made a lot of tea	17:34
ijohnson	mborzeck1: sure	17:38
mborzeck1	ijohnson: in my runs, the test fails to login over ssh, i suspect PasswordAuthentication may have not been enabled	17:39
ijohnson	cachio: yeah worked for me too, I will try in a loop to see if we can reproduce	17:39
ijohnson	mborzeck1: hmm interesting	17:39
ijohnson	cmatsuoka: yes I will try to do a review tonight	17:39
cmatsuoka	thanks	17:40
mborzeck1	ijohnson: otoh, i'm not sure how it gets enabled in those tests, we dont' seem to calling repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks in a way that would enable that, so maybe cloud-init does it implicitly when it adds a user?	17:40
ijohnson	mborzeck1: right it should be using whatever is the default	17:41
ijohnson	mborzeck1: if cloud-init ran it should have created the user with the pw	17:41
ijohnson	mborzeck1: is that your full log of a failed run @ https://paste.ubuntu.com/p/pfZsCCyBsS/ ?	17:42
mborzeck1	ijohnson: yup, that's the full log i got	17:42
ijohnson	k	17:42
pedronis	mborzeck1: it should be turned on by: ssh_pwauth: True in the cloud-init config	17:47
zyga	woot	17:48
zyga	done	17:48
cmatsuoka	pedronis: #9340 + #9277 + #9185 + some conflict fixing seem to work	17:50
mup	PR #9340: boot: streamline bootstate20.go reseal and tests changes <Run nested> <UC20> <Created by pedronis> <https://github.com/snapcore/snapd/pull/9340>	17:50
mup	PR #9277: secboot: add boot manager profile to pcr protection profile <Run nested> <UC20> <⛔ Blocked> <Created by cmatsuoka> <https://github.com/snapcore/snapd/pull/9277>	17:50
mup	PR #9185: secboot: use the snapcore/secboot native recovery key type <UC20> <Created by cmatsuoka> <https://github.com/snapcore/snapd/pull/9185>	17:50
pedronis	cmatsuoka: thx	17:51
cmatsuoka	pedronis: and snapcore/secboot update + canonical/go-tpm2 update too	17:51
pedronis	mborzeck1: mvo: could we add printing of ssh config and extrausers content somewhere that runs/at end of cloud-init?	18:04
mvo	pedronis: good idea	18:07
* zyga spawns more spread tests, pushes the branch and EODs		18:07
mvo	pedronis: hm, the annoying part is that it appears that the exiting "echo test" is not output in the log that maciej posted :(	18:09
mborzeck1	pedronis: ijohnson: so it looks like it's actually taking that long to reach the point where cloud init makes the system accessible https://paste.ubuntu.com/p/6sPDmDhFzX/	18:14
pedronis	mvo: I suspect cloud-init sends it output somewhere else	18:15
ijohnson	mborzeck1: ah you know what I remember this problem	18:15
mborzeck1	funny there's this:	18:15
mborzeck1	[ 154.076623] passwd[766]: password for 'user1' changed by 'root'	18:15
mborzeck1	but it seems the change is picked up by pam way later:	18:16
mborzeck1	[ 396.314719] chpasswd[1626]: pam_extrausers(chpasswd:chauthtok): password changed for user1	18:16
ijohnson	cloud-init creates the user from config but for whatever reason the user isn't accessible until after everything is done running, I've seen this happen in other nested suites where it fails to prepare, then drops me to a shell and 20 minutes later when I see the failure I can login just fine	18:16
ijohnson	mborzeck1: yes that makes sense	18:16
ijohnson	I think we just need to bump the timeout for now and try to optimize things later on	18:17
mborzeck1	hah, interesting	18:17
mborzeck1	[ 404.228972] cloud-init[1623]: Cloud-init v. 20.2-45-g5f7825e2-0ubuntu1~20.04.1 running 'modules:config' at Mon, 14 Sep 2020 18:06:32 +0000. Up 390.55 seconds.	18:17
mborzeck1	[ 405.991338] systemd[1]: Finished Apply the settings specified in cloud-config.	18:17
ijohnson	it would be interesting to know if gce's tpm implementation works to the point where we could use it now for many of these tests and just use nested tests for things that really actually need nested vm's	18:17
mborzeck1	wonder whether `Up 390.55 seconds` means that it was running for that long	18:18
mvo	I'm in the vm now	18:19
mvo	systemd take 44% cpu	18:19
mvo	and console-conf was also pretty cpu heavy	18:19
mvo	and now I got kicked out :(	18:19
pedronis	ijohnson: sounds like something needs reloading to notice the new user?	18:19
ijohnson	pedronis: I dunno, it might be a "feature" of cloud-init that your user isn't "login-able" until cloud-init thinks the system is "ready"	18:20
pedronis	mborzeck1: do we run a script inside the vms? where is it defined?	18:20
ijohnson	pedronis: what do you mean by a script ?	18:20
ijohnson	for cloud-init ?	18:21
mborzeck1	pedronis: https://github.com/snapcore/snapd/pull/9343 repacks the core, so we can inject pretty much anything now	18:21
mup	PR #9343: tests: more logging for UC20 kernel test <Run nested> <Test Robustness> <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	18:21
pedronis	mborzeck1: no, I'm thinking a bit of code we run everywhere defined in prepare stuff?	18:21
mborzeck1	pedronis: you mean in the nested vm?	18:23
pedronis	mborzeck1: no, I mean repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks	18:24
mborzeck1	pedronis: that extra shell bit in the helper is not used when nested runs (i presume we expect cloud-init to set up ssh)	18:26
mborzeck1	and accounts	18:26
mborzeck1	well, just one account really	18:26
pedronis	mborzeck1: my question is whether we run it or not, my fear is that things interfere with each other	18:26
pedronis	that's why I ask	18:26
mborzeck1	pedronis: no afaict, the bit is not added, ENABLE_SSH is false when the helper is called from nested prepare	18:28
pedronis	I see, ok	18:28
mup	PR snapcraft#3285 opened: v1 plugins: lock godep's dependencies <Created by cjp256> <https://github.com/snapcore/snapcraft/pull/3285>	18:30
* zyga opened export manager PR and EODs		18:30
zyga	ttyl!	18:30
mup	PR snapd#9345 opened: overlord: introduce the export manager, export snapd tools <Created by zyga> <https://github.com/snapcore/snapd/pull/9345>	18:31
mborzeck1	tweaked the cloud config we use a bit	18:39
mvo	mborzeck1: \o/	18:39
ijohnson	pedronis: fwiw I have an open pr that changes repack_snapd_snap_with_deb_content_and_run_mode_firstboot_tweaks to instead call repack_snapd_deb_to_snap from snaps.sh to reduce confusion about that	18:46
ijohnson	pedronis: actually that pr is now green with 2 +1's and I'd like to merge it to master, do you have objections ?	18:47
ijohnson	#9332	18:47
mup	PR #9332: spread.yaml, tests/nested: misc changes <Run nested> <Simple 😃> <Created by anonymouse64> <https://github.com/snapcore/snapd/pull/9332>	18:47
pedronis	ijohnson: I don't have objections, but mvo and mborzeck1 might	19:00
pedronis	ijohnson: also why was it like this? we have two functions of which the second behaves like the first if a param is false?	19:01
ijohnson	pedronis: ack, well mvo already +1d the PR, so unless mborzeck1 has an opinion I will merge it	19:01
mvo	ijohnson: works for me	19:01
mborzeck1	wfm too	19:01
ijohnson	pedronis: cachio added the param to the longer function before I added/created the simpler one	19:01
pedronis	ah	19:01
ijohnson	pedronis: I created the simpler one for uc18 nested cloud-init tests specifically	19:01
ijohnson	pedronis: but actually the same function works for uc20 nested tests too	19:02
ijohnson	mborzeck1: cool I'll merge that one now	19:02
mborzeck1	ijohnson: tried using [NoCloud,None] in datasources list for cloud init, suspecting it may be getting confused or somesuch, but i don't see any change	19:03
pedronis	mborzeck1: we need to find a way to get the ssh config and the extrausers file into the logs	19:03
pedronis	ijohnson: but now that parameter could be dropped from the 2nd, no?	19:04
ijohnson	pedronis: yes it could	19:04
ijohnson	pedronis: I can do that	19:04
pedronis	it would make sense to me, but maybe cachio has reasons, but even then I would then write a wrapper that picks one or the other in this new world	19:05
ijohnson	mborzeck1: with 648801163d3d09f3db18dd71bec3d79690eda3b1 from your pr, I got a successful run on google-nested	19:06
ijohnson	mborzeck1: I ran `spread --debug google-nested:ubuntu-20.04-64:tests/nested/core20/kernel-reseal`	19:06
mup	PR snapd#9332 closed: spread.yaml, tests/nested: misc changes <Run nested> <Simple 😃> <Created by anonymouse64> <Merged by anonymouse64> <https://github.com/snapcore/snapd/pull/9332>	19:06
ijohnson	pedronis: I really don't think it's necessary anymore and it reads quite confusing to see "false" as an argument to an already really long bash function :-/	19:07
ijohnson	but yes I will get cachio to approve the pr when ready	19:07
ijohnson	mborzeck1: so was the issue that you can't run it locally with qemu-nested ?	19:07
mborzeck1	ijohnson: thanks for trying, it did work for me a couple for runs, but it's not really consistent	19:11
ijohnson	mborzeck1: ok, I will run more tries to see if I can reproduce the failure, to be clear though, you're running in gce with google-nested or locally with qemu-nested ?	19:11
mborzeck1	mvo: ijohnson: pedronis: pushed a bit more changes to #9343, bumped the timeout and somesuch	19:12
mup	PR #9343: tests: more logging for UC20 kernel test <Run nested> <Test Robustness> <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	19:12
mborzeck1	ijohnson: google-nested	19:12
ijohnson	k	19:12
mvo	mborzeck1: looking	19:12
mborzeck1	need to wrap it up, tuck the kids to bed	19:13
mvo	mborzeck1: thanks!	19:13
mvo	mborzeck1: good night	19:14
mborzeck1	o/	19:14
mup	PR snapd#9346 opened: interfaces: builtin: add iotedge interface to support Azure iotedge <Created by kubiko> <https://github.com/snapcore/snapd/pull/9346>	19:16
mup	PR snapd#9344 closed: tests/lib/nested.sh: wait for the tpm socket to exist <Run nested> <Simple 😃> <Test Robustness> <Created by anonymouse64> <Merged by anonymouse64> <https://github.com/snapcore/snapd/pull/9344>	19:41
pedronis	ijohnson: I think I'm understanding something of the nested issues	20:30
ijohnson	understanding is good	20:31
ijohnson	I have now run 3 times the kernel reseal test on gce and it hasn't failed for me fwiw	20:31
pedronis	ijohnson: basically because we use chpasswd (and not other keys), that happens in a later cloud-init phase	20:31
pedronis	if you look at the logs of a successful run	20:32
ijohnson	ah	20:32
pedronis	[ 125.319739] passwd[783]: password for 'user1' changed by 'root'	20:32
ijohnson	yeah that matches basically what I expected	20:32
pedronis	but things works only after I see	20:32
ijohnson	I use this cloud-init config locally which doesn't use the chpasswd module:	20:33
pedronis	[ 346.008722] chpasswd[1668]: pam_extrausers(chpasswd:chauthtok): password changed for user1	20:33
ijohnson	https://pastebin.ubuntu.com/p/ZNGxtgHGTr/	20:33
pedronis	that's at 5 minutes in	20:34
pedronis	and counting	20:34
pedronis	the issue afaict is that part of cloud-init runs only after we are seeded	20:34
pedronis	and seeding is slow	20:34
ijohnson	pedronis: I wonder if cloud-init creates the user and the passwd message we see is just because the user was created and the password was set to ""	20:34
pedronis	yes, something like	20:34
pedronis	that	20:34
ijohnson	but we should try something like the minimal cloud-init I just pasted instead, perhaps that would run faster as that's just using the users module/key thing	20:35
* ijohnson afk for 5ish minutes		20:35
ijohnson	mmm now I reproduced a failure on the kernel-reseal nested test	21:28
pedronis	with the increased timeout?	21:33
ijohnson	yes	21:34
ijohnson	it looks like the vm is hard-locked up	21:34
ijohnson	qemu is still running	21:34
ijohnson	but many messages like this in the serial log	21:34
ijohnson	[ 649.475297] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]	21:34
ijohnson	[ 649.587280] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sshd:2191]	21:34
pedronis	fun, not	21:34
ijohnson	[ 684.935379] rcu: INFO: rcu_sched self-detected stall on CPU	21:34
ijohnson	[ 684.935379] rcu: 1-...!: (14946 ticks this GP) idle=ec6/1/0x4000000000000002 softirq=73107/73107 fqs=295	21:34
ijohnson	[ 684.935379] rcu: rcu_sched kthread starved for 14407 jiffies! g142565 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0	21:34
ijohnson	[ 684.935379] rcu: RCU grace-period kthread stack dump:	21:34
cmatsuoka	ijohnson: hummm... we've seen this before I think	21:35
ijohnson	yeah I dunno what's the situation, but I don't see that chpasswd message so it seems that it took too long to get to that point	21:35
ijohnson	and it seems that kvm was shut off properly	21:36
ijohnson	this is the qemu cmdline	21:36
ijohnson	https://www.irccloud.com/pastebin/Jyoyw8RK/	21:36
mup	PR snapd#9346 closed: interfaces: builtin: add iotedge interface to support Azure iotedge <Needs security review> <⛔ Blocked> <Created by kubiko> <Closed by kubiko> <https://github.com/snapcore/snapd/pull/9346>	21:37
cmatsuoka	ijohnson: I think it was when cachio's VMs were rebooting randomly and we considered it could be the watchdog rebooting the system	21:37
ijohnson	yeah but this is also on GCE, which had that problem	21:37
ijohnson	but I thought that turning off kvm fixed that problem	21:37
ijohnson	well it fixed the problem of randomly rebooting	21:37
ijohnson	now maybe the problem is just that we get hung and don't get rebooted whereas before we would at least get rebooted	21:37
cmatsuoka	ijohnson: is this happening randomly?	21:38
cmatsuoka	or did you find a way to make it reproducible?	21:38
cachio	ijohnson, hey	21:39
cachio	2 cpus?	21:39
cmatsuoka	hmm, -smp 2	21:39
ijohnson	cmatsuoka: it is random, I ran the same branch successfuly twice before it failed	21:39
cachio	I'll try to reproduce it	21:39
ijohnson	cachio: should we not be using 2 cpus ?	21:39
ijohnson	this is for nested/core20 suite	21:40
cachio	it should,t be a problem	21:40
cachio	just when you have kvm enalbed could be a prooblem	21:41
cachio	but it is not the case	21:41
ijohnson	cachio: kvm is disabled in this case	21:41
ijohnson	cachio: this is from running mborzecki's branch on PR https://github.com/snapcore/snapd/pull/9343	21:42
mup	PR #9343: tests: more logging for UC20 kernel test <Run nested> <Test Robustness> <UC20> <Created by bboozzoo> <https://github.com/snapcore/snapd/pull/9343>	21:42
cachio	ijohnson, ah, nice, I'll use it	21:43
ijohnson	thanks	21:43
ijohnson	cachio: also I reproduced the issue I mentioned earlier around lunch time	21:44
ijohnson	I don't know a root cause yet, but somehow something deleted our sudoers entries	21:44
cachio	I ran many times and all of them passed	21:44
cachio	I saw many times that until I added sync before stop the vm	21:45
ijohnson	https://www.irccloud.com/pastebin/sPEO3dQ2/	21:45
ijohnson	yeah maybe sync just makes it less likely?	21:46
ijohnson	what's weird is that I can see in the spread prepare output it created the users and they executed sudo commands successfully	21:46
cachio	yes	21:46
ijohnson	I wonder if we need to run sync inside the VM as well as outside the VM too	21:46
cachio	ijohnson, perhaps	21:47
cachio	I can try that	21:47
cachio	I am preparing a new pr with some improvements	21:47
ijohnson	cachio: thanks I'm gonna move on and debug some other things	21:47
cmatsuoka	maybe the slow system under stress is confusing rcu?	21:50
cmatsuoka	https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt	21:52
cmatsuoka	ah we really have cpu soft lockups	21:54
cmatsuoka	ugh	21:54
pedronis	looking at the logs it seems reaseling itself is very slow (but we would need to add dedicated logging to be sure)	21:56
ijohnson	cmatsuoka: this is the full log I have from that VM before I gave up and killed it https://pastebin.ubuntu.com/p/2w3XvXBh4B/	22:06
* cmatsuoka verifies...		22:06
cmatsuoka	ijohnson: is this the kernel-reseal test? any special command line to run it?	22:18
* cmatsuoka running the test...		22:20
ijohnson	cmatsuoka: yes that's the test	22:22
ijohnson	no, just whatever's in that branch right now is what I ran	22:22
cmatsuoka	gce seems especially slow today	22:23
ijohnson	cachio: could you please look at pr 9347 quickly tonight? it is small and I would like to try and merge that tonight if possible so it's ready for folks tomorrow morning	22:23
mup	PR #9347: tests/lib/nested.sh: use more focused cloud-init config for uc20 <Created by anonymouse64> <https://github.com/snapcore/snapd/pull/9347>	22:23
cmatsuoka	cachio: any known problem with arch tests and cgroups recently?	22:26
mup	PR snapd#9347 opened: tests/lib/nested.sh: use more focused cloud-init config for uc20 <Created by anonymouse64> <https://github.com/snapcore/snapd/pull/9347>	22:27
cmatsuoka	dinner, then will check test results	22:37
cachio	ijohnson, sure	22:47
ijohnson	thanks	22:47
cachio	cmatsuoka, didn't see errors, do you have al og	22:47
cachio	?	22:47
cachio	ijohnson, +1, now lets wait for test results	22:58
ijohnson	cachio thanks but very confusing that the nested runs are all done because it says they were cached?	22:59
ijohnson	This branch has never been run before it was opened less than an hour ago	22:59
ijohnson	I have to EOD now but maybe someone should look into that	22:59
cachio	ijohnson, I added the run nested tag	23:02
cachio	could you please re-push on that branch	23:02
cachio	so we force the nested tests are executed at least once for that PR?	23:02
ijohnson	Oh I forgot to add the tag	23:02
ijohnson	Sorry let me close and reopen	23:03
cachio	ijohnson, tx	23:03
cmatsuoka	cachio: it worked now, it was just a random failure I guess	23:06
cmatsuoka	cachio: thanks	23:06
cachio	cmatsuoka, ok	23:10
cachio	cmatsuoka, please send me a log in case you see it again	23:10

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!