[05:56] good morning [06:06] zyga: morning [06:06] hey :) [06:33] mvo: hey [06:34] hey mborzecki [06:45] hey mvo [06:45] quick comment, do you remember the remodel kernel work? there's a test that reliably fails in osc build, there's a trello card for it [06:45] I wanted to look now but if you know more I'd love to get help [06:46] zyga: can you point me to the log? [06:46] zyga: the remodel kernel test takes a while iirc, we had to increase the settle timeout on arm [06:46] zyga: is it maybe that? [06:46] that's the same tet [06:46] but it failed on my xeon ;D [06:48] also with a timeout? [06:48] it's hard to say, there's very little useful stuff when settle fails [06:48] I complained that it should not be time-based at all but this is not a simple problem to solve [06:49] mvo: https://trello.com/c/foU3iOrs/321-investigate-testremodelswitchtodifferentkernel-failure [06:52] mvo: the whole output is [06:52] https://www.irccloud.com/pastebin/8NhFWypY/ [06:52] zyga: this happens on your opensuse as well? [06:52] no, it only happens in 'osc build' [06:52] though my master is "after" 2.42 [06:52] zyga: is it easy to submit test builds? I could add something that prints more debug [06:52] mvo: it's not remote [06:52] it's local [06:53] mvo: it's like dpkg-buildpackage [06:53] mvo: anything extra can be added, just make a patch against 2.42 [06:53] mvo: if you push a branch I can extract the patch and run it easily [06:53] mvo: I would like to make some general changes to the tests later on [06:53] though I doubt this will be fasst [06:54] 1) log all handlers that executed during failed test [06:54] 2) change settle in tests to just do more stuff as long as handlers want to run, remove the timeout entirely [06:54] 3) patch tests that measure failure to introduce callback to break further loops [06:55] current time based code is simply not reliable and not engineered correctly as tests IMO === pstolowski|afk is now known as pstolowski [07:03] mornings [07:22] pstolowski: hey :) [07:22] man, I'm so sleepy today [07:23] the rain, the low pressure [07:23] coffee, sorry [07:24] hey pstolowski - good morning! [07:24] mvo: sorry if I came across cranky, I just think we need to acknowledge shortcomings and not just paper over them with "this is how it's been done" [07:25] zyga: not cranky at all [07:26] zyga: let me just finish this one thing here and then I can do something, probably something like for _, t := range chg.Tasks() { fmt.Printl(t.Kind(), t.Status()) } [07:26] yeah [07:29] flashing the card again, last night I had issues and somehow I could not get the card to copy from macos [07:29] (new sandbox for all apps) [07:29] mvo: macos catalina switched to ... read only image as root filesystem [07:29] mvo: yeah :) [07:29] mvo: ubuntu personal [07:30] mvo: /Users (/home for macos) and /Applications are writable [07:30] rest is not [07:30] and they live on a different apfs dataset (like zfs thing) [07:38] zyga: updated #7571 [07:38] PR #7571: sandbox/cgroup: refactor process cgroup helper to support v2 and named hierarchies [07:38] checking [07:39] zyga: is there a place like /var/log on osx where system logs land? wonder how it's combined with the read only filesystem image [07:39] mborzecki: actually macos has /home /var /usr and all that [07:39] they are just hidden [07:39] this is the current mount table [07:39] https://www.irccloud.com/pastebin/r4LiHBk7/ [07:40] mborzecki: and yeah, there is /var/log [07:41] mborzecki: this is also interesting [07:42] https://www.irccloud.com/pastebin/fRUAdcOV/ls%20-la%20%2F%20 [07:42] note that /home is /System/Volumes/Data/home [07:42] where Data is the new writable part of apfs [07:44] mborzecki: the new PR looks nice [07:46] mborzecki: approved [07:49] zyga: thx [07:50] mvo: hi, I made this comment: https://github.com/snapcore/snapd/pull/7538#issuecomment-539882778 let me know if you prefer me to go over line by line [07:50] PR #7538: tests: use `snap model` instead of `snap known model` in tests [07:55] PR snapd#7575 closed: cmd/model: add authority-id to verbose fields [08:00] pedronis: thanks, looking now [08:02] pedronis: no need to go over it line by line, thanks, I will make the changes [08:08] zyga: I created github.com/mvo5/snappy:debug-remodel-kernel-zyga with some extra debug (quite small right now). please run when you get a chance and let me know what it outputs [08:08] mvo: thank you [08:10] mvo: I've rebased on 2.42, trying now [08:13] mvo: new log https://www.irccloud.com/pastebin/EUXdHkdl/ [08:15] zyga: aha, looks like something in the mock restart is not working [08:21] mvo: I'm logged into the image from rogpeppe [08:21] refreshing snaps now [08:21] we'll know in a second if it crashes on reboot [08:22] zyga: sometimes it doesn't :) [08:22] rogpeppe: fingers crossed [08:22] I have serial output [08:22] so we'll know what is attempted at least [08:22] zyga: the other thing is that it surely shouldn't be rebooting very often [08:22] and I should send the raspi-tool to snapd into some contrib/ directory [08:23] zyga: how often would you expect? is it usual for it to try to reboot once or twice a day? [08:23] zyga: will look into this a little bit [08:23] rogpeppe: not sure, one sec [08:23] Chipaca: from that device [08:24] rogpeppe: the only valid reason to reboot that often is if it tracks edge for core/kernel everything else is most likely a bug [08:24] Chipaca: system-shutdown messages https://www.irccloud.com/pastebin/sSWsFr69/ [08:24] rogpeppe: thank you so much for helping us tracking this down! [08:24] zyga: thank you, super curious about this [08:24] mvo: i hope you do! :) [08:24] rogpeppe: it updated to 2.42 correctly [08:24] zyga: those messages are fine [08:24] ok [08:24] oh [08:24] it reboots instantly? [08:24] what? [08:25] I snap refreshed [08:25] it rebooted [08:25] then it reboots again [08:25] straight away [08:26] mvo: suggestion for improvement [08:26] mvo: uboot script should log basics [08:26] like booting that kernel and this base [08:26] and in this mode [08:26] would be great to see [08:27] mvo: snapd change after update [08:27] it seems that 2nd reboot was the only one [08:27] snap change on successful 2nd reboot https://www.irccloud.com/pastebin/4X2z8O6R/ [08:27] I'll reboot manually to see if it boots ok [08:29] rogpeppe: sadly, it works :/ [08:32] mborzecki: question in #7443 [08:32] PR #7443: timeutil: fix schedules with ambiguous nth weekday spans [08:32] rogpeppe: I'll leave it running [08:32] zyga: one thing you could do is just leave it running [08:32] zyga: lol [08:32] rogpeppe: to see how it ages for a few days [08:32] :D [08:33] zyga: yup. i guess it might be that we've got two pi's both with buggy h/w [08:33] rogpeppe: could it be the SD card? [08:33] could it be that the additional hardware somehow destabilises the pies? [08:34] zyga: i used a different SD card too [08:34] interesting [08:34] how is the hardware connecteD? [08:34] zyga: and it's a decent quality SD card, not much used. [08:34] zyga: connected to the network? [08:34] no, any custom stuff added to your pi? [08:34] not network [08:35] zyga: nothing custom added [08:35] there goes that hypothesis then :) [08:36] zyga: there was an RTC and a pi-glow on the old pi, but not on this one [08:36] I can send you my pi if you want to swap :) [08:37] zyga: that's an interesting thought. the other thing i might try (sorry) is to use a different OS. [08:37] that would be an interesting data point as well [08:52] Chipaca: hey, health-check task is not executed on install if flags&skipConfigure != 0 (because we return early and don't schedule config hook - see the bottom of doInstall); intended or a bug? [08:54] pedronis: iirc it's not strictly needed, though i kind of highlights the thing happens every day [08:55] mborzecki: highlights or confuses people like me :) [08:55] pedronis: hah ok, i can drop it [08:55] mborzecki: one possible reading is that it triggers every day of the month [08:55] anyway with /1 [08:55] the "spec" is not great [08:56] duh, it is not :/ [09:01] mborzecki: if it's not strictly needed I would drop it, less mental burden to read it. The spec says " Values may be suffixed with "/" and a repetition value, which indicates that the value itself and the value plus all multiples of the repetition value are matched. Two values separated by ".." may be used to indicate a range of values; ranges may also be followed with "/" and a repetition value." of which I'm not sure how to [09:01] interpret in practice [09:02] i'm looking at the spec, and we may have a bug there, at laest on older systemd versions [09:02] there's no .. range on xenial :/ [09:03] but out snap with timers does not fail, so, the bug is only for numbered ranges [09:04] another conclusion is that since we had no bug reports or complains about snap install failing, then people do not use it at least in snap.yaml [09:06] mborzecki: we'll need to update this, right? https://forum.snapcraft.io/t/timer-string-format/6562 [09:06] yes [09:07] also before you could do mon1-fri2, right? now you need something like mon1-sun,,mon2-fri ? [09:11] yes, mon1-fri or mon-fri2 (or mon2-fri) [09:12] so at most the range can describe 8 days, eg mon1-mon [09:12] ok, not sure I'm worried, also the other one does quite do the same thing [09:12] just making sure I understand [09:12] *does not quite do [09:14] zyga: I updated the kenrel-remodel debug PR, will probably produce a lot of output now, could you please run it? [09:14] sure [09:14] one sec [09:15] mvo: try to rebase on 2.42 next time [09:16] zyga: mvo: are you looking into TestRemodelSwitchToDifferentKernel ? [09:17] mvo: building now [09:17] pedronis: yeah, it seems to fail for zyga in obs [09:17] pedronis: mvo is really, I'm just the test runner [09:17] mvo: in osc, not obs (I didn't push yet) [09:17] pedronis: its a bit strange I cannot reproduce and I'm a bit lost [09:17] there's a card, I will move it and put you on it [09:17] osc - dpkg-buildpackage, obs -> that thing in the cloud [09:17] mvo: new log coming up [09:17] zyga: thanks, is there an ocd as well? [09:18] ocd? :D [09:18] zyga: just kidding [09:18] hahaha [09:18] I was suspecting some :D [09:18] zyga: we also have obi here [09:18] zyga: SCNR [09:18] same here [09:18] very useful [09:18] SCNR? [09:18] zyga: sorry-could-not-resist [09:19] zyga: these short acronyms are just too hard for me to remember :) [09:19] zyga: anyway, hope it gives me clues but I have *no* idea right now [09:20] I can look as well at some point if it remains a mystery, not quite right now [09:21] though [09:21] mvo: the log is large and got truncated from my screen, give me a sec to grab it from disk [09:21] zyga: only the bits before the test failure are of intersst [09:22] zyga: you could also modify the buld to just run the remodel test during the build [09:22] the build takes a second, it's just the timeout that we are waiting for [09:22] pstolowski: in practice we use SkipConfigure only in firstboot code, doesn't really answer either way [09:22] ok, collecting now [09:23] mborzecki: 7572 has an arch specific failure in the spread logs that I have not seen before, could you please have a look before I restart the build? [09:23] mborzecki: (not urgent :) [09:24] pstolowski: bug [09:25] Chipaca: pstolowski: notice that in practice we don't want that behavior, is right now immaterial, we don't want to run health-checks before at least the essential snaps are installed, so what we do for configure there, would apply also for health-checks [09:26] so in practice is a feature, just obscurely expressed [09:26] pedronis: true [09:26] mvo, zyga, ondra showed me a bootchart yesterday and we noticed that there are a gazillion unsquashfs calls during seeding ... why is that (i.e. if we extract metadata or some such, why dont we mount/umount the squashfs and cp the file which is a ton times faster and less CPU hungry) [09:26] ogra: we unsquash to look at info AFAIR [09:27] right, thats was my guess [09:27] ogra: but yeah, I agree [09:27] pstolowski: basically renaming the flag SkipConfigureAndHealthCheck would also work [09:27] *that [09:29] ogra: at one point we determined it to be cheaper to unsquash it to extract just one file, than to mount it [09:30] we also do some checks before we consider a snap safe to mount but I don't remember the specifics there for firstboot [09:30] we could save on one op entirely by mounting to /tmp/$RANDOM and then moving the mount over to final location [09:30] mvo: IMO our checks are not in the "safe to mount" category [09:30] mvo: it's valid vs invalid but not safe [09:31] either way, we will do nothing without measuring [09:31] also it's a delicate area under churn [09:31] so also nothing very soon [09:32] ogra: I recommend to create a forum topic with some of those logs, and we'll see [09:32] Chipaca, well, on first boot of a low end system it significantly eats CPU [09:32] mvo: it's still running, it's odd - I had a look at the log file it's running under now and it keeps printing open /dev/null and open /bin/false [09:32] ogra: yes. I sent an email about this three years ago. [09:32] ogra: I profiled it [09:34] mvo: the log is crashing pastebin :D [09:34] one sec [09:34] zyga: heh, I just need the last few lines [09:35] zyga: well, I need to see why its not converging, i.e. what it tries to run over and over again [09:35] [ 618s] trying to run: runMgrForever Doing 1... [] [09:35] [ 618s] trying to run: runMgrForever Doing 1... [] [09:35] [ 618s] trying to run: runMgr1 Do 1... [] [09:35] [ 618s] trying to run: unknown Do ... [] [09:36] runMgrForever is in overlord_test.go ? [09:36] no managers_test.go [09:36] there's a long sequence of those [09:36] but earlier there's also this [09:36] do we have a test that doesn't clean up? [09:36] https://www.irccloud.com/pastebin/8PTTypwd/ [09:36] anyway it seems unrelated [09:37] then there's a long wall of [09:37] zyga: how big is the log? maybe you can put it into google drive? [09:37] sure, one sec [09:38] zyga: something along the lines of "] Done link-snap Make snap "brand-kernel" (2) available to the system [2019-10-09T08:11:58Z INFO Requested system restart.]" is what I'm hoping for [09:38] gedit hangs for a while while pasting [09:38] 2019 [09:38] ogra: mail subject 'booting the pi', 26 Jan 2017 [09:38] ogra: fwiw [09:39] mvo: 32M [09:39] mvo: uploading now [09:40] zyga: woah [09:41] Chipaca, hmm, that talks about sha3 and keccac [09:41] ogra: yes [09:41] ogra: having profiled, those were the slow things [09:42] ogra: that's what we mean by 'we will do nothing without measuring': measure first, determine what is slow, see how to improve it [09:42] ogra: optimising away the calls to mskquashfs will do nothing if those calls are not the thing slowing down first boot [09:42] unsquashfs* [09:43] of course, it's entirely possibe that those calls are now slow [09:43] but after having received exactly 0 responses to my original mail I'm not going to waste time on this again [09:43] Chipaca, well, they eat CPU via userspace while mount/umount via the kernel dont [09:44] the outcome of your mail was in the end that ijohnson developed assembly to improve sha3 (even though that has never been merged or is in use anywhere still) [09:45] PR snapd#7538 closed: tests: use `snap model` instead of `snap known model` in tests [09:48] PR snapd#7566 closed: snap-repair: add missing check in TestRepairBasicRun [09:49] * mvo hugs pedronis [09:52] mvo: #7568 needs a tweak (or I am confused) [09:52] PR #7568: snap: when running `snap repair` without arguments, show hint [09:52] there is some mark down in the error message which I don't think we do [09:53] heh, Markdown [09:56] pedronis: sure, let me fix this [09:57] grr old systemd [10:07] Chipaca I'm collecting some bootcharts I can share with you [10:07] Chipaca those are the one ogra mentioned [10:08] * zyga is somewhat hungry [10:08] making progress on cgroup branches, I'll have something to share soon [10:08] trying to eliminate the extra binary now [10:09] well, it works, just needs some cleanup still [10:09] mborzecki: ^ FYI [10:12] ondra: I'm not sure what you expect to do with them though [10:13] heh, I just noticed that we have cmd/snapd-generator that claims to be snapd-workaround-generator [10:15] ondra: ogra: if I'm reading the bootcharts right, the unsquashfs calls are not a big contributor to the overall boot? most don't even register as taking any time at all, only three (of 23) take more than a second [10:17] Chipaca, they eat CPU .. all of them have pretty bold blue bars .... onn a multicore system that wont be significant, on single cores it will bite [10:17] ogra: sure. So if we avoid them we save... .01% of the overall boot time? or how much? [10:18] not on multicore where your IO is rather serialized [10:18] err [10:18] s/multi/single/ [10:18] that boot time is, still, over 5 minutes. The total sum of all time spent on unsquashfs is less than 10 seconds. [10:18] point is it wastese ressources that we could avoid [10:19] sure, there is something else going on with the system ondra took the bootchart [10:19] i'm not saying this needs urgent fixing or anything but it is an obvious place for improvement [10:19] Chipaca my main focus is what is happening between ~40 and ~140 seconds [10:20] right, it *feels* like there is a Sleep(100) somewhere in snapd's code :D [10:20] Chipaca it takes almost 100s to start snapd from snapd snap, and I don't seem to get timing for this anywhere, or logs [10:20] ondra: have you enabled debug logs? [10:20] but thats unrelated to the unsquashfs ... i'm just pointing out that unsquashfs is wasteful and it would be nic to improve it [10:21] *nice [10:21] i'm not asking for an urgent fix but for considering to improve it if there is time [10:22] ogra: please write something on the forum with data, these IRC discussions will go nowhere [10:22] will do [10:22] ogra: what pedronis said. Thanks! [10:22] Chipaca I'd need to use custom snapd snap I guess [10:22] Chipaca or how to do it at first boot [10:23] ondra: no, you just need to drop a file in etc/systemd/system/snapd.service.d [10:23] Chipaca also it seems like it's before we start snapd, so logs in snapd will not help [10:23] ondra: you can do that if you pause the bootloader for ex [10:23] it would be cool if we could enable debug options for snpd on the kernel cmdline BT [10:23] W [10:23] Chipaca can you give me example of that service, I will drop it to the image [10:24] adjusting the cmdline on firstboot is very easy .. while injecting a file to the rootfs is hard [10:24] ondra: $ cat /etc/systemd/system/snapd.service.d/debug.conf [10:24] [Service] [10:24] Environment=SNAPD_DEBUG=1 SNAPD_DEBUG_HTTP=7 [10:24] ondra: ^ I also sent an email asking anybody developing things on snapd to have this, ages ago (and it's all over the forum) [10:25] ondra: or, it might be easier, to just drop those two env vars in /etc/environment [10:25] * pstolowski walk, bbiab [10:26] ondra: in core18, given it's a funkier service, you might need to adjust the service name [10:27] Chipaca Ok thanks, I will try that [10:27] ondra: i.e. /etc/systemd/system/core18.start-snapd.service.d/debug.conf [10:27] mvo: I notice this service doesn't source /etc/environment, that's probably not good [10:28] ondra: or maybe it's snapd.seeding.service? [10:28] ondra: at this point I don't know which service actually starts snapd :-| [10:28] in core18 [10:29] Chipaca: oh, good point [10:29] Chipaca: different ones at different points [10:29] mvo: do you know how that dance happens? [10:29] i can't find a snapd.seeding.service but it's in the bootchart [10:29] there's a bootstrapping one then it does a bit of seeding then snapd restarts [10:30] Chipaca: yes, snapd itself writes it [10:30] as the real service I think [10:30] mvo: from where? [10:30] wrappers/core18.go [10:30] Chipaca: i.e. when snapd starts and there is nothing on the system yet it calls wrappers/core18.go [10:30] pedronis: but there isn't a seeding there either [10:31] are you sure it's seeding? [10:31] there is no such thing afaik [10:31] there's a seeded service [10:31] that one doesn't snapd [10:31] pedronis: https://private-fileshare.canonical.com/~okubik/bootchart-20191008-1710.svg [10:32] pedronis: there's a snapd-seeding.service [10:32] Chipaca I think this one core18.start-snapd.service [10:34] Chipaca: it's not a service, it's an ephemeral unit [10:35] Chipaca: from here (as ogra said): https://github.com/snapcore/core18/blob/master/static/usr/lib/core18/run-snapd-from-snap [10:35] * pedronis lunch [10:36] ondra: wrt not sure if it's snapd or not, snapd is running from t=39s at least [10:36] ondra: so debug logs should shed some light on what's going on there [10:37] Chipaca, pedronis https://forum.snapcraft.io/t/unsquashfs-vs-mount-during-firstboot-seeding/13614/1 ... (really just a suggestion for times where you guys dont have other things to do :) ) [10:38] Chipaca https://private-fileshare.canonical.com/~okubik/boot-plot.svg [10:39] Chipaca if you look here, we only mount snap-snapd-5038.mount around 130 second [10:39] Chipaca and snapd is started at 140 second [10:39] yep [10:40] Chipaca and my interest is before 140 [10:41] Chipaca also from boot-chart you can see that during that time system is relatively idle [10:41] once it starts seeding, it pegs on core well, there we would need to consider parallel jobs to speed it up. but that first 100s seems to be as easier to address now [10:43] ondra: what is it waiting for? [10:44] Chipaca you tell me :) [10:44] Chipaca why do you thing I'm digging into it, I want to know as well :) [10:46] ondra: can you stop it inside the initramfs just before it pivots in? [10:46] i.e. before systemd kicks in [10:46] Chipaca, pedronis, ondra https://forum.snapcraft.io/t/allowing-snapd-debug-options-to-be-set-on-kernel-cmdline/13615 [10:46] but after everything is mounted [10:46] :) [10:46] ogra: tks [10:47] we need whishlist tags in the forum :D [10:47] Chipaca possibly, what do you want to do there? [10:48] ondra: add 'set -x' to run-snapd-from-snap :-) [10:48] ondra, break=bottom ... then you can modify /writable [10:48] (indeed only stuff thats not inside snaps :P ) [10:49] Chipaca but that one is core18 snap [10:49] ondra: copy it somewhere, edit it, bind mount it back [10:49] heh [10:50] Chipaca you are bringing big guns right away :) [10:50] PR snapd#7576 opened: spread.yaml,.gitignore: sync ignored files [10:52] Chipaca I guess I can also create own core18 snap can't I? [10:52] ondra: probably, yes :) [10:53] pstolowski: btw, are you using core18 for your spike ? [10:54] ogra is that right to have in core18 snap things like dev/null dev/random dev/urandom dev/zero ? [10:55] it doesnt do harm [10:55] ogra seems to me useless [10:55] (but in general these things should be handled by devtmpfs which we mount already in the initrd) [10:55] yeah, they likely are ... unless you have a kernel without devtmpfs [10:56] (these are rare but i guess they still exist) [10:56] Chipaca anything else you want me to modify in core18 snap? set -eux -> set -x? [10:56] Chipaca I will sprinkle that file with few echos once on it [10:58] set -x should be enough ... you dont need eu (said boris johnson ...) [10:59]