[02:33] PR snapcraft#2066 closed: errors: feature flag error reports [02:36] PR snapcraft#2069 opened: Reports [04:43] jjohansen: artful is also affected. I will give you more results today [04:44] zyga: ah good, I was going through the diff and going htf is artful not affected and bionic is :) [04:48] Mainline is not affected or very little [04:48] Loading one profile over and over leaks very very quickly [04:48] Maybe some new table is not released? [04:49] I wrote some tests but went to sleep. I will keep looking today [05:03] mainline certainly has a leak [05:03] which kernel version for mainline are you not seeing it on? [05:03] zyga: ^ [05:05] jjohansen: I built mainline from yesterday, I was at e241e3f2bf97 [05:05] okay, thanks [05:06] jjohansen: when I say there was no leak I mean that loading a profile over and over (unchanged) ran for over 30 minutes with minimal memory bump (probably noise from other programs) [05:06] jjohansen: at most 300MB [05:06] jjohansen: on a affected kernel a few minutes of this would consume all my ram [05:07] jjohansen: I will give you more data soon, sorry, yesterday I just collapsed [05:08] jjohansen: this is the base64-encoded binary profile I was loading in a loop http://paste.ubuntu.com/p/Jfs3RRKcPw/ [05:11] jjohansen: on 4.13.16 inserting that 10K times leaks 440MB [05:11] (on amd64) [05:12] jjohansen: perhaps other profiles we tested inside spread+snapd leaked more memory but I wanted to keep using one profile for experiments [05:12] morning [05:13] hey mborzecki, good morning [05:13] any fires today? [05:14] mborzecki: no, I think all is the same for now [05:31] jjohansen: on xenial kernel the jump is from 946 all the way up to ~2GB [05:31] (this time using distribution kernel, not my build of the corresponding tag) [05:31] jjohansen: xenial kernel is misleading, this was 4.13.0-37 [05:35] jjohansen: 4.4.0-119 on xenial is also affected but very slightly so, same profile, same count, 626MB->660MB [05:35] jjohansen: I'll test intermediate kernels now [05:40] jjohansen: 4.8.0-58 goes from 699M -> 773M [05:42] jjohansen: 4.10.0-42 goes from 698M -> 723M [05:42] jjohansen: so that feels like noise so far [05:42] the real jump is in 4.13 [05:42] where we drop significant amount of memory [05:48] jjohansen: 4.15.0-13 goes from 640M to 1.61G [05:50] jjohansen: so for all practical purpose the diff between 4.10 and 4.13 has introduced the major part of the leak [05:51] jjohansen: but note that even on 4.4 there's some memory going somewhere, maybe that's just slab growing [05:51] jjohansen: I'll introduce a variant that does 10M insertions to see if slab stabilizes [06:12] zyga: do you know if issue #3 'Memory use on minimal/constrained systems' had any further developments? [06:12] jjohansen: on the xenial 4.4 kernel 10M insertions doesn't seem to actually leak memory, after some initial growth (of non-free memory) slab stabilises at 929M and just stays there [06:12] mborzecki: no, I didn't focus on it [06:13] i know we said it's won'tfix for 18.04, but didn't see any messages that would indicate it was further discussed later yesterday [06:16] mborzecki: sorry, I don't know more than that [06:16] mvo: did you end up having the meeting with cloud guys? [06:54] I'll terminate testing 4.4., it's pretty much rock solid [06:55] mborzecki: hi, some of the things you worked on recently (autostart, timers? ) need to be added here https://forum.snapcraft.io/t/the-snap-format/698 ? [06:56] pedronis: thanks, will do [06:56] mborzecki: as usual we need put then version (2.xx+) where it starts working [06:58] zyga, mborzecki yeah, we had a meeting yesterday. we will not do anything right now, its too risky, but we want to prepare so that we can provide a fix post-release asap [06:58] jjohansen: 4.4 is rock solid, doesn't leak memory over extreme number of insertions, I'm looking at 4.10 now and it also looks good, memory use stops at ~1.03GB after 10s of thousands of insertions [06:59] jjohansen: I'll keep it running for some more time and then try 4.13 where I suspect we really leak memory constantly [06:59] mvo: sounds very godo [06:59] good [07:03] pedronis: mvo: one thing about exiting when idle, we don't have snap.refresh.timer anymore to wake us up, but we could schedule a command to run as a on-demand timer using systemd-run [07:04] moin moin [07:07] zyga: nice findings on the kernel mem leak front [07:07] mborzecki: indeed [07:07] mvo: i hope we can find the leak soon enough :-) === pstolowski|afk is now known as pstolowski [07:13] good morning [07:14] mborzecki: it depends what's the goal [07:14] mvo: is the plan to make exit on idle, generalized behavior? [07:15] mvo: did you discuss just timings or also a bit the goals? [07:17] mborzecki: is setting configuration with "set system" landed? [07:18] pedronis: yes [07:18] it's on edge but not 2.32 , right? [07:18] but iirc it's in master only [07:18] ok [07:19] bit unfortunate, but oh well [07:20] hm timer services are in egde too [07:20] but not in 2.32.* [07:20] ah [07:20] * pedronis admits to have lost track of things a bit (2.32 being so long lived) [07:21] mborzecki: anyway that's bit less of an issue [07:21] the issue with set core vs set system is that it must work before one installs core [07:21] pedronis: gustavo preferes the wake up, do stuff, exit approach over not doing anything at all. but its still a bit undecided so worthwhile to have another meeting to discuss options. I personally still favour the "do as little as possible via units" approach [07:21] pedronis: yeah, 2.32 is the new 2.33 :/ its a bit annoying [07:22] so basically we need to document set core and support it [07:22] for the life of bionic [07:22] (more or less) [07:23] mvo: discussion for monday I suppose? [07:23] pedronis: yeah, *maybe* today but I think gustavo is pretty busy today [07:23] pedronis: it's 2 small patches, should be easy to cherry-pick in case we want fixes in 2.32 [07:23] it's too late I think [07:25] mborzecki: you can prepare a PR and target it so that *if* we need to rebuild we have it. but I'm with pedronis probably too late [07:25] mvo: sure [07:26] mvo: so I imagine we concluded that it's called 2.32.4 , not 2.33 ? [07:27] pedronis: yes, I had a call with Adam about it, the amount of work to make it 2.33 is just too high at this point [07:28] I don't think we have promised/enforced minor releases to be small or have no features [07:28] we try to [07:29] in theory we have assumes , but seems they stay unused [07:29] (anyway they are not relevant for the API, it's transparent) [07:33] PR snapd#5044 opened: 'system' nickname for 'core' in snap get/set (2.32) [07:37] mvo: I'm going to merge my PRs about doMountSnap, should I prepare cherry-picks? or will you later? [07:40] PR snapd#5036 closed: overlord/snapstate: allow to get an error from readInfo instead of a broken stub, use it in doMountSnap [07:45] * pedronis is clearly not rested enough [07:56] zyga: I posted on opensuse-packaging, but it's a very low traffic list, will let you know if I get any replies [07:59] 5038 failing on travis, works if i run it from host, cleaned the git tree but still [08:00] and it's awkward, afaict all services are getting enabled by dh snippets in postinst, but the snapd.wakeup.service is disabled when the test starts [08:01] https://media1.giphy.com/media/12NUbkX6p4xOO4/giphy.gif probably [08:04] Caelum: perfect, thanks! [08:17] i really have packaging at times [08:23] if only someone invented a simple way to package stuff! [08:23] slackware? [08:24] zyga: could you take another look at https://github.com/snapcore/snapd/pull/4989 later on? [08:24] lol [08:24] PR #4989: tests: add arch to CI [08:24] mborzecki: sure [08:32] moin moin [08:33] hey Chipaca [08:33] mvo: do you remember why switch didn't have --devmode and etc? [08:33] Chipaca: I think there is no reason, its a nice idea [08:33] also what we agreed about --no-devmode [08:33] ditto --classic vs --no-classic, etc [08:33] Chipaca: iirc nobody asked that switch would have this capability but I think its a nice idea [08:33] mvo: people have asked, we've just not been paying attention =) [08:33] Chipaca: I had this problem too (wanting to swtich from strict to devmode and vice-versa) [08:34] Chipaca: *cough* [08:34] Chipaca: details ;) [08:34] Chipaca: don't destroy my narative [08:34] mvo: https://forum.snapcraft.io/t/refresh-into-devmode/4130 and https://forum.snapcraft.io/t/refreshing-snaps-in-devmode/4942 [08:35] mvo: but also I remember niemeyer had a reason for not having --no-devmode etc, but I don't remember it [08:35] and I'm not sure it wasn't due to him confusing go-flags and flags, wrt --= [08:41] mborzecki: https://github.com/snapcore/snapd/pull/4989/files#r181321369 [08:41] PR #4989: tests: add arch to CI [08:43] Chipaca: sorry, I don't remember why we would not want --no-devmode etc [08:43] mborzecki: after that +1 [08:47] mvo: how do you get out of devmode? [08:48] Chipaca: I think you need to refresh to a new revision for this right now, no? [08:52] mvo: I think so yes [08:56] the memory leak was introduced in one of 144 patches [08:57] I will review them quickly and see if I can automate a test loop [09:02] interestingly this patch is in that list a7c3e901a46ff54c016d040847eda598a9e3e653 [09:02] zyga: nice [09:02] zyga: bisect ftw [09:03] it's not the bug yet though [09:08] PR snapd#5039 closed: overlord/snapstate: use the readInfo in doMountSnap as a check only, undo if it errors [09:08] PR snapd#5040 closed: overlord/snapstate: poll for up to 10s if a snap is unexpectedly not mounted in doMountSnap [09:09] jdstrand: I didn't know we already supported loading arbitrary extended data into profiles! [09:11] mborzecki: thanks for your review for 5043 [09:11] mborzecki: I will dig a bit more in a bit but right now I think there is no way to disentangle --kill-who=main and KllMode=process [09:11] mvo: so! what can i help with today? [09:12] Chipaca: smart ideas about 5043 are in short supply right now :) [09:13] PR snapd#5045 opened: overlord/snapstate: poll for up to 10s if a snap is unexpectedly not mounted in doMountSnap (2.32) [09:14] mvo: ^ prepared the backport [09:14] pedronis: \o/ thank you [09:14] mvo: I hear good things about runit [09:14] Chipaca: :-D [09:14] mvo: :-) [09:15] so things that appear described as unrelated are interacting in obscure ways and are not orthogonal :/ [09:16] mvo: so, if I understand the issue correctly, it's that if a daemon has refresh-mode=potato but does not hangle sigpotato and instead dies, then the whole service is killed? [09:18] Chipaca: yeah, all processes in the cgroup will be killed, that is my understanding [09:18] mvo: right [09:19] mvo: but isn't a daemon not handling the signal it asks to be delivered on refresh a bug? [09:19] Chipaca: from reading the source (my understaning is still a bit incomplete) I think what happens is that the main pid dies and that triggeres sigchld in systemd which notices that the main pid of the given unit died [09:19] Chipaca: this makes the unit enter "stop state" and systemd will do what it needs to do when this state appears. which includes the cleanup of the cgroup (AIUI) [09:20] Chipaca: in the sigterm case what we want is that the daemon restarts, I guess one could argue it should re-exec and use the same pid [09:20] Chipaca: this would solve the problem but I think this is not how most apps deal with it :/ [09:22] mvo: what are we trying to accomplish? with refresh-mode=, when the snap refreshes, we do what? and the daemon does what? and systemd does what? [09:22] we=snapd there [09:24] zyga: have you seen jdstrand's comment to 5041? [09:25] Chipaca: on snap refresh with --refresh-mode=sigterm what we want is that the main process of the unit in question gets a sigterm. but that the other processes in the cgroup are left alone and survive [09:25] Chipaca: the use case is e.g. libvirt when it has a bunch of vms running that should not stop [09:26] Chipaca: we tell systemd to do "systemctl kill --kill-who=main -s TERM snap.name.app" [09:26] Chipaca: instead of the usual "systemctl stop snap.name.app" [09:26] Chipaca: making sense so far? [09:26] yes [09:26] now the problem seems to be that the option --kill-who=main is not orthogonal to KillMode= in a service file [09:27] Chipaca: or it is but in a different way, there is some entanglement [09:27] (the enganglement I described above, main pid dies, systemd wants to cleanup) [09:27] yes [09:28] mvo: what do we _want_ systemd to do? [09:29] Chipaca: on "systemctl kill --kill-who=main" we want it to kill the main pid and leave the rest alone [09:29] Chipaca: on unit stop we want it to stop everything [09:30] mvo: do we want it to restart the thing? [09:30] or _just_ kill it? [09:30] Chipaca: do whatever is defined in Restart= [09:30] Chipaca: it seems that this is a decision for the snap [09:30] Chipaca: but normally it would be Restart=on-failure (which is our default) [09:31] Chipaca: for a lot of things (SIGHUP) its a non-problem because the process will handle it and not die but sigterm is the problematic one [09:31] Chipaca: still making sense :) ? [09:31] restart=on-failure does not restart the thing when killed with TERM [09:32] might this be the reason it's entering stop mode at all? [09:32] Chipaca: I think I tested with "Restart=always" [09:32] Chipaca: and it had no effect but I can do so again [09:33] pstolowski: looking [09:34] mvo: sigterm is like sighup, in that the process can catch it etc (sigkill is the uncatchable one) [09:34] but, ok [09:34] Chipaca: I know [09:34] Chipaca: but it seems the services we care about do not catch it [09:34] mvo: a'ight [09:35] Chipaca: we could argue they should and the problem would go away [09:35] mvo: and AIUI the problem with using KillMode is that 'stop' will no longer work as expected? [09:35] jdstrand: I replied to your comments on 5041 [09:36] Chipaca: correct [09:36] Chipaca: it will mean there are processes hanging around (potentially) [09:37] mvo: and can ExecStop itself call systemctl? [09:40] Chipaca: a good question, I think so, what do you have in mind? [09:40] mvo: wondering whether we can manually use ExecStop to get the 'stop' behaviour we want [09:42] mvo: as all the rest seems to be ok with the particular choice of restart/killmode [09:44] Chipaca: hm, won't systemd just call ExitStop= in both cases? when kill was used and when stop was used? [09:50] mvo: will it? [09:53] Chipaca: it seems to be, I added "ExecStop=/bin/sh -c "echo foo >/tmp/foo"" and ran a kill (with Restart=always) and /tmp/foo with the content got created [09:55] mvo: and is ExecStopPost _also_ run with kill? [09:59] Chipaca: let me check [10:00] Chipaca: yes, I also get a debug file with it [10:01] mvo: it sounds to me like the lesser weevil is to document that if you use refresh-mode, systemctl stop will be weird [10:02] Chipaca: yeah, and on remove do a extra kill --kill-who=all [10:02] Chipaca: I can't figure another way but I can poke at it a bit more after lunch [10:02] mvo: and do that on 'stop' itself also [10:02] Chipaca: on snap stop service? [10:02] ie 'snap stop' should work as expected even when systemctl stop doesn't [10:03] Chipaca: thats a nice idea [10:03] yeh [10:44] re [10:49] jjohansen, jdstrand: I took the profile with a leak and started removing features from it; I want to see if any of the newly-added features may be responsible [10:50] jjohansen, jdstrand: I also stress-tested all of the sysfs files in apparmorfs for extensive reading and can say that they are not a factor [10:51] (though I found one curious behaviour of the "revision" file, is that documented anywhere? [10:55] the revision file's behavior is known and going to change [10:58] jjohansen: that it "sleeps" [10:58] jjohansen: loading an empty profile leaks as well [11:03] jjohansen: https://github.com/zyga/apparmor-bug-leak [11:04] loading this is sufficient https://github.com/zyga/apparmor-bug-leak/blob/master/neutered-sample.aa [11:04] so it's not like a new optional feature is there and causes the leak [11:04] maybe something is not unref'd? [11:18] mvo: there's a failure in https://travis-ci.org/snapcore/snapd/builds/365982227?utm_source=github_status&utm_medium=notification in linode:debian-9-64:tests/main/snap-service-refresh-mode https://paste.ubuntu.com/p/cxTNbkBbtb/ [11:21] mborzecki: thanks, looking [11:22] mvo: just the paste, i've restarted the build [11:22] mborzecki: thanks, I see it in the paste I will chase that [11:23] mborzecki: looking into this area anyway currently [11:23] mvo: yup, that's what i thought :) [11:27] jjohansen: I varied the test to insert a profile with ever-changing contents, I will check how that behaves across kernel versions [11:29] jjohansen: I noticed that one of the things that differs between the broken and working kernels is the presence/absence of symlinks in apparmorfs/policy/profiles/$PROFILE/ [11:40] jjohansen, jdstrand: loading _different_ profiles forever on an affecter kernels doesn't leak memory [11:41] mvo: as a workaround I can generate random garbage rule like /tmp/.snapd.bug.1234.$RANDOM r, [11:41] mvo: and inject that into all profiles [11:41] mvo: we will lose all caching but we will not leak [11:45] pstolowski: would you like to review https://github.com/snapcore/snapd/pull/5034 ? :) [11:45] PR #5034: userd: set up journal logging streams for autostarted apps [11:46] pstolowski: it's based on 5024, so it's just the last 2 patches that are different 5034 specific [11:47] mvo: more ideas on v [11:47] https://forum.snapcraft.io/t/oom-for-interfaces-many-on-bionic-i386/4101/18?u=zyga [11:49] jjohansen, jdstrand: I'm now looking at error recovery in aa_replace_profiles [11:49] zyga: yay [11:49] zyga: let me know if you have anything to test [11:50] mvo: look at the options I gave [11:50] mvo: 1 is simple but wasteful [11:50] mvo: 2 should have no downsides but is complex [11:50] mvo: I can prepare a test with .1 quickly [11:50] zyga: if (1) is simple, we we just do it as an expeiment? [11:50] surte [11:50] *sure [11:50] zyga: yeah, lets do it if it does not take much time on your side [11:57] mborzecki: sure [11:58] mvo: after the break, now need to do stuff at home [12:01] Son_Goku: hey! am i missing something, or copr.fedorainfracloud.org doesn't actually offer an option for uploading spec+srpm as advertised on the wiki? [12:01] pstolowski: yes you are [12:01] you can upload a srpm via the copr CLI tool [12:01] or you can upload it somewhere and tell the web ui to fetch it [12:02] pstolowski: you should be able to from http://copr.fedorainfracloud.org/coprs/pstolowski/go-udev/packages/ [12:02] ah, and you can upload the srpm from the web ui too [12:03] pstolowski: https://copr.fedorainfracloud.org/coprs/pstolowski/go-udev/add_build_upload/ [12:04] Son_Goku: ah, there! I was staring at the Packages -> Add New Package where you need to have git/svn [12:04] thanks [12:24] reminds me to drop my copr packages, they're quite old anyway [12:26] PR snapd#5046 opened: snap, wrappers, tests: rename refresh-mode -> stop-mode, endure -> skip-refresh [12:26] mvo: ^^ [12:36] * kalikiana lunch [12:49] re [12:49] mvo: looking at that workaround now, it will be a moment, I'm almost done [12:52] mborzecki: \o/ [12:52] zyga: yay [12:55] Issue snapcraft#2028 closed: When asking to release to a branch that's too long, a traceback is printed that gives no hints as to the source of the error [12:55] PR snapcraft#2059 closed: storeapi: handle 500 error response when releasing snap [12:55] mborzecki: haven't we had a release with refresh-mode already? [12:59] * Chipaca hunts for his headphones [12:59] Chipaca: yes, that's why i have some doubts about backwards compat [13:02] Chipaca: standup [13:06] PR snapd#5047 opened: tests: removing linode-sru backend [13:20] PR snapd#5048 opened: tests: updating bionic version for spread tests on google [13:32] Is it "known" that opengl apps on 18.04 on binary nvidia are broken? [13:33] popey: no [13:33] shotcut "could not initialize opengl" but works on intel [13:33] popey: on stable? [13:33] i am on beta [13:33] but i can go back to stable to test [13:33] popey: can you give us some more data, which hardware, which snaps, etc [13:33] mborzecki: ^ [13:33] sure [13:34] popey: after standup i'll reboot and check on bionic [13:34] filing a bug to capture it [13:34] popey: how does it fail? is there any log? [13:35] https://bugs.launchpad.net/snapd/+bug/1763717 [13:35] Bug #1763717: some opengl applications don't work on nvidia binary driver [13:35] popey: try to /usr/lib/snapd/snap-discard-ns , and then capture the log with SNAPD_DEBUG=1 SNAP_CONFINE_DEBUG=1 [13:35] ok [13:37] added to the bug [13:37] popey: it's a classic snap [13:37] re [13:38] Is that bad? :) [13:38] popey: if it's a classic snap then it does not get any of the nvidia namespace setup treatment [13:38] (I mean, we'd like it to not be classic) [13:38] popey: classic snaps don't have any support for that [13:39] so it should "just work" right? [13:39] it depends on how it is made [13:39] but it's something for kalikana perhaps [13:39] we don't influence i [13:39] we don't influence it [13:39] it probably doesn't work [13:39] popey: shotcut snap right? [13:39] yes [13:39] because 18.04 and 16.04 differ [13:39] its in the store [13:40] host's handling of nvidia changed [13:40] there's nothing we can do IMO [13:41] i'll probably fail on arch too, let me try [13:42] popey: aborts on arch too https://paste.ubuntu.com/p/qPtZStGWth/ [13:42] :( [13:45] you can surely do something about it in a wrapper script in the snap itself [13:45] (hacks) [13:52] popey: i've installed all dependecies and can run shotcut directly outside of snap [13:54] I'm at a loss what we suggest the developer does for a smooth experience installing snaps on multiple distros at different releases and have it work. [13:54] do not use classic :) [13:54] mvo: I'll take the dog out now but then I will be back to propose the workaround [13:54] zyga: thank you [13:54] popey: you _can_ use nvidia but the snap needs some code for that [13:54] popey: talk to kalikana and sergiusens [13:54] popey: it's doable just nobody done it [13:54] ok [13:55] popey: snapd is not preventing that [13:55] popey: it's just not enabling that (because it cannot) [13:55] * zyga -> afk [13:55] yeah ... as i said, wrapper hackery [13:55] mvo: mborzecki: so... [13:56] mvo: mborzecki: removing catalogrefresh from snapd drops it (with an ~empty state) from 25MB rss to 15MB rss on start [13:56] pedronis: also ^ [13:56] catalogrefresh is all kind of evil :) [13:56] Chipaca: so basically .text + .data + .bss size [13:56] * Chipaca covers catalogrefresh's ears [13:56] Chipaca: woah [13:57] all kinds of .bs [13:57] this is also because of bolt db code [13:57] Chipaca: heh ;) [13:57] I suppose [13:57] pedronis: that's my assumption, yes [13:57] Chipaca: with GOGC=1 RSS was 19MB [13:57] Chipaca: nice, lets move it out into a separate helper [13:57] Chipaca: thanks, thats a huge win [13:57] mvo: now, or post-1804 [13:58] there are some issues around auth to move it out [13:58] pedronis: why? i thought it didn't auth at all [13:58] Chipaca: depends on your schedule for today, if you have free cycles I would say asap but does not have to be *now* [13:58] Chipaca: we need to talk with nessita, apparently /names can be per store [13:58] I don't know if it is yedt [13:58] oh [13:58] ok [13:59] mvo: I'll check with nessita and work on it today [14:00] mborzecki: I will merge your reresh-mode-endur-rename PR into my snap-rereshmode-fixes PR and work from that, ok? [14:00] Chipaca: sounds good, thank you [14:00] mvo: if we do a .5 we should consiser #5044 [14:00] PR #5044: 'system' nickname for 'core' in snap get/set (2.32)