[00:30] doko: Are you going to upload the rebuilds for extensions to drop 3.3 support? === freeflying__ is now known as freeflying === BenC- is now known as BenC === _salem is now known as salem_ === salem_ is now known as _salem === Ursinha is now known as Ursinha-afk === Ursinha-afk is now known as Ursinha [03:00] hello everyone :) I've enjoying 14.04-desktop_amd64 since almost a month now and so far it's shaping as an amazing release - rock solid, fast... really beautiful. However I've found this strange issue I'm unable to debug, may be you can shed some light here guys: whenever I return from susped, there will be a sustained load average of +1. On normal conditions like a fresh boot, the usual load when idle [03:00] could be as low as 0.01 (kudos!!!) but after waking from suspend you won't be seeing the load average drop 1.10~1.20 [03:00] Any idea? How could I debug this issue? [03:02] s/i've/i'v been/g :) [03:41] msx: Have you looked at what is eating your resources in system monitor, or using top in a terminal? [03:41] msx: No actual idea what it could be, but knowing what process is using resources is somewhere to start looking at least. [03:43] TheMuso: hey! [03:45] TheMuso: well, actually, top doesn't shows anything unusual; in fact the consume is as about as normal as always BUT the loadaverage is always from 1.10 upwards [03:45] TheMuso: if the load average would start with a 0.x everthing would be perfectly normal [03:46] i'm very intrigued about what could be happening ^_^ === vibhav is now known as Guest48466 [06:11] Good morning [06:20] slangasek: wooooow, https://code.launchpad.net/~xnox/phablet-tools/py2-3/+merge/205608 landed at last \o/ [06:34] bdmurray, ev: errors now running on prodstack> niice! === achernya_ is now known as achernya === tvoss|afk is now known as tvoss [07:27] tvoss: guten Morgen === jamesh_ is now known as jamesh [07:29] good morning [07:29] * pitti hugs dholbach [07:29] * dholbach hugs pitti back === qengho is now known as CardinalFang === CardinalFang is now known as qengho [08:50] good morning [09:01] ev, zyga, fginther: I've been looking for an ack timeout in rabbitmq -- i. e. that a message gets re-queued if the accepting worker doesn't ack it within a given time; I only found some references that this isn't possible, but they are from 2009 [09:02] zyga: sorry, I meant vila [09:02] ev, vila, fginther: do you know if that's possible? [09:03] pitti: so far, we've managed to always ack (outcome reported whether it succeeded or failed) [09:03] pitti: as fginther hinted, we want to dig that far more [09:03] vila: so this wouldn't cover the case if the test process gets stuck, but stays alive [09:04] vila: if the worker node dies properly, the AMQP connection will be terminated and the message requeued, so that's fine [09:04] but if a test runs indefinitely long, it won't help [09:04] I'm certain it's possible (otherwise why have ack'ed messages at all?). I'll dig in a little bit. [09:04] pitti: but adt-run will timeout in that case no ? [09:04] Oh I misread. [09:05] vila: usually yes; just playing the "what could possibly go wrong" game :) [09:05] pitti: the testbed and the worker are distinct instances [09:05] pitti: yeah, right ;) [09:05] pitti: planned to be digged to the bottom ;) Feel free to vanguard, I won't be far ;) [09:06] vila: I've seen an LXC host kernel-panic and stop doing anything for heavily loaded container tests [09:06] and I'm not sure whether that already was sufficiently dead to drop network connections [09:07] there's a per-queue and per-message TTL, but that will just kill messages after no worker grabs it for that time, so that's the opposite of what we want [09:08] pitti: LXC host meaning the host handling the container running adt-run right ? [09:08] right [09:09] vila: I guess we could implement this manually; the controller would traverse the list of pending (unacked) requests regularly (assuming that this is accessible), and re-queue the ones which haven't been acked after 6 hours [09:09] pitti: so, this design has indeed a fatal failure mode, we avoid that by using (today) a nova instance for the testbed, (future) a MAAS instance for bare metal [09:09] pitti: but in any case, the worker should not be subject to such a failure mode in production (it's ok for local use as it's faster) [09:10] vila: well, "should" -- it happened; kernel bugs are everywhere :) [09:10] (or hw) [09:10] pitti: yes, that's why the worker (controlling the testbed) should not be on the same host [09:11] vila: hm, I think I disagree [09:11] pitti: so it can monitor the testbed in all scenarios [09:11] it introduces more things that can go wrong, and doesn't really solve the problem [09:11] as now you need the same kind of sync between controller and worker, AND worker and testbed [09:12] twice the number of hosts which can fail, and twice the amount of synchronization/acks to do [09:12] pitti: err, it removes one thing that can go wrong: a user test crashing the infra [09:12] vila: we'll run tests in QEMU, that's not the problem [09:13] vila: the problem was that the host kernel, or the host's hard disk freaked out on high loads (from LXC, but I don't think that matters that much) [09:13] pitti: right, QEMU is not lxc, that's what a nova instance provides [09:14] pitti: so we separate root causes between worker and testbed and can handle them differently [09:14] vila: I'll see whether it's possible for the controller to get/check/kill unacked requests [09:14] pitti: so the worker can fail, the message is not acked, another worker takes the ticket [09:14] vila: you mean it does get acked, but with a failure result? [09:15] pitti: yes [09:15] vila: because "not acked → another worker takes ticket" is exactly what I'm looking for :) [09:16] pitti: I haven't written tests for that yet, but that's a strong assumption in our design, if it's not guaranteed, we'll have to re-design ;) [09:16] vila: which assumption do you mean? relying on workers to always eventually acking requests? [09:16] pitti: I've been assured it was a safe assumption though ;) [09:16] (or dying properly with closing net connections) [09:17] vila: well, maybe it is [09:17] pitti: either the worker ack the message or the message becomes available in the queue after some time [09:17] vila: I'm not sure how much active "pinging" rabbitmq does to the accepting worker to determine whether its connection is still alive [09:17] pitti: yup, that's exactly the point that needs to be tested [09:18] vila: ok; it seems you didn't run into this kind of trouble so far, so perhaps this can be put on the backburner for now? [09:18] * pitti tries what happens when SIGSTOPping the worker [09:18] pitti: that's what we've done, I'm not super comfortable with that but it's on my radar [09:19] pitti: but yes, we didn't run into that case (yet) [09:19] vila: ok; thanks for your help! (and sorry for keeping bothering you) [09:19] pitti: not bothering at all, happy to share the knowledge there as we acquire it ;) [09:19] it seems I have a habit that the very first problems I'm thinking of in these new systems are the ones which are impossible or underdocumented :) [09:20] pitti: same here, which led to a reputation of being over concerned ;) [09:20] until things break ;) [09:29] vila: so if I sigstop a worker (so that it definitively can't serve its network connection), the rabbit server still thinks it's active [09:29] pitti: and to make things clear: the actual design is under construction, so more eyeballs help [09:30] pitti: right, so there should be some way[s] to configure that, hopefully on a per-queue basis [09:30] only if I control-C it it puts the message back into the queue [09:31] pitti: meeting starting right now [09:31] vila: ack [09:31] pitti: oh.. interesting, does that mean SIGSTOP must be handled or we miss some safety net already provided in other cases ? [09:32] vila: no, I was just simulating what happens if the worker suddenly stops responding to net requeusts without closing its network connection properly [09:32] vila: i. e. the equivalent of what could happen on a kernel panic and processes end up in deep kernel sleep, etc. [09:35] vila, ev: ah, http://www.rabbitmq.com/tutorials/tutorial-two-python.html officially documents the absence of timeouts [09:36] seb128: thanks - I need to test it but I've stuck a patch derived from that one in my working tree so I don't forget about it [09:36] so if that turns out to be an actual problem, we can regularly iterate over these queues and kill the workers; but that can (and has to) be bolted on top of the system anyway, so IMHO no need to do this in phase 1 [09:36] cjwatson, thanks === oSoMoN_ is now known as oSoMoN [09:40] vila, ev: but hard-switching off (lxc-stop -k) the worker DTRT, so I think that covers the most common problems [09:42] doko_, figured out that eclipse-* failures - gnumail osgi metadata was not quite right - need a drop of javax.activiation [09:42] fix uploaded [09:43] pitti: from your url above: 'RabbitMQ will redeliver the message only when the worker connection dies.' so it should be a matter of waiting for the connection to die right ? That can delay nacks, but they will occur no ? [09:44] pitti: at worst, a worker can lose the connection and fail to ack and the same message will be handled twice, but I think handling duplicates will be easier than losing messages === funkyHat_ is now known as funkyHat [10:10] hey, quick question, what is the syntax for debian/control, that says package depends on the same version of another binary package from the same source package [10:10] like libfoo and foo that need to be updated together [10:10] I recall something like ${Source:vesrsion} [10:10] but I cannot find any docs about that [10:11] ${binary:Version} [10:11] ? [10:11] zyga: package-name (= ${binary:Version}) [10:11] zyga: deb-substvars(5) === funkyHat_ is now known as funkyHat [10:11] cjwatson: thanks [10:12] There are subtleties in the event that one of the binaries is arch-dependent and the other is arch-independent [10:22] vila: yes, duplicates isn't a problem in the adt case [10:25] pitti: thanks, I had troubles here, I may have missed some msgs [10:25] vila: no you didn't miss anything; I was AFK for a bit [10:26] pitti: did you reply to: 'from your url above: 'RabbitMQ will redeliver the message only when the worker connection dies.' so it should be a matter of waiting for the connection to die right ?' ? [10:26] vila: right [10:26] pitti: i.e. you mentioned a test you did, could it be that the connection wasn't yet dead ? [10:26] vila: it logically wasn't [10:27] vila: i. e. it never got a RST, but it wasn't really alive either as the worker was stopped [10:27] vila: which might approximate what happens if a system freezes due to a hardware failure or kernel panic or similar [10:27] pitti: but a TCP connection should ultimately die (what's the timeout there, 20 mins or something ?), right ? [10:28] vila: but as I said, we can handle that using nagios and regular ping queues, and just killing hanging workers [10:28] vila: I don't know, I'm afraid [10:28] vila: I just waited a few minutes, certainly not 20 [10:28] pitti: ack [10:29] pitti: right, one other assumption we did early on, was that rabbit itself is considered reliable so if a msg is received, it won't be lost [10:29] vila: so, perhaps this is a non-issue as the connection will eventually die; it seems rabbit is being used by quite a number of people, if that was a real problem I'd think that there was a common solution by now [10:29] * vila nods [10:30] vila: so either way, I think we have the tools to deal with this problem, I was just curious whether I coudl do something like [...], ack_timeout='4 hours' [10:30] pitti: a consequence of the above is that we focus on making sure msgs are properly queued/dequeued and don't have to handle inter components failure modes otherwise [10:30] pitti: yeah :-/ [10:31] pitti: but you mentioned TTL at one point ? [10:31] vila: *nod*, I assume that part is okay; I was playing around quite heavily with randomly killing multiple workers etc., and I never lost one [10:31] vila: yes, that's the other way around: if a message isn't being picked up by TTL, it gets deleted [10:31] ha, hmm, the opposite of what I'd like ;) [10:32] vila: so we certainly want to know when that happens (-> again, housekeeping cron job), but not kill messages, but fix the AWOL workers instead [10:32] so I don't think we want to use that feature [10:32] pitti: agreed [10:32] pitti: oh, you mean it's opt-in ? [10:32] vila: yes: http://www.rabbitmq.com/ttl.html [10:33] excellent [10:33] vila: which is certainly handy for some use cases (just not our's here) [10:33] pitti: yup [10:33] * vila go reads those pages and stop annoying pitti with silly questions [10:34] vila: no no, that's fine :) this is all new to me, too [10:35] pitti: ha, right, one more idea comes back: we can have a different queue where workers send heartbeats so they can be killed if they fail to do so, not sure it's worth it at that point but that's one way to monitor them [10:36] vila: right, I proposed that yesterday (to fginther, I believe) [10:36] ah no, to ev [10:36] pitti: hehe GMTA ;) [10:36] vila: anyway, we can do that, or regularly check pending and unack'ed queue entries in rabbit, or similar [10:36] or rather, good ideas spread, spontaneously ;) [10:37] vila: we can have a fanout queue, and every worker has to ack in 30 seconds; the set of expected workers comes from the results in swift [10:37] vila: that was roughly my proposal [10:37] pitti: right, I tend to prefer separate means for separate goals, but... real life decides [11:13] pitti, vila: we must not be the first people to face this problem [11:13] I wonder if there's decent prior work [11:14] though the current approach seems sound [11:19] pitti, vila: perhaps we already covered this, but could the workers not just open('/srv/tr_worker/var/heartbeat.stamp', 'w').close() ? Then the watchdog could just check timestamps to know if it needs to kill the daemon? [11:20] whatever we do, we should definitely have a test for os.kill (worker_pid, signal.SIGSTOP) [11:20] ev: the watchdog sshs in? If it can ssh, then we shouldnt' have the problem of "hung kernel" in the first place, but yes [11:21] ev: if that's any easier than checking unack'ed messages in the queue [11:21] ev: what pitti said, we need to cope with unreachable workers [11:21] pitti: no need. Just put the watchdog on every system [11:21] ev: the watchdog will be affected by crashing hw/kernel just as the adt worker, though [11:22] ev: that's the same as sending a heartbeat so someone still have to monitor the heartbeats from a *different* instance or we're back to square one [11:22] we can have a separate check for "do you respond to ssh? No. Okay, `nova delete` to you." [11:22] but of course, if the watchdog/ssh don't respond, that's already a "died" condition [11:22] *nod* [11:22] then juju automatically respawns the worker. In the case of the test runner. Not sure how to do this with lxc containers [11:22] but I'm assuming there's something comparable [11:22] or you could use juju to manage them [11:22] yes [11:22] since it does that now [11:23] juju-local can do that [11:23] whoop [11:23] or just lxc-stop / lxc-start again [11:23] can I just say that I'm so excited for this? :) [11:23] yeah, start/stop [11:23] excited> +1 === oSoMoN_ is now known as oSoMoN [11:40] ev, vila: FYI, I'm working on a spec here: https://wiki.debian.org/MartinPitt/DistributedDebCI [11:40] ev, vila: once it's ready, I'll send mail for review/commenting [11:40] pitti: ack === marcoceppi is now known as marco_traveling === _salem is now known as salem_ === MacSlow is now known as MacSlow|lunch [12:45] cjwatson: any opinion on openssh 6.6? It's "primarily a bugfix release" but it seems quite late now. I just triaged bug 1298280. [12:45] bug 1298280 in openssh (Ubuntu) "Update OpenSSH to 6.6" [Wishlist,Triaged] https://launchpad.net/bugs/1298280 [12:45] rbasak: I already have it staged and plan to land it [12:45] Ah - thanks! [12:46] was talking with Marc about it a few days back === roadmr_afk is now known as roadmr === MacSlow|lunch is now known as MacSlow [13:48] slangasek, https://bugs.launchpad.net/ubuntu/+source/plymouth/+bug/1160079 ? that going to get in ? [13:48] Launchpad bug 1160079 in plymouth (Ubuntu) "plymouth aborts in cloud images" [Medium,Confirmed] === roadmr_afk is now known as roadmr === roadmr is now known as roadmr_afk === roadmr_afk is now known as roadmr === Ursinha is now known as Ursinha-afk === Ursinha-afk is now known as Ursinha === doko_ is now known as doko === mbiebl_ is now known as mbiebl [15:59] mterry: please could you review https://bugs.launchpad.net/ubuntu/+source/juju-quickstart/+bug/1273865/comments/19, which answers your question on this MIR bug? [15:59] Launchpad bug 1273865 in websocket-client (Ubuntu Trusty) "[MIR] juju-quickstart, python-jujuclient, urwid, websocket-client" [High,New] [15:59] mterry: we have a FFe bug waiting too - if we can get this acked, is this sufficient for main? [15:59] rbasak: openssh> btw you can always feel free to test stuff in git://anonscm.debian.org/pkg-ssh/openssh.git :-) [16:00] cjwatson: don't look at me. I just triaged the bug! :) [16:00] if you feel the urge [16:00] ok :) [16:00] it's always worth a try [16:00] :) [16:07] rbasak, yeah both MIR and FFe would be sufficient for main. I'm looking at that bug [16:16] does William Hua hang out here? [16:17] mterry: fyi, there was some concurrent chat about this in #juju just now. jamespage will update the bug. [16:20] mterry, rbasak: commented on bug [16:21] jcastro: hi there. were you aware the askubuntu login thing (via our sso) isn't working? [16:22] no [16:22] knocte: he is attente on this network, but email might be better to reach him [16:22] jcastro: (as an aside, can't you change it to use sso and not lp?) [16:22] Laney: I'm trying here because email didn't work.. [16:23] Chipaca, we've asked them to but never put a priority on it [16:23] pitti: could we get the dbg syms updated for lightdm? [16:23] well, you have the nick [16:23] jcastro: fair enough; maybe if it's broken they'll fix it and change it at the same time :) [16:23] try a PM [16:24] knocte, what do you want from him? maybe others can help you? [16:24] pitti: for version 1.9.13-0ubuntu1 [16:24] seb128: a hint on where to start looking to try to fix https://bugs.launchpad.net/bugs/1286605 [16:24] Launchpad bug 1286605 in unity-gtk-module (Ubuntu) "Unity global menu causes handlers of the "activate" signal of Gtk.Action to be emptied" [Medium,Triaged] [16:24] Chipaca, http://meta.askubuntu.com/questions/2837/moving-to-ubuntu-single-sign-on [16:24] can you bump that and then I can ping someone === shadeslayer is now known as shadeslayer_ === shadeslayer_ is now known as shadeslayer [16:25] jcastro: maybe what's happening is that it's under way right now? (and instead of removing the button they disabled it?) [16:26] no clue [16:26] jcastro: I can't bump that, no [16:26] jcastro: because I can't log in [16:26] knocte, k, yeah you want to talk to attente ... he's in utc+10 tz atm though, so he's probably sleeping atm [16:26] oh, ok, I can try to hunt someone down [16:26] ok thanks seb [16:27] Chipaca, lp login works for me [16:27] jcastro: when i try to add a comment there is no lp button, so it does look like it's ongoing [16:27] I just tried it [16:27] pitti: How do you feel about the state of psql SRUs? Are they ready to go today, despite being a tiny bit early? [16:27] jcastro: i'm trying chrome and firefox, no luck in either [16:27] Chipaca, when you click the LP button it takes you to the text field, where you have to type in your name and then click log in [16:27] jcastro: the button does nothing [16:27] jcastro: there is no text field [16:38] stgraber: hi. I'm going through your fixes for ifupdown and seeing if they apply to precise. Do you think backporting bug 1072518's fix to precise seems reasonable? [16:38] bug 1072518 in ifupdown (Ubuntu) "Restart networking crashes dbus and the desktop manager" [Critical,Fix released] https://launchpad.net/bugs/1072518 [16:39] just the preventing ability to stop/restart networking iteractively [16:41] arges: hmm, as much as I'd like to see this gone, it'd be a behavior change that some people may (as wrong as it is), rely on [16:41] stgraber: that was my concern as well... [16:42] stgraber: i'll leave that one out then. and just ensure that documentation gets updated [16:42] It's pretty shocking the number of people who think "restart networking" is the way to reconfigure interfaces. I wish I knew where this came from originally. [16:43] infinity: at least that will simply fail in 14.04, I'm sure we'll get some more bug reports because of that, but at least it won't trash their system anymore... [16:46] https://help.ubuntu.com/12.04/serverguide/network-configuration.html suggests 'sudo /etc/init.d/networking restart' hmm [16:47] that's fixed in the 14.04 guide [16:47] stgraber: I already have one bug report because of it, yes. === bfiller is now known as bfiller_afk [16:48] stgraber: Not sure if the error lacks appropriate verbosity, or if the user in question was just silly (I'm assuming the latter). [16:48] arges: Argh. [16:48] infinity: we can't print something on screen when they hit it, so upstart will report a failed to stop and the actual reason is in upstart's log file at /var/log/upstart/networking.log [16:49] stgraber@castiana:~/Desktop/rcu$ sudo restart networking [16:49] restart: Job failed to restart [16:49] stgraber@castiana:~/Desktop/rcu$ sudo tail /var/log/upstart/networking.log [16:49] Stopping or restarting the networking job is not supported. [16:49] Use ifdown & ifup to reconfigure desired interface. [16:49] stgraber: Yeah, that's pretty opaque. [16:50] stgraber: Is there really not a sane way to echo to the controlling tty? Having *everything* in logs isn't always the best default. [16:50] * arges files a bug against serverguide... [16:51] well, a) don't call restart directly, use 'sudo service networking restart' [16:51] slangasek: That latter is still wrong... [16:52] (In the cases when it works, it's by accident, not design) [16:52] infinity: right, which is b) someone could patch service to tail the log [16:54] cjwatson: Pretty sure I've praised it at least once before in the last few months, but live-installer on a fast storage device, whee. 2-second base installs. [16:56] cjwatson: Don't know if it's ever been pondered, but why not swap base-installer for live-installer in netboot as well, and just publish base.squashfs? [16:56] (Oh, I guess that would mean publishing it to the package mirror, not the CD mirror, that would be a problem) [16:56] So, nevermind that. [16:57] infinity: how did the node stuff go on top of libv8? [16:57] alow: Fine, I just needed to add one patch to your set. [16:58] alow: http://paste.ubuntu.com/7163741/ <-- without that, my ppc64el systems tried to configure as ppc (ie: 32-bit), which didn't work out so well. [16:59] bdmurray, hey, do you have access to extra datas for e.u.c reports than the ones in the UI? [16:59] bdmurray, e.g https://errors.ubuntu.com/problem/85c403c4a05cd32a48a73b226340850faa45e785 [16:59] infinity: could you have a look at the patch in bug 1296386? [16:59] bug 1296386 in casper (Ubuntu) "[PATCH] Remove 23etc_modules script" [Undecided,New] https://launchpad.net/bugs/1296386 [16:59] bdmurray, is there any way to know why retracing fails, or see if those users are running ppas or get access to a dump of a report? [16:59] * dholbach hugs hggdh [16:59] infinity: yeah, mirroring would be the problem [17:00] * hggdh hugs dholbach :-) [17:00] dholbach: hggdh++ [17:00] seb128: I'll have a look at the retracer logs for any of those examples. [17:00] bdmurray: Yes, but not on beta day. Want to keep it on your list of things to nag about? :P [17:00] bdmurray, thanks [17:00] infinity: Thanks - I'll add that to our code [17:01] infinity: what makes you think I have such a list? [17:01] bdmurray: A hunch. [17:01] bdmurray: I think he said you should assign it to him [17:01] slangasek: and keep it on the nag list. ;-) [17:04] seb128: outdated debug symbol package for liblttng-ust0: package version 2.4.0~rc4-1ubuntu2 dbgsym version 2.1.1-2 [17:04] bdmurray, that's not likely the problem though [17:04] alow: It looks like Debian's moved on from .25 to .26.. Do you know if that brings any dangers with it (ABI transition, API breaks, etc)? [17:04] seb128: well that's why it failed to retrace [17:05] alow: Considering it for us, but only if it's really low risk, since we're close to release. [17:05] bdmurray, can you see from the log what qtubuntu-android version is installed? [17:06] seb128: its not listed in Dependencies.txt so probably not [17:06] bdmurray, I though that apport had smartness to resolve the packages containing the files listed in the stacktrace? [17:07] or in the procmap rather [17:07] infinity: I don't see a big risk moving to 0.10.26 - let me go look at the deltas [17:07] seb128: yes it does, but it just picks the latest package version in the pocket (-updates, etc) when trying to retrace [17:07] bdmurray, don't we have also some magic to see when a ppa was in use to warn about it? [17:08] seb128: yeah, that'd show up in Tags as 'third-party-packages' [17:08] seb128: well but maybe only for dependencies and not stuff in ProcMaps [17:09] bdmurray, ok, I guess we are just without a solution for that one then [17:09] bdmurray, thanks for checking/helping! [17:10] Does someone upstarty want to tell me why minutes after I've booted, sysvinit jobs seem to run again? [17:10] slangasek: ^? [17:11] slangasek: http://paste.ubuntu.com/7163794/ <-- Note the last few lines. [17:11] slangasek: I see this on my buildds too, and it does genuinely seem to be starting again, not for the first time, as on the buildds, I get apache erroring because it's already running, etc. [17:12] infinity: the ondemand init script is special [17:12] infinity: and doesn't point to anything running again [17:12] slangasek: The ondemand thing should be a red herring here, it execs itself in the background. Unless that's what's causing the rest of this. [17:12] infinity: it shouldn't be causing any of the rest of it; but you said "note the last few lines", maybe you want to be more specific? :) [17:12] you mean the apparmor profiles stuff? [17:13] slangasek: When I said "last few lines", I mean "hey look, I'm booted and logged in and, oh neat, a couple of minutes later, I'm starting apparmor and restoring resolver state". [17:13] slangasek: And, like I said, on my buildds, I see this with apache and launchpad-buildd too. [17:13] For a long time, I thought it was an oddity with booting without an initrd, so didn't care, but this example is a standard install with an initrd. [17:17] infinity: well here's a question, why the heck is something like "dns-clean" running in runlevel 2 instead of in rcS? [17:17] slangasek: Dunno, but apparmor *is* in rcS, so that's still weird. [17:17] infinity: RE: 0.10.25 -> 0.10.26 http://blog.nodejs.org/2014/02/18/node-v0-10-26-stable/ Mostly it's a bunch of bug fixes. Possibly more changes than you'd like to see, but no API level changes that I'm aware of. [17:17] infinity: but I don't know what to make of those messages, anyway; I'm not aware of ever having seen this, and it's even weirder that you're getting messages from /etc/rcS.d/S??apparmor + /etc/rc2.d/S??dns-clean [17:18] jamespage, regarding juju-quickstart -- there would be value in having it in main as an installer for the universe packages, eh? I'm not sure what the MIR policy on such installer packages is -- we let ubuntuone do that. But I'm asking the other MIR team members [17:18] infinity: IMHO - being 'current' has quite a bit of value. [17:18] slangasek: I imagine it's (re-)running all of rcS and rc2, those are just the only two with output. [17:18] slangasek: But haven't ever had the time to try to dissect it. [17:19] infinity: I could imagine blaming it on a late network interface start, but dns-clean appears to only be wired up for ppp interfaces [17:19] slangasek: And, again, wouldn't explain the apparmor start. [17:19] infinity: the output from those init scripts is using the standard lsb interfaces; it makes no sense at all that they would be the only two with output [17:20] slangasek: Meh. Well, I'll have to look into it again when it's not today. [17:21] infinity: fwiw the fact that you also have an 'ntpdate' process in your ps output also points me this way [17:21] slangasek: But there's something weird going on. Like I said, on my buildds, where it seems clear it's double-starting some things (because the late sysvinit messages on the console also whine about failing due to ports in use), something's up... [17:22] infinity: so if you want further debugging, please try to reproduce when booted with --verbose, then capture the upstart-induced dmesgry [17:23] slangasek: Right. Will do $later. [17:23] but I'm 70% sure that what you're seeing is deferred activation triggered by the network interface coming up [17:23] I just don't see the path it's taking [17:23] slangasek: Does that explanation explain the late apparmor? [17:23] slangasek: Do we defer all of sysvinity until post-network? [17:23] sysvinit, too. [17:24] infinity: in the sense that there are per-network-interface apparmor hooks (/etc/init/network-interface-security.conf), maybe? But I don't see where that triggers the init script [17:24] infinity: and yes, rc2 waits for static network configuration [17:25] so that explains late running of dns-clean (and apache), but not double-running [17:31] slangasek: I'll dig deeper later and see if I can come up with something more coherent than months of confused memories. [17:34] infinity: curious-- is apparmor starting twice or just very late? jjohansen has seen it start late and, for example, not load profiles (eg firefox) until much later [17:34] infinity: which of course means that his firefox is running unconfined [17:36] we have a medium term solution for pregenerating apparmor cache in kernel postinst, and know how to do it, which would then allow us to not only improve boot performance after install/first boot with new kernel, but allows us to do an upstart job very early since it won't slow boot [17:36] jdstrand: Without more debugging, hard to say for sure if it's twice or late. [17:37] (or systemd unit, doesn't matter) [17:37] jdstrand: But if pregeneration and an upstart job is something you think you can safely land at this point in the cycle, that sounds like a better method anyway. [17:37] infinity: is this a server or desktop? [17:37] jdstrand: These are all servers. [17:37] infinity: we cannot [17:37] this is 14.10 at this point unfortunately === DrKranz is now known as DktrKranz [17:37] Ahh, well. A man can dream. [17:38] infinity: so, you could quickly login and do a tcpdump that continues to run [17:38] infinity: please delay the LTS until 14.10. thanks. [17:38] infinity: then login to another console and do 'sudo aa-status' [17:38] jdstrand: Right, I might ask for debugging tips some day that isn't today. :) [17:38] and look for 'X processes are unconfined but have a profile defined' [17:38] infinity: ack [17:38] we've always known there was a race [17:39] Certainly, if the first start is that delayed, the idea that a bunch of stuff might be unconfined (like, half my system?) seems a bit less than ideal. [17:39] but in practice, people never really hit it [17:39] we still don't have widespread bug reports [17:39] well, any actually [17:39] it has just been observed on occasion [17:40] infinity: on a server it is less of a concern cause the profile load happens in the upstart job [17:40] before the exec [17:41] the whole generate profile cache in postinst is possibly SRUable btw [17:41] jdstrand: So, if it's happening in an upstart job, does that mean the console spew from the sysv job is basically redundant and meaningless? [17:42] I think we'll have most of the kernel bits-- beyond that it is mostly kernel packaging and an apparmor SRU [17:43] infinity: no-- the policy load happens in multiple stages. the network-security job makes sure dhclient is handled. upstart jobs are modified to load policy before the exec, like with mysql. then there is everything else that doesn't have an upstart job, like tcpdump [17:43] it is the last stage that is the most problematic [17:44] cause that happens via sysvinit which is racy [17:44] we could move that to an upstart job now, but people would probably get fired [17:44] :) [17:45] "you increased boot time by how long!?!" [17:45] the good news is that we have a plan, we know what to do, and it is scheduled work for later [17:46] that is quite a bit better than even 3 months ago [17:46] (we discussed it at our team sprint recently) [17:47] jdstrand: I like the precompiling notion. People notice postinsts sucking a bit less than they notice boot time. [17:47] infinity, jdstrand: it really looks like from infinity's paste that it's starting late, not running twice; note the profile loads im dmesg are from /etc/init/network-interface-security.conf, not from the sysvinit apparmor script [17:48] jdstrand: I suspect we all remember the good ol' days when we used to always run "depmod" on boot too... [17:48] infinity: (also, the fact that an upstart job loads the mysql policy only to have the sysv script do it later is redundant-- but in that case the cache is present for at least on of them, so it is as fast as the file can be read from the disk) [17:48] infinity: re sucking> absolutely [17:49] sbeattie: oh, you are right :\ [17:49] (Of course, from my weird POV, I notice postinst sucking more than boot time, because I upgrade constantly and never reboot, but I'm preeeeetty sure that's not normal) [17:50] something in trusty is delaying the apparmor sysv run... [17:50] (sometimes) [17:50] like I said, we might need to do an SRU [17:50] jdstrand: Could it be that slangasek's assertion about network interfaces only delaying rc2 is wrong, and it in fact delays all of sysvinit? That would pretty much explain that paste. [17:51] (Since I'm hung up on ntpdate) [17:51] (that said, if someone could figure outwhy it is late and fix it, that would be excellent. /me can't reproduce) [17:51] no, it could not [17:51] I unfortunately am not an expert with upstart job interactions [17:51] slangasek: Kay. Then back to the weird conclusion and the "not debugging today" thing. [17:52] jdstrand: Anyhow, I see this on tons of machines and can pretty much reproduce at will, so I'll get together with you guys to see if we can unwind it a bit later. [17:52] oh, wait [17:52] * slangasek scratches his head [17:52] infinity: that is good to know [17:52] (That was a fresh beta2 install from only moments earlier, so nothing weird there at all) [17:53] ok, so /etc/init.d/rcS is run from the bottom of /etc/init/rc-sysinit.conf, which does key on static network up [17:53] I thought we had better interleaving of /etc/rcS.d with the rest of the system [17:53] jdstrand: so yeah, any late apparmoring that you're seeing is actually related to us having made /etc/rc2.d non-racy, without noticing that this also holds up /etc/rcS.d :/ [17:54] proper fix would be to split out /etc/rcS.d handling, and make it wait for the local filesystem but not the network [17:54] slangasek: So, curious followup. Why do I have a console? [17:54] slangasek: oh! it is great to know the cause [17:55] slangasek: My hvc0 job is "start on stopped rc RUNLEVEL=[2345]", is that basically a complete lie, given how we're running rcS/rc2? === shadeslayer_ is now known as shadeslayer [17:56] slangasek: since I use network-manager, the fact that I don't see it locally is consistent with your understanding of the issue, correct? [17:56] jdstrand: correct [17:57] I never saw it in a vm cause I almost always do desktop installs in them [17:57] infinity: ummmm ok, yeah, I don't know why you would have a console with output from rc2.d showing up then [17:58] infinity: because 'stopped rc RUNLEVEL=[2345]' should mean the console starts only after all of rc2.d is done running [17:58] slangasek: That's what I would think, yes. [17:58] slangasek: so, is there something that can be done for 14.04 for this? (like I said, we can't to the cache pregeneration/upstart job change before 14.10/SRU) [17:59] slangasek: And really, while delayed boots are annoying (and they are), the larger complaint is actually that these late starts scribble all over the getty, making it confusingly non-obvious that a login prompt has happened. :P === marco_traveling is now known as marcoceppi [17:59] jdstrand: file a bug on upstart, cut'n'paste my comment above about the fix, and we'll see if we can pull it off sanely before 14.04 [17:59] slangasek: I saw the same with the ppc64el cloud images too, but assumed cloud-init was just doing something silly. [17:59] (Which it might be, but..) [18:00] slangasek: ack, thanks! shall I tag it special somehow or assign it to someone or is the bug enough? [18:02] Is there I way I can rebuild a package without installing build-deps on my system? means I trigger something like dpkg-buildpackage, and it looks for the build-deps, installs them to a temporary directory, compiles everything, moves the final deb package to a destination folder and removes the temporary directory (if specified)? [18:03]