[07:23] noon [15:34] bryceh, so I'm working on fixing signal-unsafe logging [15:35] the issue is that I fear it will make quite a bit of logging broken [15:35] and I haven't seen any logging break the X server so far, but then again it might only show up as memory corruption... [15:39] i thought you were only allowed X_NONE in signal context, which didn't add the timeout [15:40] maybe i misremember [15:41] apparently i do... [15:41] s/timeout/timestamp/ [15:44] heya [15:45] jcristau, what do you mean? [15:45] I think there's been some misconceptions about what is actually allowed in signal context [15:52] cnd: yeah i thought people had been careful about that, and the log was ~ just an fwrite(). guess not. [15:52] jcristau, even fwrite isn't signal safe [15:52] you have to use write [15:52] http://linux.die.net/man/7/signal [15:52] it has a list of signal safe functions [15:54] i guess that can take locks... [17:10] cnd, great, let me know how I can help [17:11] bryceh, the question is: do we want to take the patches [17:11] that could ultimately make our logging less useful [17:11] but for which we can be sure there will be no memory or other corruption [17:11] cnd, fewer crashes trumps better logging I should think [17:11] I think so too, I just wanted to get a second opinion [17:13] depends on the patch and actual effects of course. I suspect much of the logging we don't really care about that much, and if we lose something that we do, there's probably more than one way to do it [17:13] cnd, the one thing is that I'm wondering why it'd be crashing only now; we haven't seen these types of crashes in natty/oneiric afaict [17:14] bryceh, yeah, I don't think the logging is the real cause of the crashing [17:14] but we can't be sure [17:14] however, the logging does prevent us from running X under valgrind [17:14] under certain circumstances [17:15] * bryceh nods [17:16] bryceh, my current task for today is: [17:16] 1. fix logging [17:16] 2. run valgrind again to try to resolve bug 974017 [17:16] Launchpad bug 974017 in unity-2d (Ubuntu) "Crash when touching trackpad with 10 fingers" [Undecided,New] https://launchpad.net/bugs/974017 [17:16] which may be the root cause of some of the memory corruption issues? [17:18] hope something turns up! [17:23] bryceh, is it normal for the archive to be frozen now? [17:23] I just saw skaet's email on ubuntu-devel [17:23] cnd: wow, just reproduced it and thats my same bug [17:23] [ 11090.523] 3: /usr/bin/X (valuator_mask_set_double+0x0) [0x7fa1c508cb00] [17:23] Sarvatt, yep [17:23] that happens when the lid is closed here [17:23] cnd, they froze it after beta last release too [17:24] bryceh, ok [17:24] cnd, historically no, it's usually been unfrozen. But they think this will help ensure a higher quality level at release [17:24] it seems like if they are going to freeze the archive, they should be putting it in the release schedule methinks [17:24] cnd, you can still get stuff in, it just takes an extra layer of review and chance getting kicked out [17:24] yeah [17:24] cnd, showing evidence of thorough testing appears to help minimize those chances [18:38] bryceh, can you review the patch for bug 975356? [18:38] Launchpad bug 975356 in xorg-server (Ubuntu) "Logging from signal context is unsafe" [Medium,In progress] https://launchpad.net/bugs/975356 [18:38] the patch is attached [18:39] cnd, on it [18:41] cnd, did you check that 100_rethrow_signals.patch is not adding unsafe calls? [18:41] no, I didn't [18:42] ok, we'll need to remember to do that. it might be ok, I don't remember [18:44] bryceh, it may be easier to review the patches sent to xorg-devel [18:44] since they are split up into easier to review commits [18:45] ok [18:45] the only difference is the ubuntu patch includes an extra patch from upstream 1.12 that splits out the logging type string to a separate function [18:45] it backported without issue [18:48] hmm... well, valgrind now works properly [18:48] as in, it won't die [18:48] but I don't get any hits [18:49] no leads as to what the real bug is when putting lots of touches down [18:53] bryceh, so now I can't get it to crash anymore [18:53] which may mean that the logging in signal context was the real culprit all along! [18:53] cnd, ok finished reviewing the patches [18:54] cnd, well that would be pretty sweet if it's true [18:54] bryceh, the results of your review is? [18:55] yet still I wonder why we didn't see this behavior previously? [18:55] bryceh, the only two code paths that I know of that log in signal context are touch-specific [18:55] so they didn't exist before [18:55] cnd, +1, I didn't spot anything erroneous, so sent my reviewed-by to the list. my knowledge of signal code is sketchy though so dunno how useful that is... [18:55] oh, I didn't check the list [18:56] bryceh, if you could comment in the bug too, that would be helpful [18:56] okie [18:56] I am going to go take this to #ubuntu-release [18:56] to get their sign off [18:58] cnd, should we get a bit more testing before we push it into the distro? [18:58] bryceh, how do you propose we do it? [18:58] * bryceh ponders [19:00] well, we've got tons of bug reports. Might be one or two who can reproduce the bug (or a similar bug) pretty easily and have them test it? [19:00] yeah [19:00] bryceh, I'll throw it up in a ppa [19:00] I suppose the thing we're more concerned about is regressions. I could just slap it on a couple machines here and just make sure they boot and basically work [19:00] but yeah if you can ppa it, I'll scare up some testing [19:00] k [19:04] bryceh, I've uploaded to ppa:chasedouglas/jupiter [19:04] ok [19:20] wow, that built surprisingly fast [19:20] wait dah, looking at the wrong one [19:28] cnd, hmm bunch of test failures on the build [19:28] hmm [19:29] I must admit I had tests turned off [19:29] I'll enable them and check [19:29] PASS: xfree86 [19:29] /bin/bash: line 5: 14823 Segmentation fault (core dumped) MALLOC_PERTURB_=15 ${dir}$tst [19:29] FAIL: touch [19:29] ======================================================================== [19:29] 1 of 8 tests failed [19:29] here's where it started failing: [19:29] Testing bytes_to_int32() [19:29] Testing pad_to_int32 [19:29] Unlinking from front. [19:29] [mi] Increasing EQ size to 512 to prevent dropped events. [19:32] it helps having a quad core hyperthreaded behemoth when compiling the x server :) [19:34] full output http://paste.ubuntu.com/917885/ [19:35] hmm, output is out of order there [19:39] config_tests = --disable-unit-tests ? [19:39] can that be set via DEB_BUILD_OPTIONS? [19:40] * bryceh tries 'nocheck' [19:40] bryceh, it's an issue with the test basically [19:40] it creates a test device, but doesn't give the device a name [19:41] so the logging code segfaults when it tries to print the device name [19:41] ah [19:46] I've got a fix and am test building now [20:09] ricotz: man that took way too long, here's a refreshed pointer barriers patch for newer xserver http://kernel.ubuntu.com/~sarvatt/patches/500_pointer_barrier_thresholds.diff [20:09] gonna test it out now [20:10] * Sarvatt refreshed an older version of the patch first like an idiot and had to redo it [20:12] heh go figure, i refreshed it against master not 1.12 branch, have to fix up one hunk [20:12] bryceh, if this logging is the cause of the corruption, any corruption bugs where we have the X log from the crash should also contain messages from error context, i.e. messages about not finding touches or having to resize the touch array [20:18] ricotz: thank you so freaking much for refreshing our entire patch stack against the coding style changes :P [20:20] Sarvatt, except 190 ;) [20:21] got it building now to see if i screwed up the barriers patch somewhere [20:22] good, let me know if it works [20:22] i like to push it to the ppa :P [20:24] cnd, ok, so they'd have to involve touch in some fashion [20:24] cnd, not sure we have any matchers there but I'll take a deeper look [20:24] cnd, the good news is I stuck it on 6 machines and they all still boot at least [20:25] unfortunately I did get a SIGABRT on one of them (the serial touchscreen one). Dunno if it's related though. Didn't notice the system crash myself, and it hasn't done it again. [20:25] well in the past few releases only input related crashes were caught by 100_rethrow_signals so it makes sense [20:26] s/few/5/ [20:27] ricotz, \o/ !! [20:27] ricotz: bah /usr/bin/install: failed to extend `/home/sarvatt/source/bzr/xorg-pkg-tools/xorg-server/debian/tmp/main/usr/bin/Xorg': No space left on device [20:28] 20 minutes into it :P [20:28] Sarvatt, do you know why it was only catching the input crashes? [20:28] Sarvatt, doh. SSD? ;-) [20:28] bryceh: we had to disable it a bit because it wasn't working in the karmic timeframe, then pitti did some magic to get it working again and after that only input related crashes triggered it, i never could figure out why [20:29] input and proprietary drivers, let me rephrase that :) [20:30] yeah [20:31] yeah SSD, i run on 2gb free space 99% of the time :) [20:31] I certainly remember going through it with pitti. [20:33] problem is it's kinda hard to test [20:34] we would just send signals to the server. probably would be better if we deliberately introduced various kinds of faults, and checked that apport caught them [20:34] but we've never been short on bugs, and people are willing to run gdb (which gives better backtraces anyway), so hasn't been that high on my todo list [20:35] bryceh, ;) [20:35] Sarvatt, will testbuild and push it then [20:35] plus signal handling code hurts my little brain [20:35] bryceh: Heheh, no longer the case here somehow :-) [20:36] * ricotz uses pbuilder on tmpfs ;P < Sarvatt [20:37] but i like to use a web browser, 4GB isnt even enough for chromium [20:37] yeah i am struggling with 8gb here which isnt enough in many cases :\ [20:38] ricotz: Yeah I'm using 16gb atm :x [20:39] ricotz: it builds, ship it! [20:47] Sarvatt, done [21:58] cnd: Warning: attempting to log data in a signal unsafe manner while in signal context. Please update to check inSignalContext and/or use LogMessageVerbSigSafe(). The offending log format message is: [21:58] %d: %s (%s+0x%lx) [%p] [21:58] Sarvatt, right, but backtrace printer is running in signal context... [21:59] s/but/the/ [21:59] hmmm [21:59] its awesome its not crashing the server anymore, can finally close my lid :) [21:59] Sarvatt, is it fixed with the patch I added? [21:59] [mi] EQ overflow continuing. %lu events have been dropped. [22:00] yeah i'm using your newest one http://paste.ubuntu.com/918083/ [22:00] ok [22:01] Sarvatt, so based on your testing, the printing in a signal-safe manner fixes things? [22:01] just want to be sure [22:01] it fixes things but the printing is screwed up and its screwing up the mieqEnqueue printing too [22:01] yeah [22:03] thats me 10 finger + puppy paw pressing on the touchpad repeatedly in the log [22:04] (canonical-tech email humor) [22:05] heh [22:05] going to dupe all the bugs related to the crash i have on lid close on macs to the 10 finger one, its the same darn thing [22:06] yay [22:07] bryceh: if you see [291943.052] 3: /usr/bin/X (valuator_mask_set_double+0x0) [0x7fea24624ab0] in any crash logs its a dupe of 974017 [22:07] macs are wigging out sending input events constantly from the lid when its closed [22:08] there were a few duped to doko's bug [22:09] https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/933504 going through those now [22:09] Launchpad bug 933504 in xorg-server (Ubuntu Precise) "Xorg crashed with SIGABRT in __libc_message "double free or corruption (out)" from DeleteInputDeviceRequest" [High,Triaged] [22:11] cnd: is it known xinput testxi2 stopped working on macs in the past 2 weeks or so? [22:11] BadAtom xerrors [22:11] Sarvatt, I know of it :) [22:12] haven't had a chance to look at it [22:12] tried git xinput but no go, this was working a few weeks ago when i used it to see that input crap was happening from the lid [22:12] yeah [22:12] I don't have any idea what might have changed [22:13] something in x-x-i-synaptics no doubt [22:14] might be [22:14] i'll bisect that [22:17] cnd: surprised you didn't think it would be good to enable the right button area even if it didn't apply to macs :P [22:17] it was just mac people complaining [22:17] Sarvatt, I think it would be good [22:17] but it was a feature that came really late in the cycle [22:17] if we had another RC or beta release I would have pushed for it [22:18] and I certainly won't stand in the way if people who are interested want to raise it with the release team [22:18] I just didn't feel I had the justification to enable it myself [22:18] i would have totally done it no questions asked, it only affects non apple clickpads which all do need it [22:19] but yeah too late now with the freeze :( [22:19] Sarvatt, ok thanks, valuator_mask_set_double sounds familiar [22:19] Sarvatt, hmm... olli ries has managed to crash X still by drumming on his magic trackpad [22:20] with your new one? [22:20] unfortunately, the backtrace in signal context doesn't help :) [22:20] yeah [22:20] hmm wth is the touchpad sending e for [22:20] drumming my finger on the pad its going eeeee in irc [22:21] Sarvatt, do you have ginn running, by any chance? [22:21] * Sarvatt has 10 finger drumming going on and still hasnt crashed [22:21] nope [22:22] dunno then [22:22] i'm using utouch daily too [22:22] the e's stopped [22:23] is he clicking when hes drumming? [22:23] 4 minues of drumming hasnt crashed anything here on bcm5974 [22:24] cnd: nothing new is getting sent to my Xorg.0.log which is weird [22:24] Sarvatt, what are you expecting? [22:25] it stopped at [ 34.404] [mi] Increasing EQ size to 512 to prevent dropped events. [22:25] [ 34.404] [mi] EQ processing has resumed after 249 dropped events. [22:25] [ 34.404] [mi] This may be caused my a misbehaving driver monopolizing the server's resources. [22:25] stopped logging totally [22:25] that log i pastebinned is still the same, nothing new is getting written to the log [22:27] 5 minutes of drumming on the touchpad should have sent tons of spam :) [22:27] yeah i plugged in an external monitor and it didnt log any of the edid probes [22:30] cnd: ignore me, it did just now, thought i found a bug but nope [22:31] ok [22:32] heh then X crashed [22:33] %-} [22:33] it stopped logging input stuff for a good 30 minutes, then plugging in an external and the edid probe writing to the log got it back where 10 fingers on the touchpad would crash it again [22:34] 10 finger press spams the log like hell with your patches, so i know it stopped logging input related things after that EQ processing message [22:37] huh [22:37] I should add an fsync to the logging function [22:37] maybe that'll fix things [22:46] Sarvatt, no dupes for 974017 in xorg or xorg-server with valuator_mask_set_* [22:46] bryceh: your dupe checker isnt checking for bugs that are already duped to another bug :) [22:46] every time i filed it it tried to get duped to dokos bug [22:46] Sarvatt, hum true [22:47] has anyone asked doko if he still hits the bug? [22:47] its eating up other valid bugs [22:49] i just asked on the bug [22:49] dont think apport will dupe to closed bugs [22:50] when apport dupes it removes the useful info so will be good to start fresh if hes not hitting it, his specific bug he was hitting when glibc busted nvidia proprietary drivers [22:51] which was fixed forever ago [22:53] Sarvatt, sounds good [22:53] Sarvatt, or we can add a tag to make apport recollect stuff [22:53] proprietary drivers taking down X, signal handler called, input stack trying to print messages still to the log while its printing the backtrace from the real crash, every bug where input is writing to the log unsafely is getting duped to it [22:54] bryceh: theres a tag for that? [22:56] yep [22:57] needs-retrace I think, lemme check [22:57] i filed a bug, duped to doko's bug and all good logs removed, unduped it, and it automatically duped it yet again so i gave up and filed a new one and changed the info/deleted the core before it got retraced so it could be permenant [22:57] pain in the butt [22:58] nope, it's apport-request-retrace [22:58] http://www.piware.de/2011/11/apport-1-90-client-side-duplicate-checking/ [22:58] $ ./search-attachments xorg-server ThreadStacktrace.txt valuator_mask_set_ [22:58] 948792 Confirmed Xorg crashed with SIGSEGV in valuator_mask_set_double [22:58] 948697 Confirmed [bcm5974] Xorg crashed with SIGSEGV in valuator_mask_set_double() [22:58] so, including dupes I only found those two [22:59] both already properly duped [22:59] you knew pinged me about someone who could reproduce it easily and had 2-3 bugs already filed last month [23:00] think it was a canonical employee so it showed up on the bug radar management pushes [23:02] google to the rescue, just duped 3 more [23:05] of course the master bug is filed against unity when its not a unity problem