[08:42]  * apw yawns
[08:42] <lucent> hi...
[08:42] <lucent> I'm hoping for an angel, someone who can take 10 minutes and walk me through reproducing and reporting a kernel bug
[08:42] <lucent> totally confused here what layer of the kernel is the bug
[08:43] <apw> lucent, well we don't categorise bugs differently mostly based on that, if its a kernel bug its against linux
[08:43] <lucent> IEEE1394B adapter (express card) which used to be reliable, is now not reliable with the 10.10 kernel
[08:43] <lucent> ah okay
[08:44] <apw> thats firewaire right?
[08:44] <apw> wire even
[08:44] <lucent> where to start though?  my symptom was last week, I plugged in a firewire drive, it automounted, decided to fsck, and fsck killed the filesystem 'cause of data errors somewhere... in the darkness
[08:44] <lucent> yep
[08:45] <lucent> at the moment there's no data on the drive to even really test it, I don't have the same circumstances
[08:45] <apw> well we switched firewire stacks in 10.10, so you ought to be able to switch back for testing
[08:46] <lucent> ah.   okay that's one thing then, good to hear
[08:46] <apw> i would put something on you don't care about and can test easily, like a copy of /bin or something
[08:46] <apw> but do file a bug with ubuntu-bug linux regardless
[08:46] <lucent> is there a favourite storage device test you know of?
[08:47] <lucent> copying files and such, or some tool more specifically?
[08:48] <apw> there is an fsstress test, now what is it called
[08:48] <apw> bonnie i think is the one
[08:48] <apw> cking, what does one recommend for disk hammering
[08:48] <jk-> dbench ?
[08:48] <cking> apw, bonnie++ is my tool of choice
[08:49] <apw> yep that would be a good one too
[08:49] <apw> yeah i think bonnie++ is the one i would start with
[08:50] <apw> lucent, we need to get this reproducible so we can confirm/exclude the stack switch
[08:51] <cking> bonnie++ is thorough, you may need to fiddle with the settings as the defaults may not be 100% applicable to your use case
[08:52] <lucent> catching up here, got distracted a minute
[08:52] <lucent> I'm motivated to do what I can to reproduce and report, with a little hand holding along the way (thank you!)
[08:53] <bryceh> apw, do we have much in the way of kernel freeze debugging toolage for radeon like we do for intel?
[08:53] <bryceh> (hi btw)
[08:57] <apw> bryceh, hi ya ... hrm not that i am aware of, i have to admit to being rather dependant on RAOF in these matters
[08:58]  * apw suspects its pretty late where bryceh is
[08:58] <bryceh> feh, only 1am
[09:00] <lucent> cking: I'm willing to learn a new command, though is bonnie++ capable of destructive (write/read) data testing?  I am concerned that since the drive is mostly zeroes now, it will impact the effiacy of a test
[09:00] <bryceh> apw, yeah I know from the X side we don't have tools for radeon.  was hoping maybe there was something general purpose on the kernel side, but perhaps not
[09:06] <cking> lucent, bonnie++ is capable of write/read testing, not sure how that impacts on your target H/W - it depends on the filesystem, how full it is, etc
[09:07] <lucent> cking: brilliant, will go forward to learn about it
[09:09] <apw> bryceh, nothing i know of no, but i can't claim i've looked very close for radeon
[09:33] <diwic> smb, ping
[09:34] <smb> diwic, Hey David
[09:34] <diwic> smb, I'm a little unsure of what to do with those verification-failed sru:s you asked me to look at
[09:35] <apw> smb, morning
[09:35] <smb> Well those all lookedto be similar in the problem. A quirk which seemed to have no effect
[09:35] <diwic> smb, so I'm quite sure of what the problem is
[09:36] <smb> diwic, Whenever you can prove (with a test kernel) that a fix works now, then those could be resubmitted to stable
[09:36] <smb> apw, Morning
[09:36] <diwic> smb, I'm just not sure on which machines I should enable a fix
[09:37] <smb> diwic, I would suggest to look exactly which hw is not working in the bug reports. Then provide them test kernels that should fix their machines and ask for feedback
[09:37] <diwic> smb, well, we have a test kernel (dkms) that disables via_dmapos_patch for all devices which works
[09:38] <smb> The problem, at least with the last patches seemed to have been that nobody tested the actual quirk
[09:38] <diwic> smb, the question is if this fix should apply to a specific controller, a controller revision or just those particular machines
[09:39] <diwic> smb, /probably/ it is one or a few controller revisions we should check for, but if I'm wrong, we might screw up now working machines
[09:39] <smb> I probably don't know the area of hw so well, but I thought that even the same chip/controller could work or not work depending on the wiring or implementation of the machine, is this right?
[09:40] <diwic> smb, that's usually the case when it comes to the hda codec, but this is likely a bug in the controller (southbridge)
[09:41] <smb> Hm, so the feeling would be to quirk the controller and hope for the best. Hm, if the quirk would be wrong, what would be the (worst) impact. Lower performance/distorted or no sound?
[09:41] <diwic> smb, worst case would be no sound at all.
[09:43] <smb> Gah, not really something one wants to put into stable. On the other hand I can understand one would not want to quirk all the broken machines individually...
[09:44] <smb> I guess it is too much optimism to hope there would be a way to do a runtime check on brokenness
[09:46] <diwic> heh, the reason why we need these quirks in the first place because the runtime brokenness check isn't working well enough...
[09:47] <diwic> when is the 2.6.37 window going to open?
[09:48] <smb> Doh! Ok, so for stable it feels like rather quirking the machines and for upstream the run-time check should get fixed (if that is possible)
[09:49] <bryceh> smb, up late?
[09:49] <smb> diwic, Depends when .36 gets out... Haven't looked today. apw what was you estimate
[09:49] <smb> bryceh, No reasonable time. My clock says 10:50am
[09:50] <smb> bryceh, Maybe you meant diwic 
[09:50]  * diwic is in the same time zone as smb 
[09:50] <bryceh> possibly nick collision
[09:50] <diwic> Should those fixes be sent to greg as well even if there is no "upstream counterpart"?
[09:51] <lucent> diwic: it appears bonnie++ depends on the filesystem layer;  know of a tool that allows me to test a block storage device without putting a filesystem on it?
[09:51] <smb> diwic, I would do so with explaining why. 
[09:52] <diwic> lucent, are you sure you asked the right person?
[09:52] <lucent> destructive tests are fine, preferred even so this can be repeated without special requirement
[09:52] <RAOF> bryceh: There's nothing like intel's debuggage, although it wouldn't be impossible to add - radeon already has a bunch of hangcheck features.
[09:52] <lucent> oh
[09:52] <smb> diwic, Or do you think upstream might go with quirking the single machines until a better solution is found
[09:52] <lucent> diwic: my mistake name-tab failure here
[09:52] <lucent> cking: bonnie++ requires a filesystem, I think?   any tool you know that can work without the requirement for a filesystem?
[09:53] <diwic> smb, so quirk individual machines for stable and try to get entire controllers into 2.6.37?
[09:56] <smb> diwic, I would probably discuss this with upstream. If quirking the entire controller is felt as not dangerous it mightbe acceptable for stable, too. Or they want to fix the detection and that might be more complicated. Then quirking the individual devices would be better for stable
[09:57] <bryceh> RAOF, bug #649141 is what I've been chasing.  I'm looking for what the next level of debugging would be for this.
[09:57] <ubot2> Launchpad bug 649141 in linux (Ubuntu Maverick) (and 1 other project) "BUG: unable to handle kernel paging request - EIP: [<f959ae41>] snd_ctl_poll (Followed by system lockup) (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/649141
[10:05] <cking> lucent, did make a tool called iobenchmark which I put in my ppa (it's under the karmic series). that does testing on a raw device.
[10:06] <lucent> cking: looking for that now, thank you
[10:09] <cking> run it with -R (read) or -W (write) and specify the raw device with -f, e.g. -f /dev/sdX - it will destroy any existing  data on the device 
[10:10] <lucent> is it aware of the capacity of the underlying device?
[10:10] <cking> it's fairly hacky code, so if it breaks, you keep the pieces.
[10:11] <diwic> bryceh, were you listening to music through HDMI or the HDA-realtek? 
[10:11] <lucent> i.e. any reason not to run it on a 1TB drive (knowing that it's a destructive test, yes)
[10:11] <cking> lucent, I've never tested it on anything so large. It may take a few hours as it repeats the read/writes many times
[10:12] <lucent> good to know, thanks
[10:12] <cking> lucent, what do you want to test?
[10:13] <bryceh> diwic, yes through snd_hda_codec_realtek, not hdmi
[10:13] <lucent> cking: I need to reproduce a bug somewhere between the filesystem layer -> express PCI card -> firewire 1394b OHCI -> random gamma rays
[10:13] <lucent> I suspect it's firewire related but I want a process to reproduce and not just this willy nilly data loss I've been running into on one of two firewire adapters since an upgrade to Ubuntu 10.10 and the new kernel
[10:15] <lucent> need to trigger this and would rather not rely on a working filesystem because it's going to get hosed anyways if I do trigger the bug
[10:15] <cking> lucent, so you may or may not need a mix of data bursts, random I/O patterns to reproduce. Best start with something simple like dd'ing to the device and working up more complex tests if that fails to reproduce
[10:15] <lucent> okay...  dd'ing zeroes?   /dev/urandom ?
[10:16] <diwic> bryceh, so I noticed that the first one said snd_ctl_poll+ something. Are the freezes always related to snd_-something, or it is just random stuff?
[10:16] <lucent> never had to test a block device beyond seeing if just zero'es can be written to it
[10:16] <cking> dd'ing zeros is faster than reading from /dev/urandom, so try that
[10:16] <lucent> thanks for the advice, I'm relieved to hear suggestions
[10:16] <lucent> this is making me a little crazy
[10:16] <bryceh> diwic, I've only seen that one that was snd_* related; the other freezes I've not gotten any info other than what I've specified in the comments
[10:18] <bryceh> diwic, aside from the snd_* one, the other freezes have been associated with playing a game that uses mesa, so that's why I'm suspecting a mesa/3d/dri causation there
[10:19] <diwic> bryceh, okay. Let me know if things starting to point more towards the snd area
[10:29] <apw> smb, well we are at -rc6 so a couple of weeks likely
[10:30] <smb> diwic, ^
[10:31] <apw> he has been averaging -rc6 to -rc8 the last 6 or so releases, so it could be now and it could be 2 weeks hard to be more accurate
[10:31] <diwic> okay
[10:32]  * smb has about 10mins till check-in
[10:32] <smb> Actually no, now!
[10:32] <apw> good luck
[10:32] <apw> let me know when you land
[11:03] <Laney> Are you aware of a regression in M where wireless is incredibly slow when running on battery, but fine on power?
[11:03] <Laney> (MBP 7.1 here, fwiw)
[11:03] <apw> Laney, not heard that reported no ... though being a macbook nothing would supprise me
[11:04] <Laney> some kind of over-zealous power management perhaps? Renders wireless performance pretty unusable.
[11:04] <Laney> apw: Indeed, anyway it coincided with my maverick upgrade
[11:04]  * apw uses M based systems generally in that mode a lot, so its not systemic
[11:04] <Laney> indeed, I suspect it it hardware specific
[11:05] <Laney> I filed 651008 about it
[11:05] <apw> yep, macbooks are about the most difficult to get documentation for out there, apple is not keen that other OSs run on their h/w
[11:05] <apw> bug 651008
[11:05] <ubot2> Launchpad bug 651008 in linux (Ubuntu) "Regression in wireless performance under Maverick (broadcom) (affects: 2) (heat: 12)" [Undecided,New] https://launchpad.net/bugs/651008
[11:09] <popey> ahh Laney I didn't realise it only affected you on battery. I have only tested when on mains
[11:09] <Laney> popey: yeah, *just* figured that out
[11:09] <popey> oh :)
[11:10] <Laney> I realised it was intermittent, but didn't link it to this here wire supplying juice
[11:10]  * popey subscrib0rs
[11:12] <apw> Laney, ok so thats BCM4322, all the kit  i have which uses the binary-junk dirver is 4312 ... does no show anything like this
[11:12] <apw> Laney, are you using 11n ?
[11:12] <Laney> apw: no, b+g
[11:13] <apw> popey, do you have one of these mac nightmares ?
[11:13] <Laney> I tried to boot the lucid kernel but it never made it to X unfortunately
[11:13] <apw> Laney, i am a little supprised at that
[11:14] <Laney> yeah I just got a vt and couldn't start gdm manually
[11:14] <Laney> admittedly I didn't poke too hard
[11:16] <popey> apw: i do have one of those lovely apple laptops, yes
[11:16] <apw> popey, i feel for you
[11:16] <popey> meh
[11:17] <popey> i get that a lot. 
[11:17] <Laney> it actually work(ed in lucid)s quite well
[11:17] <apw> yeah thats cause lucid has a year old kernel in it ... takes about that long for the kernel to catch up with every random change apple makes
[11:17] <apw> i am supprised that the 7's are working yet, i still see trouble with the 5's
[11:18] <apw> not that i have a clue waht the difference between a 5 and a 7 are of course
[11:18] <popey> you may recall the fun we had with the sata bus on this device which means 10.04 can't see the hard disk
[11:20] <Laney> oh yeah, 10.04.1 then ;)
[11:21] <apw> Laney, popey, odder is that the wl driver you are using is bit for bit identicle between the two releases
[11:22] <Laney> apw: Yeah, I'm wondering if it's some power management the kernel is doing
[11:22] <apw> Laney, so what graphics does this have?  as the logical test is running a lucid kernel
[11:22] <Laney> trying to blat it into low power mode unsuitably
[11:22] <apw> Laney, that kind of thing is normally in the driver
[11:22] <Laney> oh ok
[11:22] <apw> and as its a binary driver we cirtianly arn't telling it to do anything speciifc
[11:23] <Laney> I'm using -nvidia too
[11:24] <apw> heh talka bout putting all the worst bits in the one box ...
[11:24] <apw> shame its such a pretty box you cannot resist paying for them
[11:24] <popey> It's a very pretty box :)
[11:24] <popey> damnit!
[11:24] <cking> that's the beauty of proprietary kit
[11:24] <apw> yep ... isn't it just
[11:25] <apw> stil another year an maybe we'll have decent bcm drivers 
[11:25] <Laney> well work purchased it for me, didn't have a choice
[11:25] <cking> shiny on the outside... closed on the inside
[11:25] <popey> I purchased mine on advice of a Canonical employee who told me 'everything works'
[11:25] <popey> I suspect he uses OSX though.
[11:26] <Laney> speaking of work, must head in now
[11:26] <Laney> back soon chaps
[11:26] <apw> popey, well and as apple randomly sub in components, you can only say that about the one you have in your hand not the next one you buy
[11:36]  * penguin42 has noticed a little quad of 4 kernel oopses that look the same that I think I might half see what's happening - someone want to have a look? Bug 640154 bug 646215 and bug 653591; I've got a bit of a description as the last comment on bug 632430
[11:36] <ubot2> Launchpad bug 640154 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference in ips_adjust in intel_ips on Sony VPC-B11KGX (affects: 2) (heat: 200)" [Undecided,New] https://launchpad.net/bugs/640154
[11:36] <ubot2> Launchpad bug 646215 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference at (null) - ips_adjust in intel_ips (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/646215
[11:36] <ubot2> Launchpad bug 653591 in linux (Ubuntu) "[18446744058.496026] BUG: unable to handle kernel NULL pointer dereference at (null) ips_adjust in intel_ips (affects: 1) (heat: 8)" [Undecided,New] https://launchpad.net/bugs/653591
[11:36] <ubot2> Launchpad bug 632430 in linux (Ubuntu) "ips-adjust - BUG: unable to handle kernel NULL pointer dereference at (null) (affects: 3) (heat: 22)" [Undecided,New] https://launchpad.net/bugs/632430
[11:40] <apw> penguin42, or is ips null
[11:41] <penguin42> apw: Don't think so, it's used a few lines earlier and generally used all over in that file
[11:51] <apw> penguin42, i am guessing actually that gpu_lower is null
[11:51] <apw> that seems to match your mental image too
[11:52] <penguin42> apw: Yeh, I think what's happening is that it's found to be null, the thing that looks for it sets gpu_turbo_enabled to false so that line isn't called, and then something later - e.g. update_turbo_limits or ips_irq_handler turns it back on
[11:52]  * penguin42 doesn't have the hardware to find out; I just noticed the 4 similar oopses
[11:52] <apw> penguin42, do the people on the bugs find it happens always/readily ?
[11:55] <penguin42> apw: It's not obvious, I think some of them are just where the system told them to report it; one of them commented it happened coming out of hibernation, another got a 'CPU power or thermal limit exceeded' just before it
[11:56] <apw> we'd expect to see the CPU power thing in some cases, as that would indicate this code is triggered, and possibly a resume also makes sense if it was on before the resume
[11:57] <apw> and the susped/hibernate made things cooler
[11:57] <penguin42> apw: The update_turbo_limits says it's 'Used at init time and for runtime BIOS support, which requires polling the regs for updates (as a result of AC->DC transition for example).' so I wouldn't be surprised if it got kicked during a hibernate
[11:58] <apw> right
[11:59] <penguin42> I find it curious it seems to be 3/4 are VAIOs
[11:59] <apw> did you say one got a message about being disabled on boot
[11:59] <apw> penguin42, probabally common h/w would trigger this, so not so supprising
[12:00] <penguin42> apw: One of the boot logs had a ' failed to get i915 symbols, graphics turbo disabled' which is the message it prints when it's looking for gpu_lower and finds it's NULL
[12:01] <apw> yeah
[12:01] <apw> thats the one, thanks
[12:05] <penguin42> on a different question, if I've built the kernel from the ubuntu git using AUTOBUILD=1 NOEXTRAS=1 fakeroot debian/rules binary-generic   is there a way just to do a make to rebuild a module or two rather than the whole package if I'm just adding some debugging?
[12:05] <apw> penguin42, looks like we could hit this if we have polling turned on
[12:05] <apw> penguin42, yep remove the build stamp in debian/stamps
[12:06] <apw> and then rebuild it as normal (obviously not cleaning it)
[12:06] <penguin42> apw: Using the debian/rules binary-generic or with a make ?
[12:06] <apw> yeah d/r b-g
[12:06] <apw> ok i think i can see how this might trigger and how only some machines would be affected
[12:06] <apw> will think on how we might avoid it
[12:07] <penguin42> apw: Could check whether gpu_lower is NULL after all the places that set gpu_turbo_enabled
[12:09] <apw> well its where we try and enable it, we should check if we managed to get the symbols
[12:09] <apw> and abort the enable
[12:10] <penguin42> apw: Yeh, it does that during the initial enable
[12:10] <apw> yeah but not during a polled enable.
[12:10]  * apw will try and make it do something sensible ... and we can ask them to test
[12:11] <penguin42> apw: Yeh, might be right to bounce it off Jesse Barnes ?
[12:11] <apw> yep will send it to him as well
[12:13] <penguin42> apw: Maybe just change ips->gpu_turbo_enabled = (ips->gpu_lower!=NULL)  && !(hts & HTS_GTD_DIS);  ?
[12:16] <apw> penguin42, thats the kind of thing want as a minium for sure
[12:16] <penguin42> hmm breakfast I think
[14:53] <pmatulis> has anyone else heard of a problem with the last kernel update on lucid that prevents booting with lvm setups (seeing a simple '/ over lvm' case here)?
[15:24] <apw> pmatulis, not heard anything like that no
[16:40] <Haegin> hi, who is the best person to talk to to request a patch being added to the kernel or does that have to happen upstream? It's for adding driver support for a usb remote.
[16:51] <apw> Haegin, normally you would suggest it on kernel-team list
[16:51] <apw> https://wiki.ubuntu.com/Kernel/FAQ#Can I get a patch included in the Ubuntu Kernel? / How can I submit a patch to the Ubuntu Kernel?
[16:52] <Haegin> apw: ok, thanks
[16:54] <JFo> A lalalala OOh
[16:54] <JFo> gah! 
[16:54] <JFo> where did that come from?
[19:43] <penguin42> apw: Looks like that ips stuff has done the job for people
[20:13] <apw> penguin42, thanks
[20:14] <penguin42> apw: What made you realise to use the _busy flag?
[20:15] <apw> penguin42, actually thats not a flag, its the primary routine the outer loop uses
[20:15] <penguin42> ah ok
[20:17] <apw> penguin42, but great thanks for followoing up, i'll get that submitted where it needs to be tommorrow
[20:18] <penguin42> apw: No problem; I just started realising I'd seen a few with similar oops; it would be kind of nice if launchpad could group oopses based on the backtrace
[20:18] <apw> penguin42, thanks for that, its great when people help out this way
[20:18] <apw> penguin42, but yes its on our 'launchpad can you help with this' list
[20:39] <MorkBork> i think there may be a bug involving ahci or ata in the latest karmic kernel
[20:40] <MorkBork> (2.6.32-25.44)
[20:40] <MorkBork> so far i havent been able to reproduce it with 2.6.32-24.43
[20:52] <penguin42> apw: You know that lshw bug from the other week? There are still a bunch of open oopses from it; what's the right thing to do, now it's fixed is it right to merge them?
[20:59] <MorkBork> dunno why i said karmic
[20:59] <MorkBork> meant lucid
[20:59] <MorkBork> ><
[21:05] <bjf> MorkBork, what is the issue that you are seeing?
[21:15] <MorkBork> seeing a lot of errors in dmesg when i do heavy disk io (checking a raid array for example)
[21:15] <penguin42> MorkBork: Want to pastebin them?
[21:15] <MorkBork> typically "failed command: READ FPDMA QUEUED"
[21:16] <MorkBork> yea
[21:16] <MorkBork> i was grepping syslog
[21:16] <MorkBork> i booted  2.6.32-24.43 and havent been able to reproduce it yet
[21:16] <MorkBork> i reproduced it three times after an hour or so of checking arrays with 2.6.32-25.44
[21:19] <MorkBork> oh and "failed to read log page 10h (errno=-5)"
[21:19] <MorkBork> which i see mentioned in one of the patches but i didnt see how it could cause this
[21:20] <penguin42> MorkBork: I'd look for the first weird errors, once something goes wrong then what comes afterwards can be less meaningful
[21:20] <MorkBork> the 'failed to read log page' is typically the first
[21:21] <MorkBork> but im making a paste
[21:30] <MorkBork> heres the second time it happened
[21:30] <MorkBork> http://pastebin.com/Pd89dHeh
[21:30] <MorkBork> the first time it happened it was a lot messier
[21:32] <MorkBork> id say that was about 60% into the array check, and it ended up being completed successfully, the disk didnt get booted, etc
[21:35] <MorkBork> here was the third time it happened
[21:35] <MorkBork> http://pastebin.com/7mRCVv4c
[21:35] <MorkBork> probably 20% into a array check
[21:36] <penguin42> MorkBork: Not happ yis it
[21:36] <MorkBork> i dont see the "failed to read log page" message that time
[21:36] <MorkBork> well now that i booted into the 24.43 kernel i havent been able to reproduce it
[21:41] <MorkBork> here was the first time it happened
[21:41] <MorkBork> (triggered by the cron job that checks arrays sunday morning)
[21:41] <MorkBork> http://pastebin.com/N5mgVUxa
[21:41] <MorkBork> it was messy
[21:42] <MorkBork> even ended up logging a bunch of ATA errors on the drive itself (with SMART)
[21:44] <MorkBork> i googled a bunch and it seems like these errors are similar to the ones people got when there was a kernel bug in an nvidia sata driver
[21:45] <MorkBork> this is an amd controller in ahci mode
[21:47] <MorkBork> im gonna keep trying to reproduce it in 24.43 but its running strong
[21:55] <MorkBork> each one of those pastes is after a restart too