/srv/irclogs.ubuntu.com/2010/10/04/#ubuntu-kernel.txt

=== yfoel is now known as yofel
* apw yawns08:42
lucenthi...08:42
lucentI'm hoping for an angel, someone who can take 10 minutes and walk me through reproducing and reporting a kernel bug08:42
lucenttotally confused here what layer of the kernel is the bug08:42
apwlucent, well we don't categorise bugs differently mostly based on that, if its a kernel bug its against linux08:43
lucentIEEE1394B adapter (express card) which used to be reliable, is now not reliable with the 10.10 kernel08:43
lucentah okay08:43
apwthats firewaire right?08:44
apwwire even08:44
lucentwhere to start though?  my symptom was last week, I plugged in a firewire drive, it automounted, decided to fsck, and fsck killed the filesystem 'cause of data errors somewhere... in the darkness08:44
lucentyep08:44
lucentat the moment there's no data on the drive to even really test it, I don't have the same circumstances08:45
apwwell we switched firewire stacks in 10.10, so you ought to be able to switch back for testing08:45
lucentah.   okay that's one thing then, good to hear08:46
apwi would put something on you don't care about and can test easily, like a copy of /bin or something08:46
apwbut do file a bug with ubuntu-bug linux regardless08:46
lucentis there a favourite storage device test you know of?08:46
lucentcopying files and such, or some tool more specifically?08:47
apwthere is an fsstress test, now what is it called08:48
apwbonnie i think is the one08:48
apwcking, what does one recommend for disk hammering08:48
jk-dbench ?08:48
ckingapw, bonnie++ is my tool of choice08:48
apwyep that would be a good one too08:49
apwyeah i think bonnie++ is the one i would start with08:49
apwlucent, we need to get this reproducible so we can confirm/exclude the stack switch08:50
ckingbonnie++ is thorough, you may need to fiddle with the settings as the defaults may not be 100% applicable to your use case08:51
lucentcatching up here, got distracted a minute08:52
lucentI'm motivated to do what I can to reproduce and report, with a little hand holding along the way (thank you!)08:52
brycehapw, do we have much in the way of kernel freeze debugging toolage for radeon like we do for intel?08:53
bryceh(hi btw)08:53
apwbryceh, hi ya ... hrm not that i am aware of, i have to admit to being rather dependant on RAOF in these matters08:57
* apw suspects its pretty late where bryceh is08:58
brycehfeh, only 1am08:58
lucentcking: I'm willing to learn a new command, though is bonnie++ capable of destructive (write/read) data testing?  I am concerned that since the drive is mostly zeroes now, it will impact the effiacy of a test09:00
brycehapw, yeah I know from the X side we don't have tools for radeon.  was hoping maybe there was something general purpose on the kernel side, but perhaps not09:00
ckinglucent, bonnie++ is capable of write/read testing, not sure how that impacts on your target H/W - it depends on the filesystem, how full it is, etc09:06
lucentcking: brilliant, will go forward to learn about it09:07
apwbryceh, nothing i know of no, but i can't claim i've looked very close for radeon09:09
diwicsmb, ping09:33
smbdiwic, Hey David09:34
diwicsmb, I'm a little unsure of what to do with those verification-failed sru:s you asked me to look at09:34
apwsmb, morning09:35
smbWell those all lookedto be similar in the problem. A quirk which seemed to have no effect09:35
diwicsmb, so I'm quite sure of what the problem is09:35
smbdiwic, Whenever you can prove (with a test kernel) that a fix works now, then those could be resubmitted to stable09:36
smbapw, Morning09:36
diwicsmb, I'm just not sure on which machines I should enable a fix09:36
smbdiwic, I would suggest to look exactly which hw is not working in the bug reports. Then provide them test kernels that should fix their machines and ask for feedback09:37
diwicsmb, well, we have a test kernel (dkms) that disables via_dmapos_patch for all devices which works09:37
smbThe problem, at least with the last patches seemed to have been that nobody tested the actual quirk09:38
diwicsmb, the question is if this fix should apply to a specific controller, a controller revision or just those particular machines09:38
diwicsmb, /probably/ it is one or a few controller revisions we should check for, but if I'm wrong, we might screw up now working machines09:39
smbI probably don't know the area of hw so well, but I thought that even the same chip/controller could work or not work depending on the wiring or implementation of the machine, is this right?09:39
diwicsmb, that's usually the case when it comes to the hda codec, but this is likely a bug in the controller (southbridge)09:40
smbHm, so the feeling would be to quirk the controller and hope for the best. Hm, if the quirk would be wrong, what would be the (worst) impact. Lower performance/distorted or no sound?09:41
diwicsmb, worst case would be no sound at all.09:41
smbGah, not really something one wants to put into stable. On the other hand I can understand one would not want to quirk all the broken machines individually...09:43
smbI guess it is too much optimism to hope there would be a way to do a runtime check on brokenness09:44
diwicheh, the reason why we need these quirks in the first place because the runtime brokenness check isn't working well enough...09:46
diwicwhen is the 2.6.37 window going to open?09:47
smbDoh! Ok, so for stable it feels like rather quirking the machines and for upstream the run-time check should get fixed (if that is possible)09:48
brycehsmb, up late?09:49
smbdiwic, Depends when .36 gets out... Haven't looked today. apw what was you estimate09:49
smbbryceh, No reasonable time. My clock says 10:50am09:49
smbbryceh, Maybe you meant diwic 09:50
* diwic is in the same time zone as smb 09:50
brycehpossibly nick collision09:50
diwicShould those fixes be sent to greg as well even if there is no "upstream counterpart"?09:50
lucentdiwic: it appears bonnie++ depends on the filesystem layer;  know of a tool that allows me to test a block storage device without putting a filesystem on it?09:51
smbdiwic, I would do so with explaining why. 09:51
diwiclucent, are you sure you asked the right person?09:52
lucentdestructive tests are fine, preferred even so this can be repeated without special requirement09:52
RAOFbryceh: There's nothing like intel's debuggage, although it wouldn't be impossible to add - radeon already has a bunch of hangcheck features.09:52
lucentoh09:52
smbdiwic, Or do you think upstream might go with quirking the single machines until a better solution is found09:52
lucentdiwic: my mistake name-tab failure here09:52
lucentcking: bonnie++ requires a filesystem, I think?   any tool you know that can work without the requirement for a filesystem?09:52
diwicsmb, so quirk individual machines for stable and try to get entire controllers into 2.6.37?09:53
smbdiwic, I would probably discuss this with upstream. If quirking the entire controller is felt as not dangerous it mightbe acceptable for stable, too. Or they want to fix the detection and that might be more complicated. Then quirking the individual devices would be better for stable09:56
brycehRAOF, bug #649141 is what I've been chasing.  I'm looking for what the next level of debugging would be for this.09:57
ubot2Launchpad bug 649141 in linux (Ubuntu Maverick) (and 1 other project) "BUG: unable to handle kernel paging request - EIP: [<f959ae41>] snd_ctl_poll (Followed by system lockup) (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/64914109:57
ckinglucent, did make a tool called iobenchmark which I put in my ppa (it's under the karmic series). that does testing on a raw device.10:05
lucentcking: looking for that now, thank you10:06
ckingrun it with -R (read) or -W (write) and specify the raw device with -f, e.g. -f /dev/sdX - it will destroy any existing  data on the device 10:09
lucentis it aware of the capacity of the underlying device?10:10
ckingit's fairly hacky code, so if it breaks, you keep the pieces.10:10
diwicbryceh, were you listening to music through HDMI or the HDA-realtek? 10:11
lucenti.e. any reason not to run it on a 1TB drive (knowing that it's a destructive test, yes)10:11
ckinglucent, I've never tested it on anything so large. It may take a few hours as it repeats the read/writes many times10:11
lucentgood to know, thanks10:12
ckinglucent, what do you want to test?10:12
brycehdiwic, yes through snd_hda_codec_realtek, not hdmi10:13
lucentcking: I need to reproduce a bug somewhere between the filesystem layer -> express PCI card -> firewire 1394b OHCI -> random gamma rays10:13
lucentI suspect it's firewire related but I want a process to reproduce and not just this willy nilly data loss I've been running into on one of two firewire adapters since an upgrade to Ubuntu 10.10 and the new kernel10:13
lucentneed to trigger this and would rather not rely on a working filesystem because it's going to get hosed anyways if I do trigger the bug10:15
ckinglucent, so you may or may not need a mix of data bursts, random I/O patterns to reproduce. Best start with something simple like dd'ing to the device and working up more complex tests if that fails to reproduce10:15
lucentokay...  dd'ing zeroes?   /dev/urandom ?10:15
diwicbryceh, so I noticed that the first one said snd_ctl_poll+ something. Are the freezes always related to snd_-something, or it is just random stuff?10:16
lucentnever had to test a block device beyond seeing if just zero'es can be written to it10:16
ckingdd'ing zeros is faster than reading from /dev/urandom, so try that10:16
lucentthanks for the advice, I'm relieved to hear suggestions10:16
lucentthis is making me a little crazy10:16
brycehdiwic, I've only seen that one that was snd_* related; the other freezes I've not gotten any info other than what I've specified in the comments10:16
brycehdiwic, aside from the snd_* one, the other freezes have been associated with playing a game that uses mesa, so that's why I'm suspecting a mesa/3d/dri causation there10:18
diwicbryceh, okay. Let me know if things starting to point more towards the snd area10:19
=== amitk is now known as amitk-afk
apwsmb, well we are at -rc6 so a couple of weeks likely10:29
smbdiwic, ^10:30
apwhe has been averaging -rc6 to -rc8 the last 6 or so releases, so it could be now and it could be 2 weeks hard to be more accurate10:31
diwicokay10:31
* smb has about 10mins till check-in10:32
smbActually no, now!10:32
apwgood luck10:32
apwlet me know when you land10:32
LaneyAre you aware of a regression in M where wireless is incredibly slow when running on battery, but fine on power?11:03
Laney(MBP 7.1 here, fwiw)11:03
apwLaney, not heard that reported no ... though being a macbook nothing would supprise me11:03
Laneysome kind of over-zealous power management perhaps? Renders wireless performance pretty unusable.11:04
Laneyapw: Indeed, anyway it coincided with my maverick upgrade11:04
* apw uses M based systems generally in that mode a lot, so its not systemic11:04
Laneyindeed, I suspect it it hardware specific11:04
LaneyI filed 651008 about it11:05
apwyep, macbooks are about the most difficult to get documentation for out there, apple is not keen that other OSs run on their h/w11:05
apwbug 65100811:05
ubot2Launchpad bug 651008 in linux (Ubuntu) "Regression in wireless performance under Maverick (broadcom) (affects: 2) (heat: 12)" [Undecided,New] https://launchpad.net/bugs/65100811:05
popeyahh Laney I didn't realise it only affected you on battery. I have only tested when on mains11:09
Laneypopey: yeah, *just* figured that out11:09
popeyoh :)11:09
LaneyI realised it was intermittent, but didn't link it to this here wire supplying juice11:10
* popey subscrib0rs11:10
=== amitk-afk is now known as amitk
apwLaney, ok so thats BCM4322, all the kit  i have which uses the binary-junk dirver is 4312 ... does no show anything like this11:12
apwLaney, are you using 11n ?11:12
Laneyapw: no, b+g11:12
apwpopey, do you have one of these mac nightmares ?11:13
LaneyI tried to boot the lucid kernel but it never made it to X unfortunately11:13
apwLaney, i am a little supprised at that11:13
Laneyyeah I just got a vt and couldn't start gdm manually11:14
Laneyadmittedly I didn't poke too hard11:14
popeyapw: i do have one of those lovely apple laptops, yes11:16
apwpopey, i feel for you11:16
popeymeh11:16
popeyi get that a lot. 11:17
Laneyit actually work(ed in lucid)s quite well11:17
apwyeah thats cause lucid has a year old kernel in it ... takes about that long for the kernel to catch up with every random change apple makes11:17
apwi am supprised that the 7's are working yet, i still see trouble with the 5's11:17
apwnot that i have a clue waht the difference between a 5 and a 7 are of course11:18
popeyyou may recall the fun we had with the sata bus on this device which means 10.04 can't see the hard disk11:18
=== diwic is now known as diwic_afk
Laneyoh yeah, 10.04.1 then ;)11:20
apwLaney, popey, odder is that the wl driver you are using is bit for bit identicle between the two releases11:21
Laneyapw: Yeah, I'm wondering if it's some power management the kernel is doing11:22
apwLaney, so what graphics does this have?  as the logical test is running a lucid kernel11:22
Laneytrying to blat it into low power mode unsuitably11:22
apwLaney, that kind of thing is normally in the driver11:22
Laneyoh ok11:22
apwand as its a binary driver we cirtianly arn't telling it to do anything speciifc11:22
LaneyI'm using -nvidia too11:23
apwheh talka bout putting all the worst bits in the one box ...11:24
apwshame its such a pretty box you cannot resist paying for them11:24
popeyIt's a very pretty box :)11:24
popeydamnit!11:24
ckingthat's the beauty of proprietary kit11:24
apwyep ... isn't it just11:24
apwstil another year an maybe we'll have decent bcm drivers 11:25
Laneywell work purchased it for me, didn't have a choice11:25
ckingshiny on the outside... closed on the inside11:25
popeyI purchased mine on advice of a Canonical employee who told me 'everything works'11:25
popeyI suspect he uses OSX though.11:25
Laneyspeaking of work, must head in now11:26
Laneyback soon chaps11:26
apwpopey, well and as apple randomly sub in components, you can only say that about the one you have in your hand not the next one you buy11:26
* penguin42 has noticed a little quad of 4 kernel oopses that look the same that I think I might half see what's happening - someone want to have a look? Bug 640154 bug 646215 and bug 653591; I've got a bit of a description as the last comment on bug 63243011:36
ubot2Launchpad bug 640154 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference in ips_adjust in intel_ips on Sony VPC-B11KGX (affects: 2) (heat: 200)" [Undecided,New] https://launchpad.net/bugs/64015411:36
ubot2Launchpad bug 646215 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference at (null) - ips_adjust in intel_ips (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/64621511:36
ubot2Launchpad bug 653591 in linux (Ubuntu) "[18446744058.496026] BUG: unable to handle kernel NULL pointer dereference at (null) ips_adjust in intel_ips (affects: 1) (heat: 8)" [Undecided,New] https://launchpad.net/bugs/65359111:36
ubot2Launchpad bug 632430 in linux (Ubuntu) "ips-adjust - BUG: unable to handle kernel NULL pointer dereference at (null) (affects: 3) (heat: 22)" [Undecided,New] https://launchpad.net/bugs/63243011:36
apwpenguin42, or is ips null11:40
penguin42apw: Don't think so, it's used a few lines earlier and generally used all over in that file11:41
apwpenguin42, i am guessing actually that gpu_lower is null11:51
apwthat seems to match your mental image too11:51
penguin42apw: Yeh, I think what's happening is that it's found to be null, the thing that looks for it sets gpu_turbo_enabled to false so that line isn't called, and then something later - e.g. update_turbo_limits or ips_irq_handler turns it back on11:52
* penguin42 doesn't have the hardware to find out; I just noticed the 4 similar oopses11:52
apwpenguin42, do the people on the bugs find it happens always/readily ?11:52
penguin42apw: It's not obvious, I think some of them are just where the system told them to report it; one of them commented it happened coming out of hibernation, another got a 'CPU power or thermal limit exceeded' just before it11:55
apwwe'd expect to see the CPU power thing in some cases, as that would indicate this code is triggered, and possibly a resume also makes sense if it was on before the resume11:56
apwand the susped/hibernate made things cooler11:57
penguin42apw: The update_turbo_limits says it's 'Used at init time and for runtime BIOS support, which requires polling the regs for updates (as a result of AC->DC transition for example).' so I wouldn't be surprised if it got kicked during a hibernate11:57
apwright11:58
penguin42I find it curious it seems to be 3/4 are VAIOs11:59
apwdid you say one got a message about being disabled on boot11:59
apwpenguin42, probabally common h/w would trigger this, so not so supprising11:59
penguin42apw: One of the boot logs had a ' failed to get i915 symbols, graphics turbo disabled' which is the message it prints when it's looking for gpu_lower and finds it's NULL12:00
apwyeah12:01
apwthats the one, thanks12:01
=== diwic_afk is now known as diwic
penguin42on a different question, if I've built the kernel from the ubuntu git using AUTOBUILD=1 NOEXTRAS=1 fakeroot debian/rules binary-generic   is there a way just to do a make to rebuild a module or two rather than the whole package if I'm just adding some debugging?12:05
apwpenguin42, looks like we could hit this if we have polling turned on12:05
apwpenguin42, yep remove the build stamp in debian/stamps12:05
apwand then rebuild it as normal (obviously not cleaning it)12:06
penguin42apw: Using the debian/rules binary-generic or with a make ?12:06
apwyeah d/r b-g12:06
apwok i think i can see how this might trigger and how only some machines would be affected12:06
apwwill think on how we might avoid it12:06
penguin42apw: Could check whether gpu_lower is NULL after all the places that set gpu_turbo_enabled12:07
apwwell its where we try and enable it, we should check if we managed to get the symbols12:09
apwand abort the enable12:09
penguin42apw: Yeh, it does that during the initial enable12:10
apwyeah but not during a polled enable.12:10
* apw will try and make it do something sensible ... and we can ask them to test12:10
penguin42apw: Yeh, might be right to bounce it off Jesse Barnes ?12:11
apwyep will send it to him as well12:11
penguin42apw: Maybe just change ips->gpu_turbo_enabled = (ips->gpu_lower!=NULL)  && !(hts & HTS_GTD_DIS);  ?12:13
apwpenguin42, thats the kind of thing want as a minium for sure12:16
penguin42hmm breakfast I think12:16
=== xfaf is now known as zul
=== fddfoo is now known as fdd
pmatulishas anyone else heard of a problem with the last kernel update on lucid that prevents booting with lvm setups (seeing a simple '/ over lvm' case here)?14:53
=== ivoks-afk is now known as ivoks
apwpmatulis, not heard anything like that no15:24
=== bjf[afk] is now known as bjf
=== jjohansen is now known as jj-afk
Haeginhi, who is the best person to talk to to request a patch being added to the kernel or does that have to happen upstream? It's for adding driver support for a usb remote.16:40
apwHaegin, normally you would suggest it on kernel-team list16:51
apwhttps://wiki.ubuntu.com/Kernel/FAQ#Can I get a patch included in the Ubuntu Kernel? / How can I submit a patch to the Ubuntu Kernel?16:51
Haeginapw: ok, thanks16:52
JFoA lalalala OOh16:54
JFogah! 16:54
JFowhere did that come from?16:54
=== ivoks is now known as ivoks-afk
=== ivoks-afk is now known as ivoks
penguin42apw: Looks like that ips stuff has done the job for people19:43
=== sconklin1 is now known as sconklin
apwpenguin42, thanks20:13
penguin42apw: What made you realise to use the _busy flag?20:14
apwpenguin42, actually thats not a flag, its the primary routine the outer loop uses20:15
penguin42ah ok20:15
apwpenguin42, but great thanks for followoing up, i'll get that submitted where it needs to be tommorrow20:17
penguin42apw: No problem; I just started realising I'd seen a few with similar oops; it would be kind of nice if launchpad could group oopses based on the backtrace20:18
apwpenguin42, thanks for that, its great when people help out this way20:18
apwpenguin42, but yes its on our 'launchpad can you help with this' list20:18
MorkBorki think there may be a bug involving ahci or ata in the latest karmic kernel20:39
MorkBork(2.6.32-25.44)20:40
MorkBorkso far i havent been able to reproduce it with 2.6.32-24.4320:40
penguin42apw: You know that lshw bug from the other week? There are still a bunch of open oopses from it; what's the right thing to do, now it's fixed is it right to merge them?20:52
MorkBorkdunno why i said karmic20:59
MorkBorkmeant lucid20:59
MorkBork><20:59
bjfMorkBork, what is the issue that you are seeing?21:05
MorkBorkseeing a lot of errors in dmesg when i do heavy disk io (checking a raid array for example)21:15
penguin42MorkBork: Want to pastebin them?21:15
MorkBorktypically "failed command: READ FPDMA QUEUED"21:15
MorkBorkyea21:16
MorkBorki was grepping syslog21:16
MorkBorki booted  2.6.32-24.43 and havent been able to reproduce it yet21:16
MorkBorki reproduced it three times after an hour or so of checking arrays with 2.6.32-25.4421:16
MorkBorkoh and "failed to read log page 10h (errno=-5)"21:19
MorkBorkwhich i see mentioned in one of the patches but i didnt see how it could cause this21:19
penguin42MorkBork: I'd look for the first weird errors, once something goes wrong then what comes afterwards can be less meaningful21:20
MorkBorkthe 'failed to read log page' is typically the first21:20
MorkBorkbut im making a paste21:21
MorkBorkheres the second time it happened21:30
MorkBorkhttp://pastebin.com/Pd89dHeh21:30
MorkBorkthe first time it happened it was a lot messier21:30
MorkBorkid say that was about 60% into the array check, and it ended up being completed successfully, the disk didnt get booted, etc21:32
MorkBorkhere was the third time it happened21:35
MorkBorkhttp://pastebin.com/7mRCVv4c21:35
MorkBorkprobably 20% into a array check21:35
penguin42MorkBork: Not happ yis it21:36
MorkBorki dont see the "failed to read log page" message that time21:36
MorkBorkwell now that i booted into the 24.43 kernel i havent been able to reproduce it21:36
MorkBorkhere was the first time it happened21:41
MorkBork(triggered by the cron job that checks arrays sunday morning)21:41
MorkBorkhttp://pastebin.com/N5mgVUxa21:41
MorkBorkit was messy21:41
MorkBorkeven ended up logging a bunch of ATA errors on the drive itself (with SMART)21:42
MorkBorki googled a bunch and it seems like these errors are similar to the ones people got when there was a kernel bug in an nvidia sata driver21:44
MorkBorkthis is an amd controller in ahci mode21:45
MorkBorkim gonna keep trying to reproduce it in 24.43 but its running strong21:47
MorkBorkeach one of those pastes is after a restart too21:55
=== bjf is now known as bjf[afk]
=== ivoks is now known as ivoks-afk
=== yofel_ is now known as yofel

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!