/srv/irclogs.ubuntu.com/2010/10/04/#ubuntu-kernel.txt

=== yfoel is now known as yofel
* apw yawns		08:42
lucent	hi...	08:42
lucent	I'm hoping for an angel, someone who can take 10 minutes and walk me through reproducing and reporting a kernel bug	08:42
lucent	totally confused here what layer of the kernel is the bug	08:42
apw	lucent, well we don't categorise bugs differently mostly based on that, if its a kernel bug its against linux	08:43
lucent	IEEE1394B adapter (express card) which used to be reliable, is now not reliable with the 10.10 kernel	08:43
lucent	ah okay	08:43
apw	thats firewaire right?	08:44
apw	wire even	08:44
lucent	where to start though? my symptom was last week, I plugged in a firewire drive, it automounted, decided to fsck, and fsck killed the filesystem 'cause of data errors somewhere... in the darkness	08:44
lucent	yep	08:44
lucent	at the moment there's no data on the drive to even really test it, I don't have the same circumstances	08:45
apw	well we switched firewire stacks in 10.10, so you ought to be able to switch back for testing	08:45
lucent	ah. okay that's one thing then, good to hear	08:46
apw	i would put something on you don't care about and can test easily, like a copy of /bin or something	08:46
apw	but do file a bug with ubuntu-bug linux regardless	08:46
lucent	is there a favourite storage device test you know of?	08:46
lucent	copying files and such, or some tool more specifically?	08:47
apw	there is an fsstress test, now what is it called	08:48
apw	bonnie i think is the one	08:48
apw	cking, what does one recommend for disk hammering	08:48
jk-	dbench ?	08:48
cking	apw, bonnie++ is my tool of choice	08:48
apw	yep that would be a good one too	08:49
apw	yeah i think bonnie++ is the one i would start with	08:49
apw	lucent, we need to get this reproducible so we can confirm/exclude the stack switch	08:50
cking	bonnie++ is thorough, you may need to fiddle with the settings as the defaults may not be 100% applicable to your use case	08:51
lucent	catching up here, got distracted a minute	08:52
lucent	I'm motivated to do what I can to reproduce and report, with a little hand holding along the way (thank you!)	08:52
bryceh	apw, do we have much in the way of kernel freeze debugging toolage for radeon like we do for intel?	08:53
bryceh	(hi btw)	08:53
apw	bryceh, hi ya ... hrm not that i am aware of, i have to admit to being rather dependant on RAOF in these matters	08:57
* apw suspects its pretty late where bryceh is		08:58
bryceh	feh, only 1am	08:58
lucent	cking: I'm willing to learn a new command, though is bonnie++ capable of destructive (write/read) data testing? I am concerned that since the drive is mostly zeroes now, it will impact the effiacy of a test	09:00
bryceh	apw, yeah I know from the X side we don't have tools for radeon. was hoping maybe there was something general purpose on the kernel side, but perhaps not	09:00
cking	lucent, bonnie++ is capable of write/read testing, not sure how that impacts on your target H/W - it depends on the filesystem, how full it is, etc	09:06
lucent	cking: brilliant, will go forward to learn about it	09:07
apw	bryceh, nothing i know of no, but i can't claim i've looked very close for radeon	09:09
diwic	smb, ping	09:33
smb	diwic, Hey David	09:34
diwic	smb, I'm a little unsure of what to do with those verification-failed sru:s you asked me to look at	09:34
apw	smb, morning	09:35
smb	Well those all lookedto be similar in the problem. A quirk which seemed to have no effect	09:35
diwic	smb, so I'm quite sure of what the problem is	09:35
smb	diwic, Whenever you can prove (with a test kernel) that a fix works now, then those could be resubmitted to stable	09:36
smb	apw, Morning	09:36
diwic	smb, I'm just not sure on which machines I should enable a fix	09:36
smb	diwic, I would suggest to look exactly which hw is not working in the bug reports. Then provide them test kernels that should fix their machines and ask for feedback	09:37
diwic	smb, well, we have a test kernel (dkms) that disables via_dmapos_patch for all devices which works	09:37
smb	The problem, at least with the last patches seemed to have been that nobody tested the actual quirk	09:38
diwic	smb, the question is if this fix should apply to a specific controller, a controller revision or just those particular machines	09:38
diwic	smb, /probably/ it is one or a few controller revisions we should check for, but if I'm wrong, we might screw up now working machines	09:39
smb	I probably don't know the area of hw so well, but I thought that even the same chip/controller could work or not work depending on the wiring or implementation of the machine, is this right?	09:39
diwic	smb, that's usually the case when it comes to the hda codec, but this is likely a bug in the controller (southbridge)	09:40
smb	Hm, so the feeling would be to quirk the controller and hope for the best. Hm, if the quirk would be wrong, what would be the (worst) impact. Lower performance/distorted or no sound?	09:41
diwic	smb, worst case would be no sound at all.	09:41
smb	Gah, not really something one wants to put into stable. On the other hand I can understand one would not want to quirk all the broken machines individually...	09:43
smb	I guess it is too much optimism to hope there would be a way to do a runtime check on brokenness	09:44
diwic	heh, the reason why we need these quirks in the first place because the runtime brokenness check isn't working well enough...	09:46
diwic	when is the 2.6.37 window going to open?	09:47
smb	Doh! Ok, so for stable it feels like rather quirking the machines and for upstream the run-time check should get fixed (if that is possible)	09:48
bryceh	smb, up late?	09:49
smb	diwic, Depends when .36 gets out... Haven't looked today. apw what was you estimate	09:49
smb	bryceh, No reasonable time. My clock says 10:50am	09:49
smb	bryceh, Maybe you meant diwic	09:50
* diwic is in the same time zone as smb		09:50
bryceh	possibly nick collision	09:50
diwic	Should those fixes be sent to greg as well even if there is no "upstream counterpart"?	09:50
lucent	diwic: it appears bonnie++ depends on the filesystem layer; know of a tool that allows me to test a block storage device without putting a filesystem on it?	09:51
smb	diwic, I would do so with explaining why.	09:51
diwic	lucent, are you sure you asked the right person?	09:52
lucent	destructive tests are fine, preferred even so this can be repeated without special requirement	09:52
RAOF	bryceh: There's nothing like intel's debuggage, although it wouldn't be impossible to add - radeon already has a bunch of hangcheck features.	09:52
lucent	oh	09:52
smb	diwic, Or do you think upstream might go with quirking the single machines until a better solution is found	09:52
lucent	diwic: my mistake name-tab failure here	09:52
lucent	cking: bonnie++ requires a filesystem, I think? any tool you know that can work without the requirement for a filesystem?	09:52
diwic	smb, so quirk individual machines for stable and try to get entire controllers into 2.6.37?	09:53
smb	diwic, I would probably discuss this with upstream. If quirking the entire controller is felt as not dangerous it mightbe acceptable for stable, too. Or they want to fix the detection and that might be more complicated. Then quirking the individual devices would be better for stable	09:56
bryceh	RAOF, bug #649141 is what I've been chasing. I'm looking for what the next level of debugging would be for this.	09:57
ubot2	Launchpad bug 649141 in linux (Ubuntu Maverick) (and 1 other project) "BUG: unable to handle kernel paging request - EIP: [<f959ae41>] snd_ctl_poll (Followed by system lockup) (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/649141	09:57
cking	lucent, did make a tool called iobenchmark which I put in my ppa (it's under the karmic series). that does testing on a raw device.	10:05
lucent	cking: looking for that now, thank you	10:06
cking	run it with -R (read) or -W (write) and specify the raw device with -f, e.g. -f /dev/sdX - it will destroy any existing data on the device	10:09
lucent	is it aware of the capacity of the underlying device?	10:10
cking	it's fairly hacky code, so if it breaks, you keep the pieces.	10:10
diwic	bryceh, were you listening to music through HDMI or the HDA-realtek?	10:11
lucent	i.e. any reason not to run it on a 1TB drive (knowing that it's a destructive test, yes)	10:11
cking	lucent, I've never tested it on anything so large. It may take a few hours as it repeats the read/writes many times	10:11
lucent	good to know, thanks	10:12
cking	lucent, what do you want to test?	10:12
bryceh	diwic, yes through snd_hda_codec_realtek, not hdmi	10:13
lucent	cking: I need to reproduce a bug somewhere between the filesystem layer -> express PCI card -> firewire 1394b OHCI -> random gamma rays	10:13
lucent	I suspect it's firewire related but I want a process to reproduce and not just this willy nilly data loss I've been running into on one of two firewire adapters since an upgrade to Ubuntu 10.10 and the new kernel	10:13
lucent	need to trigger this and would rather not rely on a working filesystem because it's going to get hosed anyways if I do trigger the bug	10:15
cking	lucent, so you may or may not need a mix of data bursts, random I/O patterns to reproduce. Best start with something simple like dd'ing to the device and working up more complex tests if that fails to reproduce	10:15
lucent	okay... dd'ing zeroes? /dev/urandom ?	10:15
diwic	bryceh, so I noticed that the first one said snd_ctl_poll+ something. Are the freezes always related to snd_-something, or it is just random stuff?	10:16
lucent	never had to test a block device beyond seeing if just zero'es can be written to it	10:16
cking	dd'ing zeros is faster than reading from /dev/urandom, so try that	10:16
lucent	thanks for the advice, I'm relieved to hear suggestions	10:16
lucent	this is making me a little crazy	10:16
bryceh	diwic, I've only seen that one that was snd_* related; the other freezes I've not gotten any info other than what I've specified in the comments	10:16
bryceh	diwic, aside from the snd_* one, the other freezes have been associated with playing a game that uses mesa, so that's why I'm suspecting a mesa/3d/dri causation there	10:18
diwic	bryceh, okay. Let me know if things starting to point more towards the snd area	10:19
=== amitk is now known as amitk-afk
apw	smb, well we are at -rc6 so a couple of weeks likely	10:29
smb	diwic, ^	10:30
apw	he has been averaging -rc6 to -rc8 the last 6 or so releases, so it could be now and it could be 2 weeks hard to be more accurate	10:31
diwic	okay	10:31
* smb has about 10mins till check-in		10:32
smb	Actually no, now!	10:32
apw	good luck	10:32
apw	let me know when you land	10:32
Laney	Are you aware of a regression in M where wireless is incredibly slow when running on battery, but fine on power?	11:03
Laney	(MBP 7.1 here, fwiw)	11:03
apw	Laney, not heard that reported no ... though being a macbook nothing would supprise me	11:03
Laney	some kind of over-zealous power management perhaps? Renders wireless performance pretty unusable.	11:04
Laney	apw: Indeed, anyway it coincided with my maverick upgrade	11:04
* apw uses M based systems generally in that mode a lot, so its not systemic		11:04
Laney	indeed, I suspect it it hardware specific	11:04
Laney	I filed 651008 about it	11:05
apw	yep, macbooks are about the most difficult to get documentation for out there, apple is not keen that other OSs run on their h/w	11:05
apw	bug 651008	11:05
ubot2	Launchpad bug 651008 in linux (Ubuntu) "Regression in wireless performance under Maverick (broadcom) (affects: 2) (heat: 12)" [Undecided,New] https://launchpad.net/bugs/651008	11:05
popey	ahh Laney I didn't realise it only affected you on battery. I have only tested when on mains	11:09
Laney	popey: yeah, just figured that out	11:09
popey	oh :)	11:09
Laney	I realised it was intermittent, but didn't link it to this here wire supplying juice	11:10
* popey subscrib0rs		11:10
=== amitk-afk is now known as amitk
apw	Laney, ok so thats BCM4322, all the kit i have which uses the binary-junk dirver is 4312 ... does no show anything like this	11:12
apw	Laney, are you using 11n ?	11:12
Laney	apw: no, b+g	11:12
apw	popey, do you have one of these mac nightmares ?	11:13
Laney	I tried to boot the lucid kernel but it never made it to X unfortunately	11:13
apw	Laney, i am a little supprised at that	11:13
Laney	yeah I just got a vt and couldn't start gdm manually	11:14
Laney	admittedly I didn't poke too hard	11:14
popey	apw: i do have one of those lovely apple laptops, yes	11:16
apw	popey, i feel for you	11:16
popey	meh	11:16
popey	i get that a lot.	11:17
Laney	it actually work(ed in lucid)s quite well	11:17
apw	yeah thats cause lucid has a year old kernel in it ... takes about that long for the kernel to catch up with every random change apple makes	11:17
apw	i am supprised that the 7's are working yet, i still see trouble with the 5's	11:17
apw	not that i have a clue waht the difference between a 5 and a 7 are of course	11:18
popey	you may recall the fun we had with the sata bus on this device which means 10.04 can't see the hard disk	11:18
=== diwic is now known as diwic_afk
Laney	oh yeah, 10.04.1 then ;)	11:20
apw	Laney, popey, odder is that the wl driver you are using is bit for bit identicle between the two releases	11:21
Laney	apw: Yeah, I'm wondering if it's some power management the kernel is doing	11:22
apw	Laney, so what graphics does this have? as the logical test is running a lucid kernel	11:22
Laney	trying to blat it into low power mode unsuitably	11:22
apw	Laney, that kind of thing is normally in the driver	11:22
Laney	oh ok	11:22
apw	and as its a binary driver we cirtianly arn't telling it to do anything speciifc	11:22
Laney	I'm using -nvidia too	11:23
apw	heh talka bout putting all the worst bits in the one box ...	11:24
apw	shame its such a pretty box you cannot resist paying for them	11:24
popey	It's a very pretty box :)	11:24
popey	damnit!	11:24
cking	that's the beauty of proprietary kit	11:24
apw	yep ... isn't it just	11:24
apw	stil another year an maybe we'll have decent bcm drivers	11:25
Laney	well work purchased it for me, didn't have a choice	11:25
cking	shiny on the outside... closed on the inside	11:25
popey	I purchased mine on advice of a Canonical employee who told me 'everything works'	11:25
popey	I suspect he uses OSX though.	11:25
Laney	speaking of work, must head in now	11:26
Laney	back soon chaps	11:26
apw	popey, well and as apple randomly sub in components, you can only say that about the one you have in your hand not the next one you buy	11:26
* penguin42 has noticed a little quad of 4 kernel oopses that look the same that I think I might half see what's happening - someone want to have a look? Bug 640154 bug 646215 and bug 653591; I've got a bit of a description as the last comment on bug 632430		11:36
ubot2	Launchpad bug 640154 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference in ips_adjust in intel_ips on Sony VPC-B11KGX (affects: 2) (heat: 200)" [Undecided,New] https://launchpad.net/bugs/640154	11:36
ubot2	Launchpad bug 646215 in linux (Ubuntu) "BUG: unable to handle kernel NULL pointer dereference at (null) - ips_adjust in intel_ips (affects: 1) (heat: 6)" [Undecided,New] https://launchpad.net/bugs/646215	11:36
ubot2	Launchpad bug 653591 in linux (Ubuntu) "[18446744058.496026] BUG: unable to handle kernel NULL pointer dereference at (null) ips_adjust in intel_ips (affects: 1) (heat: 8)" [Undecided,New] https://launchpad.net/bugs/653591	11:36
ubot2	Launchpad bug 632430 in linux (Ubuntu) "ips-adjust - BUG: unable to handle kernel NULL pointer dereference at (null) (affects: 3) (heat: 22)" [Undecided,New] https://launchpad.net/bugs/632430	11:36
apw	penguin42, or is ips null	11:40
penguin42	apw: Don't think so, it's used a few lines earlier and generally used all over in that file	11:41
apw	penguin42, i am guessing actually that gpu_lower is null	11:51
apw	that seems to match your mental image too	11:51
penguin42	apw: Yeh, I think what's happening is that it's found to be null, the thing that looks for it sets gpu_turbo_enabled to false so that line isn't called, and then something later - e.g. update_turbo_limits or ips_irq_handler turns it back on	11:52
* penguin42 doesn't have the hardware to find out; I just noticed the 4 similar oopses		11:52
apw	penguin42, do the people on the bugs find it happens always/readily ?	11:52
penguin42	apw: It's not obvious, I think some of them are just where the system told them to report it; one of them commented it happened coming out of hibernation, another got a 'CPU power or thermal limit exceeded' just before it	11:55
apw	we'd expect to see the CPU power thing in some cases, as that would indicate this code is triggered, and possibly a resume also makes sense if it was on before the resume	11:56
apw	and the susped/hibernate made things cooler	11:57
penguin42	apw: The update_turbo_limits says it's 'Used at init time and for runtime BIOS support, which requires polling the regs for updates (as a result of AC->DC transition for example).' so I wouldn't be surprised if it got kicked during a hibernate	11:57
apw	right	11:58
penguin42	I find it curious it seems to be 3/4 are VAIOs	11:59
apw	did you say one got a message about being disabled on boot	11:59
apw	penguin42, probabally common h/w would trigger this, so not so supprising	11:59
penguin42	apw: One of the boot logs had a ' failed to get i915 symbols, graphics turbo disabled' which is the message it prints when it's looking for gpu_lower and finds it's NULL	12:00
apw	yeah	12:01
apw	thats the one, thanks	12:01
=== diwic_afk is now known as diwic
penguin42	on a different question, if I've built the kernel from the ubuntu git using AUTOBUILD=1 NOEXTRAS=1 fakeroot debian/rules binary-generic is there a way just to do a make to rebuild a module or two rather than the whole package if I'm just adding some debugging?	12:05
apw	penguin42, looks like we could hit this if we have polling turned on	12:05
apw	penguin42, yep remove the build stamp in debian/stamps	12:05
apw	and then rebuild it as normal (obviously not cleaning it)	12:06
penguin42	apw: Using the debian/rules binary-generic or with a make ?	12:06
apw	yeah d/r b-g	12:06
apw	ok i think i can see how this might trigger and how only some machines would be affected	12:06
apw	will think on how we might avoid it	12:06
penguin42	apw: Could check whether gpu_lower is NULL after all the places that set gpu_turbo_enabled	12:07
apw	well its where we try and enable it, we should check if we managed to get the symbols	12:09
apw	and abort the enable	12:09
penguin42	apw: Yeh, it does that during the initial enable	12:10
apw	yeah but not during a polled enable.	12:10
* apw will try and make it do something sensible ... and we can ask them to test		12:10
penguin42	apw: Yeh, might be right to bounce it off Jesse Barnes ?	12:11
apw	yep will send it to him as well	12:11
penguin42	apw: Maybe just change ips->gpu_turbo_enabled = (ips->gpu_lower!=NULL) && !(hts & HTS_GTD_DIS); ?	12:13
apw	penguin42, thats the kind of thing want as a minium for sure	12:16
penguin42	hmm breakfast I think	12:16
=== xfaf is now known as zul
=== fddfoo is now known as fdd
pmatulis	has anyone else heard of a problem with the last kernel update on lucid that prevents booting with lvm setups (seeing a simple '/ over lvm' case here)?	14:53
=== ivoks-afk is now known as ivoks
apw	pmatulis, not heard anything like that no	15:24
=== bjf[afk] is now known as bjf
=== jjohansen is now known as jj-afk
Haegin	hi, who is the best person to talk to to request a patch being added to the kernel or does that have to happen upstream? It's for adding driver support for a usb remote.	16:40
apw	Haegin, normally you would suggest it on kernel-team list	16:51
apw	https://wiki.ubuntu.com/Kernel/FAQ#Can I get a patch included in the Ubuntu Kernel? / How can I submit a patch to the Ubuntu Kernel?	16:51
Haegin	apw: ok, thanks	16:52
JFo	A lalalala OOh	16:54
JFo	gah!	16:54
JFo	where did that come from?	16:54
=== ivoks is now known as ivoks-afk
=== ivoks-afk is now known as ivoks
penguin42	apw: Looks like that ips stuff has done the job for people	19:43
=== sconklin1 is now known as sconklin
apw	penguin42, thanks	20:13
penguin42	apw: What made you realise to use the _busy flag?	20:14
apw	penguin42, actually thats not a flag, its the primary routine the outer loop uses	20:15
penguin42	ah ok	20:15
apw	penguin42, but great thanks for followoing up, i'll get that submitted where it needs to be tommorrow	20:17
penguin42	apw: No problem; I just started realising I'd seen a few with similar oops; it would be kind of nice if launchpad could group oopses based on the backtrace	20:18
apw	penguin42, thanks for that, its great when people help out this way	20:18
apw	penguin42, but yes its on our 'launchpad can you help with this' list	20:18
MorkBork	i think there may be a bug involving ahci or ata in the latest karmic kernel	20:39
MorkBork	(2.6.32-25.44)	20:40
MorkBork	so far i havent been able to reproduce it with 2.6.32-24.43	20:40
penguin42	apw: You know that lshw bug from the other week? There are still a bunch of open oopses from it; what's the right thing to do, now it's fixed is it right to merge them?	20:52
MorkBork	dunno why i said karmic	20:59
MorkBork	meant lucid	20:59
MorkBork	><	20:59
bjf	MorkBork, what is the issue that you are seeing?	21:05
MorkBork	seeing a lot of errors in dmesg when i do heavy disk io (checking a raid array for example)	21:15
penguin42	MorkBork: Want to pastebin them?	21:15
MorkBork	typically "failed command: READ FPDMA QUEUED"	21:15
MorkBork	yea	21:16
MorkBork	i was grepping syslog	21:16
MorkBork	i booted 2.6.32-24.43 and havent been able to reproduce it yet	21:16
MorkBork	i reproduced it three times after an hour or so of checking arrays with 2.6.32-25.44	21:16
MorkBork	oh and "failed to read log page 10h (errno=-5)"	21:19
MorkBork	which i see mentioned in one of the patches but i didnt see how it could cause this	21:19
penguin42	MorkBork: I'd look for the first weird errors, once something goes wrong then what comes afterwards can be less meaningful	21:20
MorkBork	the 'failed to read log page' is typically the first	21:20
MorkBork	but im making a paste	21:21
MorkBork	heres the second time it happened	21:30
MorkBork	http://pastebin.com/Pd89dHeh	21:30
MorkBork	the first time it happened it was a lot messier	21:30
MorkBork	id say that was about 60% into the array check, and it ended up being completed successfully, the disk didnt get booted, etc	21:32
MorkBork	here was the third time it happened	21:35
MorkBork	http://pastebin.com/7mRCVv4c	21:35
MorkBork	probably 20% into a array check	21:35
penguin42	MorkBork: Not happ yis it	21:36
MorkBork	i dont see the "failed to read log page" message that time	21:36
MorkBork	well now that i booted into the 24.43 kernel i havent been able to reproduce it	21:36
MorkBork	here was the first time it happened	21:41
MorkBork	(triggered by the cron job that checks arrays sunday morning)	21:41
MorkBork	http://pastebin.com/N5mgVUxa	21:41
MorkBork	it was messy	21:41
MorkBork	even ended up logging a bunch of ATA errors on the drive itself (with SMART)	21:42
MorkBork	i googled a bunch and it seems like these errors are similar to the ones people got when there was a kernel bug in an nvidia sata driver	21:44
MorkBork	this is an amd controller in ahci mode	21:45
MorkBork	im gonna keep trying to reproduce it in 24.43 but its running strong	21:47
MorkBork	each one of those pastes is after a restart too	21:55
=== bjf is now known as bjf[afk]
=== ivoks is now known as ivoks-afk
=== yofel_ is now known as yofel

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!