/srv/irclogs.ubuntu.com/2014/07/30/#ubuntu-kernel.txt

kamal	dannf, remind me about this tomorrow and I'll help poke it if you like	00:22
=== dupondje_ is now known as dupondje
=== ming is now known as Guest28776
=== clopez_ is now known as clopez
=== psivaa is now known as psivaa-lunch
=== psivaa-lunch is now known as psivaa
Joe_CoT	smb, were you able to replicate #1349883 ?	14:12
apw	bug #1349883	14:13
ubot5	bug 1349883 in linux (Ubuntu) "dmesg time wildly incorrect on paravirtual EC2 instances." [Medium,Confirmed] https://launchpad.net/bugs/1349883	14:13
smb	Joe_CoT, No, not locally and not on the few ec2 m1.small I started up. One of those ended up on an AMD, the other on some Intel but even with a few reboots the times in dmesg were ok	14:13
Joe_CoT	ok. would it be helpful if I reproduced it on a clean instance and gave you ssh access to it?	14:15
smb	Hm, maybe... give me a sec (or a few minutes) Thenn I try to see whether I can do what I am thinking of on a local instance	14:18
Joe_CoT	ok	14:21
smb	hrm, could be some minutes more... only getting around 300K/s for the ddebs...	14:30
smb	Joe_CoT, So I am not really sure it would help. Though I can have peeks at the vpcu time info, I suspect it would only show a high system time (which we already know) and the general progress of time. From what I get from the data we have, this is a one time badness only. Meaning from that point at boot when it jumps, it increments normally	15:13
Joe_CoT	yes	15:14
Joe_CoT	but once it boots once badly, it consistently reboots badly	15:14
Joe_CoT	again, I think it's the specific hardware it runs on.	15:14
Joe_CoT	if I give you a bad one, when you reboot it it'll still be bad	15:14
smb	That could be one additional part of circumstances, yes	15:15
Joe_CoT	I'm busy deploying a new code release, but after that I'll see if I can give you a clean instance where it's happening	15:15
smb	Ok, yeah. I need a bit more time to get my head around the code anyway.	15:16
slangasek	jsalisbury: so I'm having a terrible time with my KVM host box all of a sudden becoming very unstable, and kernel panicking left and right. I'm concerned that it might be hardware flakiness, because the panics don't seem to all be in the same place. Do you have any advice for me before I start hand-transcribing kernel oopses into LP?	16:57
jsalisbury	slangasek, I guess the first thing to check, did anything change? Did you apply any updates or upgrade the kernel?	17:02
jsalisbury	slangasek, It might be good to try an older kernel, that was known to work and see if the panic persists or goes away, that might indicate if a software change caused this.	17:02
slangasek	jsalisbury: I previously had 3.11.0-26 installed, which is when I first saw a crash; then I upgraded the host to trusty in response to this, and am now running 3.13.0-32 which is definitely crashy	17:03
slangasek	jsalisbury: prior to that I was running on the saucy kernel for a long long time	17:03
jsalisbury	Can you test a Saucy kernel to see if the panic stops? I can post a link to one, if you don't have one installed already.	17:04
jsalisbury	slangasek, If the panic persists, then it might indicate hardware. If it goes away, we should did deeper.	17:05
jsalisbury	s/did/dig/	17:05
kirkland	howdy! the 3.13.0-32-generic update is rendering my Intel NUCs unbootable	17:06
jsalisbury	kirkland, uh oh. Do you get a panic, or blank screen, or something else?	17:07
kirkland	jsalisbury: hung on the ubuntu splash screen; never makes it into the desktop	17:08
kirkland	jsalisbury: note that this only happens on a cold boot (warm reboots seem to work fine)	17:08
kirkland	jsalisbury: and downgrading back to the -24-generic kernel also seems to work fine	17:08
slangasek	jsalisbury: I still have the saucy kernel to hand, I'll reboot and see	17:08
jsalisbury	kirkland, can you turn on the screen to get further debugging? I think you just need to remove quiet and splash from GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub	17:09
jsalisbury	kirkland, and see if anything is printed.	17:10
kirkland	jsalisbury: done; powering down then booting again	17:10
jsalisbury	kirkland, just don't forget to run update-grub too	17:11
kirkland	jsalisbury: okay, I have call traces now	17:14
kirkland	jsalisbury: pictures from my phone	17:14
kirkland	jsalisbury: looks like its http://support.sundtek.com/index.php/topic,1600.0.html	17:14
jsalisbury	kirkland, cool, that should help.	17:14
jsalisbury	kirkland, looks like the bug is upstream as well. I'm reading through the forum posts and upstream bug report.	17:17
jsalisbury	kirkland, We probably pulled in the offending commit as part or upstream updates.	17:18
kirkland	jsalisbury: okay -- is there a fix upstream yet for the offending commit?	17:18
jsalisbury	kirkland, I see a test patch in the upstream bug report. I guess it might be worth while to test the latest mainline kernel, in case it's being worked from someone else too. It can be downloaded from:	17:20
jsalisbury	http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc7-utopic/	17:20
jsalisbury	kirkland, The -24-generic kernel was based off of upstream 3.13.9, so we can also bisect from that version on, if mainline still exhibits the bug.	17:25
kirkland	jsalisbury: seems I do not trigger the bug, if I unplug my usb keyboard when booting	17:28
jsalisbury	kirkland, It appears to be xhci related and there were a few changes I can look at closer.	17:28
jsalisbury	kirkland, I'll build a test kernel with some of the xhci changes reverted to see if we can narrow down the cause and not have to bisect. I'll post a link in a bit.	17:33
jsalisbury	kirkland, just to confirm, the mainline kernel still has the bug, if you leave the keyboard connected while booting?	17:34
slangasek	jsalisbury: so, the saucy kernel just crashed for me again. Interestingly, while on trusty the crashes seem to be more varied, this saucy crash is the same or similar to the last saucy crash I saw, which is ksm-related	17:36
jsalisbury	slangasek, Hmm, and you used a kernel that never paniced before? It might be worth while to run a memory test real quick on the machine. Any type of HW related errors in the syslog?	17:41
slangasek	nothing hw-y in the syslog, no; and it never panicked before earlier this month, around the time the weather turned hot ;P	17:41
jsalisbury	slangasek, heh, blame the weather	17:43
slangasek	jsalisbury: the problem has persisted when the weather got better :)	17:44
jsalisbury	slangasek, speaking of weather, does the syslog have any mention of temp readings?	17:44
jsalisbury	slangasek, Just shut the window, lol	17:44
slangasek	wat	17:44
slangasek	shutting the window is what makes the room hot! :)	17:44
jsalisbury	slangasek, whoops, figured keeping the rain off might help :-)	17:45
slangasek	jsalisbury: what would I search for to find temp readings? 'temp' doesn't find anything useful	17:45
jsalisbury	slangasek, you would find something like this: " Core temperature above threshold"	17:46
jsalisbury	or "Core temperature/speed normal"	17:46
slangasek	jsalisbury: nada	17:46
jsalisbury	slangasek, maybe the logs rolled and the temps been ok. Maybe grep the /sys/log directory	17:46
slangasek	jsalisbury: did that already	17:47
TJ-	slangasek: I've found lm-sensors and the 'sensors' util good for monitoring, or simply "cat /sys/class/thermal/thermal_zone*/temp"	17:49
jsalisbury	slangasek, Hard to know if it is HW releated if there are no errors. Can you post a picture of the crashes	17:49
slangasek	jsalisbury: the fact that the backtraces are varied makes that more difficult. If I get the same ksm crash on saucy again, perhaps I'll take a picture - but saucy is EOL so that hardly helps for getting it fixed	17:51
slangasek	and in trusty the backtraces are all over the map	17:51
kirkland	jsalisbury: do we have a Launchpad bug for this yet?	17:57
kirkland	jsalisbury: do you want me to file one?	17:57
jsalisbury	kirkland, That would be great	18:00
slangasek	jsalisbury: memory test> I don't think we have a memtest that works on UEFI-only systems, do we	18:00
jsalisbury	slangasek, hmm, I don't know off hand. Can you not selecet it from the GRUB menu?	18:01
slangasek	nope	18:01
slangasek	memtest86+ has grub code that specifically says it doesn't work on UEFI	18:01
jsalisbury	slangasek, Ahh, ok.	18:02
slangasek	/etc/grub.d/20_memtest86+:	18:02
slangasek	# We need 16-bit boot, which isn't available on EFI.	18:02
slangasek	if [ -d /sys/firmware/efi ]; then	18:02
slangasek	exit 0	18:02
slangasek	fi	18:02
jsalisbury	slangasek, It makes it difficult that the backtraces are not consistent. Do you have to put some load on the machine, or does just booting and waiting trigger it?	18:03
mjg59	memtest is difficult with UEFI	18:03
mjg59	Because the firmware is using so much more of the memory	18:04
slangasek	jsalisbury: the "load" appears to be nothing more than having VMs configured to start at boot	18:04
mjg59	You'd probably need something that did ExitBootServices() and then ignored the runtime regions	18:04
mjg59	And hoped that there was no periodic SMM code using them	18:04
* slangasek nods to mjg59		18:04
slangasek	and writing that is out of scope for my current problem ;)	18:05
jsalisbury	slangasek, to see if it is HW releated, can you try to boot a LiveCD and see if you can get a crash again. That would rule out anything configuration wise.	18:06
slangasek	jsalisbury: not conveniently, but I'll put that on my list of things to try	18:07
jsalisbury	slangasek, ok	18:08
jsalisbury	kirkland, Can you try the test kernel here: http://kernel.ubuntu.com/~jsalisbury/kirkland/	18:09
kirkland	jsalisbury: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350480	18:09
ubot5	Launchpad bug 1350480 in linux (Ubuntu) "[REGRESSION] Kernel update renders Intel NUC (i5-3427) unbootable with USB devices plugged in" [Undecided,New]	18:09
jsalisbury	kirkland, That has four out of the five xhci commits added after 3.13.9 reverted.	18:10
jsalisbury	kirkland, The fifth didn't revert without a backport, so I'll revert that one if the bug still exists with v1 of this test kernel.	18:12
kirkland	jsalisbury: installing now	18:14
jsalisbury	kirkland, this one needs the linux-image-extra package as well as linux-image	18:15
kirkland	jsalisbury: k	18:15
kirkland	jsalisbury: I'm just trying to apport-collect before I reboot	18:15
jsalisbury	kirkland, ack	18:16
kirkland	jsalisbury: rebooting	18:16
kirkland	jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~DustinTestKernelv1 SMP Wed Jul 30 17:40:39 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux	18:17
kirkland	jsalisbury: booted just fine ;-)	18:17
jsalisbury	kirkland, Hmm, cool. So that means with was one of these four commits:	18:19
jsalisbury	7ba40e8 xhci: delete endpoints from bandwidth list before freeing whole device	18:19
jsalisbury	8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb	18:19
jsalisbury	00bd7b9 xhci: Switch Intel Lynx Point ports to EHCI on shutdown.	18:19
jsalisbury	c349a2e xhci: Prevent runtime pm from autosuspending during initialization	18:19
jsalisbury	kirkland, I'll try to revert only two of them and see if we can narrow it down.	18:20
Joe_CoT	smb, I was able to reproduce it right away. Can you toss me your public ssh key, and preferably your IP, and I'll give you access?	18:22
Joe_CoT	For reference, it was m1.small, in us-east-1c, off the latest Ubuntu 14.04 PV SSD AMI, straight off	18:23
smb	Joe_CoT, You should be able to use "ssh-import-id smb" for that	18:23
Joe_CoT	nifty	18:25
smb	So in theory I should be able to get there if you pm me the external host name	18:26
Joe_CoT	yeah, give me a minute and I'll have you set up	18:26
Joe_CoT	smb, all set. PMed. You have full reign of that server. Just lmk when you're done with it	18:32
smb	ack	18:33
Joe_CoT	smb, that server has ssh allowed in, http and https allowed out. LMK if you need more than that. You can reboot it as many times as you want, but please don't shut it down, as the problem may go away if they switch hardware on me.	18:37
smb	Joe_CoT, Ok, sure.	18:38
jsalisbury	kirkland, the second test kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/	18:51
jsalisbury	kirkland, if my guess is right, this kernel should still have the bug.	18:51
jsalisbury	kirkland, I think the bug is going to be caused by:	18:52
jsalisbury	8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb	18:52
jsalisbury	kirkland, I'll start building a third test kernel with just commit 8172925 reverted.	18:52
kirkland	jsalisbury: testing that 2nd one now	19:06
jsalisbury	kirkland, ack	19:09
kirkland	jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480v2 SMP Wed Jul 30 18:25:09 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux	19:09
kirkland	jsalisbury: hang on	19:09
kirkland	jsalisbury: let me do a hard power down, and up	19:10
jsalisbury	kirkland, ok. I expect it may have the bug	19:10
kirkland	jsalisbury: yeah, sorry, that previous warm boot worked fine; it's cold boots that suffer	19:10
jsalisbury	kirkland, good :-)	19:11
jsalisbury	kirkland, A 3rd kernel is building now with only commit 8172925 reverted. I expect it will be the fix	19:11
kirkland	jsalisbury: boom	19:11
kirkland	jsalisbury: explosions in the sky	19:11
jsalisbury	kirkland, I'm already crafting an email to the patch author. In the mean time, we can request a revert upstream and in Ubuntu, once we verify it's that commit.	19:12
jsalisbury	kirkland, build should be done in a few minutes	19:13
kirkland	jsalisbury: okay, I'm ready and waiting :-)	19:14
jsalisbury	kirkland, The kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/	19:22
jsalisbury	kirkland, If it fixes the bug, I'll fire off some email to have it reverted	19:23
kirkland	jsalisbury: k	19:23
kirkland	jsalisbury: v3 boots fine (cold + warm boot)	19:29
kirkland	jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480OnlyCommit8172925Reverted SMP Wed Jul 30 18:56:31 U x86_64 x86_64 x86_64 GNU/Linux	19:29
jsalisbury	kirkland, great. I'll send email upstram and cc you. I'll also request that this gets reverted in the Ubuntu kernel	19:30
kirkland	jsalisbury: fantastic, thanks for this :-)	19:34
jsalisbury	kirkland, np	19:34
smb	Joe_CoT, So I would be off the machine for today. I extracted some data. Interesting that most time info actually is correct. Like uptime and even the formatted time in /var/log/syslog. Just the seconds in the timestamp are off. Also interesting that while the timestamp in dmesg on my local system is correct, the vcpu_info time actually is the uptime of the host. Apparently that gets corrected by the delta (of tsc	19:36
smb	timestamp in the same struct and result of readtsc)	19:36
smb	While on that host it seems to be the same	19:36
smb	So looks a bit like the correction delta goes wrong and you end up having the hosts uptime in dmesg	19:36
Joe_CoT	the formatted time in syslog is correct, when using rsyslog. When using syslog-ng, it ends up with logs in october	19:37
Joe_CoT	because it's using the time from kmsg	19:37
Joe_CoT	but when I reported I brought it down to the core problem, which is the dmesg timestamp	19:37
Joe_CoT	If you're done for the day I'll turn it off then. If you need it again lmk. I'll try to be in the channel when I'm in the office, if I disappear give me a poke on the bug or by email.	19:39
smb	Probably syslog-ng does its own calculation in some way. Just some additional fact. I mean the main issue is as you pointed out, the dmesg timestamp being wrong	19:39
Joe_CoT	I think that syslog-ng does: current time - uptime + dmesg time. Which is why it's getting dates in the future	19:40
smb	Sounds reasonable. Basically offset by the hosts uptime	19:40
smb	and not the guests	19:41
Joe_CoT	And if you do dmesg -T you'll see the same then	19:41
Joe_CoT	yeah	19:41
Joe_CoT	/proc/uptime ends up with the guest uptime, but /dev/kmsg is using timestamps based on the host uptime	19:42
smb	Since the field containing the system time is the uptime of the host in both cases, there must be somethign wrong with how the delta is used. But I won't have enough brain for that to figure out right now.	19:43
smb	Oh when I say host I meant the host running the guest	19:43
smb	And we probably are on the same page and I am just too dumb by now. :)	19:44
Joe_CoT	lol, ok	19:44
Joe_CoT	well if you need back on the server later, lmk. I'm just hoping to have this fixed before I've got PCI auditors rolling through. Don't really want to explain how our servers went back to the future :D	19:45
smb	Heh ok, will do. You may always claim they run ahead of time to spot problems sooner. :-P	19:49
Joe_CoT	For now I just changed syslog-ng to ignore the time from the kmsg logs. It means that on server startup the entire startup shows up in the logs as the same second. That's better than October.	19:51
smb	True... unless it is October, though then it would be likely next year in the logs...	19:53
smb	Btw, I write up some summary and will post it to the bug report	19:53
bjf	slangasek, bug 1350035 is preventing any automated testing of Utopic by either CI or the kernel team	21:59
ubot5	bug 1350035 in debian-installer (Ubuntu) "Debootstrap warning; Warning: Failure while configuring required packages." [Undecided,Confirmed] https://launchpad.net/bugs/1350035	21:59
xnox	slangasek: bjf: i believe dropping essential:yes from init metapackage would fix it. Which i have done 10h ago, but it's stuck in utopic-proposed because of the alpha2 freeze block. And debootstrap doesn't know how to use overlay repositories, so we need to wait till tomorrow for block to drop, init to migrate and try again.	23:06
xnox	updated bug report.	23:07
bjf	xnox, many thanks for that update	23:14

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!