/srv/irclogs.ubuntu.com/2014/07/30/#ubuntu-kernel.txt

kamaldannf, remind me about this tomorrow and I'll help poke it if you like00:22
=== dupondje_ is now known as dupondje
=== ming is now known as Guest28776
=== clopez_ is now known as clopez
=== psivaa is now known as psivaa-lunch
=== psivaa-lunch is now known as psivaa
Joe_CoTsmb, were you able to replicate  #1349883 ?14:12
apwbug #134988314:13
ubot5bug 1349883 in linux (Ubuntu) "dmesg time wildly incorrect on paravirtual EC2 instances." [Medium,Confirmed] https://launchpad.net/bugs/134988314:13
smbJoe_CoT, No, not locally and not on the few ec2 m1.small I started up. One of those ended up on an AMD, the other on some Intel but even with a few reboots the times in dmesg were ok14:13
Joe_CoTok. would it be helpful if I reproduced it on a clean instance and gave you ssh access to it?14:15
smbHm, maybe... give me a sec (or a few minutes) Thenn I try to see whether I can do what I am thinking of on a local instance14:18
Joe_CoTok14:21
smbhrm, could be some minutes more... only getting around 300K/s for the ddebs... 14:30
smbJoe_CoT, So I am not really sure it would help. Though I can have peeks at the vpcu time info, I suspect it would only show a high system time (which we already know) and the general progress of time. From what I get from the data we have, this is a one time badness only. Meaning from that point at boot when it jumps, it increments normally15:13
Joe_CoTyes15:14
Joe_CoTbut once it boots once badly, it consistently reboots badly15:14
Joe_CoTagain, I think it's the specific hardware it runs on.15:14
Joe_CoTif I give you a bad one, when you reboot it it'll still be bad15:14
smbThat could be one additional part of circumstances, yes15:15
Joe_CoTI'm busy deploying a new code release, but after that I'll see if I can give you a clean instance where it's happening15:15
smbOk, yeah. I need a bit more time to get my head around the code anyway. 15:16
slangasekjsalisbury: so I'm having a terrible time with my KVM host box all of a sudden becoming very unstable, and kernel panicking left and right.  I'm concerned that it might be hardware flakiness, because the panics don't seem to all be in the same place.  Do you have any advice for me before I start hand-transcribing kernel oopses into LP?16:57
jsalisburyslangasek, I guess the first thing to check, did anything change?  Did you apply any updates or upgrade the kernel?17:02
jsalisburyslangasek, It might be good to try an older kernel, that was known to work and see if the panic persists or goes away, that might indicate if a software change caused this.17:02
slangasekjsalisbury: I previously had 3.11.0-26 installed, which is when I first saw a crash; then I upgraded the host to trusty in response to this, and am now running 3.13.0-32 which is definitely crashy17:03
slangasekjsalisbury: prior to that I was running on the saucy kernel for a long long time17:03
jsalisburyCan you test a Saucy kernel to see if the panic stops?  I can post a link to one, if you don't have one installed already.17:04
jsalisburyslangasek, If the panic persists, then it might indicate hardware.  If it goes away, we should did deeper.17:05
jsalisburys/did/dig/17:05
kirklandhowdy!  the 3.13.0-32-generic update is rendering my Intel NUCs unbootable17:06
jsalisburykirkland, uh oh.  Do you get a panic, or blank screen, or something else?17:07
kirklandjsalisbury: hung on the ubuntu splash screen;  never makes it into the desktop17:08
kirklandjsalisbury: note that this only happens on a cold boot (warm reboots seem to work fine)17:08
kirklandjsalisbury: and downgrading back to the -24-generic kernel also seems to work fine17:08
slangasekjsalisbury: I still have the saucy kernel to hand, I'll reboot and see17:08
jsalisburykirkland, can you turn on the screen to get further debugging?  I think you just need to remove quiet and splash from GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub17:09
jsalisburykirkland, and see if anything is printed.17:10
kirklandjsalisbury: done;  powering down then booting again17:10
jsalisburykirkland, just don't forget to run update-grub too17:11
kirklandjsalisbury: okay, I have call traces now17:14
kirklandjsalisbury: pictures from my phone17:14
kirklandjsalisbury: looks like its http://support.sundtek.com/index.php/topic,1600.0.html17:14
jsalisburykirkland, cool, that should help.17:14
jsalisburykirkland, looks like the bug is upstream as well.   I'm reading through the forum posts and upstream bug report.  17:17
jsalisburykirkland, We probably pulled in the offending commit as part or upstream updates.17:18
kirklandjsalisbury: okay -- is there a fix upstream yet for the offending commit?17:18
jsalisburykirkland, I see a test patch in the upstream bug report.  I guess it might be worth while to test the latest mainline kernel, in case it's being worked from someone else too.  It can be downloaded from: 17:20
jsalisburyhttp://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc7-utopic/17:20
jsalisburykirkland, The -24-generic kernel was based off of upstream 3.13.9, so we can also bisect from that version on, if mainline still exhibits the bug.17:25
kirklandjsalisbury: seems I do not trigger the bug, if I unplug my usb keyboard when booting17:28
jsalisburykirkland, It appears to be xhci related and there were a few changes I can look at closer.  17:28
jsalisburykirkland, I'll build a test kernel with some of the xhci changes reverted to see if we can narrow down the cause and not have to bisect.  I'll post a link in a bit.17:33
jsalisburykirkland, just to confirm, the mainline kernel still has the bug, if you leave the keyboard connected while booting?17:34
slangasekjsalisbury: so, the saucy kernel just crashed for me again.  Interestingly, while on trusty the crashes seem to be more varied, this saucy crash is the same or similar to the last saucy crash I saw, which is ksm-related17:36
jsalisburyslangasek, Hmm, and you used a kernel that never paniced before?  It might be worth while to run a memory test real quick on the machine.  Any type of HW related errors in the syslog?17:41
slangaseknothing hw-y in the syslog, no; and it never panicked before earlier this month, around the time the weather turned hot ;P17:41
jsalisburyslangasek, heh, blame the weather17:43
slangasekjsalisbury: the problem has persisted when the weather got better :)17:44
jsalisburyslangasek, speaking of weather, does the syslog have any mention of temp readings?17:44
jsalisburyslangasek, Just shut the window, lol17:44
slangasekwat17:44
slangasekshutting the window is what makes the room hot! :)17:44
jsalisburyslangasek, whoops, figured keeping the rain off might help :-)17:45
slangasekjsalisbury: what would I search for to find temp readings? 'temp' doesn't find anything useful17:45
jsalisburyslangasek, you would find something like this: " Core temperature above threshold"17:46
jsalisburyor "Core temperature/speed normal"17:46
slangasekjsalisbury: nada17:46
jsalisburyslangasek, maybe the logs rolled and the temps been ok.  Maybe grep the /sys/log directory17:46
slangasekjsalisbury: did that already17:47
TJ-slangasek: I've found lm-sensors and the 'sensors' util good for monitoring, or simply "cat /sys/class/thermal/thermal_zone*/temp"17:49
jsalisburyslangasek, Hard to know if it is HW releated if there are no errors.  Can you post a picture of the crashes17:49
slangasekjsalisbury: the fact that the backtraces are varied makes that more difficult.  If I get the same ksm crash on saucy again, perhaps I'll take a picture - but saucy is EOL so that hardly helps for getting it fixed17:51
slangasekand in trusty the backtraces are all over the map17:51
kirklandjsalisbury: do we have a Launchpad bug for this yet?17:57
kirklandjsalisbury: do you want me to file one?17:57
jsalisburykirkland, That would be great18:00
slangasekjsalisbury: memory test> I don't think we have a memtest that works on UEFI-only systems, do we18:00
jsalisburyslangasek, hmm, I don't know off hand.  Can you not selecet it from the GRUB menu?18:01
slangaseknope18:01
slangasekmemtest86+ has grub code that specifically says it doesn't work on UEFI18:01
jsalisburyslangasek, Ahh, ok.18:02
slangasek/etc/grub.d/20_memtest86+: 18:02
slangasek# We need 16-bit boot, which isn't available on EFI.18:02
slangasekif [ -d /sys/firmware/efi ]; then18:02
slangasek  exit 018:02
slangasekfi18:02
jsalisburyslangasek, It makes it difficult that the backtraces are not consistent.  Do you have to put some load on the machine, or does just booting and waiting trigger it?18:03
mjg59memtest is difficult with UEFI18:03
mjg59Because the firmware is using so much more of the memory18:04
slangasekjsalisbury: the "load" appears to be nothing more than having VMs configured to start at boot18:04
mjg59You'd probably need something that did ExitBootServices() and then ignored the runtime regions18:04
mjg59And hoped that there was no periodic SMM code using them18:04
* slangasek nods to mjg5918:04
slangasekand writing that is out of scope for my current problem ;)18:05
jsalisburyslangasek, to see if it is HW releated, can you try to boot a LiveCD and see if you can get a crash again.  That would rule out anything configuration wise.18:06
slangasekjsalisbury: not conveniently, but I'll put that on my list of things to try18:07
jsalisburyslangasek, ok18:08
jsalisburykirkland, Can you try the test kernel here: http://kernel.ubuntu.com/~jsalisbury/kirkland/18:09
kirklandjsalisbury: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/135048018:09
ubot5Launchpad bug 1350480 in linux (Ubuntu) "[REGRESSION] Kernel update renders Intel NUC (i5-3427) unbootable with USB devices plugged in" [Undecided,New]18:09
jsalisburykirkland, That has four out of the five xhci commits added after 3.13.9 reverted.18:10
jsalisburykirkland, The fifth didn't revert without a backport, so I'll revert that one if the bug still exists with v1 of this test kernel.18:12
kirklandjsalisbury: installing now18:14
jsalisburykirkland, this one needs the linux-image-extra package as well as linux-image18:15
kirklandjsalisbury: k18:15
kirklandjsalisbury: I'm just trying to apport-collect before I reboot18:15
jsalisburykirkland, ack18:16
kirklandjsalisbury: rebooting18:16
kirklandjsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~DustinTestKernelv1 SMP Wed Jul 30 17:40:39 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux18:17
kirklandjsalisbury: booted just fine ;-)18:17
jsalisburykirkland, Hmm, cool.  So that means with was one of these four commits:18:19
jsalisbury7ba40e8 xhci: delete endpoints from bandwidth list before freeing whole device18:19
jsalisbury8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb18:19
jsalisbury00bd7b9 xhci: Switch Intel Lynx Point ports to EHCI on shutdown.18:19
jsalisburyc349a2e xhci: Prevent runtime pm from autosuspending during initialization18:19
jsalisburykirkland, I'll try to revert only two of them and see if we can narrow it down.18:20
Joe_CoTsmb, I was able to reproduce it right away. Can you toss me your public ssh key, and preferably your IP, and I'll give you access?18:22
Joe_CoTFor reference, it was m1.small, in us-east-1c, off the latest Ubuntu 14.04 PV SSD AMI, straight off18:23
smbJoe_CoT, You should be able to use "ssh-import-id smb" for that18:23
Joe_CoTnifty18:25
smbSo in theory I should be able to get there if you pm me the external host name18:26
Joe_CoTyeah, give me a minute and I'll have you set up18:26
Joe_CoTsmb, all set. PMed. You have full reign of that server. Just lmk when you're done with it18:32
smback18:33
Joe_CoTsmb, that server has ssh allowed in, http and https allowed out. LMK if you need more than that. You can reboot it as many times as you want, but please don't shut it down, as the problem may go away if they switch hardware on me.18:37
smbJoe_CoT, Ok, sure. 18:38
jsalisburykirkland, the second test kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/18:51
jsalisburykirkland, if my guess is right, this kernel should still have the bug.18:51
jsalisburykirkland, I think the bug is going to be caused by: 18:52
jsalisbury8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb18:52
jsalisburykirkland, I'll start building a third test kernel with just commit  8172925 reverted.18:52
kirklandjsalisbury: testing that 2nd one now19:06
jsalisburykirkland, ack19:09
kirklandjsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480v2 SMP Wed Jul 30 18:25:09 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux19:09
kirklandjsalisbury: hang on19:09
kirklandjsalisbury: let me do a hard power down, and up19:10
jsalisburykirkland, ok.  I expect it may have the bug19:10
kirklandjsalisbury: yeah, sorry, that previous warm boot worked fine;  it's cold boots that suffer19:10
jsalisburykirkland, good :-)19:11
jsalisburykirkland, A 3rd kernel is building now with only commit 8172925 reverted.  I expect it will be the fix19:11
kirklandjsalisbury: boom19:11
kirklandjsalisbury: explosions in the sky19:11
jsalisburykirkland, I'm already crafting an email to the patch author.  In the mean time, we can request a revert upstream and in Ubuntu, once we verify it's that commit.19:12
jsalisburykirkland, build should be done in a few minutes19:13
kirklandjsalisbury: okay, I'm ready and waiting :-)19:14
jsalisburykirkland, The kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/19:22
jsalisburykirkland, If it fixes the bug, I'll fire off some email to have it reverted19:23
kirklandjsalisbury: k19:23
kirklandjsalisbury: v3 boots fine (cold + warm boot)19:29
kirklandjsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480OnlyCommit8172925Reverted SMP Wed Jul 30 18:56:31 U x86_64 x86_64 x86_64 GNU/Linux19:29
jsalisburykirkland, great.  I'll send email upstram and cc you.  I'll also request that this gets reverted in the Ubuntu kernel19:30
kirklandjsalisbury: fantastic, thanks for this :-)19:34
jsalisburykirkland, np19:34
smbJoe_CoT, So I would be off the machine for today.  I extracted some data. Interesting that most time info actually is correct. Like uptime and even the formatted time in /var/log/syslog. Just the seconds in the timestamp are off. Also interesting that while the timestamp in dmesg on my local system is correct, the vcpu_info time actually is the uptime of the host. Apparently that gets corrected by the delta (of tsc 19:36
smbtimestamp in the same struct and result of readtsc)19:36
smbWhile on that host it seems to be the same19:36
smbSo looks a bit like the correction delta goes wrong and you end up having the hosts uptime in dmesg19:36
Joe_CoTthe formatted time in syslog is correct, when using rsyslog. When using syslog-ng, it ends up with logs in october19:37
Joe_CoTbecause it's using the time from kmsg19:37
Joe_CoTbut when I reported I brought it down to the core problem, which is the dmesg timestamp19:37
Joe_CoTIf you're done for the day I'll turn it off then. If you need it again lmk. I'll try to be in the channel when I'm in the office, if I disappear give me a poke on the bug or by email.19:39
smbProbably syslog-ng does its own calculation in some way. Just some additional fact. I mean the main issue is as you pointed out, the dmesg timestamp being wrong19:39
Joe_CoTI think that syslog-ng does: current time - uptime + dmesg time. Which is why it's getting dates in the future19:40
smbSounds reasonable. Basically offset by the hosts uptime19:40
smband not the guests19:41
Joe_CoTAnd if you do dmesg -T you'll see the same then19:41
Joe_CoTyeah19:41
Joe_CoT/proc/uptime ends up with the guest uptime, but /dev/kmsg is using timestamps based on the host uptime19:42
smbSince the field containing the system time is the uptime of the host in both cases, there must be somethign wrong with how the delta is used. But I won't have enough brain for that to figure out right now. 19:43
smbOh when I say host I meant the host running the guest19:43
smbAnd we probably are on the same page and I am just too dumb by now. :)19:44
Joe_CoTlol, ok19:44
Joe_CoTwell if you need back on the server later, lmk. I'm just hoping to have this fixed before I've got PCI auditors rolling through. Don't really want to explain how our servers went back to the future :D19:45
smbHeh ok, will do. You may always claim they run ahead of time to spot problems sooner. :-P19:49
Joe_CoTFor now I just changed syslog-ng to ignore the time from the kmsg logs. It means that on server startup the entire startup shows up in the logs as the same second. That's better than October.19:51
smbTrue... unless it is October, though then it would be likely next year in the logs...19:53
smbBtw, I write up some summary and will post it to the bug report19:53
bjfslangasek, bug 1350035 is preventing any automated testing of Utopic by either CI or the kernel team21:59
ubot5bug 1350035 in debian-installer (Ubuntu) "Debootstrap warning; Warning: Failure while configuring required packages." [Undecided,Confirmed] https://launchpad.net/bugs/135003521:59
xnoxslangasek: bjf: i believe dropping essential:yes from init metapackage would fix it. Which i have done 10h ago, but it's stuck in utopic-proposed because of the alpha2 freeze block. And debootstrap doesn't know how to use overlay repositories, so we need to wait till tomorrow for block to drop, init to migrate and try again.23:06
xnoxupdated bug report.23:07
bjfxnox, many thanks for that update23:14

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!