[00:22] dannf, remind me about this tomorrow and I'll help poke it if you like === dupondje_ is now known as dupondje === ming is now known as Guest28776 === clopez_ is now known as clopez === psivaa is now known as psivaa-lunch === psivaa-lunch is now known as psivaa [14:12] smb, were you able to replicate #1349883 ? [14:13] bug #1349883 [14:13] bug 1349883 in linux (Ubuntu) "dmesg time wildly incorrect on paravirtual EC2 instances." [Medium,Confirmed] https://launchpad.net/bugs/1349883 [14:13] Joe_CoT, No, not locally and not on the few ec2 m1.small I started up. One of those ended up on an AMD, the other on some Intel but even with a few reboots the times in dmesg were ok [14:15] ok. would it be helpful if I reproduced it on a clean instance and gave you ssh access to it? [14:18] Hm, maybe... give me a sec (or a few minutes) Thenn I try to see whether I can do what I am thinking of on a local instance [14:21] ok [14:30] hrm, could be some minutes more... only getting around 300K/s for the ddebs... [15:13] Joe_CoT, So I am not really sure it would help. Though I can have peeks at the vpcu time info, I suspect it would only show a high system time (which we already know) and the general progress of time. From what I get from the data we have, this is a one time badness only. Meaning from that point at boot when it jumps, it increments normally [15:14] yes [15:14] but once it boots once badly, it consistently reboots badly [15:14] again, I think it's the specific hardware it runs on. [15:14] if I give you a bad one, when you reboot it it'll still be bad [15:15] That could be one additional part of circumstances, yes [15:15] I'm busy deploying a new code release, but after that I'll see if I can give you a clean instance where it's happening [15:16] Ok, yeah. I need a bit more time to get my head around the code anyway. [16:57] jsalisbury: so I'm having a terrible time with my KVM host box all of a sudden becoming very unstable, and kernel panicking left and right. I'm concerned that it might be hardware flakiness, because the panics don't seem to all be in the same place. Do you have any advice for me before I start hand-transcribing kernel oopses into LP? [17:02] slangasek, I guess the first thing to check, did anything change? Did you apply any updates or upgrade the kernel? [17:02] slangasek, It might be good to try an older kernel, that was known to work and see if the panic persists or goes away, that might indicate if a software change caused this. [17:03] jsalisbury: I previously had 3.11.0-26 installed, which is when I first saw a crash; then I upgraded the host to trusty in response to this, and am now running 3.13.0-32 which is definitely crashy [17:03] jsalisbury: prior to that I was running on the saucy kernel for a long long time [17:04] Can you test a Saucy kernel to see if the panic stops? I can post a link to one, if you don't have one installed already. [17:05] slangasek, If the panic persists, then it might indicate hardware. If it goes away, we should did deeper. [17:05] s/did/dig/ [17:06] howdy! the 3.13.0-32-generic update is rendering my Intel NUCs unbootable [17:07] kirkland, uh oh. Do you get a panic, or blank screen, or something else? [17:08] jsalisbury: hung on the ubuntu splash screen; never makes it into the desktop [17:08] jsalisbury: note that this only happens on a cold boot (warm reboots seem to work fine) [17:08] jsalisbury: and downgrading back to the -24-generic kernel also seems to work fine [17:08] jsalisbury: I still have the saucy kernel to hand, I'll reboot and see [17:09] kirkland, can you turn on the screen to get further debugging? I think you just need to remove quiet and splash from GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub [17:10] kirkland, and see if anything is printed. [17:10] jsalisbury: done; powering down then booting again [17:11] kirkland, just don't forget to run update-grub too [17:14] jsalisbury: okay, I have call traces now [17:14] jsalisbury: pictures from my phone [17:14] jsalisbury: looks like its http://support.sundtek.com/index.php/topic,1600.0.html [17:14] kirkland, cool, that should help. [17:17] kirkland, looks like the bug is upstream as well. I'm reading through the forum posts and upstream bug report. [17:18] kirkland, We probably pulled in the offending commit as part or upstream updates. [17:18] jsalisbury: okay -- is there a fix upstream yet for the offending commit? [17:20] kirkland, I see a test patch in the upstream bug report. I guess it might be worth while to test the latest mainline kernel, in case it's being worked from someone else too. It can be downloaded from: [17:20] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc7-utopic/ [17:25] kirkland, The -24-generic kernel was based off of upstream 3.13.9, so we can also bisect from that version on, if mainline still exhibits the bug. [17:28] jsalisbury: seems I do not trigger the bug, if I unplug my usb keyboard when booting [17:28] kirkland, It appears to be xhci related and there were a few changes I can look at closer. [17:33] kirkland, I'll build a test kernel with some of the xhci changes reverted to see if we can narrow down the cause and not have to bisect. I'll post a link in a bit. [17:34] kirkland, just to confirm, the mainline kernel still has the bug, if you leave the keyboard connected while booting? [17:36] jsalisbury: so, the saucy kernel just crashed for me again. Interestingly, while on trusty the crashes seem to be more varied, this saucy crash is the same or similar to the last saucy crash I saw, which is ksm-related [17:41] slangasek, Hmm, and you used a kernel that never paniced before? It might be worth while to run a memory test real quick on the machine. Any type of HW related errors in the syslog? [17:41] nothing hw-y in the syslog, no; and it never panicked before earlier this month, around the time the weather turned hot ;P [17:43] slangasek, heh, blame the weather [17:44] jsalisbury: the problem has persisted when the weather got better :) [17:44] slangasek, speaking of weather, does the syslog have any mention of temp readings? [17:44] slangasek, Just shut the window, lol [17:44] wat [17:44] shutting the window is what makes the room hot! :) [17:45] slangasek, whoops, figured keeping the rain off might help :-) [17:45] jsalisbury: what would I search for to find temp readings? 'temp' doesn't find anything useful [17:46] slangasek, you would find something like this: " Core temperature above threshold" [17:46] or "Core temperature/speed normal" [17:46] jsalisbury: nada [17:46] slangasek, maybe the logs rolled and the temps been ok. Maybe grep the /sys/log directory [17:47] jsalisbury: did that already [17:49] slangasek: I've found lm-sensors and the 'sensors' util good for monitoring, or simply "cat /sys/class/thermal/thermal_zone*/temp" [17:49] slangasek, Hard to know if it is HW releated if there are no errors. Can you post a picture of the crashes [17:51] jsalisbury: the fact that the backtraces are varied makes that more difficult. If I get the same ksm crash on saucy again, perhaps I'll take a picture - but saucy is EOL so that hardly helps for getting it fixed [17:51] and in trusty the backtraces are all over the map [17:57] jsalisbury: do we have a Launchpad bug for this yet? [17:57] jsalisbury: do you want me to file one? [18:00] kirkland, That would be great [18:00] jsalisbury: memory test> I don't think we have a memtest that works on UEFI-only systems, do we [18:01] slangasek, hmm, I don't know off hand. Can you not selecet it from the GRUB menu? [18:01] nope [18:01] memtest86+ has grub code that specifically says it doesn't work on UEFI [18:02] slangasek, Ahh, ok. [18:02] /etc/grub.d/20_memtest86+: [18:02] # We need 16-bit boot, which isn't available on EFI. [18:02] if [ -d /sys/firmware/efi ]; then [18:02] exit 0 [18:02] fi [18:03] slangasek, It makes it difficult that the backtraces are not consistent. Do you have to put some load on the machine, or does just booting and waiting trigger it? [18:03] memtest is difficult with UEFI [18:04] Because the firmware is using so much more of the memory [18:04] jsalisbury: the "load" appears to be nothing more than having VMs configured to start at boot [18:04] You'd probably need something that did ExitBootServices() and then ignored the runtime regions [18:04] And hoped that there was no periodic SMM code using them [18:04] * slangasek nods to mjg59 [18:05] and writing that is out of scope for my current problem ;) [18:06] slangasek, to see if it is HW releated, can you try to boot a LiveCD and see if you can get a crash again. That would rule out anything configuration wise. [18:07] jsalisbury: not conveniently, but I'll put that on my list of things to try [18:08] slangasek, ok [18:09] kirkland, Can you try the test kernel here: http://kernel.ubuntu.com/~jsalisbury/kirkland/ [18:09] jsalisbury: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350480 [18:09] Launchpad bug 1350480 in linux (Ubuntu) "[REGRESSION] Kernel update renders Intel NUC (i5-3427) unbootable with USB devices plugged in" [Undecided,New] [18:10] kirkland, That has four out of the five xhci commits added after 3.13.9 reverted. [18:12] kirkland, The fifth didn't revert without a backport, so I'll revert that one if the bug still exists with v1 of this test kernel. [18:14] jsalisbury: installing now [18:15] kirkland, this one needs the linux-image-extra package as well as linux-image [18:15] jsalisbury: k [18:15] jsalisbury: I'm just trying to apport-collect before I reboot [18:16] kirkland, ack [18:16] jsalisbury: rebooting [18:17] jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~DustinTestKernelv1 SMP Wed Jul 30 17:40:39 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux [18:17] jsalisbury: booted just fine ;-) [18:19] kirkland, Hmm, cool. So that means with was one of these four commits: [18:19] 7ba40e8 xhci: delete endpoints from bandwidth list before freeing whole device [18:19] 8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb [18:19] 00bd7b9 xhci: Switch Intel Lynx Point ports to EHCI on shutdown. [18:19] c349a2e xhci: Prevent runtime pm from autosuspending during initialization [18:20] kirkland, I'll try to revert only two of them and see if we can narrow it down. [18:22] smb, I was able to reproduce it right away. Can you toss me your public ssh key, and preferably your IP, and I'll give you access? [18:23] For reference, it was m1.small, in us-east-1c, off the latest Ubuntu 14.04 PV SSD AMI, straight off [18:23] Joe_CoT, You should be able to use "ssh-import-id smb" for that [18:25] nifty [18:26] So in theory I should be able to get there if you pm me the external host name [18:26] yeah, give me a minute and I'll have you set up [18:32] smb, all set. PMed. You have full reign of that server. Just lmk when you're done with it [18:33] ack [18:37] smb, that server has ssh allowed in, http and https allowed out. LMK if you need more than that. You can reboot it as many times as you want, but please don't shut it down, as the problem may go away if they switch hardware on me. [18:38] Joe_CoT, Ok, sure. [18:51] kirkland, the second test kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/ [18:51] kirkland, if my guess is right, this kernel should still have the bug. [18:52] kirkland, I think the bug is going to be caused by: [18:52] 8172925 usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb [18:52] kirkland, I'll start building a third test kernel with just commit 8172925 reverted. [19:06] jsalisbury: testing that 2nd one now [19:09] kirkland, ack [19:09] jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480v2 SMP Wed Jul 30 18:25:09 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux [19:09] jsalisbury: hang on [19:10] jsalisbury: let me do a hard power down, and up [19:10] kirkland, ok. I expect it may have the bug [19:10] jsalisbury: yeah, sorry, that previous warm boot worked fine; it's cold boots that suffer [19:11] kirkland, good :-) [19:11] kirkland, A 3rd kernel is building now with only commit 8172925 reverted. I expect it will be the fix [19:11] jsalisbury: boom [19:11] jsalisbury: explosions in the sky [19:12] kirkland, I'm already crafting an email to the patch author. In the mean time, we can request a revert upstream and in Ubuntu, once we verify it's that commit. [19:13] kirkland, build should be done in a few minutes [19:14] jsalisbury: okay, I'm ready and waiting :-) [19:22] kirkland, The kernel is ready at: kernel.ubuntu.com/~jsalisbury/kirkland/ [19:23] kirkland, If it fixes the bug, I'll fire off some email to have it reverted [19:23] jsalisbury: k [19:29] jsalisbury: v3 boots fine (cold + warm boot) [19:29] jsalisbury: Linux OrangeBox01 3.13.0-33-generic #58~lp1350480OnlyCommit8172925Reverted SMP Wed Jul 30 18:56:31 U x86_64 x86_64 x86_64 GNU/Linux [19:30] kirkland, great. I'll send email upstram and cc you. I'll also request that this gets reverted in the Ubuntu kernel [19:34] jsalisbury: fantastic, thanks for this :-) [19:34] kirkland, np [19:36] Joe_CoT, So I would be off the machine for today. I extracted some data. Interesting that most time info actually is correct. Like uptime and even the formatted time in /var/log/syslog. Just the seconds in the timestamp are off. Also interesting that while the timestamp in dmesg on my local system is correct, the vcpu_info time actually is the uptime of the host. Apparently that gets corrected by the delta (of tsc [19:36] timestamp in the same struct and result of readtsc) [19:36] While on that host it seems to be the same [19:36] So looks a bit like the correction delta goes wrong and you end up having the hosts uptime in dmesg [19:37] the formatted time in syslog is correct, when using rsyslog. When using syslog-ng, it ends up with logs in october [19:37] because it's using the time from kmsg [19:37] but when I reported I brought it down to the core problem, which is the dmesg timestamp [19:39] If you're done for the day I'll turn it off then. If you need it again lmk. I'll try to be in the channel when I'm in the office, if I disappear give me a poke on the bug or by email. [19:39] Probably syslog-ng does its own calculation in some way. Just some additional fact. I mean the main issue is as you pointed out, the dmesg timestamp being wrong [19:40] I think that syslog-ng does: current time - uptime + dmesg time. Which is why it's getting dates in the future [19:40] Sounds reasonable. Basically offset by the hosts uptime [19:41] and not the guests [19:41] And if you do dmesg -T you'll see the same then [19:41] yeah [19:42] /proc/uptime ends up with the guest uptime, but /dev/kmsg is using timestamps based on the host uptime [19:43] Since the field containing the system time is the uptime of the host in both cases, there must be somethign wrong with how the delta is used. But I won't have enough brain for that to figure out right now. [19:43] Oh when I say host I meant the host running the guest [19:44] And we probably are on the same page and I am just too dumb by now. :) [19:44] lol, ok [19:45] well if you need back on the server later, lmk. I'm just hoping to have this fixed before I've got PCI auditors rolling through. Don't really want to explain how our servers went back to the future :D [19:49] Heh ok, will do. You may always claim they run ahead of time to spot problems sooner. :-P [19:51] For now I just changed syslog-ng to ignore the time from the kmsg logs. It means that on server startup the entire startup shows up in the logs as the same second. That's better than October. [19:53] True... unless it is October, though then it would be likely next year in the logs... [19:53] Btw, I write up some summary and will post it to the bug report [21:59] slangasek, bug 1350035 is preventing any automated testing of Utopic by either CI or the kernel team [21:59] bug 1350035 in debian-installer (Ubuntu) "Debootstrap warning; Warning: Failure while configuring required packages." [Undecided,Confirmed] https://launchpad.net/bugs/1350035 [23:06] slangasek: bjf: i believe dropping essential:yes from init metapackage would fix it. Which i have done 10h ago, but it's stuck in utopic-proposed because of the alpha2 freeze block. And debootstrap doesn't know how to use overlay repositories, so we need to wait till tomorrow for block to drop, init to migrate and try again. [23:07] updated bug report. [23:14] xnox, many thanks for that update