=== Ursinha is now known as Ursinha-afk === Ursinha-afk is now known as Ursinha === Ursinha is now known as Ursinha-afk [09:58] Hi! Anyone here who knows what hard lockup on cpu from watchdog means? === mitya57_ is now known as mitya57 === luc4_mac_ is now known as luc4_mac === luc4_mac_ is now known as luc4_mac === luc4_mac_ is now known as luc4_mac [12:23] Hi! Anyone here who knows what hard lockup on cpu from watchdog means? [12:28] luc4_mac: vaguely [12:28] luc4_mac: There is a 'watchdog' timer that goes off regularly, to detect when something has stopped responding (i.e. locked up) [12:29] luc4_mac: What's the message you got and is it in a vm ? [12:30] penguin42: hi, I suppose you don't remember me. You helped me with a network issue/bug. I'm still experiencing that and I notice in my dmesg that message "watchdog detected hard lockup in cpu 0". I was wondering if that could make my network go down. [12:31] luc4_mac: The watchdog message is more of a symptom rather than a cause - it says something bad is happening, but not why [12:31] penguin42: it is my understanding it might reboot some kind of processes when the CPU is overloaded. [12:32] luc4_mac: It's not as simple as overloaded, if there is a lot of stuff running and the CPU is busy you still shouldn't get that [12:32] penguin42: for some reason it seems that my old old system is not using DMA (don't know why either) and results overloaded for long periods. [12:32] luc4_mac: It only happens if the kernel effectively doesn't get a chance to run for a while and that should never happen [12:33] luc4_mac: Post a full dmesg to pastebin? [12:34] penguin42: in that case… I was wondering if my network issue could be related to that and in that case if I should add the information to the bugreport. I rebooted, I'll have to search that if it is still in my logs. I noticed anyway it is very frequent. [12:34] It really shouldn't happen! [12:34] luc4_mac: When you say it's not using DMA - on hard drive? [12:35] penguin42: it shouldn't happen that DMA is not used as well, but it seems there are many things not working properly here… yes, the hard drive seems not to be using DMA. [12:35] what's telling you that? [12:35] penguin42: when accessing the hard drive CPU is in IO wait almost 100%. [12:36] also haparm seems to report that. [12:36] hdparm sorry. [12:36] luc4_mac: OK, get a full dmesg in a pastebin [12:36] luc4_mac: the watchdog stuff can happen if the kernel is stuck in a driver for a long time, so if something is going badly wrong with some driver it's less surprising if you're getting a watchdog === luc4_mac_ is now known as luc4_mac [12:44] penguin42: Ok, I found one of those warnings: http://pastebin.com/zciQkwya. [12:45] luc4_mac: I need the full dmesg [12:46] penguin42: I found that in a kern.log file. It was starting with that. Maybe it is better if I wait for it to happen again? [12:46] luc4_mac: Look, I need the full dmesg to be helpful [12:48] penguin42: do you mean from the boot of the system? [12:48] luc4_mac: Just run dmesg and put the full output in a pastebin [12:48] * penguin42 wants to see the stuff where it's detecting and doing stuff with the hardware and disks in particular [12:49] penguin42: yes, I can do that, but it won't include the warning message because the system has not logged it yet. [12:49] that's ok [12:52] penguin42: entire dmesg at the moment: http://pastebin.com/0AaHcwdd. [12:56] luc4_mac: OK, so my reading on there is that everything is in DMA at that point [12:57] penguin42: oh… then there should be something else explaining the IO wait... [12:57] luc4_mac: See next to each of the devices it shows UDMA or MDMA [13:00] penguin42: this is the bug that is still affecting me anyway: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/997767. [13:00] Ubuntu bug 997767 in linux "10ec:8139 Network connection rtl8139 lost after some hours of inactivity and comes up again on user interaction" [Medium,Confirmed] [13:00] penguin42: I installed Ubuntu again, fresh system. After a month or so, the same is happening again. [13:01] luc4_mac: Watch that dmesg for anything else; my guess is that after a while you'll get some errors, I'm guessing as a result of a hard drive problem and it'll reset the bus and drop out of dma [13:02] luc4_mac: The important thing is to find the _first_ bad thing that happens in dmesg [13:03] penguin42: Still anyway hdparm is reporting HDIO_GET_DMA failed: Inappropriate ioctl for device. Is this supposed to happen? [13:04] no, what exactly is the hdparm command you're giving? [13:04] penguin42: found in the Ubuntu documentation: hdparm -d /dev/sdb2. [13:06] luc4_mac: That happens for me as well, it wouldn't surprise me if that's no longer supported now that stuff is goign via the /dev/sd stuff [13:06] penguin42: ah ok, no problems then. [13:06] luc4_mac: what about hdparm -I /dev/sdb ? [13:07] luc4_mac: Mine has something like DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 and I think the * indicates the one in use [13:08] penguin42: can I ask you if it is possible at all that some energy saving is still causing the network shutdown? [13:09] luc4_mac: Yeh that's a reasonable cause [13:09] penguin42: so, is there a way for me to be certain that no energy saving is applied? [13:10] luc4_mac: I'm not too sure about the energy saving stuff - there are loads of different things that do it [13:10] penguin42: anyway, months ago you suggested to iterate an ifconfig and check what happens in case of network shutdown. What resulted is that ifconfig is iterated and shows many dropped packets. This doesn't seem energy saving to me... [13:10] luc4_mac: I know things like 'powertop' help you find out what you can turn on to save energy, perhasp look at the docs for it to see what you can turn off [13:11] penguin42: installing Ubuntu server might be a solution maybe... [13:11] maybe, maybe not [13:15] penguin42: do you think I'm heading the right way investigating this watchdog warning to solve my network issue? Or do you think that is unrelated? [13:16] luc4_mac: the watchdog warning is a bit odd, it's possible that it's related, but the backtrace looked more disk related [13:17] luc4_mac: The important thing is to see whether the watchdog is the 1st bad thing in the logs or whether there is something else first [13:20] penguin42: this system is really really weird… What I just noticed is this: if I transfer via ethernet a large file using samba I get less than 500Kb/s and IO wait over 90%. If I transfer it via ssh, I get more than 10Mbit/s and almost no IO wait... [13:21] is it a large file full of zeros ? [13:21] I think ssh compresses by default [13:22] penguin42: no, avi file. [13:23] hmm ok so that should already be heavily compressed [13:25] penguin42: ah ah, I got it… different partition :-) if I scp from one partition I get that strange behavior! [13:26] luc4_mac: And that's on one of your disks and the other partition is on a different one? [13:26] penguin42: yes, two different disks I think. [13:27] penguin42: yes, sda* is ok, sdb* is not. [13:28] luc4_mac: Ok, now do the hdparm -I /dev/sdb -what does the DMA line show? [13:28] penguin42: so, when tranfering from /dev/sda* transfer is fast. From /dev/sdb* I get a system completely overloaded. [13:29] penguin42: the interesting line is this I think: DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5. [13:29] hmm well that's still happy [13:30] luc4_mac: Anything new in the dmesg output yet? [13:30] penguin42: the system is so overloaded that even the mouse cursor is not moving. [13:31] penguin42: I'm transfering now overloading the system but dmesg seems to output nothing new. [13:31] luc4_mac: OK, that shouldn't happen [13:32] penguin42: this also explains why usb transfer was almost stuck. Transfering from that disk is an issue. [13:33] luc4_mac: Sure there are no new dmesg entries? [13:33] penguin42: last line is: [ 86.952063] usb 4-2: USB disconnect, device number 2. [13:33] penguin42: the same as before. [13:34] luc4_mac: Hmm ok, it's odd; it's possible that the driver/controller really doesn't like slave drives - if there was an actual faulty cable or disk I'd expect to see some retries/errors in the logs [13:35] penguin42: maybe I could plug that differently to the mb… [13:36] luc4_mac: I'd check the master/slave/cable select jumpers on it and the master on that cable, but also if you could try swapping it to your other ide chain (as the only drive) and seeing if it still gets naff performance - that would isolate whether it's the drive or the channel [13:37] penguin42: doing it now :-( [13:37] luc4_mac: You might also try running smartctl -a on the drive to see if it's reporting problems, but again if it's actually faulty I'd expect some dmesg content by now [13:37] dmesg output: [ 3266.264700] sched: RT throttling activated [13:38] luc4_mac: https://lkml.org/lkml/2012/1/13/60 I think the slow disk, that RT throttling and the watchdog are probably related [13:39] luc4_mac: It's either some faulty hardware or a dodgy via pata driver [13:40] penguin42: the disk might actually be faulty yes. I might be 10 years old. [13:41] luc4_mac: smartctl -a should tell you if the drive is actually faulty [13:41] ah ah, I meant "it might be 10 years old". [13:41] luc4_mac: And similarly if you move the drive to be the master alone on your 2nd channel it should help; if the problem goes away then it's unlikely to be the drive [13:42] penguin42: I don't see any information reporting faulty hardware... [13:42] luc4_mac: Can you pastebin the output of smartctl -a /dev/sdb ? [13:44] penguin42: yes, here: http://pastebin.com/Rh5k0MEF [13:45] odd, it says smart support is available but disabled - never seen that before [13:45] penguin42: yes, I see that… but to be sincere I don't know what smart is. [13:46] luc4_mac: It's a bunch of testing systems internal to the hard drive to detect when they're going wrong [13:46] penguin42: ok, now I know :-) so I should enable it to test. [13:46] luc4_mac: you could try smartctl --smart=on /dev/sdb and then smartctl -a /dev/sdb [13:47] luc4_mac: I would, and then there are really 3 types of things; 1) some stats 2) Logs of errors 3) Some full tests you can trigger [13:48] luc4_mac: Like here's my disk http://paste.ubuntu.com/1265675/ [13:48] penguin42: oooohh… I never stop learning… :-) it is better if I pastebin this :-) [13:50] luc4_mac: In that all the stats are good, there is 'No errors logged' in the error log, and I've not run any of the actual tests [13:52] penguin42: http://pastebin.com/5pDQsxfi [13:52] penguin42: it seems like we found the issue. [13:57] luc4_mac: Yeh that error log looks bad, and the pending sectors is a little high; although the reallocated sector is only 1 - sounds like you have a few bad sectors, although I'm surprsied it isn't triggering more errors in dmesg - if it actually fails to read the sector it should get an error in dmesg, it might be taking a few goes to get it [13:58] luc4_mac: Looks like the drive is a bit hot as well [13:59] penguin42: shouldn't the bad sectors be ignored and left unused? [14:00] luc4_mac: Not if you're trying to read data off them [14:01] luc4_mac: Different drives behave differently; some will give up after a few retries and error it back to teh OS (and you'll see it in the logs) some will keep going and just take a heck of a long time to do anything - although it still surprises me that the 1st thing you see is a watchdog/RT error [14:05] penguin42: unfortunately I think this is not related to the network issue right? [14:06] luc4_mac: It's unlikely [14:11] penguin42: any suggestion how I can guess what is wrong with that? [14:12] luc4_mac: Not really, you need to find something in some diagnostics which changes between it working and failing [14:16] penguin42: ok, thanks for your help! ;-) [14:16] np [14:16] penguin42: it is always interesting to discuss with you! === luc4_mac_ is now known as luc4_mac [14:27] When I click on Dash home my xserver crashes and returns me to the login screen. I've searched but only found one other person with this problem and no solution. [14:28] What should I look at? [14:31] What version of Ubuntu and what graphics card? [14:32] hold on. Nvidia 5200 [14:32] I will need to check, brb [14:34] how can I tell which version I have? [14:36] AssociateX: If you click on the cog at the top right and do about this computer, if it doesn't crash then it should show you the number [14:38] 12.04 lts [14:40] I'm using the nvidia-173 driver because anything newer will not let flash play with my video card. [14:43] geforce fx5200 is the card [14:44] ok, I don't know the Nvidia stuff, you might want to try #ubuntu-x or #ubuntu [14:45] ok, thank you very much for your time though. [14:46] how about this, what file should I look at for the error, or what tool would I use? It's been a long time since I've had to use cli. [14:47] AssociateX: If the X server is crashing then I'd expect to see a backtrace in /var/log/Xorg.0.old or /var/log/Xorg.0 [14:47] OK, I'm going to go look there. Thank you. [14:47] AssociateX: Depending what stuff you do with your machine I'd try dropping back to Unity-2d or try the open source Nvidia driver [14:47] Still crashes there. [14:48] and just on the Dash home button [14:48] nothing else [14:50] /var/log/Xorg.0.log|less llok clean [14:50] looks* [14:51] try the .old varient [14:51] ok [14:52] nothing in /var/log/ for X [14:53] ok, when you say X crashes, what do you actually see? [14:53] the screen blinks/flashes, goes black, and then the login screen shows up. [14:54] just like you would expect when loging out. [14:54] sure sounds like an X crash [14:54] yes [14:56] nothing in /var/log/ for X <---opp's I wasn't looking correctly. I have some files to look at. brb [14:58] Caught signal 11 (Segmentation fault). Server aborting [14:58] there you go [14:58] yeah, I wonder what's causing it. [14:58] almost certainly a bug in the Nvidia driver [14:59] it should show you a backtrace [14:59] /usr/bin/X (xorg_backtrace+0x37) [0x80a6707] [15:00] that's the only thing that shows backtrace in it, I wouldn't know what to do with that though. [15:00] I should get lynx up and do a paste bin [15:00] right but there will be some similar lines below it with different names and numbers - that set of lines is the 'back trace' - put them in a pastebin [15:01] ok, brb [15:01] AssociateX: Still, you've only got a few options; 1) Use something else that doesn't trigger the crash other than the dash, 2) switch driver [15:16] http://paste.ubuntu.com/1265827/ [15:16] that should be the pastebin [15:17] [ 8595.308] Warning: Xalloc: requesting unpleasantly large amount of memory: 0 bytes. [15:18] what the heck is that? [15:18] yeh that's weird [15:19] yeah, I have been using blackbox instead of unity, but I have kids that would like a regular desktop. Maybe I will just install kde or somethign. [15:20] maybe gnome, kde is pretty big [15:20] thank you again for all of your help [15:21] np === maxb_ is now known as maxb [15:31] * penguin42 looks at bug 1062159 and wonders why someone would crypt one slice of a RAID0 [15:31] Launchpad bug 1062159 in mdadm "Raid is incorrectly determined as DEGRADED preventing boot in 12.04" [Undecided,New] https://launchpad.net/bugs/1062159 [16:44] * penguin42 wonders what one does with a bug like 1056626 [16:44] bug 1056626 [16:44] Launchpad bug 1056626 in gammu "source distributes personal information" [Undecided,New] https://launchpad.net/bugs/1056626 [17:03] penguin42: Looks like it has been removed from upstream http://blog.cihar.com/archives/2012/09/27/think-twice-making-your-private-data-public/ Good question, though. [17:31] hjd: I assume there is someone that should be subscribed for 'please remove' type of things if there is some question of privacy or legals - but I've never found who? [17:43] * penguin42 has sent a request to bugcontrol asking what the right thing to do is [17:59] huh? [18:04] penguin42, hjd: re gammu -- a link to the updated upstream would be nice, but not critical; a patch would be very welcome [18:07] relating to texlive: is there a Debian bug on this? We should try to keep in sync, mostly if preining is acting on it [18:08] hggdh: It was a more general question on whether there is anyone/thing that tracks license/legal issues [18:20] penguin42: as far as I can remember, not specifically. But then, who am I, I never dug into the licence arena. Let's wait, worst scenario we ask in -devel [18:45] hggdh: Yeh, it just seems some things you come across sound erm dodgy and really should be sorted out === thomi is now known as thomj [23:14] how do you document these days that a bug is fixed in ubuntu+1 but needs a backported fix for lucid. Bug 579958 for example. [23:14] Launchpad bug 579958 in duplicity "Assertion error "time not moving forward at appropriate pace"" [Medium,Fix released] https://launchpad.net/bugs/579958 === Laibsch1 is now known as Laibsch [23:21] Laibsch, hmm. Usually you have a link to "Nominate for Series" but this one doesn't. Lucid is a half year away from EOL, so my first recommendation would be to upgrade to 12.04. If you really need it in Lucid then I'd say make a bug report for a backport request [23:21] yeah, that's the one I was looking for as well [23:21] hardy is half a year away [23:21] lucid still has 2,5 years [23:22] I found the button now, you have to look at the bug as registered in Ubuntu not for upstream [23:22] and Ubuntu really needs to improve the experience in its stable releases [23:22] long-term releases [23:26] Laibsch, once they get this late in the cycle, they'll just try to get backport requests.