genkgo | Hello. We have a huge problem with Ubuntu 14.04 VPS inside a Hyper V platform. Running Windows Server Backup (VSS) changes the filesystem into a read-only filesystem. It is not a specific VPS problem: all three Ubuntu machines have exactly the same problem. In the same cluster we have a CentOS machine that is not having any problem at all. The Ubuntu machines are all on 3.13.0-52-generic. Because the machines are in production, our | 08:34 |
---|---|---|
apw | genkgo, is there a dmesg error at the time of the switch to read-only ? | 08:37 |
genkgo | apw: no, there is no log at time the machine switches to read-only. That is exactly what it makes so hard. The problems occur randomly (at least I am not able to see a pattern). During some backups we have these logs: http://pastebin.com/MvGuDyRL. But also during other backups we have these logs: http://pastebin.com/sExsZKhV. | 08:42 |
genkgo | apw: but we never see any log before the machine goes into read-only filesystem. those logs only occur during backups that finish successfully do not cause the filesystem switch. they maybe (and I guess so) an indication of other problems. | 08:43 |
apw | genkgo, and then what happens, the filesystem reports a full error and moves the filesystem r/o ? | 08:43 |
genkgo | apw: no errors, the filesystem is in read-only mode. so every service that tries to save files is down. | 08:45 |
genkgo | apw: it is a simple webserver: nginx + apache + php-fpm. our http requests cannot be delivered. | 08:46 |
genkgo | apw: we have tried multiple backup strategies: all fail. and the weird thing is: the centos inside the cluster is doing very fine. | 08:47 |
genkgo | apw: i now see that there is a difference between centos and ubuntu. the ubuntu machine is using etx4 while centos uses xfs. | 08:49 |
apw | genkgo, could you file a bug against linux for me, and i will ask someone who has a hyper-v system to see if they can reproduce the behaviour | 08:52 |
apw | "ubuntu-bug linux" | 08:52 |
genkgo | apw: will do. is there anything you can advise me? | 08:54 |
genkgo | apw: to fix the problem temporarily? | 08:55 |
apw | genkgo, when you say it goes read-only what makes you say its read-only if we have no diagnostics saying that ? | 08:56 |
genkgo | apw: ok, when I was logged in to the machine, certain commands fail due to read-only filesystem. | 08:57 |
apw | genkgo, and yet the end of dmesg output does not indicate it going read-only ? | 08:58 |
apw | genkgo, could the filesystem be frozen for the backup, i know some of the backup bits do that on hyper-v | 08:59 |
apw | because /boot being ext2 and it not supporting freeze was an issue for a while | 09:00 |
genkgo | apw: hmm, now I am having doubts. I believe I did not look at dmesg while the system was in read-only mode, but only after the reboot. And then there was nothing on read-only. | 09:00 |
apw | genkgo, right, if you just reboot it won't get flushed to a permenant file, if it had gone read-onoly | 09:01 |
genkgo | apw: alright: so while being in the read-only I should immediately run dmesg | 09:01 |
genkgo | apw: I will do that and see what the logs say. | 09:02 |
apw | genkgo, for sure, as the end of that might indicate something kernel side triggering read-only | 09:02 |
genkgo | apw: I do not know Hyper V well enough to know whether it freezes the filesystem. | 09:03 |
genkgo | apw: I will not file the bug untill I have the dmesg output | 09:04 |
apw | genkgo, no, i think our best bet is that there is a kernel diagnostic at the end of dmesg, if you have a dev environment or something you can tickle this in | 09:04 |
genkgo | apw: will try to do that. the problem is the randomness. it is hard to reproduce the issue. last time (last night) the system went into read-only after the last machine was finished with backup. | 09:06 |
apw | genkgo, odd indeed | 09:06 |
genkgo | so there were 4 machines in the backup sequence, no problem at all during backup. when last one finished, another machine got into read-only filesystem. | 09:07 |
apw | genkgo, that sounds pretty odd doesn't it | 09:07 |
apw | the one doing the work is ok, and another is collateral dammage | 09:07 |
genkgo | apw: sounds extremly crazy to me. | 09:07 |
genkgo | apw: I think the Hyper V host sends a signal to the guest machines after backup, other something like that. | 09:08 |
genkgo | apw: thanks for the help anyway. lets see what dmesg has to say during read-only filesystem. | 09:12 |
apw | genkgo, yeah, that is one reason i am wondering about the freeze bits, but yes, lets gets a dmesg and cat /proc/mounts as well | 09:15 |
apw | genkgo, also make sure we know what kernel version we are talking about in the report | 09:17 |
genkgo | apw: yes, at the moment is is 3.13.0-52-generic | 09:17 |
apw | genkgo, i would also look at whether the backup is requesting fs freeze, as there is definatly a hyper-v interface to ask the kernel to freeze and unfreeze filesystems | 09:18 |
genkgo | apw: what does /proc/mounts tell us | 09:23 |
genkgo | ? | 09:23 |
genkgo | hmm I see :) | 09:23 |
apw | genkgo, lots of things | 09:24 |
apw | including whether we think it is read-only or not | 09:25 |
genkgo | apw: regarding the freeze bit, we tried multiple backup strategies, all failed | 09:25 |
apw | sadly i know next to nothing about VSS backup | 09:26 |
apw | genkgo, how long does the backup take btw, across all the nodes | 09:26 |
genkgo | 1 hour and a few minutes | 09:26 |
apw | and everything is working in parallel with that until the last second when the backup ends, and sometimes one of the members breaks | 09:27 |
apw | well indeed we need to see if there is anything in that dmesg when it occurs, as i suspect your ext4 filesystems are mounted to go r/o on any error | 09:28 |
apw | why xfs wouldn't have the same hissy fit at a failed IO is an interesting question, assuming it sees them too | 09:28 |
genkgo | apw: the backups are not simultaneously, it takes 1 hour to backup all 4 machines in a row | 09:29 |
lifeless | genkgo: do you have a VSS agent running in Ubuntu ? | 09:32 |
genkgo | lifeless: yes, I believe so. | 09:32 |
genkgo | lifeless: /usr/lib/linux-tools/3.13.0-52-generic/hv_vss_daemon is running | 09:33 |
lifeless | genkgo: does it log anything? | 09:33 |
lifeless | also https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx is a little terrifying | 09:34 |
genkgo | lifeless: let me check that, I have not found any logs of the vss daemon before | 09:37 |
genkgo | lifeless: that picture frightens me too! | 09:38 |
genkgo | lifeless: I do not see any hv daemon logs | 09:52 |
apw | genkgo, i do not believe that forms separate logs, it should log to syslog | 09:58 |
apw | i also cannot see how this daemon guarentees it is able to run if for instance it gets paged out during the backup | 09:58 |
apw | nor does it seem to report anything on thaw failures, hrumph, not helpful | 10:00 |
genkgo | apw: we have planned to replicate one of the machines today (we do not want anymore downtime) and backup that machine hourly until things go wrong | 10:08 |
genkgo | apw: hopefully i can give additional information afterwards | 10:09 |
genkgo | apw: anyway, thanks a lot so far! | 10:11 |
apw | genkgo, sounsd purfect | 10:12 |
=== hugbot is now known as swordsmanz | ||
genkgo | apw: INFO: task rs:main Q:Reg:605 blocked for more than 120 seconds. What could that mean? | 13:30 |
apw | genkgo, that implies a task is unable to finish in the kernel | 13:46 |
genkgo | afw: during boot I see also the following message: init: plymouth-upstart-bridge main process (298) terminated with status 1 | 13:57 |
genkgo | and scsi scan: INQUIRY result too short (5), using 36 | 13:57 |
genkgo | afw: would it make sense to dump the complete output of one of the machines? | 14:05 |
genkgo | complete output of the boot sequence (/var/log/syslog) | 14:05 |
genkgo | afw: http://pastebin.com/zGqiMkAc that's the complete syslog of one of the Ubuntu VPS machines since it booted this morning (07:45 local time). | 14:13 |
genkgo | oh, I did |grep kernel |grep -v UFW | 14:15 |
nessita | jsalisbury, hello again. Regarding LP: #1201528, I reproduced the issue by playing a youtube video. Audio is completely lost, no way to recover unless I reboot. Added to the bug debug logs from pulseaudio. Anything else I can do to help? | 14:20 |
ubot5 | Launchpad bug 1201528 in linux (Ubuntu Saucy) "[INTEL DP55WG,Realtek ALC889] - Audio Playback Unavailable" [Medium,Won't fix] https://launchpad.net/bugs/1201528 | 14:20 |
=== bdmurray_ is now known as bdmurray | ||
jsalisbury | nessita, thanks for the update. I'll review the bug again | 14:50 |
nessita | jsalisbury, thank you! | 14:53 |
jsalisbury | nessita, Just to confirm, you reproduced the bug on Vivid? | 14:54 |
jsalisbury | ** | 14:57 |
jsalisbury | ** Ubuntu Kernel Team Meeting - Today @ 17:00 UTC - #ubuntu-meeting | 14:57 |
jsalisbury | ** | 14:57 |
nessita | jsalisbury, yes, vivid and kernel 3.19.0-16-generic | 14:57 |
jsalisbury | nessita, thanks | 14:58 |
cristian_c | jsalisbury, hello | 14:58 |
cristian_c | jsalisbury, are there any news about build of that kernel you told me? | 15:05 |
jsalisbury | cristian_c, not as of yet, but should be soon | 15:21 |
hallyn | jjohansen: apw: danwest reports https://bugs.launchpad.net/apparmor/+bug/1408833 appears to be back in 14.04.2 | 15:24 |
ubot5 | Ubuntu bug 1408833 in AppArmor "broken postinst test for uvtool-libvirt on utopic" [Undecided,Confirmed] | 15:24 |
cristian_c | jsalisbury, ok | 15:26 |
jsalisbury | ## | 16:55 |
jsalisbury | ## Kernel team meeting in 5 minutes | 16:55 |
jsalisbury | ## | 16:55 |
apw | hallyn, how long has .2 been out there ? | 16:57 |
hallyn | not a clue | 16:58 |
hallyn | danwest: could you (or whoever ran into that bug) do an 'apport-collect 1408833' on the affected host? | 16:59 |
hallyn | that should save apw/jjohansen some time (assuming it works) | 16:59 |
apw | hallyn, danwest, the fix we applied still seems to be applied at least | 17:02 |
infinity | apw: It was never fixed on 3.13, afaict, maybe danwest's seeing it on the trusty kernel, not the hwe-u kernel. | 17:08 |
infinity | apw: At least, the bug log implies it was only fixed in 3.16 (and I hope carried forward), no indication that it was backported to older kernels. | 17:09 |
apw | infinity, but .2 had the utopic kernel on ? | 17:11 |
apw | no ? | 17:11 |
smb | apw, the server iso but who knows about cloud-image which maybe they use | 17:12 |
infinity | apw: Well, that depends on how you define ".2", doesn't it? | 17:12 |
infinity | apw: lsb_release on any up-to-date trusty host will tell you it's 14.04.2 | 17:13 |
apw | infinity, ahh good point | 17:13 |
infinity | apw: What kernel you have installed is irrelevant. | 17:13 |
apw | heh ... yay for useless monikas | 17:13 |
infinity | 14.04.2 isn't useless, it's just wrong for people to claim it relates to the HWE stack we happen to release at the same (ish) time. | 17:14 |
infinity | But I stopped fighting that battle a while ago. | 17:14 |
=== jsalisbury changed the topic of #ubuntu-kernel to: Home: https://wiki.ubuntu.com/Kernel/ || Ubuntu Kernel Team Meeting - Tues May 19th, 2015 - 17:00 UTC || If you have a question just ask, and do wait around for an answer! If the question is should I file a bug for something, likely you can assume yes. || Channel logs: http://irclogs.ubuntu.com/ | ||
infinity | smb: Cloud images typically don't use HWE kernels, until a cloud requires it for some reason (like, the Azure precise images moved to lts-s because they had to, and then lts-t) | 17:15 |
smb | infinity, Yeah. I was just thinking about it for the reasons you said for naming that 14.04.2 as well. But forgetting that upgrade also results in the same | 17:17 |
infinity | smb: Yeah, 14.04.2 is a point in time in the archive, the only relation it has to HWE stacks is the ISOs. | 17:19 |
infinity | Whcih does make it confusing when people talk about it, but the whole HWE stack thing is confusing in general. | 17:20 |
smb | True. More reasons to insist on bug reports with proper data (or it never happened) ... | 17:21 |
=== adrian is now known as alvesadrian | ||
hallyn | danwest: ^ so looks like we need data; thx | 17:37 |
danwest | infinity, HWE is confusing - I still don't truly get it - not obvious what it is (just a kernel??), where and how I get it, etc... | 17:57 |
danwest | hallyn, what data? that apport-collect tries to open a browser which turns out to be something text based like lynx | 17:58 |
danwest | apw, infinity, hallyn: 3.16.0-30-generic is my current kernel | 17:58 |
infinity | danwest: Knowing which kernel is a big help already, yes. If you could at least include that in the bug report. | 18:02 |
danwest | infinity, will do | 18:03 |
=== kees_ is now known as kees | ||
=== eseifert is now known as seiferteric |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!