[08:34] Hello. We have a huge problem with Ubuntu 14.04 VPS inside a Hyper V platform. Running Windows Server Backup (VSS) changes the filesystem into a read-only filesystem. It is not a specific VPS problem: all three Ubuntu machines have exactly the same problem. In the same cluster we have a CentOS machine that is not having any problem at all. The Ubuntu machines are all on 3.13.0-52-generic. Because the machines are in production, our [08:37] genkgo, is there a dmesg error at the time of the switch to read-only ? [08:42] apw: no, there is no log at time the machine switches to read-only. That is exactly what it makes so hard. The problems occur randomly (at least I am not able to see a pattern). During some backups we have these logs: http://pastebin.com/MvGuDyRL. But also during other backups we have these logs: http://pastebin.com/sExsZKhV. [08:43] apw: but we never see any log before the machine goes into read-only filesystem. those logs only occur during backups that finish successfully do not cause the filesystem switch. they maybe (and I guess so) an indication of other problems. [08:43] genkgo, and then what happens, the filesystem reports a full error and moves the filesystem r/o ? [08:45] apw: no errors, the filesystem is in read-only mode. so every service that tries to save files is down. [08:46] apw: it is a simple webserver: nginx + apache + php-fpm. our http requests cannot be delivered. [08:47] apw: we have tried multiple backup strategies: all fail. and the weird thing is: the centos inside the cluster is doing very fine. [08:49] apw: i now see that there is a difference between centos and ubuntu. the ubuntu machine is using etx4 while centos uses xfs. [08:52] genkgo, could you file a bug against linux for me, and i will ask someone who has a hyper-v system to see if they can reproduce the behaviour [08:52] "ubuntu-bug linux" [08:54] apw: will do. is there anything you can advise me? [08:55] apw: to fix the problem temporarily? [08:56] genkgo, when you say it goes read-only what makes you say its read-only if we have no diagnostics saying that ? [08:57] apw: ok, when I was logged in to the machine, certain commands fail due to read-only filesystem. [08:58] genkgo, and yet the end of dmesg output does not indicate it going read-only ? [08:59] genkgo, could the filesystem be frozen for the backup, i know some of the backup bits do that on hyper-v [09:00] because /boot being ext2 and it not supporting freeze was an issue for a while [09:00] apw: hmm, now I am having doubts. I believe I did not look at dmesg while the system was in read-only mode, but only after the reboot. And then there was nothing on read-only. [09:01] genkgo, right, if you just reboot it won't get flushed to a permenant file, if it had gone read-onoly [09:01] apw: alright: so while being in the read-only I should immediately run dmesg [09:02] apw: I will do that and see what the logs say. [09:02] genkgo, for sure, as the end of that might indicate something kernel side triggering read-only [09:03] apw: I do not know Hyper V well enough to know whether it freezes the filesystem. [09:04] apw: I will not file the bug untill I have the dmesg output [09:04] genkgo, no, i think our best bet is that there is a kernel diagnostic at the end of dmesg, if you have a dev environment or something you can tickle this in [09:06] apw: will try to do that. the problem is the randomness. it is hard to reproduce the issue. last time (last night) the system went into read-only after the last machine was finished with backup. [09:06] genkgo, odd indeed [09:07] so there were 4 machines in the backup sequence, no problem at all during backup. when last one finished, another machine got into read-only filesystem. [09:07] genkgo, that sounds pretty odd doesn't it [09:07] the one doing the work is ok, and another is collateral dammage [09:07] apw: sounds extremly crazy to me. [09:08] apw: I think the Hyper V host sends a signal to the guest machines after backup, other something like that. [09:12] apw: thanks for the help anyway. lets see what dmesg has to say during read-only filesystem. [09:15] genkgo, yeah, that is one reason i am wondering about the freeze bits, but yes, lets gets a dmesg and cat /proc/mounts as well [09:17] genkgo, also make sure we know what kernel version we are talking about in the report [09:17] apw: yes, at the moment is is 3.13.0-52-generic [09:18] genkgo, i would also look at whether the backup is requesting fs freeze, as there is definatly a hyper-v interface to ask the kernel to freeze and unfreeze filesystems [09:23] apw: what does /proc/mounts tell us [09:23] ? [09:23] hmm I see :) [09:24] genkgo, lots of things [09:25] including whether we think it is read-only or not [09:25] apw: regarding the freeze bit, we tried multiple backup strategies, all failed [09:26] sadly i know next to nothing about VSS backup [09:26] genkgo, how long does the backup take btw, across all the nodes [09:26] 1 hour and a few minutes [09:27] and everything is working in parallel with that until the last second when the backup ends, and sometimes one of the members breaks [09:28] well indeed we need to see if there is anything in that dmesg when it occurs, as i suspect your ext4 filesystems are mounted to go r/o on any error [09:28] why xfs wouldn't have the same hissy fit at a failed IO is an interesting question, assuming it sees them too [09:29] apw: the backups are not simultaneously, it takes 1 hour to backup all 4 machines in a row [09:32] genkgo: do you have a VSS agent running in Ubuntu ? [09:32] lifeless: yes, I believe so. [09:33] lifeless: /usr/lib/linux-tools/3.13.0-52-generic/hv_vss_daemon is running [09:33] genkgo: does it log anything? [09:34] also https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx is a little terrifying [09:37] lifeless: let me check that, I have not found any logs of the vss daemon before [09:38] lifeless: that picture frightens me too! [09:52] lifeless: I do not see any hv daemon logs [09:58] genkgo, i do not believe that forms separate logs, it should log to syslog [09:58] i also cannot see how this daemon guarentees it is able to run if for instance it gets paged out during the backup [10:00] nor does it seem to report anything on thaw failures, hrumph, not helpful [10:08] apw: we have planned to replicate one of the machines today (we do not want anymore downtime) and backup that machine hourly until things go wrong [10:09] apw: hopefully i can give additional information afterwards [10:11] apw: anyway, thanks a lot so far! [10:12] genkgo, sounsd purfect === hugbot is now known as swordsmanz [13:30] apw: INFO: task rs:main Q:Reg:605 blocked for more than 120 seconds. What could that mean? [13:46] genkgo, that implies a task is unable to finish in the kernel [13:57] afw: during boot I see also the following message: init: plymouth-upstart-bridge main process (298) terminated with status 1 [13:57] and scsi scan: INQUIRY result too short (5), using 36 [14:05] afw: would it make sense to dump the complete output of one of the machines? [14:05] complete output of the boot sequence (/var/log/syslog) [14:13] afw: http://pastebin.com/zGqiMkAc that's the complete syslog of one of the Ubuntu VPS machines since it booted this morning (07:45 local time). [14:15] oh, I did |grep kernel |grep -v UFW [14:20] jsalisbury, hello again. Regarding LP: #1201528, I reproduced the issue by playing a youtube video. Audio is completely lost, no way to recover unless I reboot. Added to the bug debug logs from pulseaudio. Anything else I can do to help? [14:20] Launchpad bug 1201528 in linux (Ubuntu Saucy) "[INTEL DP55WG,Realtek ALC889] - Audio Playback Unavailable" [Medium,Won't fix] https://launchpad.net/bugs/1201528 === bdmurray_ is now known as bdmurray [14:50] nessita, thanks for the update. I'll review the bug again [14:53] jsalisbury, thank you! [14:54] nessita, Just to confirm, you reproduced the bug on Vivid? [14:57] ** [14:57] ** Ubuntu Kernel Team Meeting - Today @ 17:00 UTC - #ubuntu-meeting [14:57] ** [14:57] jsalisbury, yes, vivid and kernel 3.19.0-16-generic [14:58] nessita, thanks [14:58] jsalisbury, hello [15:05] jsalisbury, are there any news about build of that kernel you told me? [15:21] cristian_c, not as of yet, but should be soon [15:24] jjohansen: apw: danwest reports https://bugs.launchpad.net/apparmor/+bug/1408833 appears to be back in 14.04.2 [15:24] Ubuntu bug 1408833 in AppArmor "broken postinst test for uvtool-libvirt on utopic" [Undecided,Confirmed] [15:26] jsalisbury, ok [16:55] ## [16:55] ## Kernel team meeting in 5 minutes [16:55] ## [16:57] hallyn, how long has .2 been out there ? [16:58] not a clue [16:59] danwest: could you (or whoever ran into that bug) do an 'apport-collect 1408833' on the affected host? [16:59] that should save apw/jjohansen some time (assuming it works) [17:02] hallyn, danwest, the fix we applied still seems to be applied at least [17:08] apw: It was never fixed on 3.13, afaict, maybe danwest's seeing it on the trusty kernel, not the hwe-u kernel. [17:09] apw: At least, the bug log implies it was only fixed in 3.16 (and I hope carried forward), no indication that it was backported to older kernels. [17:11] infinity, but .2 had the utopic kernel on ? [17:11] no ? [17:12] apw, the server iso but who knows about cloud-image which maybe they use [17:12] apw: Well, that depends on how you define ".2", doesn't it? [17:13] apw: lsb_release on any up-to-date trusty host will tell you it's 14.04.2 [17:13] infinity, ahh good point [17:13] apw: What kernel you have installed is irrelevant. [17:13] heh ... yay for useless monikas [17:14] 14.04.2 isn't useless, it's just wrong for people to claim it relates to the HWE stack we happen to release at the same (ish) time. [17:14] But I stopped fighting that battle a while ago. === jsalisbury changed the topic of #ubuntu-kernel to: Home: https://wiki.ubuntu.com/Kernel/ || Ubuntu Kernel Team Meeting - Tues May 19th, 2015 - 17:00 UTC || If you have a question just ask, and do wait around for an answer! If the question is should I file a bug for something, likely you can assume yes. || Channel logs: http://irclogs.ubuntu.com/ [17:15] smb: Cloud images typically don't use HWE kernels, until a cloud requires it for some reason (like, the Azure precise images moved to lts-s because they had to, and then lts-t) [17:17] infinity, Yeah. I was just thinking about it for the reasons you said for naming that 14.04.2 as well. But forgetting that upgrade also results in the same [17:19] smb: Yeah, 14.04.2 is a point in time in the archive, the only relation it has to HWE stacks is the ISOs. [17:20] Whcih does make it confusing when people talk about it, but the whole HWE stack thing is confusing in general. [17:21] True. More reasons to insist on bug reports with proper data (or it never happened) ... === adrian is now known as alvesadrian [17:37] danwest: ^ so looks like we need data; thx [17:57] infinity, HWE is confusing - I still don't truly get it - not obvious what it is (just a kernel??), where and how I get it, etc... [17:58] hallyn, what data? that apport-collect tries to open a browser which turns out to be something text based like lynx [17:58] apw, infinity, hallyn: 3.16.0-30-generic is my current kernel [18:02] danwest: Knowing which kernel is a big help already, yes. If you could at least include that in the bug report. [18:03] infinity, will do === kees_ is now known as kees === eseifert is now known as seiferteric