[08:34] <genkgo> Hello. We have a huge problem with Ubuntu 14.04 VPS inside a Hyper V platform. Running Windows Server Backup (VSS) changes the filesystem into a read-only filesystem. It is not a specific VPS problem: all three Ubuntu machines have exactly the same problem. In the same cluster we have a CentOS machine that is not having any problem at all. The Ubuntu machines are all on 3.13.0-52-generic. Because the machines are in production, our
[08:37] <apw> genkgo, is there a dmesg error at the time of the switch to read-only ?
[08:42] <genkgo> apw: no, there is no log at time the machine switches to read-only. That is exactly what it makes so hard. The problems occur randomly (at least I am not able to see a pattern). During some backups we have these logs: http://pastebin.com/MvGuDyRL. But also during other backups we have these logs: http://pastebin.com/sExsZKhV.
[08:43] <genkgo> apw: but we never see any log before the machine goes into read-only filesystem. those logs only occur during backups that finish successfully do not cause the filesystem switch. they maybe (and I guess so) an indication of other problems.
[08:43] <apw> genkgo, and then what happens, the filesystem reports a full error and moves the filesystem r/o ?
[08:45] <genkgo> apw: no errors, the filesystem is in read-only mode. so every service that tries to save files is down.
[08:46] <genkgo> apw: it is a simple webserver: nginx + apache + php-fpm. our http requests cannot be delivered.
[08:47] <genkgo> apw: we have tried multiple backup strategies: all fail. and the weird thing is: the centos inside the cluster is doing very fine.
[08:49] <genkgo> apw: i now see that there is a difference between centos and ubuntu. the ubuntu machine is using etx4 while centos uses xfs.
[08:52] <apw> genkgo, could you file a bug against linux for me, and i will ask someone who has a hyper-v system to see if they can reproduce the behaviour
[08:52] <apw> "ubuntu-bug linux"
[08:54] <genkgo> apw: will do. is there anything you can advise me?
[08:55] <genkgo> apw: to fix the problem temporarily?
[08:56] <apw> genkgo, when you say it goes read-only what makes you say its read-only if we have no diagnostics saying that ?
[08:57] <genkgo> apw: ok, when I was logged in to the machine, certain commands fail due to read-only filesystem.
[08:58] <apw> genkgo, and yet the end of dmesg output does not indicate it going read-only ?
[08:59] <apw> genkgo, could the filesystem be frozen for the backup, i know some of the backup bits do that on hyper-v
[09:00] <apw> because /boot being ext2 and it not supporting freeze was an issue for a while
[09:00] <genkgo> apw: hmm, now I am having doubts. I believe I did not look at dmesg while the system was in read-only mode, but only after the reboot. And then there was nothing on read-only.
[09:01] <apw> genkgo, right, if you just reboot it won't get flushed to a permenant file, if it had gone read-onoly
[09:01] <genkgo> apw: alright: so while being in the read-only I should immediately run dmesg
[09:02] <genkgo> apw: I will do that and see what the logs say.
[09:02] <apw> genkgo, for sure, as the end of that might indicate something kernel side triggering read-only
[09:03] <genkgo> apw: I do not know Hyper V well enough to know whether it freezes the filesystem.
[09:04] <genkgo> apw: I will not file the bug untill I have the dmesg output
[09:04] <apw> genkgo, no, i think our best bet is that there is a kernel diagnostic at the end of dmesg, if you have a dev environment or something you can tickle this in
[09:06] <genkgo> apw: will try to do that. the problem is the randomness. it is hard to reproduce the issue. last time (last night) the system went into read-only after the last machine was finished with backup.
[09:06] <apw> genkgo, odd indeed
[09:07] <genkgo> so there were 4 machines in the backup sequence, no problem at all during backup. when last one finished, another machine got into read-only filesystem.
[09:07] <apw> genkgo, that sounds pretty odd doesn't it
[09:07] <apw> the one doing the work is ok, and another is collateral dammage
[09:07] <genkgo> apw: sounds extremly crazy to me.
[09:08] <genkgo> apw: I think the Hyper V host sends a signal to the guest machines after backup, other something like that.
[09:12] <genkgo> apw: thanks for the help anyway. lets see what dmesg has to say during read-only filesystem.
[09:15] <apw> genkgo, yeah, that is one reason i am wondering about the freeze bits, but yes, lets gets a dmesg and cat /proc/mounts as well
[09:17] <apw> genkgo, also make sure we know what kernel version we are talking about in the report
[09:17] <genkgo> apw: yes, at the moment is is 3.13.0-52-generic
[09:18] <apw> genkgo, i would also look at whether the backup is requesting fs freeze, as there is definatly a hyper-v interface to ask the kernel to freeze and unfreeze filesystems
[09:23] <genkgo> apw: what does /proc/mounts tell us
[09:23] <genkgo> ?
[09:23] <genkgo> hmm I see :)
[09:24] <apw> genkgo, lots of things
[09:25] <apw> including whether we think it is read-only or not
[09:25] <genkgo> apw: regarding the freeze bit, we tried multiple backup strategies, all failed
[09:26] <apw> sadly i know next to nothing about VSS backup
[09:26] <apw> genkgo, how long does the backup take btw, across all the nodes 
[09:26] <genkgo> 1 hour and a few minutes
[09:27] <apw> and everything is working in parallel with that until the last second when the backup ends, and sometimes one of the members breaks
[09:28] <apw> well indeed we need to see if there is anything in that dmesg when it occurs, as i suspect your ext4 filesystems are mounted to go r/o on any error
[09:28] <apw> why xfs wouldn't have the same hissy fit at a failed IO is an interesting question, assuming it sees them too
[09:29] <genkgo> apw: the backups are not simultaneously, it takes 1 hour to backup all 4 machines in a row
[09:32] <lifeless> genkgo: do you have a VSS agent running in Ubuntu ?
[09:32] <genkgo> lifeless: yes, I believe so.
[09:33] <genkgo> lifeless: /usr/lib/linux-tools/3.13.0-52-generic/hv_vss_daemon is running
[09:33] <lifeless> genkgo: does it log anything?
[09:34] <lifeless> also https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx is a little terrifying
[09:37] <genkgo> lifeless: let me check that, I have not found any logs of the vss daemon before
[09:38] <genkgo> lifeless: that picture frightens me too!
[09:52] <genkgo> lifeless: I do not see any hv daemon logs
[09:58] <apw> genkgo, i do not believe that forms separate logs, it should log to syslog
[09:58] <apw> i also cannot see how this daemon guarentees it is able to run if for instance it gets paged out during the backup
[10:00] <apw> nor does it seem to report anything on thaw failures, hrumph, not helpful
[10:08] <genkgo> apw: we have planned to replicate one of the machines today (we do not want anymore downtime) and backup that machine hourly until things go wrong
[10:09] <genkgo> apw: hopefully i can give additional information afterwards
[10:11] <genkgo> apw: anyway, thanks a lot so far!
[10:12] <apw> genkgo, sounsd purfect
[13:30] <genkgo> apw: INFO: task rs:main Q:Reg:605 blocked for more than 120 seconds. What could that mean?
[13:46] <apw> genkgo, that implies a task is unable to finish in the kernel
[13:57] <genkgo> afw: during boot I see also the following message: init: plymouth-upstart-bridge main process (298) terminated with status 1
[13:57] <genkgo> and scsi scan: INQUIRY result too short (5), using 36
[14:05] <genkgo> afw: would it make sense to dump the complete output of one of the machines?
[14:05] <genkgo> complete output of the boot sequence (/var/log/syslog)
[14:13] <genkgo> afw: http://pastebin.com/zGqiMkAc that's the complete syslog of one of the Ubuntu VPS machines since it booted this morning (07:45 local time).
[14:15] <genkgo> oh, I did |grep kernel |grep -v UFW
[14:20] <nessita> jsalisbury, hello again. Regarding LP: #1201528, I reproduced the issue by playing a youtube video. Audio is completely lost, no way to recover unless I reboot. Added to the bug debug logs from pulseaudio. Anything else I can do to help?
[14:50] <jsalisbury> nessita, thanks for the update.  I'll review the bug again
[14:53] <nessita> jsalisbury, thank you!
[14:54] <jsalisbury> nessita, Just to confirm, you reproduced the bug on Vivid?
[14:57] <jsalisbury> **
[14:57] <jsalisbury> ** Ubuntu Kernel Team Meeting - Today @ 17:00 UTC - #ubuntu-meeting
[14:57] <jsalisbury> **
[14:57] <nessita> jsalisbury, yes, vivid and kernel 3.19.0-16-generic
[14:58] <jsalisbury> nessita, thanks 
[14:58] <cristian_c> jsalisbury, hello
[15:05] <cristian_c> jsalisbury, are there any news about build of that kernel you told me?
[15:21] <jsalisbury> cristian_c, not as of yet, but should be soon
[15:24] <hallyn> jjohansen: apw: danwest reports https://bugs.launchpad.net/apparmor/+bug/1408833 appears to be back in 14.04.2
[15:26] <cristian_c> jsalisbury, ok
[16:55] <jsalisbury> ##
[16:55] <jsalisbury> ## Kernel team meeting in 5 minutes
[16:55] <jsalisbury> ##
[16:57] <apw> hallyn, how long has .2 been out there ?
[16:58] <hallyn> not a clue
[16:59] <hallyn> danwest: could you (or whoever ran into that bug) do an 'apport-collect 1408833' on the affected host?
[16:59] <hallyn> that should save apw/jjohansen some time (assuming it works)
[17:02] <apw> hallyn, danwest, the fix we applied still seems to be applied at least
[17:08] <infinity> apw: It was never fixed on 3.13, afaict, maybe danwest's seeing it on the trusty kernel, not the hwe-u kernel.
[17:09] <infinity> apw: At least, the bug log implies it was only fixed in 3.16 (and I hope carried forward), no indication that it was backported to older kernels.
[17:11] <apw> infinity, but .2 had the utopic kernel on ?
[17:11] <apw> no ?
[17:12] <smb> apw, the server iso but who knows about cloud-image which maybe they use
[17:12] <infinity> apw: Well, that depends on how you define ".2", doesn't it?
[17:13] <infinity> apw: lsb_release on any up-to-date trusty host will tell you it's 14.04.2
[17:13] <apw> infinity, ahh good point
[17:13] <infinity> apw: What kernel you have installed is irrelevant.
[17:13] <apw> heh ... yay for useless monikas
[17:14] <infinity> 14.04.2 isn't useless, it's just wrong for people to claim it relates to the HWE stack we happen to release at the same (ish) time.
[17:14] <infinity> But I stopped fighting that battle a while ago.
[17:15] <infinity> smb: Cloud images typically don't use HWE kernels, until a cloud requires it for some reason (like, the Azure precise images moved to lts-s because they had to, and then lts-t)
[17:17] <smb> infinity, Yeah. I was just thinking about it for the reasons you said for naming that 14.04.2 as well. But forgetting that upgrade also results in the same
[17:19] <infinity> smb: Yeah, 14.04.2 is a point in time in the archive, the only relation it has to HWE stacks is the ISOs.
[17:20] <infinity> Whcih does make it confusing when people talk about it, but the whole HWE stack thing is confusing in general.
[17:21] <smb> True. More reasons to insist on bug reports with proper data (or it never happened) ...
[17:37] <hallyn> danwest: ^  so looks like we need data;  thx
[17:57] <danwest> infinity, HWE is confusing - I still don't truly get it - not obvious what it is (just a kernel??), where and how I get it, etc...
[17:58] <danwest> hallyn, what data? that apport-collect tries to open a browser which turns out to be something text based like lynx
[17:58] <danwest> apw, infinity, hallyn: 3.16.0-30-generic is my current kernel
[18:02] <infinity> danwest: Knowing which kernel is a big help already, yes.  If you could at least include that in the bug report.
[18:03] <danwest> infinity, will do