/srv/irclogs.ubuntu.com/2015/05/12/#ubuntu-kernel.txt

genkgo	Hello. We have a huge problem with Ubuntu 14.04 VPS inside a Hyper V platform. Running Windows Server Backup (VSS) changes the filesystem into a read-only filesystem. It is not a specific VPS problem: all three Ubuntu machines have exactly the same problem. In the same cluster we have a CentOS machine that is not having any problem at all. The Ubuntu machines are all on 3.13.0-52-generic. Because the machines are in production, our	08:34
apw	genkgo, is there a dmesg error at the time of the switch to read-only ?	08:37
genkgo	apw: no, there is no log at time the machine switches to read-only. That is exactly what it makes so hard. The problems occur randomly (at least I am not able to see a pattern). During some backups we have these logs: http://pastebin.com/MvGuDyRL. But also during other backups we have these logs: http://pastebin.com/sExsZKhV.	08:42
genkgo	apw: but we never see any log before the machine goes into read-only filesystem. those logs only occur during backups that finish successfully do not cause the filesystem switch. they maybe (and I guess so) an indication of other problems.	08:43
apw	genkgo, and then what happens, the filesystem reports a full error and moves the filesystem r/o ?	08:43
genkgo	apw: no errors, the filesystem is in read-only mode. so every service that tries to save files is down.	08:45
genkgo	apw: it is a simple webserver: nginx + apache + php-fpm. our http requests cannot be delivered.	08:46
genkgo	apw: we have tried multiple backup strategies: all fail. and the weird thing is: the centos inside the cluster is doing very fine.	08:47
genkgo	apw: i now see that there is a difference between centos and ubuntu. the ubuntu machine is using etx4 while centos uses xfs.	08:49
apw	genkgo, could you file a bug against linux for me, and i will ask someone who has a hyper-v system to see if they can reproduce the behaviour	08:52
apw	"ubuntu-bug linux"	08:52
genkgo	apw: will do. is there anything you can advise me?	08:54
genkgo	apw: to fix the problem temporarily?	08:55
apw	genkgo, when you say it goes read-only what makes you say its read-only if we have no diagnostics saying that ?	08:56
genkgo	apw: ok, when I was logged in to the machine, certain commands fail due to read-only filesystem.	08:57
apw	genkgo, and yet the end of dmesg output does not indicate it going read-only ?	08:58
apw	genkgo, could the filesystem be frozen for the backup, i know some of the backup bits do that on hyper-v	08:59
apw	because /boot being ext2 and it not supporting freeze was an issue for a while	09:00
genkgo	apw: hmm, now I am having doubts. I believe I did not look at dmesg while the system was in read-only mode, but only after the reboot. And then there was nothing on read-only.	09:00
apw	genkgo, right, if you just reboot it won't get flushed to a permenant file, if it had gone read-onoly	09:01
genkgo	apw: alright: so while being in the read-only I should immediately run dmesg	09:01
genkgo	apw: I will do that and see what the logs say.	09:02
apw	genkgo, for sure, as the end of that might indicate something kernel side triggering read-only	09:02
genkgo	apw: I do not know Hyper V well enough to know whether it freezes the filesystem.	09:03
genkgo	apw: I will not file the bug untill I have the dmesg output	09:04
apw	genkgo, no, i think our best bet is that there is a kernel diagnostic at the end of dmesg, if you have a dev environment or something you can tickle this in	09:04
genkgo	apw: will try to do that. the problem is the randomness. it is hard to reproduce the issue. last time (last night) the system went into read-only after the last machine was finished with backup.	09:06
apw	genkgo, odd indeed	09:06
genkgo	so there were 4 machines in the backup sequence, no problem at all during backup. when last one finished, another machine got into read-only filesystem.	09:07
apw	genkgo, that sounds pretty odd doesn't it	09:07
apw	the one doing the work is ok, and another is collateral dammage	09:07
genkgo	apw: sounds extremly crazy to me.	09:07
genkgo	apw: I think the Hyper V host sends a signal to the guest machines after backup, other something like that.	09:08
genkgo	apw: thanks for the help anyway. lets see what dmesg has to say during read-only filesystem.	09:12
apw	genkgo, yeah, that is one reason i am wondering about the freeze bits, but yes, lets gets a dmesg and cat /proc/mounts as well	09:15
apw	genkgo, also make sure we know what kernel version we are talking about in the report	09:17
genkgo	apw: yes, at the moment is is 3.13.0-52-generic	09:17
apw	genkgo, i would also look at whether the backup is requesting fs freeze, as there is definatly a hyper-v interface to ask the kernel to freeze and unfreeze filesystems	09:18
genkgo	apw: what does /proc/mounts tell us	09:23
genkgo	?	09:23
genkgo	hmm I see :)	09:23
apw	genkgo, lots of things	09:24
apw	including whether we think it is read-only or not	09:25
genkgo	apw: regarding the freeze bit, we tried multiple backup strategies, all failed	09:25
apw	sadly i know next to nothing about VSS backup	09:26
apw	genkgo, how long does the backup take btw, across all the nodes	09:26
genkgo	1 hour and a few minutes	09:26
apw	and everything is working in parallel with that until the last second when the backup ends, and sometimes one of the members breaks	09:27
apw	well indeed we need to see if there is anything in that dmesg when it occurs, as i suspect your ext4 filesystems are mounted to go r/o on any error	09:28
apw	why xfs wouldn't have the same hissy fit at a failed IO is an interesting question, assuming it sees them too	09:28
genkgo	apw: the backups are not simultaneously, it takes 1 hour to backup all 4 machines in a row	09:29
lifeless	genkgo: do you have a VSS agent running in Ubuntu ?	09:32
genkgo	lifeless: yes, I believe so.	09:32
genkgo	lifeless: /usr/lib/linux-tools/3.13.0-52-generic/hv_vss_daemon is running	09:33
lifeless	genkgo: does it log anything?	09:33
lifeless	also https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx is a little terrifying	09:34
genkgo	lifeless: let me check that, I have not found any logs of the vss daemon before	09:37
genkgo	lifeless: that picture frightens me too!	09:38
genkgo	lifeless: I do not see any hv daemon logs	09:52
apw	genkgo, i do not believe that forms separate logs, it should log to syslog	09:58
apw	i also cannot see how this daemon guarentees it is able to run if for instance it gets paged out during the backup	09:58
apw	nor does it seem to report anything on thaw failures, hrumph, not helpful	10:00
genkgo	apw: we have planned to replicate one of the machines today (we do not want anymore downtime) and backup that machine hourly until things go wrong	10:08
genkgo	apw: hopefully i can give additional information afterwards	10:09
genkgo	apw: anyway, thanks a lot so far!	10:11
apw	genkgo, sounsd purfect	10:12
=== hugbot is now known as swordsmanz
genkgo	apw: INFO: task rs:main Q:Reg:605 blocked for more than 120 seconds. What could that mean?	13:30
apw	genkgo, that implies a task is unable to finish in the kernel	13:46
genkgo	afw: during boot I see also the following message: init: plymouth-upstart-bridge main process (298) terminated with status 1	13:57
genkgo	and scsi scan: INQUIRY result too short (5), using 36	13:57
genkgo	afw: would it make sense to dump the complete output of one of the machines?	14:05
genkgo	complete output of the boot sequence (/var/log/syslog)	14:05
genkgo	afw: http://pastebin.com/zGqiMkAc that's the complete syslog of one of the Ubuntu VPS machines since it booted this morning (07:45 local time).	14:13
genkgo	oh, I did \|grep kernel \|grep -v UFW	14:15
nessita	jsalisbury, hello again. Regarding LP: #1201528, I reproduced the issue by playing a youtube video. Audio is completely lost, no way to recover unless I reboot. Added to the bug debug logs from pulseaudio. Anything else I can do to help?	14:20
ubot5	Launchpad bug 1201528 in linux (Ubuntu Saucy) "[INTEL DP55WG,Realtek ALC889] - Audio Playback Unavailable" [Medium,Won't fix] https://launchpad.net/bugs/1201528	14:20
=== bdmurray_ is now known as bdmurray
jsalisbury	nessita, thanks for the update. I'll review the bug again	14:50
nessita	jsalisbury, thank you!	14:53
jsalisbury	nessita, Just to confirm, you reproduced the bug on Vivid?	14:54
jsalisbury	**	14:57
jsalisbury	** Ubuntu Kernel Team Meeting - Today @ 17:00 UTC - #ubuntu-meeting	14:57
jsalisbury	**	14:57
nessita	jsalisbury, yes, vivid and kernel 3.19.0-16-generic	14:57
jsalisbury	nessita, thanks	14:58
cristian_c	jsalisbury, hello	14:58
cristian_c	jsalisbury, are there any news about build of that kernel you told me?	15:05
jsalisbury	cristian_c, not as of yet, but should be soon	15:21
hallyn	jjohansen: apw: danwest reports https://bugs.launchpad.net/apparmor/+bug/1408833 appears to be back in 14.04.2	15:24
ubot5	Ubuntu bug 1408833 in AppArmor "broken postinst test for uvtool-libvirt on utopic" [Undecided,Confirmed]	15:24
cristian_c	jsalisbury, ok	15:26
jsalisbury	##	16:55
jsalisbury	## Kernel team meeting in 5 minutes	16:55
jsalisbury	##	16:55
apw	hallyn, how long has .2 been out there ?	16:57
hallyn	not a clue	16:58
hallyn	danwest: could you (or whoever ran into that bug) do an 'apport-collect 1408833' on the affected host?	16:59
hallyn	that should save apw/jjohansen some time (assuming it works)	16:59
apw	hallyn, danwest, the fix we applied still seems to be applied at least	17:02
infinity	apw: It was never fixed on 3.13, afaict, maybe danwest's seeing it on the trusty kernel, not the hwe-u kernel.	17:08
infinity	apw: At least, the bug log implies it was only fixed in 3.16 (and I hope carried forward), no indication that it was backported to older kernels.	17:09
apw	infinity, but .2 had the utopic kernel on ?	17:11
apw	no ?	17:11
smb	apw, the server iso but who knows about cloud-image which maybe they use	17:12
infinity	apw: Well, that depends on how you define ".2", doesn't it?	17:12
infinity	apw: lsb_release on any up-to-date trusty host will tell you it's 14.04.2	17:13
apw	infinity, ahh good point	17:13
infinity	apw: What kernel you have installed is irrelevant.	17:13
apw	heh ... yay for useless monikas	17:13
infinity	14.04.2 isn't useless, it's just wrong for people to claim it relates to the HWE stack we happen to release at the same (ish) time.	17:14
infinity	But I stopped fighting that battle a while ago.	17:14
=== jsalisbury changed the topic of #ubuntu-kernel to: Home: https://wiki.ubuntu.com/Kernel/ \|\| Ubuntu Kernel Team Meeting - Tues May 19th, 2015 - 17:00 UTC \|\| If you have a question just ask, and do wait around for an answer! If the question is should I file a bug for something, likely you can assume yes. \|\| Channel logs: http://irclogs.ubuntu.com/
infinity	smb: Cloud images typically don't use HWE kernels, until a cloud requires it for some reason (like, the Azure precise images moved to lts-s because they had to, and then lts-t)	17:15
smb	infinity, Yeah. I was just thinking about it for the reasons you said for naming that 14.04.2 as well. But forgetting that upgrade also results in the same	17:17
infinity	smb: Yeah, 14.04.2 is a point in time in the archive, the only relation it has to HWE stacks is the ISOs.	17:19
infinity	Whcih does make it confusing when people talk about it, but the whole HWE stack thing is confusing in general.	17:20
smb	True. More reasons to insist on bug reports with proper data (or it never happened) ...	17:21
=== adrian is now known as alvesadrian
hallyn	danwest: ^ so looks like we need data; thx	17:37
danwest	infinity, HWE is confusing - I still don't truly get it - not obvious what it is (just a kernel??), where and how I get it, etc...	17:57
danwest	hallyn, what data? that apport-collect tries to open a browser which turns out to be something text based like lynx	17:58
danwest	apw, infinity, hallyn: 3.16.0-30-generic is my current kernel	17:58
infinity	danwest: Knowing which kernel is a big help already, yes. If you could at least include that in the bug report.	18:02
danwest	infinity, will do	18:03
=== kees_ is now known as kees
=== eseifert is now known as seiferteric

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!