/srv/irclogs.ubuntu.com/2015/05/13/#ubuntu-kernel.txt

=== tmpRAOF is now known as RAOF
=== gerald is now known as Guest76602
=== ming is now known as Guest29940
tomti	What is the easiest way to find the latest official kernel version for a particular ubuntu release? I tried apt-cache show\|madison\|showpkg but none give what I want	10:18
genkgo	afw: We spoke yesterday on the read-only filesystem issue with ubuntu (3.13/ext4) vs centos (3.10/xfs) inside a HyperV platform. We created a new VPS and started to backup the machine hourly. This machine just went into read-only state and I grabbed info from dmesg: http://pastebin.com/48CK60Hi and /proc/mounts http://pastebin.com/WwLej7r7 as you told me.	10:48
genkgo	afw: the machine is still in read-only mode, so if you need more info, please let me know.	10:50
infinity	genkgo: Nothing earlier in dmesg before that? Looks like the (fake) disk driver exploding.	11:16
genkgo	infinity: yes, http://pastebin.com/DTmvZgHS	11:19
genkgo	infinity: the lines I just added happen during every VSS backup	11:20
genkgo	infinity: I explained yesterday to afw that I have four machines, three ubuntu 3.13.0-52 with ext4 and one with centos 3.10.0-123 with xfs. The Ubuntu go into read-only mode randomly while the CentOS machine is doing fine.	11:22
genkgo	infinity: Sorry, go randomly into read-only while HyperV is creating a backup of the four machines (in a row, not simultaneously). That could be during a backup of the specific is created but also at the end of the complete backup process.	11:24
genkgo	* of the specific machine	11:24
xnox	.... hyperv backup does freeze on to the filesystems and devices.	11:26
xnox	and expects the freeze & unfreeze to work...	11:26
genkgo	lifeless showed me a picture of the HyperV VSS process: https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx confirming information is exchanged between the hyperv cluster and the guest machines at the end of a backup	11:26
genkgo	xnox: yes, that is confirmed on the link I just pasted	11:27
genkgo	but why do the ubuntu machines go into read-only mode while the centos is doing fine?	11:27
genkgo	xnox: and what does the dmesg http://pastebin.com/48CK60Hi output tell me?	11:28
genkgo	except from being in read-only: we knew that, I cannot find a cause in the message.	11:29
xnox	writting journal was aborted, then time jumps 200ms, journal is attempted to be read and is incomplete.	11:30
genkgo	ok, and this means filesystem inconsistency and therefore ubuntu kernel switch the filesystem to read-only?	11:30
xnox	imho that smells like "sync" did not complete, yet "freeze" returned and backup kicked in thus expoding at "thaw" and remounting ro	11:30
genkgo	xnox: do you have any advice what to do? I have machines in production that are affected by this, causing downtime.	11:32
genkgo	afw asked me yesterday to file a bug if I knew what was going on (dmesg). Now I have that information. Is it a bug? Is it Ubuntu Kernel related?	11:33
xnox	dunno, i would have asked cking to stress test freezing/unfreezing vms under I/O workload to figure out what's going on.	11:33
xnox	it should be reproducible outside hypervm	11:33
xnox	not sure who afw is, you mean apw?	11:34
genkgo	xnox: sorry, I mean apw indeed :)	11:34
apw	genkgo, well you have a pretty clear disk error there	11:35
apw	end_request: I/O error, dev sda, sector 65127256	11:35
apw	that IO failed so the filesystem wen't offline	11:35
genkgo	apw: ok, so you guess bad hardware?	11:35
apw	genkgo, i would like to see more of the dmesg before that	11:35
apw	genkgo, it is a VM so it is likely not actual h/w failure, it presumably is talking about a virtual disk	11:36
xnox	also it would be interesting to know how hyperv initiates vm freeze... given that we probably lack fsfreeze and xfs_freeze userspace tools in 14.04	11:36
genkgo	apw: http://pastebin.com/DTmvZgHS contains the other lines (I also showed you yesterday). Would you like me to include boot sequence too? Because there is nothing more in between.	11:36
apw	genkgo, if you showed afw yesterday, then i'd have not noticed	11:37
xnox	genkgo: use pate.canonical.com and show everything =)	11:37
genkgo	hehe :)	11:37
xnox	genkgo: also paste.ubuntu.com works nicer with pastebinit utility ;-)	11:37
apw	genkgo, remind me of the kernel version again	11:39
apw	genkgo, and do the ones which do not fail also report those changed operating definition	11:43
genkgo	apw: yes, they do	11:44
genkgo	http://paste.ubuntu.com/11112285/	11:44
apw	i see you are using 3.13 kernels on these hyper-v guests, we are mostly producing images with HWE kernels installed for hyper-v	11:45
apw	because the hypervisor interface is evolving so very fast at the moment	11:46
apw	genkgo, that one also shows an aborted journal	11:46
apw	[66392.076569] end_request: I/O error, dev sda, sector 65127256	11:46
apw	[66392.076610] Aborting journal on device sda5-8.	11:46
genkgo	apw: correct, this is is full output of dmesg	11:47
genkgo	of the same machine	11:47
apw	or is that a change for each backup, and only the last output if the only one which failed	11:47
genkgo	yeah, we replicated a machine as test machine yesterday, started backup hourly until the system went into read-only, which just happened	11:47
genkgo	this is the full output from boot yesterday untill now	11:48
genkgo	apw: we are using 3.13 kernels for all ubuntu machines (the centos one is using 3.10)	11:48
apw	genkgo, is the centos running the same workload as the ubuntu machines in the backup set ?	11:49
xnox	genkgo: .... centos is xfs which always had freeze support, e.g. ext2 only gained freeze support in 3.19 kernel.	11:50
genkgo	yeah, every machine has other purposes and therefore other services, but yeah, I think there is no difference in load	11:50
xnox	genkgo: plus centos version numbers are a bit pointless, as 3.10 can have eons of cherrypicked patches.	11:50
apw	genkgo, i mean are they doing the exact same things? i'd say the one which has failed had an IO in flight when the change request popped out and that has made it go pop	11:50
xnox	and we default to mounting ext2 filesystems with the ext4 driver. so logs are different.	11:50
genkgo	xnox: I noticed we are on version 3.10.0-123 so yeah I imagined the pataches	11:51
xnox	imho you should _only_ be using hwe kernels on hyperv.	11:51
genkgo	apw: no, in that case they are doing really different things	11:51
xnox	apw: centos is usind different filesystem type....	11:51
genkgo	centos is doing mail (imap and smtp)	11:51
xnox	as in no IO at all...	11:52
genkgo	while two ubuntu machines are handling http requests	11:52
xnox	which logs all the time to disk...	11:52
genkgo	the final ubuntu is helper machine with all kinds of services (tomcat / libreoffice converter etc.)	11:53
genkgo	xnox: so you are saying we should switch filesystem?	11:54
xnox	genkgo: no.	11:55
xnox	genkgo: i am saying it's uneven comparison with centos. oranges and apples.	11:55
genkgo	xnox: ok	11:55
xnox	genkgo: you should switch to our hwe kernels, and check if you can reproduce this with 3.19 - vivid's kernel.	11:56
xnox	genkgo: and azure people want ubuntu to use 3.19 kernel and better... to get the ext2 freeze support	11:56
xnox	cause default server config uses ext2 + lvm volume group and they can't freeze that for backup across the board.	11:57
xnox	on other clouds we default to hwe kernels. e.g. on ec2 and similar.	11:57
genkgo	xnox: ok, I will do that. switching to xfs makes no sense?	11:59
xnox	genkgo: we cannot do that, no.	11:59
xnox	genkgo: we are talking about all ubuntu vms launched in azure, not just your three vms.	11:59
genkgo	xnox: allright, I never meant to talk about all ubuntu vms	12:00
genkgo	xnox: so I leave to fs to ext4 and upgrade to HWE kernels	12:00
genkgo	being 3.19	12:01
genkgo	xnox: this page does not indicate there is a 3.19 https://wiki.ubuntu.com/Kernel/LTSEnablementStack	12:02
genkgo	xnox: is this the ppa ppa:canonical-kernel-team/ppa I should use?	12:04
xnox	https://launchpad.net/ubuntu/+source/linux-lts-vivid	12:05
xnox	it's in proposed	12:05
genkgo	xnox: thank you very much for helping me out	12:06
genkgo	I will install it and see what happends	12:06
apw	genkgo, if this is a test box, i would suggest that you run a test using the linux-lts-utopic	12:17
apw	as in theory that is what is being tested in majority in azure	12:17
genkgo	apw: I am already installed 3.19 on the test machine, using sudo add-apt-repository ppa:canonical-kernel-team/ppa, sudo apt-get install linux-generic-lts-vivid	12:18
genkgo	hmm, now I am into dependency troubles	12:26
genkgo	hmm, this dependency issue is harder than I had before	12:37
genkgo	dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)	12:39
genkgo	while trying to install tools and could tools	12:39
genkgo	trying to overwrite '/usr/bin/perf', which is also in package linux-tools-common 3.13.0-52.86	12:44
genkgo	I see this was a problem before	12:44
genkgo	https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1410278	12:44
ubot5	Ubuntu bug 1410278 in linux (Ubuntu) "package linux-cloud-tools-common 3.16.0-29.39 failed to install/upgrade: subprocess installed post-installation script returned error exit status 1" [Medium,Confirmed]	12:44
genkgo	I cannot remove reinstall 3.19	12:46
genkgo	xnox: how should I install hv-kvp-daemon-init in combination with vivid kernel?	13:15
genkgo	if I just do apt-get install I asks me to install the cloud tools of the older kernel	13:16
genkgo	3.13	13:16
genkgo	I now have 3.19 + tools + cloud tools	13:16
genkgo	but no hv-kvp-daemon-init	13:17
apw	linux-cloud-tools-lts-vivid perhaps ?	13:24
genkgo	apw: that is already installed	13:26
genkgo	apw: http://paste.ubuntu.com/11113627/	13:27
genkgo	And I am currently on 3.19.0-17-generic.	13:28
genkgo	xnox: apw: There is no current release of this source package in The Vivid Vervet (hv-kvp-daemon-init).	14:04
apw	genkgo, hv-kvp-daemon-init should not be needed	14:04
apw	those are carried in the kernel now	14:05
genkgo	ah alright, perfect	14:05
apw	and /usr/sbin/hv_kvp_daemon should start it, and it should be being started automatically from upstart	14:06
genkgo	apw: there is a binary over there	14:06
apw	did it start correctly thought	14:07
apw	though	14:07
genkgo	apw: it is not in the list of processes, I only see hv_vmbus_con hv_vmbus_ctl	14:08
genkgo	apw: I do see some additional errors in dmesg when booting	14:09
genkgo	visorutil: module is from the staging directory, the quality is unknown, you have been warned	14:09
genkgo	and some visorchannel errors	14:10
apw	genkgo, what does "initctl status \| grep hv" sat	14:10
apw	say	14:10
genkgo	initctl: missing job name	14:11
apw	sorry initctl list \| grep hv	14:11
genkgo	empty	14:11
apw	this is trusty right? so it is running upstart ?	14:11
genkgo	apw: this is vivid	14:11
apw	oh now we are getting confused, i thought it was trusty with lts-vivid installed ?	14:12
genkgo	this 14.04 with vivid	14:12
apw	so trusty right	14:12
genkgo	:) yes	14:12
apw	with the hew vivid kernel	14:12
genkgo	yes	14:12
apw	hwe	14:12
apw	and "initctl list \| head" has jobs listed	14:13
genkgo	apw: yes, there are jobs	14:13
apw	ls -l /etc/init/hv-*	14:13
genkgo	and I installed the kernel by sudo add-apt-repository ppa:canonical-kernel-team/ppa, sudo apt-get install linux-generic-lts-vivid	14:13
apw	and do you have the hv- init configuration ?	14:14
genkgo	ls: cannot access /etc/init/hv-*: No such file or directory	14:14
genkgo	apw: I guess not, before I just install cloud tools and tools together with the hv daemon	14:15
genkgo	http://apt-browse.org/browse/ubuntu/trusty/main/all/linux-cloud-tools-common/3.13.0-24.46/file/etc/init/hv-kvp-daemon.conf	14:16
genkgo	apw: should I add that file?	14:16
apw	well if you have linux-cloud-tools-lts-vivid installed you should have linunx-cloud-tools-common installed as a dependancy	14:16
genkgo	apw: I have linux-lts-vivid-cloud-tools-common installed	14:19
genkgo	not linux-cloud-tools-common	14:19
genkgo	if I do, it tries to install the 3.13.0 one	14:19
apw	i don't believe i expect there to _be_ an linux-lts-vivid-cloud-tools-common	14:20
apw	and yes i expect it to use the 3.13 common one, as it is common to _all_ versions	14:20
apw	it only carries the wrapper scripts which are common	14:20
apw	and the same between them all	14:20
genkgo	http://paste.ubuntu.com/11113627/	14:20
apw	well that seems bust to me	14:21
genkgo	apw, so I should remove the linux-lts-vivid-cloud-tools-common	14:21
genkgo	an install the common one again	14:22
apw	if it will let you yes, as i think the vivid one is empty. it should also not exist	14:22
apw	if it is a depenency of linux-cloud-tools-generic-lts-vivid or whatever you installed, then it is broke	14:22
genkgo	ok, so now I have common tools and cloud tools (3.13.0-52.86) and the vivid kernel	14:25
=== JanC_ is now known as JanC
genkgo	hv-kvp-daemon stop/waiting	14:25
apw	it think this kernel may have broken tools dependancies	14:25
genkgo	same for vss and fcopy daemons	14:25
apw	i am looking at it	14:25
=== txspud` is now known as txspud
genkgo	apw: I changed the linux-lts-vivid-cloud-tools to the common one	14:27
genkgo	http://paste.ubuntu.com/11114346/	14:27
genkgo	but the hv daemons are not starting	14:27
apw	yep, and it has deinstalled the actual daemons	14:28
apw	i think this is just broken	14:28
apw	and i am not sure the utopic one is any better	14:28
* apw checks properly		14:28
genkgo	apw: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1410278	14:29
ubot5	Ubuntu bug 1410278 in linux (Ubuntu) "package linux-cloud-tools-common 3.16.0-29.39 failed to install/upgrade: subprocess installed post-installation script returned error exit status 1" [Medium,Confirmed]	14:29
genkgo	apw: is it broken indeed?	14:59
=== zyga is now known as zyga-phone
=== zyga-phone is now known as zyga
=== kloeri_ is now known as kloeri
smoser	hey... wonder if someone could confirm my suspicion / conclusion in bug https://bugs.launchpad.net/ubuntu/+source/curtin/+bug/1443542	18:00
ubot5	Ubuntu bug 1443542 in curtin (Ubuntu) "curtin race on vivid when /dev/sda1 doesn't exist" [Undecided,Confirmed]	18:00
smoser	maybe, wonder if there is a way to acheive what i want there, without monitoring udev hooks myself or somethingto that effect.	18:04
apw	smoser, welll i can say whne you do the reread ioctl the udev message has been queued before we return to you	19:40
apw	whether udev would include pending ones it has not yet read in its idea of pending is still in the air	19:41
smoser	hm..	19:42
smoser	udevadm settle [options]	19:42
smoser	Watches the udev event queue, and exits if all current events are	19:42
smoser	handled.	19:42
smoser	what else would be the point, apw ?	19:42
apw	smoser, i'd say it ought to see them, to my reading of that english, which is of course not the source code	19:47
apw	smoser, all i can really for sure say is if you did the reread ioctl, and that returned 0 then it will have completed the:	19:49
apw	kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE);	19:49
apw	that is that that has been queued to all listeners	19:50
smoser	apw, k. thanks.	19:51
smoser	now i'm back to not knowing what was wrong.	19:51
smoser	i think you shot my theory	19:52
smoser	rbasak, ^ just fyi.	19:52
apw	smoser, and from what i can see in udev that even it we got out of the kernel and into udevadm settle before udev is woken to read the event, we will read the event before checkign if we are idle and responding	20:02
apw	to the settle	20:02
smoser	apw, so i think you're saying that it should work like i originally expected / coded for.	20:03
smoser	a.) echo "2048," \| sfdisk /dev/sda	20:04
smoser	b.) blockdev --rereadpt	20:04
smoser	c.) udevadm settle	20:04
smoser	d.) expect /dev/sda1 to exist	20:04
smoser	right?	20:04
apw	smoser, though i guess it depends if more than one is produced	20:04
apw	smoser, and whether you are waiting for the second one	20:05
smoser	more than one?	20:05
apw	yes, the event i listed was the "device has changed" for i assume sda in this case	20:05
apw	is it sda1 you are waiting for ?	20:05
smoser	yes.	20:07
smoser	so are you saying that the kernel would emit "device_has_changed(sda)", then return from blockdev, then subsequently emit "device_has_changed(sda1)" ?	20:08
smoser	that would seem unfortunate.	20:08
apw	smoser, oh ... but ... actually the interface for settle is a bit odd, it is actually using a file in /run	20:08
apw	smoser, no it queues them all i believe before returning 0	20:08
smoser	and then udevadm settle should wait until it has processed the entire queue	20:09
smoser	at least it says it will.	20:09
smoser	(or 120 seconds, but i dont htink thats the issue here)	20:09
apw	so i think although it is using a file, it is interlocking with udevd by pinging it, so they at least think they are doing the right thing	20:12
apw	do you get the events in the end in your scenario ?	20:12
apw	smoser, ^	20:12
smoser	well, all i have to go on is the bug at this point.	20:13
smoser	and the code i pointed to	20:13
smoser	apw, thanks for your help.	20:20
rbasak	smoser: I think beisner said he can reliably reproduce it?	20:20
smoser	yeah, but i can't have access at the moment.	20:20
rbasak	I guess maybe the next step is to log udev events and compare the timing of those to the timing of the commands	20:21
smoser	yeah... given apw's assesement, i think maybe we're in a different path than i originally thought.	20:22
rbasak	<apw> that is that that has been queued to all listeners	20:23
rbasak	apw: does that definitely mean that it's visible to udev in userspace by that point?	20:23
apw	rbasak, to my understanding of the netlink code yes	20:24
rbasak	I know that's what you're saying; just want to eliminate the possibility of there being some other queue in kernelspace in the way	20:24
rbasak	OK, thanks.	20:24
rbasak	Then I wonder if there's a race in udev between reading that and handling "settle".	20:24
apw	rbasak, there may be, but it is at least claiming to handle the proposed race	20:25
rbasak	Understood	20:25
apw	rbasak, but i also don't think we have any proof the right thing was actually done yet ... ie that the do appear	20:25
apw	(the events)	20:25
rbasak	I am curious enough to dig into udev's source, but I'm busy this evening	20:26
* rbasak should go		20:26
=== pgraner is now known as pgraner-afk
smoser	rbasak, apw http://paste.ubuntu.com/11119254/	20:54
smoser	i can get that to fail.	20:55
smoser	like:	20:55
smoser	BLKRRPART: Device or resource busy	20:55
smoser	waitfor after partition2 failed	20:55
smoser	i think that the script is doing all sane things.	20:55
apw	what says BLKRRPART: De...	20:56
smoser	blockdev	20:58
smoser	i thikn	20:58
smoser	but i can patch to make sure	20:58
apw	so the wait is bound to fail, as you didn't actually do the partition reload	20:58
apw	which might indicate something has one of the partitions open	20:59
smoser	tomomrrow.	20:59
smoser	?	20:59
apw	if the blockdev failed, then it didn't change anything	20:59
apw	and didn't emit anything to wait for	20:59
smoser	sorry. i have to run. oi'll look more tomorrow.	21:00
nessita	jsalisbury, hi, quick question, in the audio bug you mention kernel /v4.1-rc3-vivid/ but I only see v4.1-rc2-vivid in http://kernel.ubuntu.com/~kernel-ppa/mainline/	21:08
nessita	jsalisbury, shall I try v4.1-rc2-vivid or v4.1-rc3-unstable	21:08
=== pgraner-afk is now known as pgraner
jsalisbury	nessita, I would suggest v4.1-rc3	21:58

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!