=== tmpRAOF is now known as RAOF | ||
=== gerald is now known as Guest76602 | ||
=== ming is now known as Guest29940 | ||
tomti | What is the easiest way to find the latest official kernel version for a particular ubuntu release? I tried apt-cache show|madison|showpkg but none give what I want | 10:18 |
---|---|---|
genkgo | afw: We spoke yesterday on the read-only filesystem issue with ubuntu (3.13/ext4) vs centos (3.10/xfs) inside a HyperV platform. We created a new VPS and started to backup the machine hourly. This machine just went into read-only state and I grabbed info from dmesg: http://pastebin.com/48CK60Hi and /proc/mounts http://pastebin.com/WwLej7r7 as you told me. | 10:48 |
genkgo | afw: the machine is still in read-only mode, so if you need more info, please let me know. | 10:50 |
infinity | genkgo: Nothing earlier in dmesg before that? Looks like the (fake) disk driver exploding. | 11:16 |
genkgo | infinity: yes, http://pastebin.com/DTmvZgHS | 11:19 |
genkgo | infinity: the lines I just added happen during every VSS backup | 11:20 |
genkgo | infinity: I explained yesterday to afw that I have four machines, three ubuntu 3.13.0-52 with ext4 and one with centos 3.10.0-123 with xfs. The Ubuntu go into read-only mode randomly while the CentOS machine is doing fine. | 11:22 |
genkgo | infinity: Sorry, go randomly into read-only while HyperV is creating a backup of the four machines (in a row, not simultaneously). That could be during a backup of the specific is created but also at the end of the complete backup process. | 11:24 |
genkgo | * of the specific machine | 11:24 |
xnox | .... hyperv backup does freeze on to the filesystems and devices. | 11:26 |
xnox | and expects the freeze & unfreeze to work... | 11:26 |
genkgo | lifeless showed me a picture of the HyperV VSS process: https://msdn.microsoft.com/en-us/library/aa384589(v=vs.85).aspx confirming information is exchanged between the hyperv cluster and the guest machines at the end of a backup | 11:26 |
genkgo | xnox: yes, that is confirmed on the link I just pasted | 11:27 |
genkgo | but why do the ubuntu machines go into read-only mode while the centos is doing fine? | 11:27 |
genkgo | xnox: and what does the dmesg http://pastebin.com/48CK60Hi output tell me? | 11:28 |
genkgo | except from being in read-only: we knew that, I cannot find a cause in the message. | 11:29 |
xnox | writting journal was aborted, then time jumps 200ms, journal is attempted to be read and is incomplete. | 11:30 |
genkgo | ok, and this means filesystem inconsistency and therefore ubuntu kernel switch the filesystem to read-only? | 11:30 |
xnox | imho that smells like "sync" did not complete, yet "freeze" returned and backup kicked in thus expoding at "thaw" and remounting ro | 11:30 |
genkgo | xnox: do you have any advice what to do? I have machines in production that are affected by this, causing downtime. | 11:32 |
genkgo | afw asked me yesterday to file a bug if I knew what was going on (dmesg). Now I have that information. Is it a bug? Is it Ubuntu Kernel related? | 11:33 |
xnox | dunno, i would have asked cking to stress test freezing/unfreezing vms under I/O workload to figure out what's going on. | 11:33 |
xnox | it should be reproducible outside hypervm | 11:33 |
xnox | not sure who afw is, you mean apw? | 11:34 |
genkgo | xnox: sorry, I mean apw indeed :) | 11:34 |
apw | genkgo, well you have a pretty clear disk error there | 11:35 |
apw | end_request: I/O error, dev sda, sector 65127256 | 11:35 |
apw | that IO failed so the filesystem wen't offline | 11:35 |
genkgo | apw: ok, so you guess bad hardware? | 11:35 |
apw | genkgo, i would like to see more of the dmesg before that | 11:35 |
apw | genkgo, it is a VM so it is likely not actual h/w failure, it presumably is talking about a virtual disk | 11:36 |
xnox | also it would be interesting to know how hyperv initiates vm freeze... given that we probably lack fsfreeze and xfs_freeze userspace tools in 14.04 | 11:36 |
genkgo | apw: http://pastebin.com/DTmvZgHS contains the other lines (I also showed you yesterday). Would you like me to include boot sequence too? Because there is nothing more in between. | 11:36 |
apw | genkgo, if you showed afw yesterday, then i'd have not noticed | 11:37 |
xnox | genkgo: use pate.canonical.com and show everything =) | 11:37 |
genkgo | hehe :) | 11:37 |
xnox | genkgo: also paste.ubuntu.com works nicer with pastebinit utility ;-) | 11:37 |
apw | genkgo, remind me of the kernel version again | 11:39 |
apw | genkgo, and do the ones which do not fail also report those changed operating definition | 11:43 |
genkgo | apw: yes, they do | 11:44 |
genkgo | http://paste.ubuntu.com/11112285/ | 11:44 |
apw | i see you are using 3.13 kernels on these hyper-v guests, we are mostly producing images with HWE kernels installed for hyper-v | 11:45 |
apw | because the hypervisor interface is evolving so very fast at the moment | 11:46 |
apw | genkgo, that one also shows an aborted journal | 11:46 |
apw | [66392.076569] end_request: I/O error, dev sda, sector 65127256 | 11:46 |
apw | [66392.076610] Aborting journal on device sda5-8. | 11:46 |
genkgo | apw: correct, this is is full output of dmesg | 11:47 |
genkgo | of the same machine | 11:47 |
apw | or is that a change for each backup, and only the last output if the only one which failed | 11:47 |
genkgo | yeah, we replicated a machine as test machine yesterday, started backup hourly until the system went into read-only, which just happened | 11:47 |
genkgo | this is the full output from boot yesterday untill now | 11:48 |
genkgo | apw: we are using 3.13 kernels for all ubuntu machines (the centos one is using 3.10) | 11:48 |
apw | genkgo, is the centos running the same workload as the ubuntu machines in the backup set ? | 11:49 |
xnox | genkgo: .... centos is xfs which always had freeze support, e.g. ext2 only gained freeze support in 3.19 kernel. | 11:50 |
genkgo | yeah, every machine has other purposes and therefore other services, but yeah, I think there is no difference in load | 11:50 |
xnox | genkgo: plus centos version numbers are a bit pointless, as 3.10 can have eons of cherrypicked patches. | 11:50 |
apw | genkgo, i mean are they doing the exact same things? i'd say the one which has failed had an IO in flight when the change request popped out and that has made it go pop | 11:50 |
xnox | and we default to mounting ext2 filesystems with the ext4 driver. so logs are different. | 11:50 |
genkgo | xnox: I noticed we are on version 3.10.0-123 so yeah I imagined the pataches | 11:51 |
xnox | imho you should _only_ be using hwe kernels on hyperv. | 11:51 |
genkgo | apw: no, in that case they are doing really different things | 11:51 |
xnox | apw: centos is usind different filesystem type.... | 11:51 |
genkgo | centos is doing mail (imap and smtp) | 11:51 |
xnox | as in no IO at all... | 11:52 |
genkgo | while two ubuntu machines are handling http requests | 11:52 |
xnox | which logs all the time to disk... | 11:52 |
genkgo | the final ubuntu is helper machine with all kinds of services (tomcat / libreoffice converter etc.) | 11:53 |
genkgo | xnox: so you are saying we should switch filesystem? | 11:54 |
xnox | genkgo: no. | 11:55 |
xnox | genkgo: i am saying it's uneven comparison with centos. oranges and apples. | 11:55 |
genkgo | xnox: ok | 11:55 |
xnox | genkgo: you should switch to our hwe kernels, and check if you can reproduce this with 3.19 - vivid's kernel. | 11:56 |
xnox | genkgo: and azure people want ubuntu to use 3.19 kernel and better... to get the ext2 freeze support | 11:56 |
xnox | cause default server config uses ext2 + lvm volume group and they can't freeze that for backup across the board. | 11:57 |
xnox | on other clouds we default to hwe kernels. e.g. on ec2 and similar. | 11:57 |
genkgo | xnox: ok, I will do that. switching to xfs makes no sense? | 11:59 |
xnox | genkgo: we cannot do that, no. | 11:59 |
xnox | genkgo: we are talking about all ubuntu vms launched in azure, not just your three vms. | 11:59 |
genkgo | xnox: allright, I never meant to talk about all ubuntu vms | 12:00 |
genkgo | xnox: so I leave to fs to ext4 and upgrade to HWE kernels | 12:00 |
genkgo | being 3.19 | 12:01 |
genkgo | xnox: this page does not indicate there is a 3.19 https://wiki.ubuntu.com/Kernel/LTSEnablementStack | 12:02 |
genkgo | xnox: is this the ppa ppa:canonical-kernel-team/ppa I should use? | 12:04 |
xnox | https://launchpad.net/ubuntu/+source/linux-lts-vivid | 12:05 |
xnox | it's in proposed | 12:05 |
genkgo | xnox: thank you very much for helping me out | 12:06 |
genkgo | I will install it and see what happends | 12:06 |
apw | genkgo, if this is a test box, i would suggest that you run a test using the linux-lts-utopic | 12:17 |
apw | as in theory that is what is being tested in majority in azure | 12:17 |
genkgo | apw: I am already installed 3.19 on the test machine, using sudo add-apt-repository ppa:canonical-kernel-team/ppa, sudo apt-get install linux-generic-lts-vivid | 12:18 |
genkgo | hmm, now I am into dependency troubles | 12:26 |
genkgo | hmm, this dependency issue is harder than I had before | 12:37 |
genkgo | dpkg-deb: error: subprocess paste was killed by signal (Broken pipe) | 12:39 |
genkgo | while trying to install tools and could tools | 12:39 |
genkgo | trying to overwrite '/usr/bin/perf', which is also in package linux-tools-common 3.13.0-52.86 | 12:44 |
genkgo | I see this was a problem before | 12:44 |
genkgo | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1410278 | 12:44 |
ubot5 | Ubuntu bug 1410278 in linux (Ubuntu) "package linux-cloud-tools-common 3.16.0-29.39 failed to install/upgrade: subprocess installed post-installation script returned error exit status 1" [Medium,Confirmed] | 12:44 |
genkgo | I cannot remove reinstall 3.19 | 12:46 |
genkgo | xnox: how should I install hv-kvp-daemon-init in combination with vivid kernel? | 13:15 |
genkgo | if I just do apt-get install I asks me to install the cloud tools of the older kernel | 13:16 |
genkgo | 3.13 | 13:16 |
genkgo | I now have 3.19 + tools + cloud tools | 13:16 |
genkgo | but no hv-kvp-daemon-init | 13:17 |
apw | linux-cloud-tools-lts-vivid perhaps ? | 13:24 |
genkgo | apw: that is already installed | 13:26 |
genkgo | apw: http://paste.ubuntu.com/11113627/ | 13:27 |
genkgo | And I am currently on 3.19.0-17-generic. | 13:28 |
genkgo | xnox: apw: There is no current release of this source package in The Vivid Vervet (hv-kvp-daemon-init). | 14:04 |
apw | genkgo, hv-kvp-daemon-init should not be needed | 14:04 |
apw | those are carried in the kernel now | 14:05 |
genkgo | ah alright, perfect | 14:05 |
apw | and /usr/sbin/hv_kvp_daemon should start it, and it should be being started automatically from upstart | 14:06 |
genkgo | apw: there is a binary over there | 14:06 |
apw | did it start correctly thought | 14:07 |
apw | though | 14:07 |
genkgo | apw: it is not in the list of processes, I only see hv_vmbus_con hv_vmbus_ctl | 14:08 |
genkgo | apw: I do see some additional errors in dmesg when booting | 14:09 |
genkgo | visorutil: module is from the staging directory, the quality is unknown, you have been warned | 14:09 |
genkgo | and some visorchannel errors | 14:10 |
apw | genkgo, what does "initctl status | grep hv" sat | 14:10 |
apw | say | 14:10 |
genkgo | initctl: missing job name | 14:11 |
apw | sorry initctl list | grep hv | 14:11 |
genkgo | empty | 14:11 |
apw | this is trusty right? so it is running upstart ? | 14:11 |
genkgo | apw: this is vivid | 14:11 |
apw | oh now we are getting confused, i thought it was trusty with lts-vivid installed ? | 14:12 |
genkgo | this 14.04 with vivid | 14:12 |
apw | so trusty right | 14:12 |
genkgo | :) yes | 14:12 |
apw | with the hew vivid kernel | 14:12 |
genkgo | yes | 14:12 |
apw | hwe | 14:12 |
apw | and "initctl list | head" has jobs listed | 14:13 |
genkgo | apw: yes, there are jobs | 14:13 |
apw | ls -l /etc/init/hv-* | 14:13 |
genkgo | and I installed the kernel by sudo add-apt-repository ppa:canonical-kernel-team/ppa, sudo apt-get install linux-generic-lts-vivid | 14:13 |
apw | and do you have the hv- init configuration ? | 14:14 |
genkgo | ls: cannot access /etc/init/hv-*: No such file or directory | 14:14 |
genkgo | apw: I guess not, before I just install cloud tools and tools together with the hv daemon | 14:15 |
genkgo | http://apt-browse.org/browse/ubuntu/trusty/main/all/linux-cloud-tools-common/3.13.0-24.46/file/etc/init/hv-kvp-daemon.conf | 14:16 |
genkgo | apw: should I add that file? | 14:16 |
apw | well if you have linux-cloud-tools-lts-vivid installed you should have linunx-cloud-tools-common installed as a dependancy | 14:16 |
genkgo | apw: I have linux-lts-vivid-cloud-tools-common installed | 14:19 |
genkgo | not linux-cloud-tools-common | 14:19 |
genkgo | if I do, it tries to install the 3.13.0 one | 14:19 |
apw | i don't believe i expect there to _be_ an linux-lts-vivid-cloud-tools-common | 14:20 |
apw | and yes i expect it to use the 3.13 common one, as it is common to _all_ versions | 14:20 |
apw | it only carries the wrapper scripts which are common | 14:20 |
apw | and the same between them all | 14:20 |
genkgo | http://paste.ubuntu.com/11113627/ | 14:20 |
apw | well that seems bust to me | 14:21 |
genkgo | apw, so I should remove the linux-lts-vivid-cloud-tools-common | 14:21 |
genkgo | an install the common one again | 14:22 |
apw | if it will let you yes, as i think the vivid one is empty. it should also not exist | 14:22 |
apw | if it is a depenency of linux-cloud-tools-generic-lts-vivid or whatever you installed, then it is broke | 14:22 |
genkgo | ok, so now I have common tools and cloud tools (3.13.0-52.86) and the vivid kernel | 14:25 |
=== JanC_ is now known as JanC | ||
genkgo | hv-kvp-daemon stop/waiting | 14:25 |
apw | it think this kernel may have broken tools dependancies | 14:25 |
genkgo | same for vss and fcopy daemons | 14:25 |
apw | i am looking at it | 14:25 |
=== txspud` is now known as txspud | ||
genkgo | apw: I changed the linux-lts-vivid-cloud-tools to the common one | 14:27 |
genkgo | http://paste.ubuntu.com/11114346/ | 14:27 |
genkgo | but the hv daemons are not starting | 14:27 |
apw | yep, and it has deinstalled the actual daemons | 14:28 |
apw | i think this is just broken | 14:28 |
apw | and i am not sure the utopic one is any better | 14:28 |
* apw checks properly | 14:28 | |
genkgo | apw: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1410278 | 14:29 |
ubot5 | Ubuntu bug 1410278 in linux (Ubuntu) "package linux-cloud-tools-common 3.16.0-29.39 failed to install/upgrade: subprocess installed post-installation script returned error exit status 1" [Medium,Confirmed] | 14:29 |
genkgo | apw: is it broken indeed? | 14:59 |
=== zyga is now known as zyga-phone | ||
=== zyga-phone is now known as zyga | ||
=== kloeri_ is now known as kloeri | ||
smoser | hey... wonder if someone could confirm my suspicion / conclusion in bug https://bugs.launchpad.net/ubuntu/+source/curtin/+bug/1443542 | 18:00 |
ubot5 | Ubuntu bug 1443542 in curtin (Ubuntu) "curtin race on vivid when /dev/sda1 doesn't exist" [Undecided,Confirmed] | 18:00 |
smoser | maybe, wonder if there is a way to acheive what i want there, without monitoring udev hooks myself or somethingto that effect. | 18:04 |
apw | smoser, welll i can say whne you do the reread ioctl the udev message has been queued before we return to you | 19:40 |
apw | whether udev would include pending ones it has not yet read in its idea of pending is still in the air | 19:41 |
smoser | hm.. | 19:42 |
smoser | udevadm settle [options] | 19:42 |
smoser | Watches the udev event queue, and exits if all current events are | 19:42 |
smoser | handled. | 19:42 |
smoser | what else would be the point, apw ? | 19:42 |
apw | smoser, i'd say it ought to see them, to my reading of that english, which is of course not the source code | 19:47 |
apw | smoser, all i can really for sure say is if you did the reread ioctl, and that returned 0 then it will have completed the: | 19:49 |
apw | kobject_uevent(&disk_to_dev(disk)->kobj, KOBJ_CHANGE); | 19:49 |
apw | that is that that has been queued to all listeners | 19:50 |
smoser | apw, k. thanks. | 19:51 |
smoser | now i'm back to not knowing what was wrong. | 19:51 |
smoser | i think you shot my theory | 19:52 |
smoser | rbasak, ^ just fyi. | 19:52 |
apw | smoser, and from what i can see in udev that even it we got out of the kernel and into udevadm settle before udev is woken to read the event, we will read the event before checkign if we are idle and responding | 20:02 |
apw | to the settle | 20:02 |
smoser | apw, so i think you're saying that it should work like i originally expected / coded for. | 20:03 |
smoser | a.) echo "2048," | sfdisk /dev/sda | 20:04 |
smoser | b.) blockdev --rereadpt | 20:04 |
smoser | c.) udevadm settle | 20:04 |
smoser | d.) expect /dev/sda1 to exist | 20:04 |
smoser | right? | 20:04 |
apw | smoser, though i guess it depends if more than one is produced | 20:04 |
apw | smoser, and whether you are waiting for the second one | 20:05 |
smoser | more than one? | 20:05 |
apw | yes, the event i listed was the "device has changed" for i assume sda in this case | 20:05 |
apw | is it sda1 you are waiting for ? | 20:05 |
smoser | yes. | 20:07 |
smoser | so are you saying that the kernel would emit "device_has_changed(sda)", then return from blockdev, then subsequently emit "device_has_changed(sda1)" ? | 20:08 |
smoser | that would seem unfortunate. | 20:08 |
apw | smoser, oh ... but ... actually the interface for settle is a bit odd, it is actually using a file in /run | 20:08 |
apw | smoser, no it queues them all i believe before returning 0 | 20:08 |
smoser | and then udevadm settle *should* wait until it has processed the entire queue | 20:09 |
smoser | at least it says it will. | 20:09 |
smoser | (or 120 seconds, but i dont htink thats the issue here) | 20:09 |
apw | so i think although it is using a file, it is interlocking with udevd by pinging it, so they at least think they are doing the right thing | 20:12 |
apw | do you get the events in the end in your scenario ? | 20:12 |
apw | smoser, ^ | 20:12 |
smoser | well, all i have to go on is the bug at this point. | 20:13 |
smoser | and the code i pointed to | 20:13 |
smoser | apw, thanks for your help. | 20:20 |
rbasak | smoser: I think beisner said he can reliably reproduce it? | 20:20 |
smoser | yeah, but i can't have access at the moment. | 20:20 |
rbasak | I guess maybe the next step is to log udev events and compare the timing of those to the timing of the commands | 20:21 |
smoser | yeah... given apw's assesement, i think maybe we're in a different path than i originally thought. | 20:22 |
rbasak | <apw> that is that that has been queued to all listeners | 20:23 |
rbasak | apw: does that definitely mean that it's visible to udev in userspace by that point? | 20:23 |
apw | rbasak, to my understanding of the netlink code yes | 20:24 |
rbasak | I know that's what you're saying; just want to eliminate the possibility of there being some other queue in kernelspace in the way | 20:24 |
rbasak | OK, thanks. | 20:24 |
rbasak | Then I wonder if there's a race in udev between reading that and handling "settle". | 20:24 |
apw | rbasak, there may be, but it is at least claiming to handle the proposed race | 20:25 |
rbasak | Understood | 20:25 |
apw | rbasak, but i also don't think we have any proof the right thing was actually done yet ... ie that the do appear | 20:25 |
apw | (the events) | 20:25 |
rbasak | I am curious enough to dig into udev's source, but I'm busy this evening | 20:26 |
* rbasak should go | 20:26 | |
=== pgraner is now known as pgraner-afk | ||
smoser | rbasak, apw http://paste.ubuntu.com/11119254/ | 20:54 |
smoser | i can get that to fail. | 20:55 |
smoser | like: | 20:55 |
smoser | BLKRRPART: Device or resource busy | 20:55 |
smoser | waitfor after partition2 failed | 20:55 |
smoser | i think that the script is doing all sane things. | 20:55 |
apw | what says BLKRRPART: De... | 20:56 |
smoser | blockdev | 20:58 |
smoser | i thikn | 20:58 |
smoser | but i can patch to make sure | 20:58 |
apw | so the wait is bound to fail, as you didn't actually do the partition reload | 20:58 |
apw | which might indicate something has one of the partitions open | 20:59 |
smoser | tomomrrow. | 20:59 |
smoser | ? | 20:59 |
apw | if the blockdev failed, then it didn't change anything | 20:59 |
apw | and didn't emit anything to wait for | 20:59 |
smoser | sorry. i have to run. oi'll look more tomorrow. | 21:00 |
nessita | jsalisbury, hi, quick question, in the audio bug you mention kernel /v4.1-rc3-vivid/ but I only see v4.1-rc2-vivid in http://kernel.ubuntu.com/~kernel-ppa/mainline/ | 21:08 |
nessita | jsalisbury, shall I try v4.1-rc2-vivid or v4.1-rc3-unstable | 21:08 |
=== pgraner-afk is now known as pgraner | ||
jsalisbury | nessita, I would suggest v4.1-rc3 | 21:58 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!