[10:48] Guys ubiquity is freeze out at 96% removing gparted [10:49] what could have broken hardy netboot a week ago? Now I get "No root file system defined" error, but the preseed files haven't been changed.. [10:49] last successfull installation was on Mon 14th [10:50] partman says "no matching physical volumes found" [10:51] and then "no volume groups found" [10:52] davmor2: likely bug 251223 [10:52] tjaalton: the log bits you've quoted so far are irrelevant; it would be much better to get the full logs [10:53] checks [10:53] davmor2: can you try with union=unionfs on the command line? [10:53] cjwatson: ok, a sec [10:55] cjwatson: What sorry? [10:55] cjwatson: http://users.tkk.fi/~tjaalton/foo/syslog [10:55] cjwatson: and /partman [10:55] oops [10:55] davmor2: was that unclear? [10:56] permissions fixed [10:56] I just type union=unionfs in the command line or add it to the ubiquity command or what? [10:57] *kernel* command line, sorry [10:57] as in, press F6 at the CD boot loader and add union=unionfs to the end [10:57] cjwatson: okay no probs 2 ticks [10:57] if that works, I am going to point and laugh at all the people who told me vociferously that aufs was the bee's knees [10:58] tjaalton: hmm, shame I can't see what it's 404ing on [10:58] cjwatson: those should be harmless, visible on a valid installation too [10:59] given that there are no other relevant errors ... [10:59] can I have the preseed file too? [10:59] blink, where did nic-restricted-firmware-2.6.24-19-generic-di go? [10:59] http://pastebin.ubuntu.com/29907/ [11:00] that's from an old installation [11:00] ah, never mind, there it is [11:02] http://users.tkk.fi/~tjaalton/foo/preseed-log [11:03] cjwatson: drops into initramfs busybox [11:03] cjwatson: would it help to have a ubiquity --debug report? [11:04] cjwatson: is there a way to see what packages have been accepted to -updates after Mon 14th? [11:05] davmor2: no, it isn't a ubiquity problem [11:06] davmor2: the kernel is crashing [11:06] :( [11:06] davmor2: mdz already got debug logs - I can't guarantee it's at the same point but it sounds like the same thing [11:06] davmor2: standard advice: if the whole system freezes, it ain't ubiquity's fault [11:06] tjaalton: err, not easily [11:06] it ain't ubiquity then :) [11:07] tjaalton: give me a while, I'm doing several things at once here and my attention is fractured [11:07] cjwatson: ok :) [11:07] tjaalton: a log with DEBCONF_DEBUG=developer on the kernel command line would probably help [11:08] cjwatson: ok, trying that [11:08] tjaalton: partman.recipe.hardy would be useful too [11:09] cjwatson: added [11:19] cjwatson: http://users.tkk.fi/~tjaalton/foo/syslog-debug [11:35] tjaalton: one thing I notice is that lvmok{ } should be $lvmok{ } [11:36] cjwatson: ok, I'll try changing that [11:37] argh, I think somebody did what I told them not to do with hardy-proposed [11:37] oh, hm, no, -updates. WTF? [11:39] cjwatson: adding $ didn't help :/ [11:39] yeah, I think what's happened is that linux got removed from hardy-updates for some reason, and is only in -security [11:39] ah... [11:39] theoretically that ought not to break things, but ... [11:41] hmm, I'll check our mirror [11:41] it's not the fault of your mirror [11:41] (I don't think) [11:43] maybe because we didn't mirror hardy-security main/debian-installer because it failed to mirror before (using apt-mirror) [11:43] Please I could use some help, I've finally succeeded in getting my local net install with a preseed option, using hardy-proposed installer with the alternate CD(i386 btw). [11:43] Unfortunately after reboot the keyboard comes up with diamonds and I can't login. [11:45] hmm, mirroring those didn't help [11:47] mons88: ok, I need to see your preseed file [11:47] mons88: (remove any passwords from it) [11:48] (sorry, I think we had this conversation before but as you could probably tell I wasn't very awake then) [11:48] cjwatson: I was poised to test unionfs=unionfs when I saw your note that it's gone :-/ [11:49] cjwatson: had words with the kernel team yet? [11:49] sure, hope your feeling better, just a few moments while I connect remotely and I'll paste the file online [11:49] mdz: I haven't been able to raise them [11:50] mdz: ogasawara and slangasek brought up the bug last night; the only response as far as I could see was rtg saying that BenC should look at it [11:50] cjwatson: I made sure pgraner was aware of it yesterday as well [11:52] * cjwatson has not typically had good experiences with blocking kernel bugs discovered the day before a scheduled release [11:56] cjwatson: so, do you have an idea what's going on with the archive, and do you need anything from me?-) [11:57] I think I've fixed it [11:58] but you won't be able to tell for at least an hour (assuming you're using the master archive not a mirror) [11:59] cjwatson: ok, I've also noticed that I didn't mirror the amd64 udebs from -security, so I'll try once again [11:59] just to be sure [12:01] that certainly wouldn't help [12:01] but anyway, I've copied linux back to -updates (I believe it was removed "temporarily" to hack around an overrides problem in -security about a week ago, which fits), so give it an hour [12:01] right.. mirroring those bits did help [12:02] so, maybe have stub files for debian-installer also in -security in the future? [12:03] they're already there! [12:03] no need for them to be stubs, they have actual content [12:04] now they have, but not when I started mirroring hardy ;) [12:05] if I'd mirror intrepid now (with apt-mirror), it would fail because intrepid-security does not have main/debian-installer [12:06] hmm, yeah, there are placeholders for main/binary-i386 but not main/debian-installer/binary-i386 [12:06] anyway, thanks for tracking it down! I can go back on vacation.. [12:06] feel free to file a Soyuz bug [12:06] (I don't control that stuff directly) [12:06] ok, I'll do [12:08] cjwatson: posted my preseed @ http://pastebin.com/m4f4bece8, cheers [12:10] cjwatson: bug 251454 [12:11] hm, no ubotu [12:11] is that the same/similar to bug, 188492 [12:12] mons88: mine? no: https://bugs.edge.launchpad.net/soyuz/+bug/251454 [12:12] no [12:13] # [12:13] d-i partman-auto/disk string /dev/?da [12:13] WTF [12:13] if that works it's by dumb luck [12:13] d-i preseed/early_command string anna-install common-console [12:13] remove that; you've spelled it wrong (it would be console-common) and even if that worked you don't want it [12:14] d-i debian-installer/consoledisplay string kbd=lat0-sun16(utf8) [12:14] I would advise removing that, though I don't know if it will have negative effects [12:14] mons88: can you boot in recovery mode from the grub menu, and fish out /etc/default/console-setup for me? [12:16] okay but i had the issue before adding common-console, string kbd=lat0-sun16(utf8) comes from http://www.ubuntu-forum.net/showthread.php?p=5120670 which apparently fixes it [12:16] I'll try recovery mode now [12:19] yeeeeees, sort of cargo-cult debugging going on in that forums post [12:19] namely, applying solutions from four releases back [12:22] root password at maintenance prompt isn't accepted :( [12:22] recovery mode shouldn't involve a maintenance prompt, IIRC [12:23] well, failing that, select your normal boot prompt, press 'e', add 'init=/bin/sh' to the end, and boot [12:23] hmm, yeah, recovery mode of course *does* involve a maintenance prompt if you set a root password, sorry [12:27] still seeing diamonds [12:34] looks like going to a later version of hardy-proposed isn't a good idea, Bug 251344 ! [12:53] would it be an idea to try a later, or much earlier version of the alternate iso? [12:55] mons88: please be patient - I'm investigating an intrepid alpha 3 blocker at the moment [12:55] mons88: I will get back to you later on [12:55] but I can't do everything at once [12:56] cool, npbs [12:56] mons88: you mention hardy-proposed - does that mean you're using it? === davmor2 is now known as davmor2_lunch [12:57] yep, kept getting an error about downloading a file from the mirror with the normal hardy net install stuff, so hardy-proposed worked beautifully [12:57] mons88: that should have been fixed some time ago [12:57] mons88: grab the new net install image from /dists/hardy/ (we copied it over) and stop using -proposed [12:58] worth a try anyway [12:58] okay will do [13:02] mdz: ok, you were right, if I tell ubiquity not to remove gparted then the same thing just happens in libntfs10.postrm [13:03] cjwatson: is that the very next one? [13:04] not quite, it got through jfsutils and ntfsprogs, both of which I think were later [13:05] those sound familiar, I think they might have been first [13:07] I'd have expected libntfs10 to be removed right after ntfsprogs, I think [13:09] libntfs10 has a postrm, the others don't (nor does lupin-casper), but casper does [13:09] so it seems to be consistently the second postrm, so far [13:15] (how many times does hw-detect need to be run?) [13:16] I thought it was only run once [13:31] cjwatson: I've been stepping ubiquity along and it seems to be run more than once [13:31] not worrying about it right now though [13:34] * cjwatson tries commenting out remove_extras === davmor2_lunch is now known as davmor2 [13:40] mdz: not that I'm one to talk, but your screen is filthy [13:40] http://people.ubuntu.com/~mdz/251223/P7230016.JPG [13:42] none of the photos look anything like root causes though [13:42] aha, you have a call trace that I don't, I missed that [13:42] or at least one with symbols [13:44] I'm very confused as to why mtd is involved [13:44] cjwatson: it looks fine in normal lighting; the flash reflects off the dust [13:44] cjwatson: me too [13:44] maybe that is a red herring [13:45] it's almost as if there's a superblock refcounting bug in aufs [13:45] sys_getcwd is something I can imagine the shell doing randomly [13:47] cjwatson: it's not hw-detect that gets run multiple times, it's update-dev [13:47] called by both clock-setup and hw-detect [13:47] and I think other places, judging by how much hal spam I see in syslog [13:47] cjwatson: there is a *lot* of mounting and unmounting which happens under /target [13:48] update-dev is needed as a checkpoint in a few places to wait for devices to appear [13:48] /proc and /sys are mounted while doing certain things, so that's expected [13:48] it's much simpler to just mount/umount as needed rather than try to keep track of when they're needed [13:48] cjwatson: yes, I'm just pointing out that the bug could be tickled by all that mount churn [13:49] (IIRC, there was some udeb code that expected those filesystems to be unmounted) [13:50] commenting out remove_extras avoids the problem (though of course creates others) [13:51] mdz: just out of interest, how large is your disk image? [13:51] I just noticed mine is full [13:51] cjwatson: 6GB [13:51] ah, mine's 2.5, so not that [13:52] (I'm low on disk at the moment) [13:52] 2.3G used [13:52] cjwatson: I have an strace -f of the dpkg process [13:52] http://people.ubuntu.com/~mdz/251223/strace.gz [13:53] doesn't get very far into the postrm, does it? [13:54] to put it mildly [13:54] cjwatson: oh, dpkg is running outside of the chroot. I didn't realize that [13:54] yes, with --root [13:57] mdz: where is the sleep 36000 || ... bit coming from in casper.postrm? did you add that? [13:57] cjwatson: yes [13:57] easier than trying to catch it at the right time [13:57] pgraner: hi [13:58] mdz: hey [14:00] pgraner: I've updated the bug in LP with the latest [14:01] mdz: ok, just got benc he will be here shortly. [14:01] I have to confess that I'm not getting very far, beyond the observation that if I stop ubiquity from removing packages from /target then the bug goes away [14:04] If I build a unionfs module for someone, will they be able to test this (just woke up, no test rig)? [14:04] yes [14:04] if I get the source and some instructions on what I need to build-dep on, I could even put it in casper temporarily ;-) [14:05] if that turns out to fix it, I'm going to point and laugh so hard at the people who were going ooh-shiny at aufs [14:06] heh [14:06] cjwatson: And I'm gonna smack the people that told me it worked [14:07] cjwatson: 32 or 64-bit? [14:07] BenC: and what do we do with the people who took the old-working solution away before the new-shiny one was tested? ;-) [14:07] BenC: 32 [14:08] * unionfs is now known as Agent K [14:08] mdz: my fear is I took the old solution away because it didn't compile any more and didn't seem worth the effort :( [14:09] oh dear [14:16] cjwatson: I'm able to reproduce the bug by booting the live CD, mounting /target and running dpkg --root=... [14:16] cjwatson: chrooting dpkg instead seems to run fine [14:16] blink, how is that different as far as the kernel is concerned? [14:17] cjwatson: one chroot() call instead of a bunch of forks and chroots? [14:17] ubiquity doesn't actually have much choice here; it needs to use python-apt, which can only really sanely be done in-process, and apt doesn't seem to have an option to chroot the whole of dpkg [14:18] err, sure, but none of those should have a stateful effect on the fs layer [14:19] I think unionfs is probably our best hope for a workaround; I'm trying to root cause [14:20] I thing I'm worried about is that this may be caused by the apparmor stuff [14:20] anyone tried booting with apparmor disabled? [14:20] BenC: give me the runes [14:21] (not really apparmor, but the compatibility parts for the VFS) [14:21] BenC: I saw one trace which had aa_* in it, but it was after the initial BUG() [14:21] BenC: my hypothesis above was that it was a refcounting bug in a filesystem somewhere [14:21] mdz: apparmor=0 [14:21] BenC: I can't think of any other reason why kill_sb would be called from a sys_getcwd path [14:22] I'm trying apparmor=0 now [14:22] cjwatson: right, but that may be a bug in the parts that apparmor had to add in order to pass around the mount-point (changes we had to make to unionfs and other filesystems) [14:22] ah, interesting, ok [14:22] cjwatson: yeah, that does sound like a put() without a proper get() before it [14:23] where would I get the apparmor patch to look at while I'm waiting? [14:23] cjwatson: in git, there's one commit for the plumming to add apparmor [14:24] e103a4e81552fc5fea7c21a1d34cabc23bc938cc? [14:24] UBUNTU: SAUCE: [AppArmor] aufs patches [14:25] no, maybe not [14:25] ah, I think that was the only part in lum :) [14:25] but there is one that touches the rest of the VFS [14:25] looks like 2ec8408decf6d72c346247e0c437080e9f628fe4 [14:25] 9000+ lines, yum [14:29] OH! [14:29] I bet I know what is causing this [14:29] ? [14:30] iget() was removed from upstream, and I had fixup aufs to compile with the new API [14:30] I bet I did it wrong...there was limited examples of how the other filesystems coped with the change [14:30] still trying to get unionfs to compile [14:47] BenC: apparmor=0 doesn't seem to make any difference [14:51] BenC: anything else I can try yet [14:51] ? [14:55] "double fault" [14:55] that's a new one [14:55] cjwatson: Nothing I can think of off-hand [14:55] unionfs is proving to be non-trivial to compile against 2.6.26 [14:56] is 30d866ec206ff71b5f2b626ed3442bd2547da524 the iget thing you were talking about? that's squashfs rather than aufs though [14:58] my test case is down to "chroot /target /var/lib/dpkg/info/gparted.postrm remove" [14:59] chroot /target bash [15:00] this keeps getting weirder [15:00] cjwatson: I'm pretty sure a whole crapload of stuff has run in the chroot before this point in the installation [15:00] but even trying to start a shell blows up [15:00] when are you running your test case? [15:01] cjwatson: single-user mode on the live cd [15:01] gosh [15:01] chroot /target dash? [15:01] mkdir /target && mount /target && chroot /target bash [15:01] blows up [15:01] getting there [15:01] bearing in mind that bash isn't the default shell [15:01] cjwatson: chroot /target /bin/true crashes [15:03] this suggests that that filesystem has been subtly buggered by something just beforehand [15:06] I can't reproduce this [15:06] well, not from a desktop on the live CD anyway [15:07] cjwatson: I created a fresh ext3 filesystem, rsync'd the contents of /rofs into it, shut down, rebooted, mounted it, and reproduced [15:07] ah, I was working from the previous crashed fs [15:07] https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/251223/comments/7 [15:07] BenC: ^^ ever seen one like that? [15:07] I'll try to duplicate, by way of the scientific method [15:09] cjwatson: you can have my 2.3GB qemu image if you think it would help, but I'm pretty sure it's clean [15:09] Has this been seen on real hardware as well? [15:09] I'm sure it won't take long to set up a duplicate [15:09] * soren crosses fingers [15:09] almost certainly quicker than downloading 2.3GB [15:09] davmor2: are you using real hardware or a kvm? [15:10] soren: yes, that's where I first saw it [15:10] mdz: Oh, "good". [15:11] There has been a few people who claimed to have seen data corruption of some sort with kvm 70, but noone could provide any detail and kvm upstream have been unable to reproduce.. I'm glad that's not it. [15:11] cjwatson: interestingly, chroot /rofs /bin/true works [15:11] suggests an ext3 bug, which is a bit terrifying [15:12] mdz: IIRC, double fault means a fault in a fault [15:12] looks like register corruption kept it from printing the first oops properly [15:12] and caused another fault [15:12] Yes, that's indeed what a double fault is. [15:12] * BenC loves re-entrant crashing [15:14] what's a statically-linked binary in the base system? [15:14] "The system was unable to crash properly. Please reboot soon to avoid losing any other crashes" [15:14] ah, ld_static [15:14] ld_static -> not base system (though it might be there anyway) [15:15] aha [15:15] chroot /target /bin/ld_static -> OK [15:15] chroot /target /bin/true -> crash [15:16] chroot /target /sbin/ldconfig ? [15:16] in fact, perhaps chroot /target /sbin/ldconfig.real [15:16] rebooting [15:18] mdz: Hi, just got your Email. What is the problem? [15:18] * cjwatson goes to stretch back after too long hunched over the laptop [15:19] cking: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/251223 is the problem [15:19] cking: bug 251223 [15:19] you might want to check scrollback of this channel on irclogs.ubuntu.com [15:19] I've got one more file to forward port in unionfs [15:20] cking: this bug is blocking alpha 3 [15:20] cking: so we can attack this from both sides, could you take this bug from the perspective of finding out what is broken in aufs? [15:20] cjwatson: ldconfig causes a crash [15:20] cking: most likely it has to do with the apparmor patches to aufs, but there's an outside chance it is just busted [15:21] cjwatson: was aufs not being used in Alpha 2? [15:21] cking: if you can give me a kernel with that patch reverted, I'll test [15:23] mdz: we can't revert the patch [15:23] it wont compile without them [15:23] BenC: what won't? [15:23] unless you get a kernel that has no apparmor patch as well, but then we wouldn't know if it was aufs or apparmor [15:23] mdz: aufs [15:23] BenC: right, revert apparmor as well [15:24] cjwatson: ldconfig.real runs with no problem [15:24] I wonder what's different between this ext3 mount point and the squashfs mount point which has exactly the same files in it [15:25] * cking is trying to extracate himself from a meeting and find somewhere to park himself [15:25] chroot /target true breaks, chroot /rofs true works [15:26] cking: if you're unavailable, don't sweat it, I only nagged you because I thought you were the only one around [15:26] I had just seen your trip report and thought you might have some time to help [15:27] mdz: I'm in the middle of a power savings talk - it's packed here [15:27] aufs/unionfs> mdz's test case exercises neither of those, surely? [15:28] mdz: ldconfig.real> and then does anything work *after* that? I was wondering if building the ld cache would be enough to fix dynamically linked binaries [15:29] cjwatson: aufs is involved [15:29] This is going to be difficult to sort out from here, the network connection is slow and my home box is turned off at the moment. [15:29] my root filesystem is aufs [15:29] because I'm booted from the cd [15:29] true [15:29] seems ever more tenuous though [15:31] mounting a tmpfs, copying a minimal chroot into it, and chrooting true doesn't break [15:34] mdz: I can't make a similar setup crash for me [15:34] cjwatson: how did you create the fs? [15:35] booted kvm with textonly as a kernel argument and a blank -hda (dd), parted to create a disklabel and partition, sudo mke2fs -j /dev/sda1, sudo mkdir /target, sudo mount /dev/sda1 /target, sudo rsync -av /rofs/ /target/ [15:35] cjwatson: except for creating the partition table with fdisk, that's exactly what I did [15:39] I almost have unionfs done [15:39] all the 2.6.26 VFS changes plus apparmor...it's a wonder there aren't more bugs [15:39] BenC: by the way, what happened to replacing unionfs by aufs? did that happen? [15:40] cr3: yes [15:42] cjwatson: http://kernel.ubuntu.com/~bcollins/unionfs.ko [15:42] cjwatson: that's 32-bit 2.6.26-4-generic [15:43] cjwatson: if it works, I have the source, and can tell you how to build-dep/compile it [15:44] soren: as a side note, I have a -virtual package for you to test [15:44] Woo! [15:45] soren: 9.1M .deb, 34Meg unpacked [15:45] soren: I had to mess with your list a little, and module dep's pulled in some extras [15:45] cr3: we conjecture that that may be the problem :-P [15:46] BenC: Yes, I kind of expected that. That's fine. [15:46] BenC: 404 [15:46] BenC: Did you see my question over in #ubuntu-kernel, by the way? About the virtio modules? [15:46] cjwatson: try now [15:46] * BenC needed to hit enter still [15:52] BenC: sorry, crashes at boot [15:52] I'll try to get you a trace after this call [15:52] damnit [15:53] it's a null deref in unionfs_interpose [15:53] that helps... [15:53] +0x16f [15:53] I did an API change there...let me check it [15:53] err, +0x16f/0x420 [15:54] claims to be in the mount process [16:00] cjwatson: I see what happened...I didn't do all the right changes to drop read_inode() [16:00] give me a couple minutes [16:04] ok [16:12] cjwatson: new module ready (same place) [16:17] BenC: not quite sure, but it seems to have hung [16:18] * BenC isn't quite sure that that means [16:19] Ah!...found one more thing [16:19] well, nor am I, all I know is it ain't doing anything [16:23] cjwatson: new module up [16:26] it's at least booting now [16:26] * BenC crosses fingers [16:27] cjwatson: to be honest, I looked at the aufs code, and I can't see how anyone thinks it's somehow better than unionfs [16:27] but then, I'm looking at unionfs 1.4 code, not the unionfs 2.3.x-need-to-patch-half-the-vfs patches [16:27] plus unionfs being in mm says a lot more about it :) [16:42] installing [16:43] cjwatson: did the test case pass? [16:43] I can't reproduce mdz's little test case, but I'm trying the big one (full installation) [16:43] it's at 64% [16:43] failed around 96% before [16:43] ok....good to know that the test passed at least [16:43] cjwatson: I didn't know about textonly and used single; I'm baffled as to why it didn't happen fro you [16:43] means we're probably on the right track [16:44] BenC: err, what test? :-) [16:44] no test has passed yet, short of booting [16:44] never mind, I misunderstood [16:52] BenC: same problem with unionfs [16:53] (unionfs:generic_shutdown_super is in the call trace so I'm sure) [16:55] cjwatson: can I see a full trace? [16:55] at least with unionfs I can do a better job of tracing the code and finding out what's wrong [16:56] And I know it at least worked at some point (in hardy) [16:56] cjwatson: it turns out I had inadvertently used the i386 squashfs rather than the amd64 one [16:56] it is still completely bizarre that it caused those crashes though [16:57] BenC: erm, if you have really good eyesight, http://people.ubuntu.com/~cjwatson/tmp/251223.png [16:57] BenC: sorry for the font but I was trying to guarantee that the dump plus context would fit on the screen [16:57] cjwatson: I think I see the problem...but let me check the trace [17:01] sysfs: Use kill_anon_super [17:01] [17:01] Since sysfs no longer stores fs directory information in the dcache [17:01] on a permanent basis kill_litter_super it is inappropriate and actively [17:01] wrong. It will decrement the count on all dentries left in the [17:01] dcache before trying to free them. [17:01] [17:01] At the moment this is not biting us only because we never unmount sysfs. [17:02] that sounds just like our unionfs problem [17:02] yes, it does rather [17:02] going to try that for unionfs [17:02] is that patch in our tree [17:02] ? [17:02] because note that we *do* unmount sysfs [17:02] it is in intrepid [17:03] it's from Aug 20, 2007 [17:03] so was in hardy too [17:03] and probably gutsy [17:03] (not bringing its total mount count down to zero, but nevertheless we mount /target/sys using 'chroot /target mount -t sysfs sysfs sys' rather than as a bind-mount, and then unmount it) [17:03] cjwatson: new module ready [17:03] ok [17:04] cjwatson: that patch probably coincided with sysfs not using dentry cache anymore, so never affected us [17:04] or maybe it only affected something like 2.6.2x where X is an odd number that we didn't use :) [17:05] or it broke mysteriously but we never got to the bottom of the problem :) [17:05] * cjwatson <- cynical [17:07] installing [17:08] cjwatson: Not taking the mount count to zero probably meant it didn't affect us either...since kill_sb() wouldn't be called in that case [17:08] ok, I wasn't sure whether repeated mounts were equivalent to bind-mounts for that purposes [17:08] s/s$// [17:24] cjwatson: so far so good? [17:24] sorry dude, exact same thing [17:25] fuck [17:26] it can't be the same backtrace though [17:26] cjwatson: can you repost the trace? [17:28] BenC: http://people.ubuntu.com/~cjwatson/tmp/251223-2.png [17:28] cjwatson: so with a less weird test rig, I am still able to reproduce the bug with dpkg --root=/target --purge gparted [17:29] though not with running random programs as with the 32-bit chroot [17:29] perhaps that's a different bug [17:29] I think I'm going to assume different [17:31] cjwatson: hmm...this appears more and more like it's in fuse module, since that's the last module before generic_shutdown_super() [17:33] I can't think of any fuse filesystems that would be mounted [17:33] oh, there's $HOME/.gvfs I suppose [17:34] cjwatson: I think that has to be unmounted prior to rootfs [17:34] so it may be a unionfs bug that it allows unmounting when there's another fs still mounted under it, but we can work around that in scripting [17:34] but nothing is being unmounted here [17:35] this is getting triggered inside gparted.postrm, which does nothing except update-menus [17:35] the fact that it's hitting unmount paths at all is the bug [17:35] the crash says "unmount of rootfs rootfs" [17:35] sorry, I should have said "nothing should be being unmounted here" [17:35] this is why I was talking about refcounting bugs [17:35] it's trying to unmount stuff unasked [17:35] I don't think refcounting causes unmounting [17:36] initially I thought that it wasn't really gparted.postrm causing it, but we have traces that confirm [17:36] you can see dpkg reporting that gparted.postrm segfaulted in that last trace [17:37] ok, let me look deeper [17:37] it definitely shouldn't have got as far as unmounting /target yet, much less rootfs [17:45] BenC: yes, I was wondering where that message comes from [17:45] cjwatson: can you see whether you get that "unmount of rootfs rootfs" in your test? [17:45] I do [17:45] it scrolls off too fast for me [17:45] you can see it in the screenshots I've posted [17:46] it's been entirely consistent [17:50] so what is the rootfs in this case? [17:50] I mean, what filesystem? [17:50] is it squashfs? [17:50] does rootfs mean absolute root (initramfs), or whatever we pivoted to, or the current process' root? [17:51] I thought rootfs generally meant the initramfs [17:51] hard to say...what are the possibilities at the point of failure? [17:51] it wouldn't be initramfs anymore, since we've pivoted by now, right? [17:52] if you look at /proc/mounts, the top item is still labelled rootfs and no others are, and that top item is the initramfs [17:52] the other possibilities are the root that practically everything else thinks we're using, which is a unionfs composed of squashfs+ext3; and the process that dies happens to be chrooted into an ext3 filesystem [17:53] I'd be surprised if the kernel referred to anything other than the absolute root initramfs as "rootfs", though - anything else I'd've thought would be "/" at most [17:53] I wonder if it's squashfs [17:54] with so many filesystems layered, and the fact that this is a bubble up call to destroy the sb, It's hard to say where it happens [17:54] indirection might make a problem anywhere in the layering cause this [17:54] it would probably be a good idea if you got yourself a test rig :) [17:54] I'm going to be having myself an evening in the not too distant future [17:55] what's the most simple test case at this point? [17:57] cjwatson: can we just not remove gparted and see if that let's things finish? :) [17:57] BenC: I tried telling ubiquity not to dpkg --remove gparted, and it failed on a different package instead [17:57] it's surprising to me that it's gparted and nothing else (other things surely have to be calling getcwd(), right?) [17:58] ah [17:58] libntfs10.postrm [17:58] gparted is just the unlucky first one [17:58] we *could* have ubiquity not remove any packages at all, but that would probably be bad [17:58] it would leave stuff like casper and ubiquity on the target system [17:58] simplest in the sense of simplest-to-construct (not quickest) is to do an installation with all the defaults [17:59] a kvm and use-the-whole-disk is sufficient [17:59] I have a more minimal test case now [17:59] mdz reckons that creating a filesystem mounted on /target, then rsync -a /rofs/ /target/, then dpkg --root=/target --remove gparted will do the job [18:00] with a reboot in between the rsync and the dpkg? [18:00] I was always running dpkg on a fresh boot [18:00] BenC: I have a test case which only requires a minimal chroot [18:00] can you reproduce it *outside* the live CD? [18:01] cjwatson: haven't tried, I doubt it [18:01] it's just chroot+exec [18:01] a few times [18:01] so what's the current minimal test case and I'll brave it on my laptop? [18:03] cjwatson: is the postrm doign anything important? Maybe we could delete the postrm files of the packages being removed? [18:03] quick link to an ISO? [18:03] cjwatson: http://people.ubuntu.com/~mdz/251223/test.c [18:03] http://cdimage.ubuntu.com/daily-live/current/intrepid-desktop-i386.iso [18:04] BenC: not much in the cases I've looked at so far, but I'm not very comfortable with that [18:05] mdz: are you running this on the livecd? [18:05] BenC: yes [18:05] continuing to try to minimize it [18:06] BenC: /mnt needs to have a chroot with a working shell in it [18:06] going to take me an hour to download this ISO (should have started earlier [18:06] ) [18:07] execing /bin/sh -c '' triggers the bug, /bin/true doesn't [18:08] mdz: that crashed my laptop, although only after a lot of iterations [18:08] bin/sh would do a getcwd [18:08] several hundred [18:09] cjwatson: the current version at the same URL does only 3 [18:09] cjwatson: and that crashes 100% reliably for me [18:09] cjwatson: inside or outside of the live cd environment? [18:09] in a regular intrepid system [18:09] oohhh [18:09] cjwatson: really? [18:09] that's interesting [18:09] I just mounted some random ext3 fs I had lying around [18:09] (I didn't want to introduce further variables with a loop-mount) [18:10] that will speed up my test cycles tremendously [18:10] * cjwatson sticks a counter in [18:13] this time round it hung after 8 iterations [18:14] cjwatson: if I add a chdir("/") after the chroot, it stops crashing [18:14] now we're getting somewhere [18:15] cjwatson: did you do a /proc or /sys mount (or any mounts) on the ext3 fs? [18:15] BenC: no [18:16] mdz: could you put http://people.ubuntu.com/~mdz/251223/strace.gz back up? [18:16] cjwatson: it's still there [18:17] BenC: basically the same trace except no call trace in syslog [18:17] mdz: oops, I transposed two digits [18:17] and I note that dpkg does not chdir() [18:18] yes, it survives 10000+ iterations for me with a chdir() [18:18] ok [18:19] I have a test case now which works without a chroot [18:19] blink [18:19] er, with a chroot(2) [18:19] chdir rmdir? [18:19] but with an empty chroot [18:19] ah [18:19] yay...I crashed [18:19] so doesn't require any setup [18:19] cjwatson: can you try the latest test.c? [18:19] [71309.482997] [] path_put+0x31/0x40 [18:19] [71309.483019] [] sys_getcwd+0x101/0x160 [18:20] now we're getting somewhere [18:20] that is with 2.6.27+ubuntu too [18:20] wonder if I can try a stock kernel and make sure it's not an upstream bug [18:21] this one works without even mounting anything on the directory [18:22] mdz: yeah, fails [18:22] cjwatson: does it fail in 3 iterations for you, or do you need more like you did before? [18:22] weird...if I read the dmesg right [18:22] I'm not sure - the system didn't crash immediately so I thought it needed more, but then later on vi segfaulted and then my terminal ... [18:22] this failed on a regular chroot [18:22] not even a mounted fs [18:22] BenC: right, either works [18:22] interesting [18:23] BenC: this one works without even mounting anything on the directory [18:23] an empty dir will do [18:23] suggesting that it's just the fact of calling chroot() exec*() without an intervening chdir() [18:23] cjwatson: my latest test case doesn't even call exec() [18:23] just getcwd() [18:23] oh, heh [18:24] so getcwd likes to have a cwd [18:24] cjwatson: so we might get away with a workaround after all... [18:24] yay [18:24] dpkg would be more correct if it chdir()d [18:24] mdz: fancy doing your yearly upload? :) [18:24] let me try this on a stock kernel before we beat ourselves up [18:24] cjwatson: ;-) [18:24] hehe [18:24] I'll work up a dpkg patch and see [18:25] I suppose maintainer scripts hardly ever rely on relative paths [18:26] if they relied on a relative path outside of the chroot, they'd be darn buggy :-) [18:27] they can't rely on relative paths *at all* or they'd break here [18:27] but of course nothing relies on relative paths because dpkg doesn't chdir :) [18:28] not for running maintscripts anyway [18:28] I think I'll stop crashing my laptop now [18:30] patched dpkg works [18:30] woo [18:30] mdz: sweet [18:31] I'll look at this from the kernel side still [18:31] it's definitely a bug no matter how we work around it [18:31] idneed [18:31] so looks like all the fs stuff was a red herring :) [18:31] tested with --root=/mnt --purge gparted jfsutils lupin-casper ubiquity-casper ubiquity casper [18:31] now that alpha3 isn't blocking on the kernel, I can fix some coffee :) [18:32] cjwatson: yeah...yay for indirection and abstraction [18:32] BenC: is it relevant that the out: path of sys_getcwd does the puts in FIFO order rather than LIFO? [18:32] path_get(&pwd); path_get(&root); ... out: path_put(&pwd); path_put(&root); [18:34] cjwatson: http://people.ubuntu.com/~mdz/251223/dpkg.debdiff [18:35] mdz: looks good, nice touch for reusing an existing string [18:37] I'll investigate that side as well [18:37] cjwatson: will you do a build and confirm that it fixes things for you? [18:37] I'll fire up ubiquity and do a full test [18:38] well, I was going to confirm it by firing up ubiquity and doing a full test :) [18:41] how funny. X won't start in kvm anymore [18:42] amd64 or i386? [18:42] mdz: double-check that you remembered -m? :) [18:42] heh [18:44] cjwatson: that's what I thought, too, but -m 512 still doesn't work [18:44] evand: amd64 [18:46] which init script is it which switches usplash back to PULSATE at some random point in the boot sequence? [18:48] mdz: I noticed that at some point to...went from progress to pulsate a couple times during boot up [18:48] Pretty sure it only happened on livecd [18:48] ah, perhaps you're running into this: https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/251480 - though I can't imagine why it would suddenly trigger when it was working before. [18:48] IIRC, it was a hard 8.04.1 CD [18:48] *hardy [18:51] evand: except that i386 guests break as well now [18:52] test install with new dpkg running [18:55] ah [18:55] I'm leaving it to you, kvm hates me now [18:55] it survived only long enough to to test dpkg manually [18:56] 51% [19:02] cjwatson: using aufs? [19:02] BenC: yes [19:02] evand: could you commit the 1.9.7 release? [19:02] (ubiquity) [19:05] upload it? sure. [19:05] I thought you had already uploaded it [19:05] actually, I meant push, not commit [19:05] nope, not yet [19:06] should I? [19:06] evand: oh, I'm sorry, I thought I'd seen a release commit, but I hadn't, so ignore me [19:06] ok [19:06] I thought it was just one of those committed-but-not-pushed things [19:06] you did for 1.9.6 :), I was in a rush to fix a build failure. [19:07] are you on intrepid now then? :) [19:07] I'd been holding off updating the autotools since you weren't [19:07] I'm really hoping this kernel bug exists upstream [19:08] indeed I am [19:16] mdz: success [19:16] SHIP IT [19:17] hahaha [19:19] cjwatson: uploaded [19:20] mdz: excellent work on getting that work around [19:20] I'm booting a stock kernel now to see if I can reproduce it [19:21] cjwatson: a few things changed in sys_getcwd() since hardy, but not the ordering of the put's [19:21] but it may just be one of those things that has gotten exposed through other changes [19:21] so I'll still give it a try [19:22] the VFS did have some locking changes, and quite a few API changes since hardy, so I suspect this is an upstream issue [19:24] BenC: thanks, felt good to get my hands a little dirty [19:25] oem-config: cjwatson * r492 oem-config/ (d-i/manifest debian/changelog): [19:25] oem-config: Automatic update of included source packages: console-setup 1.25ubuntu2, [19:25] oem-config: localechooser 2.03ubuntu1, user-setup 1.20ubuntu3. [19:28] mdz: you wont be the next to leave management for engineering, will you? ;) [19:40] oem-config: cjwatson * r493 oem-config/debian/changelog: releasing version 1.43 [20:00] 98 [20:00] evand, just wanted to see how long are you sticking around for? both tue and wed? [20:01] mario_limonciell: just tried calling you back ;) . Correct, my flight back is the night of the 30th. [20:01] evand, yeah i've got poor reception in doors here, but also have a hard time getting to IRC lately [20:01] evand, okay. i'll make sure you get a CC of the agenda we have together. [20:02] evand, do you have anything particularly you would like to bring up [20:02] and fit into a time slot for me to throw onto the agenda? [20:03] Where you stand with respect to automated installations, and anything you need from us there. [20:03] okay. that should probably be fine clumped into the installer timeslot as it stands [20:03] indeed [20:03] okay thx [20:09] oem-config: cjwatson * r494 oem-config/debian/ (changelog oem-config.dirs): [20:09] oem-config: Create /var/lib/localechooser directory, otherwise localechooser [20:09] oem-config: completely breaks. [20:10] oem-config: cjwatson * r495 oem-config/ (configure configure.ac): bump to 1.44 [20:11] that sucks...it's not an upstream problem...stock kernel doesn't BUG() out even after letting this run a few thousand times [20:11] oem-config: cjwatson * r496 oem-config/debian/changelog: releasing version 1.44 [21:18] BenC: any luck tracking down the bug? [21:19] mdz: it's definitely apparmor changes to the VFS [21:19] there's some patches to d_path (and thusly sys_getcwd's usage of it) [21:19] BenC: have you sent it to apparmor upstream? [21:19] and they even talk about lazy unmounts [21:20] mdz: not yet, tracking it down bit more [21:26] aha! [21:26] I think I found the bug [21:30] They had the same patch in hardy, but slightly different...in hardy sys_getcwd passed fail_on_deleted to __d_path() but in the intrepid patches it doesn't [21:30] and that's from upstream svn (not something we goofed up) [21:30] Hopefully this recompile shows it fixes the bug [21:45] is this the best channel for a usplash question? [22:10] kirkland: not really, -devel is fine [22:37] I have a question regarding apt-ftparchive.....so the way i undertsand it, it takes a pool as input and generates the packages index file in the dist folder. But the pool is only divided along components (main universe etc.)... how does apt-ftparchive know the distribution (hardy. feisty etc.) to generate the packages file in their respective subloders under dist ? [22:37] *subloders =subfolders [22:46] cjwatson: confirmed successful installation using dpkg from the archive [22:50] i have dvdshrink installed but i don't know how to use it cuz it's installed by the terminal. can anyone help me out?