[02:17] http://lwn.net/Articles/351013/ this is absolutely awesome [03:29] Hi. I'm having issues with Upstart (as found in Ubuntu Karmic) successfully starting a system when run in an LXC container. I'm hoping someone may have an idea of what I might try to do to fix it. [03:32] strace says it's hanging on a read from a pipe pretty early into startup, just after trying (and failing) to open a connection to the /var/run/dbus/system_bus_socket UNIX socket. [04:56] mbt: you should probably try in an ubuntu channel first [04:59] I doubt "LXC containers" are something officially supported by Ubuntu? [04:59] LXC support is rather new to Ubuntu, I was unable to find anyone who seemed to know what they were let alone what could be an issue. [05:00] That said, as I have no clue what's going on, I'm kind of stuck. Currently reading through the upstart sources to see if that will turn on any light bulbs for me. [05:01] might be useful to ask LXC people too [05:02] I have no clue where to even find them. [05:02] probably a permission issue or something [05:03] mbt: they've sent patches to the kernel... ;-) [05:03] lol [05:04] LKML must have their email address ;-) [05:06] lol, that's true. Something tells me that would take longer than trying to solve by brute force, though. [05:07] The last time I emailed any kernel maintainer it took weeks and I got brushed off because it couldn't have been a software fault in the kernel. [05:07] mbt: who? [05:10] The maintainers for a USB->Parallel Printer cable driver. (Driver would eat a character of output to the printer and require the printer to signal offline/online before printing, weird). [05:10] I don't remember who it was at this point, though, that was a few years ago. [05:11] mbt: the trick is to figure out the sociable ones and email the closest one to what your issue is [05:11] mbt: Ingo is good, Andrew Morton is good.... [05:14] I guess the language of your mail mathers too ;-) [05:14] matters [05:14] JanC: If you mean literal language: English. [05:14] kernel development occurs in English. [05:14] No exceptions [05:15] 'your sh***y code is f***ing broken' [05:15] sadmac_tokyo: more like PovAddict's example won't help ;) [05:17] "Good day, kind sir. It is my humble regret to inform you that your sh***y code is f***ing broken." [05:19] sadmac_tokyo: feel free to try it [05:22] That said, I can't find anything that points to the kernel so I wouldn't have a clue what to even seek out. [05:22] This same host kernel is running several other systems, including older (by about 70 days) Karmic systems. Just an up-to-date Karmic guest doesn't work. I don't have any way to test upstream components to try to narrow it down further. [05:23] I once commented on some disgusting code duplication (adding a feature by copy/paste/minor replace), and the project BDFL replied 'if you have better code, let's see it; otherwise shut up' [05:23] There is some more info at bug 461638 in launchpad. [05:23] Rather, bug 461438 [05:24] PovAddict: he's right. if you know exactly what's wrong, why not fix it? [05:24] That's the type of thing I'd send a patch for. [05:24] mbt: I'd start by comparing package versions, and trying downgrading candidates to the same version as your old-but-working system [05:25] candidate = package that might possibly be the cause or related [05:25] PovAddict, I don't know how likely that is to work; there are a whole slew of changes around the way things went in. There's new packages for things like mountall and I don't think rolling back is going to be straightforward. [05:26] Of course, that was also my fault for not snapshotting the bloody FS first. I always do that. This time, I forgot to. :( [05:26] let's start with the obvious [05:27] is it same kernel and same upstart on working and non-working systems? [05:28] Same kernel (Linux Containers/LXC is like BSD Jail). Host hasn't changed configuration and works with the latest stuff; so upstart on the host works, but not in the container. At least, I think it's upstart, that's what's failing to read something else. [05:29] Though, the "something else" part I haven't figured out (yet). [10:26] mbt: the fact that you can strace Upstart scares me [13:29] :-) [15:05] Apparently, while strace will work, ltrace does not. [15:05] * mbt shrugs [15:34] heh [15:34] strace isn't supposed to work on init [15:35] why not? [15:36] because ptrace() relies the process you are tracing being your child [15:36] this means that you become the parent of init [15:36] which means that strace becomes init [15:36] this is obviously wrong [15:37] you can attach to an existing process too [15:38] and in his case, probably upstart is inside the container and strace is outside... or something [15:38] yes [15:38] and when you attach to the existing process [15:38] you become its parents [15:38] you get SIGCHLD if it does, not its real parent [15:38] et.c [15:38] it dies, etc. [16:45] Keybuk, it can work on init in a container, because in a container, init has PID 1, but outside the container it has a parent. [16:46] Keybuk, see http://pastebin.com/d55fbfbbb -- those have non-1 PIDs, but when as far as they are concerned, they are PID 1 inside of the container. [16:47] The problem with ltrace is documented as a bug in its man page, so that's pretty much a dead end for me. [16:48] And it's not like I can install sysvinit, because Karmic doesn't have it apparently. [16:48] So I can't even verify that the problem is what I think it is. [16:49] http://packages.ubuntu.com/search?keywords=sysvinit right, it's gone in karmic [16:50] Yeah, so I'm feeling rather stuck at the moment. [16:51] I wonder.... [16:51] If I take a core dump of the failing Upstart process, would anyone know how to extract anything useful about the state of the process from it? [16:56] I've attached one to https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/461438 though I don't know how useful it will be. [16:57] yes, absolutely [16:57] did Upstart core dump? [16:58] honestly though, you're not really describing your problem [16:58] on the bug you're talking about mountall outputting errors [16:58] but here's you're talking about upstart core dumps [16:58] I'm trying to figure out *what* the problem is. At first, I thought it was mountall. Now it's looking like it's upstart. The problem is the container won't start. [16:59] do you get a pid 1 in the container? [16:59] And no, I used gcore to get the process memory. [16:59] Yes, that coredump is taken from the container's PID 1. That's the *only* process that is running in the container. [16:59] so the container is clearly starting [17:00] Alright, the container shell starts, but the stuff in the container does not. [17:00] if no other processes are running, how do you get error messages from them? [17:00] because you reported the bug with error messages from something upstart had run [17:00] so stuff in the container *CLEARLY IS* running [17:00] Yes, I did. It was running mountall. To see if that was the problem, I replaced mountall with a script that returns 0. [17:00] did the script return 0? [17:00] did Upstart see the script return 0 and return exited normally? [17:00] Now, if you look at the 4th comment on the bug you see where Upstart is hanging. [17:01] Mountall is the last thing it does, and it goes no further. It sees that it exits normally. [17:01] After mountall is done, upstart hangs forever, blocked on either a read or a select() call. The data I have collected there seems to be inconsistent. [17:02] So not only am I confused and not sure where the problem's root cause is (the only thing I *do* know is that it's related to a change in the container; Karmic as it was installed in the container 70 days ago worked just fine; Karmic as was updated 2 days ago does not). [17:02] But I can't seem to collect enough data to figure out *why* it's hanging. [17:03] All I know is that it is. [17:03] And that all Upstart does is start mountall, hostname, and hwclock; nothing else is getting run, and init is the only process that is left running when those are completed. [17:05] right [17:05] but that can just mean that Upstart is done [17:05] did mountall work? [17:05] do you see events being emitted by mountall? [17:05] Alright, let's back up just a second here. [17:07] Comment 4 on the bug shows all there is to show on Upstart's early startup. From there, nothing else in the container that is configured to start, does. Upstart isn't finished because the system never finishes booting; no gettys are spawned, no avahi-daemon, no sshd, nothing. When the container is fully booted, it should have about 20 processes running, including a web server. [17:08] The last thing to happen in that is mountall exits normally (according to upstart) and it changes state from post-stop to waiting. [17:08] Now, the container has nothing to mount (all of that is done by lxc-start). [17:10] and what does mountall say? [17:10] What do you mean? It is a script that returns 0. [17:10] Now, here's something else: [17:10] I disabled the mountall, hwclock and hostname .conf files in /etc/init --- and now, Upstart is doing nothing. [17:10] It says: [17:11] root@spicerack:~# exec /sbin/init -v [17:11] Loading configuration from /etc/init.conf [17:11] Loading configuration from /etc/init [17:11] init: Handling startup event [17:11] And hangs. [17:11] Attaching to it with gdb (from outside the container) shows that it's blocking on a call to select(). [17:12] err [17:12] why is mountall a script that returns zero [17:12] mountall is a C binary [17:12] You're not paying attention to what I've said. 12 minutes ago, right here, in this conversation, I said that I replaced it with a script that returns success to see if that was the culprit. [17:12] well, that won't work [17:13] As everything in the container is already mounted when lxc-start forks and exec's /sbin/init, mountall has nothing to do in the container in the first place. [17:13] yes it does [17:13] it had to send the events that the rest of the system is waiting for [17:13] otherwise nothing else will start [17:13] rather than an exit 0 [17:13] why not? [17:13] initctl emit virtual-filesystems [17:13] initctl emit local-filesystems [17:13] initctl emit remote-filesystems [17:13] initctl emit filesystem [17:13] exit 0 [17:14] you may want || true on the end of those [17:14] Interesting. So the bug *is* in mountall? [17:14] or maybe -n [17:14] no, not at all [17:14] you're not running mountall [17:14] I was. [17:14] you replaced it with a shell script [17:14] And it was still doing nothing. [17:14] That's why I replaced it with the shell script, to see if the system *still* did nothing. [17:14] Let me swap back and see if that has any effect. [17:16] Okay, swapping back, I still hang, though the output of exec /sbin/init -v has changed: https://bugs.edge.launchpad.net/ubuntu/+source/mountall/+bug/461438/comments/7 [17:17] So mountall is sending events just fine, but upstart is still failing to proceed with the boot process. [17:17] what does mountall output? [17:18] mountall hasn't sent the virtual-filesystems event yet [17:18] let along the filesystem event [17:18] so nothing will start yet [17:18] What do you mean, what does mountall output? Everything that is output is there on the link I just pasted. That's it, in its entirety. [17:19] I don't see any event names in the output so I can't tell what's being fired or not. [17:21] If I use that script you put here a few minutes ago, though, things seem to work. [17:22] Well sort-of. I get some things started. [17:22] ok [17:22] try mountall --debug [17:23] Before execing init, or from the script? [17:26] from the mountall script [17:28] If I do that, it spawns over and over and over again, like a giant fork-bomb. [17:29] really? [17:29] Yep. I had to kill the container forcibly. [17:29] * Keybuk doesn't see respawn in mountall.conf [17:29] I don't have one in mine, either. I haven't modded any of the *.conf files in /etc/init. [17:30] But there were 3000 processes when I killed the container. [17:32] At this point, I have the container running using the initctl statements to fake mountall's presence. [17:32] I'm going to create a new container to continue debugging in, since I need this one to be up and running. [17:33] I'll bbiaf, need to switch back to my regular freenode profile. [17:33] Back. === robbiew is now known as robbiew-afk