[00:02] davecheney: does this bug still happen? https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/1304167 [00:02] <_mup_> Bug #1304167: syntax error, trusty beta-2 cloud image [00:02] thumper: maybe you were smoking weed when you wrote the email [00:02] davecheney: seems like a quite major bug if so [00:02] wallyworld: nah... [00:02] although I am wondering if it would help [00:02] couldn't hurt :-) [00:02] ha [00:03] thumper: would it be possible for you to log [00:04] "%T", err [00:04] davecheney: sure [00:04] thanks [00:04] thumper: yes, the bug is still open [00:04] it has screwed LXC on any platform that uses apparmor [00:05] :-( [00:05] thumper: when you run the destroy-enviromnet, you're not in that directory are you [00:06] ie; mkdir /tmp/t [00:06] cd /tmp/t [00:06] rmdir /tmp/t [00:06] davecheney: no [00:06] ok, just checking [00:07] http://gcc.gnu.org/releases.html [00:07] gcc 4.9 released [00:07] but not really [00:07] if you destroy too close to bootstrap, you don't get it [00:08] thumper: hmm ok [00:08] oh... [00:08] I think I know what it could be... [00:09] thumper: hold pls [00:09] when we kill the machine agent with pkill [00:09] it cleans up after itself [00:09] we then have a race [00:09] thumper: right, so things are racing on the directory listing [00:09] the agent is trying to remove some files [00:10] and then so does the destroy command [00:10] http://golang.org/src/pkg/os/error_unix.go [00:10] so is the agent removing ~/.juju/local ? [00:10] ie it's not a file [00:10] but the top level directory itself ? [00:11] so os.RemoveAll goes to remove ~/.juju/locla [00:11] and the whole thing has been deleted already ? [00:11] not all of it... [00:11] but some of it [00:11] oh... [00:11] yeah, sometimes all of it [00:11] yeah... [00:11] it does [00:12] *os.SyscallError [00:12] they are racing to remove the datadir [00:13] thumper: ok, that should be possible to make a repro [00:13] i'll do that while i'm waiting for gccgo to compile [00:13] davecheney: what do you think should happen? [00:13] 10:12 < thumper> *os.SyscallError [00:13] ^ is that %T ? [00:13] yeah [00:14] cheaky bugger [00:14] thumper: leave it with me [00:14] raise an issue maybe [00:14] i need to make a repro [00:14] davecheney: you see it as a golang bug? [00:15] thumper: it won't fit through http://golang.org/src/pkg/os/error_unix.go [00:16] http://play.golang.org/p/mp5i8GFL47 [00:16] * davecheney goes to find out where that os.SysclalError is coming from [00:18] thumper: for the moment you'lre going to have to code around it [00:18] this won't be fixed in 1.2 [00:18] dir_unix.go [00:18] 41: return names, NewSyscallError("readdirent", errno) [00:18] this is where it's coming from [00:19] * davecheney feels very depressed [00:19] it's just bugs, bugs, and more bugs [00:20] davecheney: I'll work around it [00:21] davecheney: we already ignore errors from two other things that we are racing with [00:30] thumper: i'll get a repro quick smart [00:30] i can see where it happens [00:37] morning davecheney. [00:37] davecheney: when I run make check on vm I get the following: http://pastebin.ubuntu.com/7246968 [00:37] any hints? [00:38] thumper: wip on jujud isolation: https://codereview.appspot.com/87130045 [00:38] thumper: cmd/juju and environs/bootstrap are now passing [00:39] environs/sync is going to take a bit more thought [00:39] and right now I'm too hungry to think [00:39] waigani: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754 [00:39] <_mup_> Bug #1304754: gccgo on ppc64el using split stacks when not supported

[00:40] davecheney: reading [00:43] waigani: short versoin [00:43] downgrading to an older kernel works around the problem [00:43] but isn't a fix [00:43] davecheney: yep, thanks [00:44] I neeeeed food. bbl [00:46] thumper: if err, ok := err.(*os.SyscallError); ok { if os.IsNotFound(err.Err) } [00:46] or something [00:56] axw: just saw your answer too [00:57] axw: however the error that is being returned isn't os.IsNotExist [00:57] axw: as the race is being caught elsewhere [00:59] thumper: ah, maybe in the Readdir then [00:59] anyway, there's definitely a race, and you should ignore it I think [00:59] thumper: lucky(~/devel/issue) % go run issue.go [00:59] 2014/04/14 10:58:58 creating temporary directories rooted at "/tmp/issue015782153" [00:59] 2014/04/14 10:58:59 preparing workers [00:59] 2014/04/14 10:58:59 release the swarm [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] ah... read-dir-int [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] not re-addir-int [00:59] thumper: raising an issue [00:59] axw: yeah, that's it [01:00] I couldn't parse the smashedtogetherwords [01:05] heh i finally have results for waigani and he's gone [01:05] but i think his problem was actually the "things randomly die on ppc" bug... [01:06] things all in all don't look too bad on arm64 actually [01:08] thumper: https://code.google.com/p/go/issues/detail?id=7776&thanks=7776&ts=1397437695 [01:09] mwhudson: \o/ [01:09] not actually good [01:09] just not terrible [01:10] mwhudson: /usr/include/features.h:374:25: fatal error: sys/cdefs.h: No such file or directory [01:10] any suggestions which package contains this header [01:10] uh, no, looks basic though [01:10] hm [01:11] dpkg -S sez libc6-dev-i386 [01:11] which seems a bit random [01:11] % dpkg -S /usr/include/sys/cdefs.h [01:11] libc6-dev-i386: /usr/include/sys/cdefs.h [01:11] yeah [01:11] ah [01:11] um [01:11] mwhudson: this is compiling gcc 4.9 [01:12] "real" libc6-dev installs it to /usr/include/$triplet/sys/cdefs.h [01:12] davecheney: from upstream or the deb? [01:12] mwhudson: upstream [01:12] mwhudson: our deb produces broken binaries [01:13] davecheney: on powerpc64 i assume? [01:13] um, that sounds like something doko should know about :) [01:13] is this the split stack thing? [01:13] yup [01:14] i guess libc6-dev-i386 must be some kind of pre-multiarch thing [01:14] * davecheney tries patching in some of the arguments from /usr/bin/gcc -v [01:14] davecheney: "dpkg --listfiles libc6-dev | grep cdefs.h" on your platform? [01:15] $ dpkg --listfiles libc6-dev | grep cdefs.h [01:15] /usr/include/powerpc64le-linux-gnu/sys/cdefs.h [01:15] maybe ./configure got the tripplet wrong [01:16] well, i was wondering why this was such a good compile box [01:16] clock : 4284.000000MHz [01:16] ziiing [01:21] gcc, just keep adding flags until it compiles [01:22] nope [01:22] still broke [01:22] fuck this [01:22] i'm using symlinks [01:34] wow. such multiarch [01:40] mwhudson: ok, here is what I think [01:40] gccgo on ppc is correctly detecting that split stacks are not supported [01:40] and using the default 'large' stack model [01:40] but .. the stack is still too small [01:41] i'm bt'in in gdb and at stack frame 1475 with no end in sight [01:42] haha [01:42] ok [01:42] so stack overflow? [01:42] hmm [01:42] make that stack frame 3,300 [01:42] is this on the altstack? i.e. while handing a signal? [01:42] so, in summary, gccgo doesn't give a clean indication when you fall off the end of the stack [01:42] mwhudson: nope, with split stacks disabled [01:42] you get a c style stack per goroutine [01:42] davecheney: that's not what i mean [01:42] sure [01:43] but signals are handled on a different stack again [01:43] (sigaltstack and all that) [01:43] i think those stacks are smaller? [01:43] anyways [01:43] mwhudson: i'm going to say, conditionally, yes [01:43] mwhudson: the sig handled gets a SEGV [01:43] davecheney: it's easy ish to make the stacks bigger i think [01:43] and it blames the topmost stack frame for hittig a nil [01:44] i found the code that was allocating them [01:44] when actaully all it did was call a function [01:44] yeah, well, if you fall off the end of the stack it's certainly going to break [01:44] mwhudson: are you adding -fsplit-stack on aarch64 ? [01:44] davecheney: no [01:44] shit, 5,000 stack frames [01:45] how in gods name could juju use so much stack ... [01:45] could this "just" be application infinite recursion for some reason? [01:45] or does the backtrace look reasonable? [01:46] mwhudson: the latter [01:46] maybe a dozen frames [01:46] this is going to be an 8mb stack [01:46] 18,000 stack frames [01:47] #31380 0x000000001000522c in main.count () [01:47] #31381 0x0000000010005854 in main.main () [01:47] <_mup_> Bug #31381: POMsgSet.active_texts assumes POFile.pluralforms is an int [01:47] <_mup_> Bug #31380: source package sort by version doesn't cope with invalid version numbers [01:47] that doesn't sound reasonable [01:47] lolmup [01:47] #-1 [01:47] although, eh, i guess it works well enough on platforms that do have split stacks [01:48] mwhudson: most gccgo developers are on amd64 [01:48] when I say most [01:48] i mean [01:48] all 1 of them? [01:48] everyone except you and me and some neckbeard using mips [01:49] strange this doesn't happen on arm64 though [01:49] * davecheney goes to talk to ian taylor [01:49] mwhudson: gccgo src/test/peano.go [01:49] ./a.out [01:49] i wouldn't have thought that stack frames would be much bigger on that [01:49] well yes, that fails on arm64 too [01:49] i wonder if it is unrelated [01:49] that gives a straight segfault [01:49] and the go handler doens't catch it [01:49] i wonder if we're barking up the wrong tree [01:53] mwhudson: i'm thinking these are two different issues [01:53] [492932.974051] a.out[25065]: bad frame in setup_rt_frame: 000000c20ffaf0e0 nip 0000000010004e0c lr 00000000100051fc [01:53] ^ this is what running off the stack looks like [01:53] note nip [01:54] [2028013.988376] jujud[400]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000 [01:54] ^ this is what a juju segfault on a bad kernel looks like [01:54] nip and lr are 0 [01:54] something branched to 0 and nuked the lr for good measure [01:55] well, once you have a disagreement over whether a bit of memory is stack or not, it's not exactly predictable what happens next [01:55] true [01:55] but why is the ip 0 [01:55] both cases this is unmapped memory [01:56] because something stomped over the link register on the stack, so it branched to lala land when trying to do a procedure return? [01:56] i don't know the ppc abi but i certainly saw that sort of thing a lot on arm64 [01:56] mwhudson: anything with a LR is probably going to act the same [01:57] also [01:57] mwhudson: ok, so if we're not running of the end of the stack [01:57] and i'm pretty sure we're not [01:57] then why does the size of the kernel page size affect the result [02:25] $ pmap -x 969 [02:25] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:25] Address Kbytes RSS Dirty Mode Mapping [02:25] total kB 0 0 0 [02:25] ---------------- ------- ------- ------- [02:25] well, thanks [02:28] thumper: juju stutus returns 0 if there are hook errors [02:28] axw: sorry, maybe this question is best addressed to you [02:30] is that a problem? [02:30] axw: dunno [02:30] depends what we've promised status willdo [02:30] i know that people want to be able to say 'is this environment ok' [02:30] $ pmap -x 969 [02:30] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:30] Address Kbytes RSS Dirty Mode Mapping [02:30] sory [02:31] ---------------- ------- ------- ------- [02:31] $ pmap -x 969 [02:31] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:31] Address Kbytes RSS Dirty Mode Mapping [02:31] oh for fucks sake [02:31] ---------------- ------- ------- ------- [02:31] total kB 0 0 0 [02:31] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [02:31] 969 root 20 0 1413376 515456 19136 S 9.6 6.2 0:18.51 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:31] yeah I can see the use case, but AFAIK it always just returned 0 [02:31] axw: i think this might be related [02:31] heavy use of the api server causes RES to rise [02:36] oh god [02:36] i hate everything [02:36] upstart isn't logging the stderr of jujud-machine-0 [02:41] :cry: SIGQUIT doesn't do what I think on gccgo [03:06] wallyworld: hangout died [03:06] wallyworld, axw, waigani: I figured I was done anyway :-) [03:06] thumper: will take a look at your CL after I finish up on this HA thing [03:06] axw: ack [03:07] axw: I first read that as "hating" [03:07] heh [03:07] made me chuckle [03:07] * thumper goes for a brief lie down before his head explodes [03:13] wallyworld: I found the mockable BuildToolsTarball, what was the other one? bundleTools? [03:13] yeah BundleTools [03:13] in environs/tools [03:13] that isn't mockable? [03:14] environ/tools/build.go:205 [03:14] you just need to introduce a var [03:14] make the method lower case [03:14] make te var upper case [03:15] ah sure, make it mockable - no problem [03:19] mwhudson: https://bugs.launchpad.net/juju-core/+bug/1307282 [03:19] <_mup_> Bug #1307282: cmd/jujud: gccgo api server consumes ~500mb of ram on machine-0

[03:22] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:22] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:22] did this get fixed ? [03:24] waigani: can you send me `uname -a` from your vm ? [03:24] davecheney: Linux winton-09 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:09:21 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux [03:26] waigani: intersting [03:26] i'm trying a -24 kernel and I can't get it to crash [03:26] waigani: did you just upgrade to that kernel ? [03:26] hmmm [03:27] waigani: uptime [03:27] davecheney: 03:27:40 up 1:09, 2 users, load average: 0.00, 0.01, 0.05 [03:28] I did a restart, to see if that helped at all [03:28] ran make check after, same problem [03:28] waigani: ok [03:28] thanks, that makes it concrete [03:28] dmesg [03:28] ^^ [03:29] davecheney: http://pastebin.ubuntu.com/7247924/ [03:29] waigani: ta [03:29] i should have said [03:29] dmesg | tail [03:29] waigani: could I ask you to check again [03:30] davecheney: http://pastebin.ubuntu.com/7247927/ [03:30] sorry [03:30] the test [03:30] not the dmesg [03:30] ah right [03:30] what i'm looking for is a line like [03:30] (no worries, this was my fault) [03:30] 11:54 < davecheney> [2028013.988376] jujud[400]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000 [03:30] ^ should see something like this [03:31] okay, I'll paste when done and keep an eye out for a line like that [03:39] waigani: can you ssh-import-id dave-cheney on your vm [03:39] so I can stooge around you /var/log/ [03:39] and see what kernel you were running before reboot [03:40] davecheney: already done, your public key is on the vm [03:41] danka [03:41] waigani: i have a theory that -24 kernel fixes the issue [03:41] it's not much of a theory atm [03:41] davecheney: http://pastebin.ubuntu.com/7247954/ [03:42] davecheney: I have a theory that I did something stupid [03:42] not so much a theory as a constant axiom [03:43] waigani: ubuntu@winton-09:/var/log$ grep '\-generic' dmesg.0 dmesg [03:43] dmesg.0:[ 0.000000] Linux version 3.13.0-20-generic (buildd@denneed04) (gcc version 4.8.2 (Ubuntu 4.8.2-17ubuntu1) ) #42-Ubuntu SMP Fri Mar 28 09:55:49 UTC 2014 (Ubuntu 3.13.0-20.42-generic 3.13.7) [03:43] dmesg.0:[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-3.13.0-20-generic root=UUID=30486aa4-f767-4397-ab88-dd0e02e66651 ro console=hvc0 earlyprintk [03:43] dmesg:[ 0.000000] Linux version 3.13.0-24-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #46-Ubuntu SMP Thu Apr 10 19:09:21 UTC 2014 (Ubuntu 3.13.0-24.46-generic 3.13.9) [03:43] dmesg:[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-3.13.0-24-generic root=UUID=30486aa4-f767-4397-ab88-dd0e02e66651 ro console=hvc0 earlyprintk [03:44] looks like you were running -20, then you got -24 when you rebooted [03:44] waigani: dmesg ? [03:45] davecheney: http://pastebin.ubuntu.com/7247958/ [03:45] sorry, is that what you meant? [03:45] waigani: yup [03:45] intersting [03:45] all prevoius panics of this class leave a message in dmesg [03:46] ok, there could be two unrelated issues [03:46] waigani: could you log a bug for http://pastebin.ubuntu.com/7247954/ [03:46] tag it gccgo ppc64el [03:46] davecheney: yep, gladly :) [03:47] waigani: ta [03:47] davecheney: I'll just double check that I have not done something stupid it the code. It *should* be latest trunk [03:47] waigani: nah [03:47] this isn't you [03:47] the panic is happening in /usr/bin/go [03:47] if you want to ingestigate [03:48] apt-get source gccgo-go [03:48] right, that is what stumps me [03:48] then have a look at that line in build.go [03:48] waigani: i ran into that about a week ago [03:48] that was when the floor fell out from under me [03:48] lol [03:48] yep, I know that one [03:53] davecheney: https://bugs.launchpad.net/juju-core/+bug/1307289 [03:53] <_mup_> Bug #1307289: Go panics when running tests on ppc64

[03:53] waigani: jolly good [03:57] axw: ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:57] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:57] ^ did this get fixed recently [03:57] or should I log a bug [03:58] axw: do you really think that two filtering methods is better than one with a bool? [03:58] axw: I'll write it and look at the diff [03:59] thumper: I really do. With that approach you can see without a doubt that nothing can change the behaviour at runtime; with the bool you need to ensure that nothing changes it [04:00] ok [04:01] davecheney: wallyworld fixed that already I think [04:01] axw: right [04:01] this is 1.17.8 (ish) [04:01] yeah, fixed in 1.18.1 I believe [04:01] yeah, fixed in trunk [04:01] i think I saw a branch last week [04:01] right o [04:04] axw: like this http://paste.ubuntu.com/7247988/ ? [04:05] thumper: yup [04:05] thumper: comment on countedFilterLine needs fixing [04:49] ping jam ? [04:50] davecheney: /wave [04:50] jam: i think we're eating hte elephant from different ends [04:50] wrt to the api server memory usage [04:51] I'm not sure I understand [04:52] thumper: I'm around whenever you would like to hangout [04:52] jam: ok [04:52] in trying to trace down the panics i'm seeing [04:52] i've sort of discovered just how much memory jujud consumes [04:52] it's horrific [04:53] my initial results showed about 0.5MB per agent, which wasn't great, but wasn't terrible. but when something gets into a bad situation, I see memory spike terribly [04:53] jam: gccgo [04:54] it's more like 250mb per agent [04:54] two agesnts per machine [04:54] at a minimum [04:54] wow... [04:54]