[00:02] davecheney: does this bug still happen? https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/1304167 [00:02] <_mup_> Bug #1304167: syntax error, trusty beta-2 cloud image [00:02] thumper: maybe you were smoking weed when you wrote the email [00:02] davecheney: seems like a quite major bug if so [00:02] wallyworld: nah... [00:02] although I am wondering if it would help [00:02] couldn't hurt :-) [00:02] ha [00:03] thumper: would it be possible for you to log [00:04] "%T", err [00:04] davecheney: sure [00:04] thanks [00:04] thumper: yes, the bug is still open [00:04] it has screwed LXC on any platform that uses apparmor [00:05] :-( [00:05] thumper: when you run the destroy-enviromnet, you're not in that directory are you [00:06] ie; mkdir /tmp/t [00:06] cd /tmp/t [00:06] rmdir /tmp/t [00:06] davecheney: no [00:06] ok, just checking [00:07] http://gcc.gnu.org/releases.html [00:07] gcc 4.9 released [00:07] but not really [00:07] if you destroy too close to bootstrap, you don't get it [00:08] thumper: hmm ok [00:08] oh... [00:08] I think I know what it could be... [00:09] thumper: hold pls [00:09] when we kill the machine agent with pkill [00:09] it cleans up after itself [00:09] we then have a race [00:09] thumper: right, so things are racing on the directory listing [00:09] the agent is trying to remove some files [00:10] and then so does the destroy command [00:10] http://golang.org/src/pkg/os/error_unix.go [00:10] so is the agent removing ~/.juju/local ? [00:10] ie it's not a file [00:10] but the top level directory itself ? [00:11] so os.RemoveAll goes to remove ~/.juju/locla [00:11] and the whole thing has been deleted already ? [00:11] not all of it... [00:11] but some of it [00:11] oh... [00:11] yeah, sometimes all of it [00:11] yeah... [00:11] it does [00:12] *os.SyscallError [00:12] they are racing to remove the datadir [00:13] thumper: ok, that should be possible to make a repro [00:13] i'll do that while i'm waiting for gccgo to compile [00:13] davecheney: what do you think should happen? [00:13] 10:12 < thumper> *os.SyscallError [00:13] ^ is that %T ? [00:13] yeah [00:14] cheaky bugger [00:14] thumper: leave it with me [00:14] raise an issue maybe [00:14] i need to make a repro [00:14] davecheney: you see it as a golang bug? [00:15] thumper: it won't fit through http://golang.org/src/pkg/os/error_unix.go [00:16] http://play.golang.org/p/mp5i8GFL47 [00:16] * davecheney goes to find out where that os.SysclalError is coming from [00:18] thumper: for the moment you'lre going to have to code around it [00:18] this won't be fixed in 1.2 [00:18] dir_unix.go [00:18] 41: return names, NewSyscallError("readdirent", errno) [00:18] this is where it's coming from [00:19] * davecheney feels very depressed [00:19] it's just bugs, bugs, and more bugs [00:20] davecheney: I'll work around it [00:21] davecheney: we already ignore errors from two other things that we are racing with [00:30] thumper: i'll get a repro quick smart [00:30] i can see where it happens [00:37] morning davecheney. [00:37] davecheney: when I run make check on vm I get the following: http://pastebin.ubuntu.com/7246968 [00:37] any hints? [00:38] thumper: wip on jujud isolation: https://codereview.appspot.com/87130045 [00:38] thumper: cmd/juju and environs/bootstrap are now passing [00:39] environs/sync is going to take a bit more thought [00:39] and right now I'm too hungry to think [00:39] waigani: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754 [00:39] <_mup_> Bug #1304754: gccgo on ppc64el using split stacks when not supported [00:40] davecheney: reading [00:43] waigani: short versoin [00:43] downgrading to an older kernel works around the problem [00:43] but isn't a fix [00:43] davecheney: yep, thanks [00:44] I neeeeed food. bbl [00:46] thumper: if err, ok := err.(*os.SyscallError); ok { if os.IsNotFound(err.Err) } [00:46] or something [00:56] axw: just saw your answer too [00:57] axw: however the error that is being returned isn't os.IsNotExist [00:57] axw: as the race is being caught elsewhere [00:59] thumper: ah, maybe in the Readdir then [00:59] anyway, there's definitely a race, and you should ignore it I think [00:59] thumper: lucky(~/devel/issue) % go run issue.go [00:59] 2014/04/14 10:58:58 creating temporary directories rooted at "/tmp/issue015782153" [00:59] 2014/04/14 10:58:59 preparing workers [00:59] 2014/04/14 10:58:59 release the swarm [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] ah... read-dir-int [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] 2014/04/14 10:58:59 unexpected error: *os.SyscallError, "readdirent: no such file or directory" [00:59] not re-addir-int [00:59] thumper: raising an issue [00:59] axw: yeah, that's it [01:00] I couldn't parse the smashedtogetherwords [01:05] heh i finally have results for waigani and he's gone [01:05] but i think his problem was actually the "things randomly die on ppc" bug... [01:06] things all in all don't look too bad on arm64 actually [01:08] thumper: https://code.google.com/p/go/issues/detail?id=7776&thanks=7776&ts=1397437695 [01:09] mwhudson: \o/ [01:09] not actually good [01:09] just not terrible [01:10] mwhudson: /usr/include/features.h:374:25: fatal error: sys/cdefs.h: No such file or directory [01:10] any suggestions which package contains this header [01:10] uh, no, looks basic though [01:10] hm [01:11] dpkg -S sez libc6-dev-i386 [01:11] which seems a bit random [01:11] % dpkg -S /usr/include/sys/cdefs.h [01:11] libc6-dev-i386: /usr/include/sys/cdefs.h [01:11] yeah [01:11] ah [01:11] um [01:11] mwhudson: this is compiling gcc 4.9 [01:12] "real" libc6-dev installs it to /usr/include/$triplet/sys/cdefs.h [01:12] davecheney: from upstream or the deb? [01:12] mwhudson: upstream [01:12] mwhudson: our deb produces broken binaries [01:13] davecheney: on powerpc64 i assume? [01:13] um, that sounds like something doko should know about :) [01:13] is this the split stack thing? [01:13] yup [01:14] i guess libc6-dev-i386 must be some kind of pre-multiarch thing [01:14] * davecheney tries patching in some of the arguments from /usr/bin/gcc -v [01:14] davecheney: "dpkg --listfiles libc6-dev | grep cdefs.h" on your platform? [01:15] $ dpkg --listfiles libc6-dev | grep cdefs.h [01:15] /usr/include/powerpc64le-linux-gnu/sys/cdefs.h [01:15] maybe ./configure got the tripplet wrong [01:16] well, i was wondering why this was such a good compile box [01:16] clock : 4284.000000MHz [01:16] ziiing [01:21] gcc, just keep adding flags until it compiles [01:22] nope [01:22] still broke [01:22] fuck this [01:22] i'm using symlinks [01:34] wow. such multiarch [01:40] mwhudson: ok, here is what I think [01:40] gccgo on ppc is correctly detecting that split stacks are not supported [01:40] and using the default 'large' stack model [01:40] but .. the stack is still too small [01:41] i'm bt'in in gdb and at stack frame 1475 with no end in sight [01:42] haha [01:42] ok [01:42] so stack overflow? [01:42] hmm [01:42] make that stack frame 3,300 [01:42] is this on the altstack? i.e. while handing a signal? [01:42] so, in summary, gccgo doesn't give a clean indication when you fall off the end of the stack [01:42] mwhudson: nope, with split stacks disabled [01:42] you get a c style stack per goroutine [01:42] davecheney: that's not what i mean [01:42] sure [01:43] but signals are handled on a different stack again [01:43] (sigaltstack and all that) [01:43] i think those stacks are smaller? [01:43] anyways [01:43] mwhudson: i'm going to say, conditionally, yes [01:43] mwhudson: the sig handled gets a SEGV [01:43] davecheney: it's easy ish to make the stacks bigger i think [01:43] and it blames the topmost stack frame for hittig a nil [01:44] i found the code that was allocating them [01:44] when actaully all it did was call a function [01:44] yeah, well, if you fall off the end of the stack it's certainly going to break [01:44] mwhudson: are you adding -fsplit-stack on aarch64 ? [01:44] davecheney: no [01:44] shit, 5,000 stack frames [01:45] how in gods name could juju use so much stack ... [01:45] could this "just" be application infinite recursion for some reason? [01:45] or does the backtrace look reasonable? [01:46] mwhudson: the latter [01:46] maybe a dozen frames [01:46] this is going to be an 8mb stack [01:46] 18,000 stack frames [01:47] #31380 0x000000001000522c in main.count () [01:47] #31381 0x0000000010005854 in main.main () [01:47] <_mup_> Bug #31381: POMsgSet.active_texts assumes POFile.pluralforms is an int [01:47] <_mup_> Bug #31380: source package sort by version doesn't cope with invalid version numbers [01:47] that doesn't sound reasonable [01:47] lolmup [01:47] #-1 [01:47] although, eh, i guess it works well enough on platforms that do have split stacks [01:48] mwhudson: most gccgo developers are on amd64 [01:48] when I say most [01:48] i mean [01:48] all 1 of them? [01:48] everyone except you and me and some neckbeard using mips [01:49] strange this doesn't happen on arm64 though [01:49] * davecheney goes to talk to ian taylor [01:49] mwhudson: gccgo src/test/peano.go [01:49] ./a.out [01:49] i wouldn't have thought that stack frames would be much bigger on that [01:49] well yes, that fails on arm64 too [01:49] i wonder if it is unrelated [01:49] that gives a straight segfault [01:49] and the go handler doens't catch it [01:49] i wonder if we're barking up the wrong tree [01:53] mwhudson: i'm thinking these are two different issues [01:53] [492932.974051] a.out[25065]: bad frame in setup_rt_frame: 000000c20ffaf0e0 nip 0000000010004e0c lr 00000000100051fc [01:53] ^ this is what running off the stack looks like [01:53] note nip [01:54] [2028013.988376] jujud[400]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000 [01:54] ^ this is what a juju segfault on a bad kernel looks like [01:54] nip and lr are 0 [01:54] something branched to 0 and nuked the lr for good measure [01:55] well, once you have a disagreement over whether a bit of memory is stack or not, it's not exactly predictable what happens next [01:55] true [01:55] but why is the ip 0 [01:55] both cases this is unmapped memory [01:56] because something stomped over the link register on the stack, so it branched to lala land when trying to do a procedure return? [01:56] i don't know the ppc abi but i certainly saw that sort of thing a lot on arm64 [01:56] mwhudson: anything with a LR is probably going to act the same [01:57] also [01:57] mwhudson: ok, so if we're not running of the end of the stack [01:57] and i'm pretty sure we're not [01:57] then why does the size of the kernel page size affect the result [02:25] $ pmap -x 969 [02:25] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:25] Address Kbytes RSS Dirty Mode Mapping [02:25] total kB 0 0 0 [02:25] ---------------- ------- ------- ------- [02:25] well, thanks [02:28] thumper: juju stutus returns 0 if there are hook errors [02:28] axw: sorry, maybe this question is best addressed to you [02:30] is that a problem? [02:30] axw: dunno [02:30] depends what we've promised status willdo [02:30] i know that people want to be able to say 'is this environment ok' [02:30] $ pmap -x 969 [02:30] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:30] Address Kbytes RSS Dirty Mode Mapping [02:30] sory [02:31] ---------------- ------- ------- ------- [02:31] $ pmap -x 969 [02:31] 969: /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:31] Address Kbytes RSS Dirty Mode Mapping [02:31] oh for fucks sake [02:31] ---------------- ------- ------- ------- [02:31] total kB 0 0 0 [02:31] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [02:31] 969 root 20 0 1413376 515456 19136 S 9.6 6.2 0:18.51 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug [02:31] yeah I can see the use case, but AFAIK it always just returned 0 [02:31] axw: i think this might be related [02:31] heavy use of the api server causes RES to rise [02:36] oh god [02:36] i hate everything [02:36] upstart isn't logging the stderr of jujud-machine-0 [02:41] :cry: SIGQUIT doesn't do what I think on gccgo [03:06] wallyworld: hangout died [03:06] wallyworld, axw, waigani: I figured I was done anyway :-) [03:06] thumper: will take a look at your CL after I finish up on this HA thing [03:06] axw: ack [03:07] axw: I first read that as "hating" [03:07] heh [03:07] made me chuckle [03:07] * thumper goes for a brief lie down before his head explodes [03:13] wallyworld: I found the mockable BuildToolsTarball, what was the other one? bundleTools? [03:13] yeah BundleTools [03:13] in environs/tools [03:13] that isn't mockable? [03:14] environ/tools/build.go:205 [03:14] you just need to introduce a var [03:14] make the method lower case [03:14] make te var upper case [03:15] ah sure, make it mockable - no problem [03:19] mwhudson: https://bugs.launchpad.net/juju-core/+bug/1307282 [03:19] <_mup_> Bug #1307282: cmd/jujud: gccgo api server consumes ~500mb of ram on machine-0 [03:22] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:22] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:22] did this get fixed ? [03:24] waigani: can you send me `uname -a` from your vm ? [03:24] davecheney: Linux winton-09 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:09:21 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux [03:26] waigani: intersting [03:26] i'm trying a -24 kernel and I can't get it to crash [03:26] waigani: did you just upgrade to that kernel ? [03:26] hmmm [03:27] waigani: uptime [03:27] davecheney: 03:27:40 up 1:09, 2 users, load average: 0.00, 0.01, 0.05 [03:28] I did a restart, to see if that helped at all [03:28] ran make check after, same problem [03:28] waigani: ok [03:28] thanks, that makes it concrete [03:28] dmesg [03:28] ^^ [03:29] davecheney: http://pastebin.ubuntu.com/7247924/ [03:29] waigani: ta [03:29] i should have said [03:29] dmesg | tail [03:29] waigani: could I ask you to check again [03:30] davecheney: http://pastebin.ubuntu.com/7247927/ [03:30] sorry [03:30] the test [03:30] not the dmesg [03:30] ah right [03:30] what i'm looking for is a line like [03:30] (no worries, this was my fault) [03:30] 11:54 < davecheney> [2028013.988376] jujud[400]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000 [03:30] ^ should see something like this [03:31] okay, I'll paste when done and keep an eye out for a line like that [03:39] waigani: can you ssh-import-id dave-cheney on your vm [03:39] so I can stooge around you /var/log/ [03:39] and see what kernel you were running before reboot [03:40] davecheney: already done, your public key is on the vm [03:41] danka [03:41] waigani: i have a theory that -24 kernel fixes the issue [03:41] it's not much of a theory atm [03:41] davecheney: http://pastebin.ubuntu.com/7247954/ [03:42] davecheney: I have a theory that I did something stupid [03:42] not so much a theory as a constant axiom [03:43] waigani: ubuntu@winton-09:/var/log$ grep '\-generic' dmesg.0 dmesg [03:43] dmesg.0:[ 0.000000] Linux version 3.13.0-20-generic (buildd@denneed04) (gcc version 4.8.2 (Ubuntu 4.8.2-17ubuntu1) ) #42-Ubuntu SMP Fri Mar 28 09:55:49 UTC 2014 (Ubuntu 3.13.0-20.42-generic 3.13.7) [03:43] dmesg.0:[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-3.13.0-20-generic root=UUID=30486aa4-f767-4397-ab88-dd0e02e66651 ro console=hvc0 earlyprintk [03:43] dmesg:[ 0.000000] Linux version 3.13.0-24-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #46-Ubuntu SMP Thu Apr 10 19:09:21 UTC 2014 (Ubuntu 3.13.0-24.46-generic 3.13.9) [03:43] dmesg:[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinux-3.13.0-24-generic root=UUID=30486aa4-f767-4397-ab88-dd0e02e66651 ro console=hvc0 earlyprintk [03:44] looks like you were running -20, then you got -24 when you rebooted [03:44] waigani: dmesg ? [03:45] davecheney: http://pastebin.ubuntu.com/7247958/ [03:45] sorry, is that what you meant? [03:45] waigani: yup [03:45] intersting [03:45] all prevoius panics of this class leave a message in dmesg [03:46] ok, there could be two unrelated issues [03:46] waigani: could you log a bug for http://pastebin.ubuntu.com/7247954/ [03:46] tag it gccgo ppc64el [03:46] davecheney: yep, gladly :) [03:47] waigani: ta [03:47] davecheney: I'll just double check that I have not done something stupid it the code. It *should* be latest trunk [03:47] waigani: nah [03:47] this isn't you [03:47] the panic is happening in /usr/bin/go [03:47] if you want to ingestigate [03:48] apt-get source gccgo-go [03:48] right, that is what stumps me [03:48] then have a look at that line in build.go [03:48] waigani: i ran into that about a week ago [03:48] that was when the floor fell out from under me [03:48] lol [03:48] yep, I know that one [03:53] davecheney: https://bugs.launchpad.net/juju-core/+bug/1307289 [03:53] <_mup_> Bug #1307289: Go panics when running tests on ppc64 [03:53] waigani: jolly good [03:57] axw: ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:57] ERROR loaded invalid environment configuration: storage-port: expected int, got float64(8040) [03:57] ^ did this get fixed recently [03:57] or should I log a bug [03:58] axw: do you really think that two filtering methods is better than one with a bool? [03:58] axw: I'll write it and look at the diff [03:59] thumper: I really do. With that approach you can see without a doubt that nothing can change the behaviour at runtime; with the bool you need to ensure that nothing changes it [04:00] ok [04:01] davecheney: wallyworld fixed that already I think [04:01] axw: right [04:01] this is 1.17.8 (ish) [04:01] yeah, fixed in 1.18.1 I believe [04:01] yeah, fixed in trunk [04:01] i think I saw a branch last week [04:01] right o [04:04] axw: like this http://paste.ubuntu.com/7247988/ ? [04:05] thumper: yup [04:05] thumper: comment on countedFilterLine needs fixing [04:49] ping jam ? [04:50] davecheney: /wave [04:50] jam: i think we're eating hte elephant from different ends [04:50] wrt to the api server memory usage [04:51] I'm not sure I understand [04:52] thumper: I'm around whenever you would like to hangout [04:52] jam: ok [04:52] in trying to trace down the panics i'm seeing [04:52] i've sort of discovered just how much memory jujud consumes [04:52] it's horrific [04:53] my initial results showed about 0.5MB per agent, which wasn't great, but wasn't terrible. but when something gets into a bad situation, I see memory spike terribly [04:53] jam: gccgo [04:54] it's more like 250mb per agent [04:54] two agesnts per machine [04:54] at a minimum [04:54] wow... [04:54] its complicated [04:54] that's way different [04:54] gccgo when not using split stacks [04:54] allocates an 8 mb stack from the heap [04:54] so that puts the heap under a lot of pressure [04:55] even if large amounts of that 8mb stack remain uncomitted [04:55] yeah, 8MB per goroutine would be really bad for how much we use it [04:55] i'm also seeing strange things that make me thing when a client disconnects [04:56] so I think we have a bug that if a client disconnects in a bad way, it cascades into causing an APIServer restart, but I haven't tracked down the exact issues yet. [04:56] we're not releasing all the server side resources used by the client [04:56] in my test [04:56] 3 machines [04:56] on the manual provider [04:56] It might just be that it leaves resources behind, right [04:56] killing the agents on the service units [04:56] causese memoryu usage to almost double [04:57] with gc and 8k stacks, you won't feel a few leaked goroutines [04:57] with 8mb stacks [04:57] yup, you'll feel it [04:58] $ grep -c goroutine /tmp/out [04:58] 247 [04:58] ^ starts at 169 for 4 agents [04:58] after a few restarts of the agents we're up to 247 [04:59] jam: ok, with you in 1m [05:43] axw: http://paste.ubuntu.com/7248220/ [05:43] i don't get it [05:43] i did destroy-machine as requiested [05:43] the agents are stopped [05:43] but I can't destroy the environment [05:48] davecheney: umm [05:49] davecheney: if they never disappear from state, seems that's a bug. but you can do destroy-machine --force to clean up manually [05:50] axw: right [05:51] thumper: the connection seems to have died [05:52] jam: google tells me my connectivity is experiencing issues === vladk|offline is now known as vladk [05:57] axw: --force doesn't give me any love [06:00] davecheney: did it return an error or anything? [06:00] or just silence? [06:02] silence [06:02] davecheney: the provisioner should remove the machine from state when it's dead... it's entirely possible that someone has changed the provisioner so that it doesn't work with manual anymore [06:03] we need a "no provider left behind" act [06:04] * davecheney reaches for rm [06:04] davecheney: destroy-environment --force should work as a last resort, if all the machines really are cleaned up [06:25] ok, some good news, 3.13.0-24 may fix the issue [06:26] oh [06:26] nope [06:26] hmm [06:26] hard to tell [06:26] need more information === rogpeppe1 is now known as rogpeppe [06:43] mornin' all [06:43] morning rogpeppe [06:43] 'moin [06:44] jam: hiya [06:44] davecheney: yo! [06:55] morning rogpeppe [06:55] axw: hiya [06:55] rogpeppe: landed the EnsureAvailability MP. is there something else you'd like me to look at now? [06:56] evening [06:56] axw: there is one thing that would be awesome if we could do [06:56] axw: currently we can't upgrade to a HA environment [06:57] ok [06:57] axw: because there is no mongo user configured on the admin database [06:57] axw: we need to change EnsureMongoServer to add one [06:57] axw: (if necessary) [06:58] rogpeppe: I guess there's a tonne of other things that need to be done for upgrades too, though? like rewriting mongo scripts? or has nate done that already? [06:59] axw: the mongo upstart script is already written when necessary (well, actually, it's been disabled for the moment, pending this) [06:59] I see [06:59] ok, I will take a look [06:59] axw: to add the admin user, while the service is stopped, we need to start the mongod in non-authenticated mode [06:59] axw: then add the admin user in that mode [07:00] thanks [07:00] axw: before tearing mongod down again and starting it up normally [07:00] axw: i did manually verify that that does actually work, but i'm afraid i can't remember the exact steps i used [07:11] rogpeppe: hiya, i have a reflection question for you if you have a moment [07:11] wallyworld: sure [07:12] i have a reflect.Value [07:12] i want to create a nil value pointer [07:12] eg reflect.ValueOf((*string)(nil)) [07:12] if it were for a *string [07:12] but i want to do it dynamically [07:13] reflect.New(val.Type().Elem()) gives me a pointer to a zero value [07:13] but i want a pointer to nil that i can use with value.Set() [07:13] make sense? [07:13] wallyworld: what would the code in normal Go look like? use T for the type of the value [07:14] var foo *T [07:14] foo = nil [07:14] foo is a field of a struct [07:15] i have it working using a switch on the field Kind and using reflect.ValueOf((*sgtring)(nil)) [07:15] but i want to do it without that [07:15] wallyworld: so you want a nil value of the same type as a pointer to the type of the field? [07:16] yeah, i think so, so that a call to value.Set() works [07:16] wallyworld: do you want to actually set the value of the field in the struct? [07:16] yep [07:17] wallyworld: i don't think you want a pointer, in that case [07:17] reflect.ValueOf(*mystruct).Elem().FieldByName(fieldName) is what i use to get the value [07:17] wallyworld: right, well you can just call Set on the result of that [07:17] so if val is the result of the above [07:17] i call Set() yes [07:17] but i can't find out what to pass to Set() [07:18] wallyworld: a reflect.Value of the same type as the field... [07:18] reflect.New(val.Type().Elem()) gives a pointer to "" for example [07:18] i want to do it dynamically [07:18] wallyworld: are you just trying to set the field to nil? [07:18] yes [07:19] i thought i'd need value.Set() [07:20] wallyworld: val := reflect.ValueOf(mystructptr).Elem().FieldByName(fieldName); val.Set(reflect.Zero(val.Type()) [07:20] dimitern: morning. We can do a 1:1 if you would like, though officially that's natefinch's responsibility now. [07:20] but reflect.Zero() gives me "" doesn't it? [07:21] rogpeppe: ah, it seems to have worked [07:21] thank you. for some reason i was thinking reflect.Zero() would give me the wrong thing [07:22] wallyworld: you did "reflect.New(val.Type().Elem())" [07:22] note that Elem is an element of the pointed to type [07:22] vs [07:22] reflect.New(val.Type()) [07:22] val.Type() is a pointer, val.Type().Elem() is the actual object [07:22] and the Zero of a pointer is nil [07:22] the Zero of a string is "" [07:22] ah ffs, stupid mistake, thanks [07:22] (11:18:25 AM) wallyworld: reflect.New(val.Type().Elem()) gives a pointer to "" for example [07:24] yeah, New is exactly equivalent to the language primitive "new" [07:30] jam, oh is that so [07:30] jam, well, i can join the regular meeting? [08:23] fwereade: looks like we made our N^2 problem with CharmURL worse in 1.18 because of the changes to Upgrade now watching the machine's agent version. [08:23] This one may not matter *quite* as much in practice, if you aren't deploying multiple units to machines. [08:24] But in my sim tests, we wake up the Upgrader even more often than we wake up the CharmURL [08:53] jam1, ha [08:54] jam1, yeah, I think we write something extra to the machine doc now -- dimitern, do I recall correctly? [08:54] dimitern, btw can we please undo those errors changes? I added a note to the review but it was already landed ofc [08:54] fwereade: well, we also wake up every 15 min because the instance poller claims the machine has a new address [08:55] jam1, yeah, indeed [08:55] fwereade, I'm working on that now as a follow-up [08:56] jam1, I cannot figure out how to schedule those sorts of fixes though -- unless we carve out X% of time for paying down tech debt and classify it as that [08:56] fwereade: well, if we have a client that wants us to scale to 10000 units, we can bill them for it, as well [08:56] fwereade: ATM, I'm mostly focused on "this is where we're at" [08:58] jam1, I guess :) [08:58] jam1, clarity on that front is indeed helpful [08:59] fwereade: "juju status" with 10k machines actually is doing ok performance wise, but nobody wants 10,000 lines of output [08:59] jam1, indeed [08:59] so there are quite a few things that would need tweaking to scale to that level [09:00] fwereade: though for *testing* purposes, the N^2 stuff bites me in the ass a lot. 'juju add-unit" to add another 100 units each to 19 machines takes: 200s, 400s, 1200s, 2800s, and I'll let you know when it finishes seconds. [09:01] jam1: 1-1? [09:01] jam1, yeah -- I kinda feel like those sorts of issues are... they should work properly *now* [09:01] rogpeppe: I just need to switch machines, 1 sec [09:02] jam1, but, ehh, prioritisation :/ [09:03] dimitern: so it looks like Canonical admin got it backwards, its actually you on my team and roger's on nate's team. [09:04] dimitern: so I think everyone is still on the same standup for now [09:06] jam, what team am i supposed to be on? [09:10] dimitern: so looking at Alexis's email about Nate and Ian, you're on my team [09:12] jam, yeah, I thought so [09:56] jam: you've frozen... [09:56] rogpeppe: I got logged out of my google account somehow [09:56] end of month? [09:56] jam: perhaps [09:58] morning [10:00] mgz: 1:1? (just running to the restroom myself) [10:00] sure, I'll wait for you thdere [10:01] ...the hangout, not the restroom [10:07] wallyworld: I can get TestUpgradeJujuWithRealUpload to pass by patching sync.BuildToolsTarball but not when I patch envtools.BundleTools [10:08] wallyworld: here is my attempt at mocking out bundleTools: http://pastebin.ubuntu.com/7248910/ [10:08] what is the error? [10:08] wallyworld: ... and http://pastebin.ubuntu.com/7248930/ [10:08] wallyworld: error uploading tools: no tools uploaded [10:12] waigani: why is the bundle tools mock uploading tools as well? [10:12] it shouldn't be doing that [10:13] wallyworld: good question! I just read the logic, let me give that another go ... [10:13] that is my guess as to what the error is, as there would be no metadata or anything [10:13] wallyworld: I basically ripped the logic out of BuildToolsTarball [10:13] upload tools needs the tarball and also metadata [10:14] wallyworld: right, let me try again [10:30] evilnickveitch, the links on https://juju.ubuntu.com/docs/ looked foobared to me - are you aware? [10:30] jamespage, ooh. they were working yesterday. let me have a look [10:31] fwereade: morning, are you around? [10:32] jamespage, hmm. seem to be working for me - was there a particular page or link that wasn't working for you? [10:33] evilnickveitch, the links on the lhs of the page don't appear for me [10:33] jamespage, the links are pasted in by a bit of javascript at the end of the page [10:33] so either the js isn't loading [10:34] because something is messed up on that page, or the page isn't loaded [10:34] hmm [10:34] have you tried refreshing etc? [10:35] are you sure page has finished loading? some external assets take a while to load sometimes, and the link JS is right at the end [10:37] evilnickveitch: quick test here on FF show no links too [10:37] evilnickveitch: jamespage is right [10:38] TheMue, jamespage okay, I guess mine was fetching from cache. i will check into it [10:43] TheMue, jamespage okay, I found the problem, some wonky HTML which prevents the rest of the page loading, it's only on the front page, the others should work fine [10:43] I will fix it ASAP [10:44] evilnickveitch: Great, thanks. [10:45] evilnickveitch: do you not validate? :P [10:46] mgz, it was the stupid linter that caused the problem :P [10:46] evilnickveitch: :D [10:47] perrito666, sorry, I completely missed yu there [10:55] fwereade: happns :) [10:57] fwereade: still missing the transaction hooks tests but https://codereview.appspot.com/86430043 I did ignore some of your comments because they broke functionality :) but I am willing to re-try once I make sure this goes the right way (altough my assert is either broken or making blow an error existing that was not being discovered bc I am failing 5 tests) https://codereview.appspot.com/86430043 [10:58] perrito666, cheers, I'll take a look [10:59] wallyworld: I exported tools.archive: http://pastebin.ubuntu.com/7249092 (tests pass now) [10:59] great, in standup, will look later [11:00] ah [11:00] had a quic look, looks nice and simple [11:00] like i'd hoped [11:01] yeah, just hope it's okay that I've made Archive public - adding noise to the API? [11:01] anyway, I'll leave it for the review [11:04] perrito666, rogpeppe has a deepcopy package that may help with cloning [11:04] fwereade, perrito666: it doesn't work any more [11:04] rogpeppe, bah [11:04] fwereade: ah, might be much better than the by-hand copy I am doing ther... [11:04] rogpeppe: :( [11:04] it was trying to be too clever [11:05] perrito666: what are you copying? [11:05] rogpeppe: units and machines [11:05] perrito666: why? [11:06] perrito666: is it just for testing? [11:06] rogpeppe: sory I was listening on the other side :) no, not just for testing [11:07] trying to get a copy that ensures me won't change wile I am working in it in certain circumstances (I am just making a method of something previously done by hand) [11:27] perrito666: what are you actually trying to do? [11:28] rogpeppe, clone state.Machine/Unit -- I commented that it'd be nice to do it properly [11:28] fwereade: ah [11:28] rogpeppe, there are a few places we do it in varyingly hackish ways iirc [11:29] TheMue, jamespage docs should be working now [11:29] evilnickveitch, ta - next question - do release notes get published on /docs ? [11:30] jamespage, very good question - not as yet, but I do have a branch that will add them to the reference section. At least for the ones I can find [11:30] Check back after 7.30pm [11:31] evilnickveitch, its something that ceph does quite well upstream [11:31] jamespage, cool, I will check out what they do. I was just intending to dump them all in newest first order with an index of links at the top [11:32] fwereade, perrito666: two thoughts: 1) we could probably avoid doing a deep copy of the machineDoc, as we don't allow mutation of pieces inside its components [11:33] rogpeppe, I suspect that statement is only mostly accurate [11:33] fwereade, perrito666: 2) if we decided to, it would be easy (but not greatly efficient) to clone by serialising/deserialising through bson [11:34] rogpeppe, perrito666: ha, I could live with that [11:35] fwereade: tbh i think it's reasonable to have methods that return mutable values with a stipulation that you should not modify the contents [11:36] fwereade: (i presume you're thinking about the Jobs method here) [11:38] fwereade: if we did that, then Clone could be ultra cheap [11:41] rogpeppe: got around to finishing the last few test failures: https://codereview.appspot.com/87540043 [11:42] mgz: thanks. looking. [11:43] mgz: LGTM [11:43] rogpeppe: thanks! [11:48] oops, upgrade-juju seems to have killed its own environment [11:49] * rogpeppe hates it when that happens [11:53] hmm, this is the second time this morning i've had a live bootstrap fail with this error: [11:53] 2014-04-14 11:53:05 ERROR juju.cmd supercommand.go:299 cannot write file "tools/releases/juju-1.19.0.1-precise-amd64.tgz" to control bucket: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. === psivaa is now known as psivaa-lunch [12:15] rogpeppe, perrito666: if the clone were an internal method with "do not modify result" I'd be fine [12:16] rogpeppe, perrito666: if it's exported there's just way too much opportunity to screw up at a distance [12:16] fwereade: i'm thinking of the Jobs method only [12:16] rogpeppe, does Jobs not copy? it should ;p [12:17] * perrito666 sees another ball coming his way :p [12:17] fwereade: well, if Jobs copies, then why do we need a deep copy of the machine doc? [12:17] rogpeppe, plenty of methods mutate bits and pieces of state [12:17] fwereade: (it doesn't, BTW, but it probably should) [12:18] fwereade: which methods mutate stuff that's pointed to by the machine doc, rather than machine doc fields themselves? [12:18] rogpeppe, and considering current cases misses the point; things will change and if we expose this functionality without insulating the objects frm one another we *will* screw it up [12:19] fwereade: if we make all methods return copies of the underlying data, what is there to screw up? [12:22] fwereade: AFAICS this should be fine: func (m *Machine) Clone() *Machine { m1 := *m; return &m1} [12:23] rogpeppe, various methods write to the document on success [12:23] fwereade: that's fine [12:24] fwereade: the document is stored in the Machine as a value type. as long as none of our Machine methods change things that are pointed to by things in the machine doc, we're ok [12:24] rogpeppe, we can be sure that none of them will ever change, say, a slice on the document? [12:24] fwereade: that's not a hard invariant to maintain (it's local) [12:24] fwereade: we can be sure they don't now, and it's not hard to verify that in the future [12:25] rogpeppe, my experience is that it's a very difficult invariant to maintain, even with a team of ultra-smart people half the size of this one [12:25] fwereade: i think it's better than adding to memory pressure and writing a bunch more code that needs to be maintained every time a field is changed. [12:26] rogpeppe, if we're exporting a Clone method, that clone method must deep-copy the data [12:26] rogpeppe, if it's not exported I'm willing to be a bit laxer [12:27] rogpeppe, not because it won;t screw us, because it *will* [12:27] rogpeppe, but because at least the scope of the weirdness will be small enough that we'll have a chance of dealing with it [12:27] fwereade: tbh, i would prefer us to make Machine etc immutable [12:27] fwereade: i don't think we gain much by having methods mutate our local idea of state [12:29] rogpeppe, that's probably a reasonable position, especially considering current usage, but it's not really on the table at the moment [12:29] rogpeppe, in terms of potentially fiddly changes, errgo has a much bigger payoff ;p [12:29] * fwereade needs to go to the airport, hadn;t realised he was flying so early [12:29] * fwereade will say hi again this evening if he can [12:29] rogpeppe: gotta help with my daughter for a bit, probably be 45-60 mins [12:30] fwereade: where are you going? [12:30] natefinch: ok [12:30] rogpeppe: well have to wait until he returns to know :p [12:45] ha, i have a machine where provisioning failed (amazon says "Server.InternalError: Internal error on launch") but i can't call retry-provisioning because the machine isn't in an error state [12:52] rogpeppe, mgz, errors package improvements - https://codereview.appspot.com/87560043 - it's a bit big, but most changes are renames [12:53] ...scary [12:56] dimitern: the Suffix field looks like it's not used - is it? [13:01] dimitern: similarly ArgsConstructor doesn't appear to be used [13:01] rogpeppe, it's used in tests only [13:02] dimitern: right. i'm not sure we need to pollute the production code with test-specific functionality. [13:02] rogpeppe, allErrors is unexported - how does it pollute? [13:02] dimitern: it makes the code more complex [13:03] rogpeppe, so you're saying let's have 2 almost identical []struct{} defined - one for testing, the other for production? [13:03] dimitern: i don't think you need the table at all in the production code - i'm just writing up a suggestion [13:04] rogpeppe, if it stays like this there's less chance of forgetting to add a new error type to allErrors and have it tested [13:04] rogpeppe, ok, thanks [13:06] jamespage, Do you know who I can show Bug #1305280 to get an apparmor issue addressed? [13:06] <_mup_> Bug #1305280: juju command get_cgroup fails when creating new machines, local provider arm32 [13:07] hi sinzui, I had some CI things I wanted to work with you on [13:07] hi jam [13:08] sinzui: specifically, looking at the log files, you're using "juju-1.18.0" to do "scp" [13:08] which is known broken for you [13:08] and we released 1.18.1 with that specific fix for you [13:08] though beyond that, "juju scp" always requires the API server to be functioning, which is what is breaking in the "upgrade" test [13:08] so it might be nice if we tried to use raw "scp" if we can. [13:08] either try to raw scp first, or try "juju scp" first and fall back [13:09] sinzui: we can get the API IP address from the environment/foo.jenv file [13:09] jam, yes, abentley and I discussed the fallback [13:09] (If we've ever connected successfully, we'll be caching the value there, and we'd like to get to the point where we cache it at the end of bootstrap) [13:10] sinzui: I'm trying to debug the local bootstrap problem. I haven't reproduced it yet, but I'm currently on Trusty [13:10] jam, and we can also update to 1.18.1 today [13:10] so I have to fire up a Precise instance first === psivaa-lunch is now known as psivaa [13:11] jam, interesting. http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/aws-upgrade-trusty/ shows trusty is upgrade fails in parallel to precise [13:12] * sinzui starts update and upgrade [13:14] jam, I can re-run an upgrade test for the cloud and series of your choice [13:14] sinzui: but that is not local [13:14] 1.18.1 is installed [13:15] now [13:15] I'm just starting with trying to fix the local-deploy issue [13:15] so I can try to not fire up a remote machine just to debug upgrade [13:18] sinzui: the main question about local right now is that probably the version of mongod running on precise is different from trusty [13:18] so while I think we also have an upgrade bug [13:18] It might be that bootstrap is failing because trusty has 2.4.9 (which works for us), and Precise is running 2.4.6 or something [13:19] I just realized my test won't work, as local under LXC doesn't work [13:21] dimitern: reviewed (kinda) [13:22] rogpeppe, cheers, I have the next one for you btw :) - it's tiny https://codereview.appspot.com/87470044 [13:23] dimitern: i agree that the mgo/txn docs could be clearer, BTW [13:24] rogpeppe, no doubt about it [13:24] dimitern: LGTM [13:25] rogpeppe, ta! [13:44] rogpeppe, updated https://codereview.appspot.com/87560043 - it's nicer now I think [13:44] axw: ha, it seems that 7 maximum parallel try attempts is way too small for real world API dialling [13:45] BTW I now have a functional environment where I destroyed the bootstrap instance [13:45] * axw tries to remember why it's 7 [13:45] cool :) [13:46] * dimitern will bbiab (1h) [13:47] axw: in my environment with 3 state servers, i see 21 addresses cached in the .jenv file... [13:47] rogpeppe: I guess I was thinking one per state server, but we will need more for each address type... [13:47] axw: and because each dial attempt takes ages to time out, we don't get to try the second valid address until it has. [13:48] axw: i'm tempted to just allow unlimited concurrent dials [13:49] rogpeppe: how do you have 21 addresses for 3 state servers? [13:49] why so many? [13:49] axw: http://paste.ubuntu.com/7249818/ [13:49] axw: machine-local addresses, ipv6 addresses, etc [13:50] rogpeppe: we are ignoring the machine-local ones, right? [13:50] that number should get filtered a little, yeah [13:50] I should know the answer to this :) [13:50] axw: no, not in api dial, because we don't currently store the metadata in the jenv [13:50] axw: that needs to be fixed [13:50] ah right, yeah [13:51] rogpeppe: so really, I think we'd have 2*state-server for both public and internal, if we had that [13:51] still, the point remains that you probably want to try dialling all your api server addresses at once, because sod's law says that the one you don't try is the only one that works [13:52] true. there's the private-inside-private scenario to cater for [13:52] axw: probably 4, because DNS-name vs numeric [13:52] rogpeppe: I was thinking public IP & name, but yes we do need to try private too [13:52] axw: yup [13:52] jamespage: sinzui: do we know why cloud-archive:tools only has juju-1.16.3 ? [13:53] jam: its called a blocked SRU [13:53] can't get into cloud-tools before you go into saucy [13:54] rogpeppe: I guess we can do unlimited... if we get in trouble, we could try a more complicated initially-short but expanding timeout [13:55] jam: we don't have an MRE yet so I have to detail how to test every bug in full - see bug 1277526 [13:55] <_mup_> Bug #1277526: [SRU] juju-core 1.16.6 point release tracker [13:55] jamespage: ouch. Going from 1.16.6 => 1.18.X is going to be a massive PITA for that. [13:55] jam: there is not SRU for 1.16.6 -> 1.18.x [13:56] jamespage: I realize there isn't (yet), but wouldn't the plan be to have the stable version of Juju in cloud-archive:tools ? [13:56] jam: that happens when cloud-archive:tools get's superceded by the trusty version [13:56] superceeded/replaced [13:57] jam: actually - while I'm thinking about this - how does backup/restore work on 1.16.6? [13:57] I see the update-bootstrap-node stuff mgz did in the bug list [13:57] jamespage: AFAIK it works through all the 1.16's because that is what we wrote it against. [13:58] jam: but for 1.16.x there is no backup or restore plugin? [13:58] jamespage: I think it was added in 1.16.5 ? [13:58] really? [13:58] jamespage: we added it for CTS [13:59] yeah, it was a bit of a fudge for minor version [14:02] rogpeppe: hey, sorry, that took a lot longer than expected, obviously. [14:02] natefinch: i'm just about to go to lunch [14:03] rogpeppe: ok, where are we right now? [14:03] natefinch: i've had a mostly-success with my integrated branch [14:04] natefinch: two things we need to fix: agent.Config.StateInfo needs to return localhost always [14:04] natefinch: api.Open should try all addresses concurrently [14:04] natefinch: oh, and one other one line fix [14:05] natefinch: APIWorker needs to fetch agent config again after dialling [14:06] rogpeppe: ok, I can start working on those. Should I branch off your branch or just do that in a new branch off trunk? [14:06] natefinch: i'd just do new branches off trunk [14:07] natefinch: they're all trivial [14:07] rogpeppe: yep, ok [14:15] jamespage: "juju-local" doesn't seem to depend on rsyslog-gnutls [14:16] ah, maybe it does now, but upgrade didn't do it? [14:16] weird [14:16] it does [14:16] jamespage: I thought I did apt-get upgrade, but I had to "apt-get install juju-local" again to get it [14:17] jamespage: anyway, it looks like 1.18.1 does depend on it, so thanks for that, sorry about the confusion [14:17] np [14:26] sinzui: I'm unable to reproduce the "local bootstrap" failure with trunk and cloud-archive:tools version of mongo (2.4.6) [14:26] I see the replicaSet line, but it doesn't fail [14:32] jam, I don't know which bug you are working on. The lxc bug I know of is about apparmor: bug 1305280 [14:32] <_mup_> Bug #1305280: juju command get_cgroup fails when creating new machines, local provider arm32 [14:32] sinzui: https://bugs.launchpad.net/juju-core/+bug/1306212 [14:32] <_mup_> Bug #1306212: juju bootstrap fails with local provider [14:33] sinzui: since I can't reproduce that right now, I'm switching to https://bugs.launchpad.net/juju-core/+bug/1307450 [14:33] <_mup_> Bug #1307450: upgrading from 1.18.1 to 1.19 (trunk) fails (API server stops responding) [14:34] jam: please do [14:38] sinzui: so offhand, we have a different bug, which is that "juju upgrade-juju --upload-tools" doesn't end up putting the tools where the agents can find them. :( [14:39] damn [14:40] sinzui: it looks like it uploads the tools, but doesn't make it readable [14:40] jam, We would be happy if local-provider honoures tools-metadata-url. We want to set it to a testing stream since local has to use streams to get tools for different series [14:40] jam, but I won't redirect you for delivering the fastest fix [14:41] sinzui: well this is testing "juju-1.19.0 upgrade-juju --upload-tools" [14:41] which should be working, but something isn't right [14:43] morning all (and good evening) [14:44] sinzui: sorry I couldn't get any farther on this, but I have to EOD [14:44] wallyworld wanted to pick it up in the morning [14:44] Thank you for you time jam [14:44] sinzui: and I think he was the one who did the changes to "upload-tools" so he probably has better insight there [14:47] alexisb: morning alexis (I think the convention is just to use the greeting relative to your own time zone... everyone knows what you mean :) [14:47] morning alexisb [14:48] you're up awfully early [14:48] sinzui: launchpad Q. If I have sensitive all-machines.log, can I upload it as a private attachment? [14:52] jam No private attachment :( [14:53] sinzui: fortunately VIM can global search for the secrets and replace them with XXX without too much trouble [14:57] axw: ping [14:57] alexisb: hiya [14:58] sinzui: hmm... It looks like "juju bootstrap" started creating i386 instances, and you cant "upgrade-juju --upload-tools" with a 64-bit version [14:58] it will let you, but it can't find the i386 tools (for obvious reasons) [14:58] rogpeppe: hey [14:59] jam, not that early 8am for me [14:59] sinzui: can you check if your 1.18.1 bootstrapped instances are i386 ? [14:59] it was for me [14:59] which is also a bug [14:59] axw: the existing code doesn't seem to mention juju-mongodb [14:59] axw: do you know how we should tell if it's available? [14:59] axw: (looking at your comments on https://codereview.appspot.com/86920043 ) [15:00] alexisb: well, you were on a bit earlier, but I did the math wrong. 11 hours makes you 1 hour closer, not 1 hour farther away [15:00] rogpeppe: right. no, I don't. I guess it just hasn't been done yet - so that can be TODO [15:00] I thought it was 5:30 ish [15:00] axw: ok, cool [15:01] rogpeppe: this upgrade thing is a massive PITA. may take me a little while yet to come up with a nice solution [15:01] axw: where do the main difficulties lie? [15:02] rogpeppe: upgrade steps require API server & state, API server dies when state gets bounced [15:02] axw: don't do it in upgrade steps [15:02] axw: do it in EnsureMongo [15:02] sinzui: so I have a bit more I can try to go on tomorrow, or *maybe* later tonight depending on how things go. [15:02] axw: where we're already stopping and restarting the service [15:03] jam, okay. I am still looking for the arch that was used [15:03] rogpeppe: I *think* there's a problem then that server.pem may not exist [15:03] err [15:03] maybe not that one [15:03] there was another file that was created on upgrade [15:03] sinzui: I think we have a 1.18.2 Critical bug that 1.18.X no longer prefers amd64 [15:03] axw: EnsureMongoServer is responsible for writing out the files that mongo requires, so we *should* be ok, i think [15:03] rogpeppe: anyway. I did start down that path... I'll keep looking into it tomorrow [15:04] ok [15:04] axw: thanks a lot [15:04] I *think* someone commented that it was because of PPC/ARM64 enablement [15:04] (we can't force amd64, so we let the cloud tell us what to use, but that means if both i386 and amd64 are available we now do i386, when we should do amd64 if possible) [15:04] sleepy time.. night all [15:04] sinzui: I do believe you can force it with: juju bootstrap --constraints="arch=amd64" [15:05] axw: BTW the reason for moving InitiateMongoServer into peergrouper is... [15:05] and now, I really must go spend time with my family :) [15:05] too late! [15:09] jam. CI has started a new round of tests. These will use 1.18.1. I will watch them for arch mismatches [15:11] natefinch: ping [15:11] rogpeppe: hi [15:12] natefinch: hangout? [15:12] rogpeppe: sure [15:12] natefinch: https://plus.google.com/hangouts/_/canonical.com/juju-core?authuser=1 [15:39] could someone have a look at this please? we've addressed comments, but it still needs a LGTM and it's a major blocker for HA. https://codereview.appspot.com/86920043/ [15:41] jam: I don't see an arch mismatch deploying 1.18.1. CI doesn't use upload-tools when deploying stable (since upload-tools is officially an developer feature) [15:41] * sinzui tries locally [15:41] dimitern, mgz, jam, ping on the review above that roger posted [15:57] trivial review anyone? https://codereview.appspot.com/87560044 [15:58] dimitern, mgz, jam: ^ [16:03] rogpeppe, looking [16:03] dimitern: ta! [16:04] rogpeppe, i'd swap you for https://codereview.appspot.com/87560043 :) [16:05] dimitern: will do, after i've finished investigating this issue [16:06] rogpeppe, sure, np - just reminding [16:06] rogpeppe, LGTM [16:07] dimitern: we really really need a review of https://codereview.appspot.com/86920043/ if you could muster the energy for it [16:07] dimitern: but thanks for that review too :-) [16:08] rogpeppe, looking that one as well [16:08] dimitern: much appreciated [16:33] rogpeppe: on https://codereview.appspot.com/87560044/ is there something about direct State destruction that we lose with your patch? [16:33] jam: no [16:33] jam: AFIK [16:33] AFAIK [16:36] jam: we only connect to the API if we don't use --force, and in that case we really want to use the usual API connection methods [16:53] rogpeppe, natefinch, that HA CL LGTM with some trivials [16:54] dimitern: thanks muchlu [16:54] y [16:54] rogpeppe, i'll poke you again about https://codereview.appspot.com/87560043 though :) (last time for today) [16:54] dimitern: ok, will look now :-) [16:55] rogpeppe, tyvm! [16:58] dimitern: the only comment i might have would be that it might be more idiomatic to have the error types themselves as pointer types, embedding wrapper as a value [16:59] dimitern: in fact, i think that's definitely worth doing [16:59] dimitern: because it means that %#v will work better on errors [16:59] dimitern: so: type notFound {wrapper} [16:59] dimitern: and func (*notFound) new( etc [17:44] rogpeppe, ok, that sgtm [17:45] rogpeppe, did I see LGTM as well? :)\ [17:45] dimitern: i really think those tests could use sorting out [17:45] rogpeppe, which ones? [17:45] dimitern: i've been struggling to understand the logic [17:45] dimitern: errors_test.go [17:45] dimitern: after some effort, i think i've managed to tease out a suggestion [17:45] rogpeppe, for each error in allErrors I add like 20ish cases [17:46] rogpeppe, I didn't want to repeat the same tests for all types and possibly miss something along the way [17:46] dimitern: i know, but the logic is quite a bit more complex than it needs to be [17:46] dimitern: lines 180 to 190 are really hard to follow [17:47] dimitern: and the errorSatisfier type doesn't seem to be doing much any more [17:47] rogpeppe, I confess I kept it only for the String() method [17:48] dimitern: yeah, it feels like a weird holdover [17:48] s/holdover/relic/ [17:48] dimitern: and you don't even need the String method for what you're using it for [17:49] rogpeppe, I need a way to compare 2 satisfiers (== or !=) and i can't do it with func pointers it seems [17:49] dimitern: you could have two nested loops over allErrors [17:50] rogpeppe, isn't that worse than using reflect? [17:50] dimitern: then you just need to compare indexes (or perhaps pointers if you prefer) [17:50] dimitern: it's certainly simpler [17:50] dimitern: so i think it's better [17:51] rogpeppe, but I have test.satisfier and allErrors[i].satisfier [17:51] dimitern: you don't need test.satisfier [17:51] rogpeppe, I can't just compare them and the indexes don't matter [17:51] dimitern: the only reason you have that is that you're mixing in nil satisfier tests [17:51] dimitern: they don't really fit, and they complicate all the logic [17:51] rogpeppe, hmm.. [17:52] rogpeppe, I guess I can make a separate set of tests + loop in another test case for nils [17:52] dimitern: i'd move the contextf tests into their own function too [17:52] dimitern: it's really a totally independent function [17:52] rogpeppe, but it needs to loop over allErrors as well [17:53] rogpeppe, ok, can be done separately I agree [17:53] dimitern: not necessarily [17:53] dimitern: its logic is independent of allErrors [17:53] dimitern: you do need to check that each error implements the newer interface, but that's easy to check statically [17:54] rogpeppe, the origin of this CL is the behavior of ErrorContextf - I need to check each error type is preserved [17:54] dimitern: fair enough. but that's a very simple test and loop over allErrors. [17:54] rogpeppe, yeah, but that's an implementation detail that you, as a user of Contextf doesn't need to know [17:55] rogpeppe, exactly [17:55] natefinch: https://codereview.appspot.com/87570043/ <= log the version of mongo as we create the upstart job [17:56] rogpeppe, ok, I appreciate your comments and will look at it a bit later or tomorrow [17:56] * dimitern reached eod [17:56] dimitern: np, sorry for the push-back. [17:57] rogpeppe, not to worry - it was useful :) [18:05] sinzui: just in case it wasn't clear, "juju upgrade-juju --upload-tools" failed because bootstrap picked an i386, but upload-tools can only upload the amd64 that I'm running. [18:05] so it was a combination of bug #1304407 [18:05] <_mup_> Bug #1304407: juju bootstrap defaults to i386 [18:06] and bug #1282869 [18:06] <_mup_> Bug #1282869: juju bootstrap --upload-tools does not honor the arch of the machine being created [18:06] o O (clue x 4) [18:07] sinzui: so I'm going to try it again and see if I can reproduce the failing to upgrade (for the right reason) [18:07] sinzui: though it looks like bug #1282869 isn't quite complete, as we fixed "bootstrap" but not "upgrade-juju" [18:07] <_mup_> Bug #1282869: juju bootstrap --upload-tools does not honor the arch of the machine being created [18:12] sinzui: I reproduced the "cannot upgrade to 1.19.0" bug: 2014-04-14 18:11:40 ERROR juju runner.go:220 worker: exited "state": cannot log in to admin database as "machine-0": unauthorized mongo access: auth fails [18:12] natefinch: ^^ [18:26] rogpeppe: if you're still around, found the upgrade bug [18:26] specifically, 1.19.0 always tries to login to the "admin" db [18:26] jam: really? cool. [18:26] but on an upgrade, it doesn't have rights as machine-0 [18:26] jam: oh of course, dammit [18:27] rogpeppe: so... do we back out logging into admin, do we make it "try but be ok if it fails" ? [18:27] rogpeppe: if we aren't going to do the full "upgrade support for HA" then we need to put in hacks [18:27] jam: i think we've got to do the latter [18:27] jam: otherwise HA won't work even when not upgraded [18:28] rogpeppe: so out of curiousit, why are we doing "admin := session.DB(AdminUser)" I realize the name of the db is "admin" but that shouldn't be AdminUser should it? [18:28] jam: hmm [18:29] rogpeppe: it is just that we're using the "admin" as the name of the user as the name of the DB [18:29] mostly just a constant that "works" but isn't actually the right named constant [18:30] jam: yeah, it does seem odd [18:31] rogpeppe: k, the other Database names are just hard-coded strings in the function, so I'll follow suit for clarity [18:33] jam: sgtm [18:33] jam: personally i like hard-coded strings anyway - i think they're often clearer [18:35] jam: DB(AdminUser) does seem wrong to me. i don't know what i was thinking. [18:36] rogpeppe: is there an obvious way how to remove an agent from admin? (I'd like to add a test that we come up ok when we can't access 'admin' as we'd run into after upgrade) [18:37] pwd [18:37] afaict we don't do anything with the "admin" db we just logged into [18:37] at least not directly [18:37] the other DB objects in that func are put into the State object [18:37] jam: st.db.SessionDB("admin").RemoveUser(AdminUser) [18:38] rogpeppe: thanks [18:38] well, in this case "RemoveUser(info.Tag)" aka ("machine-0" [18:38] jam: yeah [18:38] jam: no, we don't do anything with the admin db [18:39] jam: but we do need access to it for manipulating the replica set [18:39] rogpeppe: right, it allows you to call particular functions *if* you're logged in [18:39] jam: yeah [18:39] side-effect is on Mongo side [18:41] rogpeppe: presumably we also need to change State.setMongoPassword to allow for AddUser on the "admin" table to fail? [18:41] or we shouldn't ever be creating one of those [18:42] since we can't be in HA we shouldn't be adding any machines that would want to [18:42] jam: yeah [18:43] jam: we should change EnsureAvailability to fail if we're not in replica set mode [18:43] jam: that way people can't get themselves into a nasty twist [18:57] rogpeppe: &mgo.LastError{Err:"not authorized to remove from admin.system.users", [18:57] jam: hmm [18:57] jam: i suppose it might have removed the user anyway [18:58] that is after trying to do: [18:58] adminDB := s.state.db.Session.DB("admin") [18:58] password := testing.FakeConfig()["admin-secret"].(string) [18:58] adminDB.Login(AdminUser, password) [18:58] so theoretically ensuring that I'm admin, though I need to check the err code [18:58] auth fails ... [18:59] rogpeppe: from what I can tell, TestingInitialize returns State object that isn't actually logged into the Admin db [18:59] TestingInfo doesn't have a password [18:59] rogpeppe: what is *really* strange is that SetMongoPassword was perfectly happy, which *should* be setting the password in "admiN" [19:00] so you are authed to add people, but not remove them? [19:00] jam: it does seem odd [19:00] jam: mongo has some weird semantics sometimes [19:09] rogpeppe: so I haven't figured out the password for "admin", but I have found that if I call Machine.SetMongoPassword() I can then log into "admin" as "machine-0" with the password I just gave it, and then use *those* credentials to remove the "machine-0" user. [19:09] WTWTTWW WTF? [19:09] lol mongo is wacky [19:10] natefinch: yeah, so $CURRENT_USER can add admins, but can't remove them, but you can add one, log in as it, and then do whatever-you-want [19:13] natefinch: apparently the model changed in mongo 2.6: http://docs.mongodb.org/manual/reference/method/db.addUser/ [19:13] rogpeppe: from what I can tell, calling adminDB.RemoveUser("machine-0") removes it completely, and not just from admin [19:14] jam: ha [19:14] jam: so i guess you'll have to remove it, then add it back to the ones you want it on [19:15] rogpeppe: actually, looks like I was just screwing up the password, so I need to try again [19:16] finally, failing in the way I wanted [19:16] and now success [19:16] jam: so mongo wasn't being weird at all, in fact? [19:17] rogpeppe: well, I still have to log in as the agent I just created to delete it [19:17] that is still weird-as-fuck [19:17] jam: ah yes [19:17] but removing it from admin only removes it from admin [19:22] I think we want a juju-local-kvm package to sort of kvm deps. juju-local is lxc centric [19:26] rogpeppe: lp:~natefinch/juju-core/043-localstateinfo [19:31] natefinch: rogpeppe: It makes me wonder if we couldn't just add ourselves if we weren't in admin to start with ... [19:32] jam: i *think* i tried that [19:32] jam: but try it anyway [19:34] rogpeppe: natefinch: sinzui: https://code.launchpad.net/~jameinel/juju-core/soft-login-failure-1307450/+merge/215742 [19:34] I was able to reproduce the upgrade failure with the local provider [19:34] and that patch lets it get further [19:34] jam: codereview? [19:34] \o/ [19:34] rogpeppe: lbox is thinking about it [19:35] sinzui: not that there won't be any other bugs, but the first one I think I got [19:35] rogpeppe: weird, still thinking [19:35] rogpeppe: https://codereview.appspot.com/87730043 [19:36] rogpeppe: natefinch: I'm off to sleep, unfortunately, so if it needs tweaks, I'm sure curtis would appreciate you picking it up from here. [19:36] jam: i like "haven't implemented bug #xxxx" - sounds like we want to implement a bug... [19:36] or you can point wallyworld at it when he gets up [19:36] jam:cool [19:36] jam: thanks [19:41] natefinch: FWIW, the last remaining diffs that we haven't already got branches in progress for: http://paste.ubuntu.com/7251459/ [19:41] rogpeppe: wow, that's awesome [21:09] sinzui: so swift remains dead to us. jenkins for charmworld uses juju to update to the newly blessed code, so staging is now stuck and useless. === marrusl is now known as marrusl_afk [21:09] bac: yep [21:09] * bac is sad [21:10] bac: our only option at this time would be to replace the stack under our personal credentials, but we also need different public IPs [21:10] sinzui: why the last part? [21:11] sorry, that was cryptic, sinzui why do we need new public IPs? because we can't wrest them away from the current assignees? [21:11] bac: public IPs are not shareable or transferable between accounts [21:11] hi thumper [21:11] o/ [21:11] sinzui: oh. can they be revoked from orange and given to us? [21:12] bac: if we wanted to preserve the current IPs we need to revoke then hope we get the same ones when we allocate new ips [21:12] oi goi oi [21:12] sinzui: I'm going to test bootstrapping on precise [21:12] or, dios mio as they say here [21:12] sinzui: I have a precise machine here [21:12] bac: my success rate is is 25% in my attempts to get an IP I had in another account [21:12] what is our ppa for precise stuff? [21:13] sinzui: so, what's another RT to update dns in the grand scheme? [21:14] morning all [21:14] sinzui: so it looks like i need to push a change directly to production without running on staging first. guess i'll wait until the morning. [21:15] thumper, while you slept I worked out how to use debug-log [21:16] sinzui: okay... [21:16] bac: yes lets wait till the morning. I can think about how to make the machine do an update like the charm would [21:16] thumper, I will ping you when I would like your review. [21:17] sinzui: oh... for docs... [21:17] yeah, this documentation thing still slips by me ... [21:17] sinzui: we should get a summary of the help doc into the actual command line help [21:18] thumper, I agree. Maybe I will make that a topic for vegas [21:24] sinzui: for the local bootstrap test on precise [21:24] sinzui: what is the minimum I need to install? [21:25] thumper, CI uses real precise + juju + juju-local [21:25] sinzui: juju and juju-local from where? [21:26] sinzui: also, do you know which compiler? [21:26] thumper, any recent. I have juju 19 and juju-local 1.18.1. I haven't changed the last package in a while [21:26] sinzui: as I may need to build additional logging [21:27] thumper, good question [21:27] sinzui: let me be clear, the precise box I have currently has no juju deps at all [21:27] * sinzui looks [21:27] sinzui: I'm assuming there is a ppa [21:28] $ apt-cache madison golang-go [21:28] golang-go | 2:1.1.2-2ubuntu1~ctools1 | http://ubuntu-cloud.archive.canonical.com/ubuntu/ precise-updates/cloud-tools/main amd64 Packages [21:29] thumper, and if you want very close matches to packages I can offer this...bug I assure you I have't changed local packaging since 1.18.0 [21:29] http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/publish-revision/ws/tmp.fQ6PU5ZxX5/ [21:29] ah ha [21:29] I have precise-updates/cloud-tools in apt [21:29] * thumper installs juju-local [21:30] sinzui: it seems weird to me that jam was able to boot trunk on aws but CI was not [21:30] thumper, I think you have that reversed [21:31] ah... wat? [21:31] thumper, http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/ [21:31] CI can deploy fine [21:31] sinzui: what is the current state of the local provider CI tests? [21:31] which one is that [21:31] jam reports that deploy is using the wrong tools. I have not seen that personally or in CI [21:31] local-deploy is red [21:32] it has been broken for a few days. it is not "techincally" aws as we have done this on canonistack too [21:34] hmm, this is a regrettable error when the machine in question is down (actually, its instance has been destroyed): 2014-04-14 21:32:16 ERROR juju.cmd supercommand.go:299 some agents have not upgraded to the current environment version 1.19.0.3: machine-0 [21:35] sinzui: how come the precise-updates cloud tools doesn't have 1.18? [21:35] i think there should probably be a way to force that [21:35] thumper: hiya [21:35] or perhaps another question would be [21:36] why doesn't my machine see it? [21:36] hi rogpeppe [21:36] rogpeppe: I'm wondering if the 'regrettable error' is an understatement for something? [21:36] thumper: well, it means that the environment is now broken - i cannot upgrade it [21:37] thumper, politics [21:37] thumper: it is an understatement, yeah [21:37] rogpeppe: I suppose an error message that says "you're borked, sucks to be you" wouldn't be appreciated [21:37] thumper: at least then i'd know it was deliberate... [21:38] thumper, Ubuntu rejected 1.16.4 (they consider backup and restore a feature). jamespage is still trying to get 1.16.6 into archive for precise to ensure they can upgrade, then go to 1.18.0 [21:38] thumper: it's an interesting situation actually, because usually i'd be able to do destroy-machine --force, but in this case the machine in question is a state server [21:38] thumper, we have never said users can upgrade from 1.16.3 to 1.18.0 [21:38] sinzui: even in the cloud-tools? [21:38] It's not our repo [21:39] * rogpeppe creates a bug [21:40] thumper, I talked with a few people today about it. There is a chance 1.18.1 will become official in the archive when trusty is released and customers cannot upgrade to it [21:40] hmm, actually maybe it's just a bug for me at this moment [21:40] sinzui: aargh... that is terrible === vladk is now known as vladk|offline [22:01] sinzui: ok, can confirm that 1.19.0 bootstraps the local provider on my precise machine [22:01] r2626 [22:01] which I can see fails on CI [22:02] sinzui: so the big question now becomes, what is different? [22:03] thumper, well. [22:03] what changed in lp:juju-core r2593 [22:03] thumper, when CI slows I can run the deploy with --debug [22:04] sinzui: that was when the machine agent became responsible for setting up the mongo upstart script [22:05] sinzui: can you capture the mongo logs from the CI machine? [22:05] I wonder if this is the crash that dave had reported [22:06] sinzui: https://bugs.launchpad.net/juju-core/+bug/1306536 [22:06] <_mup_> Bug #1306536: replicaset: mongodb crashes during test [22:07] bugger, CI is trying the local job more than 5 times [22:07] thumper, I think it is related since the logs report it http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/local-deploy/1174/console [22:10] sinzui: there is also the mongo log file [22:11] sinzui: /var/log/upstart/juju-db-tim-local.log is my file [22:11] sinzui: replace for ci user, and for the env name [22:11] thumper, noted. [22:11] sinzui: that way we'll get any extra crash info [22:12] the local upgrade test is still playing so I cannot start the deploy test [22:12] ack [22:12] surely if the upgrade test is running, then the local provider bootstraps? [22:12] or is it taking a long time to fail? [22:13] sinzui: perhaps also worth noting that my precise machine is running i386 [22:13] thumper, 1.18.1 is good. We can bootstrap with stable, we cannot upgrade to unstable [22:13] We are amd64 [22:16] sinzui: http://paste.ubuntu.com/7252245/ [22:16] I can bootstrap now [22:17] sinzui: where? [22:17] On the CI machine [22:17] ?! [22:17] what changed? [22:20] thumper, this is the log of my bootstrap attempt https://pastebin.canonical.com/108508/ [22:20] thumper, I didn't mean CI could pass bootstrap. I meant that the env was free for me to bootstrap [22:23] thumper, I didn't get logs in a local dir or juju-jenkins-local [22:24] thumper, maybe this config offends you: https://pastebin.canonical.com/108509/ [22:24] * thumper looks [22:24] what is test-mode? [22:25] what is bootstrap-timeout in? [22:25] oops, ensure-availability shouldn't have done *that* [22:26] thumper, this is mongodb-server https://pastebin.canonical.com/108510/ [22:26] sinzui: log? [22:26] thumper, test-mode tell the charm store to not count the deployment [22:26] no logs [22:26] ^ [22:27] thumper, bootstrap failures don't seem to ever leave logs [22:27] sinzui: same mongo version [22:27] hmm [22:27] thumper, I can try to tail something in another terminal while I bootstrap [22:28] sinzui: can I log into that machine? [22:29] sure [22:30] thumper, ssh -i ./cloud-city/staging-juju-rsa jenkins@54.84.137.170 [22:30] sinzui: I don't have that identity [22:30] thumper, the key is in lp:~sinzui/+junk/cloud-city [22:30] which is shared with you [22:30] * thumper looks [22:31] That also has the env for everything we test [22:32] sinzui: I'm in [22:33] thumper, export GOPATH=/var/lib/jenkins/jobs/local-deploy/workspace/extracted-bin/ [22:33] thumper, export JUJU_HOME=~/cloud-city [22:42] * rogpeppe has an environment that seems reasonably HA [22:44] rogpeppe: \o/ [22:44] thumper: there are still... strangenesses [22:45] thumper: but still, i destroyed the bootstrap instance and everything carried on much as usual [22:45] thumper: sinzui: i am going to land john's recent branch "soft-login-failure-1307450" which fixes an issue preventing upgrade from 1.18 to .19 from working [22:46] rogpeppe, send me a bried summary of how you made it HA via the command line. I think I can reused the backup restore test to instrument a failure of a machine. I expect with HA, juju status still works after the failure [22:46] \o/ wallyworld [22:46] sinzui: the requisite branches haven't landed yet [22:46] sinzui: well, i'm going by the description - there may be other issues :-) [22:46] sinzui: there's one which isn't ready to be proposed yet [22:47] sinzui: i can push the branch that i'm testing, if you like [22:49] sinzui: essentially i did this: http://paste.ubuntu.com/7252375/ [22:49] rogpeppe, no rush. I am busy preparing for a release and trying to get juju 1.16.6 in the cloud archive [22:49] rogpeppe, excellent. as I hoped [22:55] * rogpeppe grinds to a halt [22:55] g'night all [22:55] night rogpeppe [22:56] waigani: ttfn [22:56] congrats on HA [22:56] it's not there yet! [22:57] congrats on *almost* HA ;) [22:58] thumper, Can you read my debug-log draft at https://docs.google.com/a/canonical.com/document/d/1BXYrLC78H3H9Cv4e_4XMcZ3mAkTcp6nx4v1wdN650jw/edit === marrusl_afk is now known as marrusl [23:17] why does local provider try to reverse dns on the ip addresses.. [23:18] * hazmat wonders how he got dns-name: 176.52.236.23.bc.googleusercontent.com. === mjs2 is now known as mjs0 [23:20] ha.. yummy! [23:20] rogpeppe, sinzui if you want an additional tester for that.. send me some instructions [23:24] hazmat, thank you [23:27] smoser, you ever seen cloudinit on trusty hang.. i'm in a container.. and the last output is http://pastebin.ubuntu.com/7252494/ but its blocking the rest of the container startup (ssh, etc). [23:36] sinzui: john's branch landed at r2627 so hopefully that might help the upgrade tests pass. we'll see i guess [23:40] hazmat, can you turn cloud-init debug on. [23:40] and get paste. [23:41] hazmat, in /etc/cloud/cloud.cfg.d/05_logging.cfg just turn 'handler_consoleHandler' to be [23:41] level=DEBUG [23:41] rather than [23:41] level=WARNING [23:41] you should see lots more output. [23:42] not sure how you ran that though [23:46] thumper, I think CI will start testing in 15 minutes. Do you want me to disable the local tests so that you can use the env as you like? [23:46] sinzui: yeah, for now would be good [23:46] just otp with alexisb [23:48] yes sinzui I was distracting thumper, I am done he is all yours now [23:48] thumper, local is all yours. Say when you are done so that I can re-enable the test [23:48] sinzui: ok [23:48] in poking around now [23:52] sinzui: um... [23:52] sinzui: hangout? [23:52] smoser, ack [23:52] smoser, it was an old version of trusty i was updating. [23:53] i'll see if i can reproduce and log [23:53] thumper, I can 40 minutes. My children want dinner [23:54] ok [23:54] sinzui: can in 40 minutes? [23:54] or only for 40 minutes [23:54] :)