[00:46] grrr functional-ha-recovery still fails [00:46] aaaaaaaaarh [00:46] * menn0_ rolls up his sleeves [00:51] at least mongo didn't go down in a ball of flames [00:52] it looks like the test perhaps didn't wait quite long enough [00:52] \o/ add more timeouts [00:57] hmmm.... it's already waiting up to 20mins [00:57] 20mins to bring up 2 state servers [00:58] should be enough [00:58] yup, that is far too long [02:02] davecheney: something definitely went wrong [02:03] there are signs that the extra state servers where on their way up [02:03] but they didn't get there [02:03] and the logs just stop well before the test gave up [02:03] in fact, the logs are a bit sparse throughout [02:03] I don't see the EnsureAvailability API call for example [02:04] but I do see messages from the provisioner that the extra machines came up as a result [02:04] menn0_: what I was seeing last night, even when mongo wasn't fucked is the _second_ we issue ensure-availabilityu [02:04] the primary api server goes offline [02:04] not stops [02:04] just stops accepting new connections [02:04] so the two new state servers get through cloud init [02:04] we start their jujud upstart jobs [02:05] I saw that yesterday too but this is different [02:05] and thye just sit their waiting for the api server [02:05] there are no errors or warnings in the logs [02:05] they don't evne get to the point of establishing a replica set [02:05] everything looks positively happy [02:05] thye don't even have mongo installed [02:05] its like turning on journalling buggers the primary state server [02:05] and thus the api server connected to it [02:05] the other thing is, I've manually run this CI test against EC2 and it worked [02:06] so it could be that it'll work sometimes [02:06] Curtis sent me the details for logging in to our CI infrastructure [02:07] I might run the test in Jenkins again and get on the hosts that get created [02:07] 20 mins is hopefully enough time to nose around [02:11] menn0_: so you got some working ec2 creds ? [02:11] yep [02:11] John set me up [02:11] there's one canonical funded account [02:11] given how many ne starters we have, it might be valuable to email the internal list and let them know how to get the details [02:11] rather than everyone figuring it out on their own [02:11] and he can create user accounts within that [03:09] waigani: thanks for the review comments [04:12] menn0_ what's the latest? [04:12] waigani: well, the problem doesn't happen very often [04:13] I've only seen the functional-ha-recovery jenkins job fail once out of 4 attempts [04:13] ah a heisenbug [04:13] and the one time it failed the logs don't indicate anything [04:13] the one suspicious thing is that the logs stop dead several minutes before the test gave up [04:14] but no errors or warnings before then [04:14] I'm working on a little script which slurps all the useful logs from all the machines as the test is running [04:14] any idea what command / test is run just before blackout? [04:15] I have a theory that something went wrong with rsyslog (which feeds all-machines.log, which is the one we have for the failed run) [04:15] but that the machine-N.log may have continued on [04:15] have you checked the machine-n.log? [04:16] yes sure, but only for runs where the test succeeds :) [04:16] for the failed one where I wasn't looking as closely [04:16] Jenkins only archived all-machines.log [04:16] from machine-0 [04:16] ah right [04:16] and that's the one that seems truncated [04:16] actually [04:17] I might make a change to the test so that it archives all the logs from all the machines [04:17] that's probably more sensible than what I'm doing now [04:17] it's not already doing that? what is it doing now? [04:17] it just archives all-machines.log and cloud-init.log from machine-0 [04:18] which is perfectly sensible if logging to all-machines.log is working correctly [04:18] right, i see [04:18] but I have a feeling it isn't (or at least wasn't for the failed run) [04:19] I think having all the logs for all machines when tests fail is probably a good thing to have at this stage [04:19] ha [04:19] the test just ran successfully again... [04:19] which is great [04:19] and certainly better than what we had before [04:20] but it bugs me that we still might have a lurking issue [04:21] emailing curtis with an update now === HankM00dy is now known as thehe === allenap_ is now known as allenap === viperZ28_ is now known as viperZ28 === uru_ is now known as urulama === stub` is now known as stub [06:16] morning all [06:37] morning all [06:53] hey voidspace [06:53] dimitern: hwy [06:53] *hey even [06:54] voidspace, i'll be running tests on MAAS as soon as the talk starting now is done (~1h from now) [06:54] dimitern: cool [06:54] thanks [06:59] jam, jam1, hey, I'm in juju-networking [07:02] dimitern: hi, we're not done with the lightning talks, but we'll be switching to the boardroom soon [07:02] jam1, ah, ok [08:48] * voidspace lurches [09:03] voidspace: morning, just opened your latest PR [09:13] TheMue: voidspace: [09:13] hey guys [09:13] how's IPv6 stuff looking? [09:16] jam1: I’m in contact with Serge and Stéphane regarding LXC. Got a mail back with lots of data in it. ;) There seems to be an issue with the bridge for the containers, but I have to go deeper into the pasted command output in the mail. [09:16] TheMue: as in 'br0' doesn't actually support ipv6? [09:16] jam1: Will do it after finishing voidspaces review. [09:17] jam1: That could be the worst answer, yes. Don’t hope so. [09:17] TheMue: certainly I expect that the containers can't just come up on lxcbr0 as they default to [09:17] because that isn't actually a bridge onto the outer network [09:21] jam1: allocating routable ipv6 will need a daemon offering address space to the bridge (and you can then route it onto the exterior net [09:24] lifeless: in this case, I believe we are just setting the addresses on the individual containers, the bridge itself needs to know the range? [09:26] jam1: same as dnsmasq is used to offer IP addresses to containers for ipv4, you need something offering it for ipv6 [09:26] jam1: autoconfig will get you local address space only of course [09:27] lifeless: sure, but you can set them with ip manually [09:28] right? [09:28] jam1, lifeless: yes, currently all addresses on host and in the containers are set manually [09:28] anyway, I have another meeting now, but I'd like to hear your thoughts lifeless, perhaps in about an hour? [09:29] jam1: I'm at a python user group meetup, but I'll try to catch you later perhaps :) [09:29] and as said, host A can reach host B, host A can reach container A1 and A2, container A1 and A2 can reach, only A1 cannot reach B1 [09:31] lifeless: sounds like a good time, have fun [09:32] TheMue: is there an etherpad or a bug with details? [09:33] lifeless: not yet, so far we haven’t seen it as bug, only as a result of my lack of knowledge about ipv6 :D [09:35] TheMue: ok, so I'm happy to cross-check, if you have a diagram (or full prose) description of the topology (and ip -6 route output etc from the hosts and containers) [09:36] lifeless: thanks, will add this to my „research doc“ on google and send you the link [09:37] voidspace, still in a call; upgrading the maas nucs, it takes some time - when done i'm running tests [09:57] TheMue: cool [09:58] I'm going to be late to hangout - my plugin has crashed, trying to sort it [10:00] anyone getting on the juju-core team meeting? [10:01] natefinch: oh, goooood hint *iirks* === perrito6` is now known as perrito666 [10:14] dimitern: cool, I have some review comments from TheMue anyway [10:14] TheMue: thanks for the review, useful [10:16] voidspace: yw and just for info, it’s core meeting time [10:17] TheMue: ah! I always forget core meeting [10:17] grabbing coffee and will join [10:17] thanks [10:17] voidspace: hehe, me this time almost too [10:27] sinzui: are you around? CI might be unblocked [10:28] morning alexisb [10:54] davecheney, http://i.imgur.com/TkWPd9o.jpg [10:59] TheMue: dimitern: I assume we're not doing standup as well? [11:00] voidspace, TheMue, let's skip it yeah, unless you need to talk about something specifically? [11:00] dimitern: no, you both know where I'm at [11:00] dimitern: I'd like confirmation from you at some point that my branch doesn't screw MAAS [11:01] voidspace, certainly, I'm trying my best to get my local maas in a usable state - almost there i hope [11:01] dimitern: haha, ok [11:01] dimitern: I have minor cleanups to do on that branch anyway [11:01] dimitern: so I'm not blocked [11:02] voidspace, sweet! [11:02] git config --global rerere.enabled true [11:02] that's all I need to do? [11:02] if so, we should add that to the contributing doc - just before the instructions to rebase [11:02] waigani, check that blog post "Rerere your boat" [11:03] dimitern: okay, I'll keep reading [11:06] ma, that's a terrible blog title pun [11:11] mattyw: thanks for the git/D3 link, great idea [11:12] waigani, I use 2 resources for understanding git. that's the first one - the other one is tasdomas [11:12] mattyw: hehe, I'll have to bookmark that second one [11:14] night all. [11:35] natefinch: you said CI might be unblocked? [11:46] sorry natefinch https://bugs.launchpad.net/juju-core/+bug/1350983 is still open. While we got a pass, it failed most of the time we tried [11:54] sinzui: well, I have another plan on that one too [11:55] sinzui: can you confirm it's just bug 1350983 and bug 1347715 left blocking? [11:55] c7z, yes, just them [11:56] c7z, there is another critical regression reported by voidspace, We may need to add "ci" to it to make it block checkout bug 1353443 [11:56] I'm pretty certain the azure issue is slow disk, but not sure best how to mitigate, or the direct cause (as it's not that) [11:58] sinzui: devs seem blocked on the manual provider issue from not being able to reproduce it... but ci does hit it completely reliably [11:59] c7z, yep. I since 1.20 always passed, and it passed before the problem revisions, I cannot fault the test. === ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Open critical bugs 1350983, 1347715, 1353443 [12:06] mgz: standup? [12:09] TheMue: which do you prefer, the first or the second? [12:09] TheMue: http://pastebin.ubuntu.com/7979232/ [12:09] sinzui: hey, can you give me a hand with azure? [12:09] TheMue: I like the first as you then have a type for the function parameter [12:10] I have to reboot [12:10] screwy driver kills the mouse from time to time - so I currently have no mouse pointer [12:10] brb [12:11] perrito666, I have a few minutes between meetings [12:11] ok Ill be fast [12:11] I copied one of the setups for azure from cloud city [12:11] I set up the env variables from azuretoolssrc [12:12] katco: sorry, sec [12:12] timezone is wrong here somehow [12:12] and I tried a few combinations and I always end with: [12:12] 2014-08-06 22:12:26 ERROR juju.provider.common bootstrap.go:120 bootstrap failed: waited for 10m0s without being able to connect: Permission denied (publickey). [12:22] aaand back. But I have to help the wife for a few minutes. Back for realsies shortly. [12:23] realsies .. heh [12:23] wwitzel3: morning [12:23] voidspace: hola, how are you? [12:23] wwitzel3: I'm good, but I have to go [12:23] back in a few [12:24] voidspace: back from lunch, just seen your question. I prefer the first approach too. but is it do and don’t or can and cannot? [12:32] voidspace: I'm pretty sure that while you have a type for the parameter, you can still pass "myFunc(true)" [12:35] voidspace: http://play.golang.org/p/q4ttwlIC4_ === jam2 is now known as jam1 [12:35] jam2: yeah, that's right. "true" and "false" are constants and thus get converted to whatever special type you want that is derived from bool [12:35] sinzui: let me know if you can shed some light on my issue [12:35] natefinch: which is also true for enumerate strings, and lets you create an "instance" of your enumerated type that has any arbitrary string [12:36] jam1: yep, it's kind of a problem at times... it's not too bad for enums based on ints, because randomly passing 8 or whatever into a function looks weird and should get caught by a review [12:37] jam1, voidspace: I'm actually -1 on making constants that just mean true or false. If you want to make the code clearer, don't use a boolean parameter at all [12:38] jam1, voidspace: just make two functions: Foo() and FooNoConfig() and [12:38] s/and// [12:39] another option is to use an integer instead of true/false just because it only has two values doesn't mean it has to be a boolean... no one's going to pass 0 or 1 into your function and make it past a code review (hopefully) [12:40] but really, the best answer is just two functions that in their implementation pass true/false to a single implementation function [12:40] ok gotta go. Bringing my older daughter to a doctor's appointment [12:40] natefinch: sounds good, and you later simply can stuff like „iDontCare“ or „whoKnows“ ;) [12:41] natefinch: but I would always start with 1, so that a passed uninitialized variabe with value would may fail internally, e.g. inside a switch. [12:48] TheMue: voidspace: I'd also *highly* recommend affirmative statements, rather than something like "if !DontFoo" [12:52] jam1: yep, if statements should be „if isWanted“ or „isPossible“ while the arguments are „want“, „dontWant“, „possible“ or „impossible“ [12:53] jam1: but only for boolean variables, fields, arguments [12:53] jam1: otherwise the prefixes „is“,“can“ etc don’t match [12:54] * TheMue is reminded of the always good readability of smalltalk sources [13:01] voidspace, hey [13:01] voidspace, sorry for the long delay [13:02] voidspace, i had a ~4h fight with my hardware maas, eventually giving up and testing your branch on my kvm-based virtual maas [13:13] voidspace, network setup looks fine, containers on all nodes, bootstrap included, are addressable on the same subnet (the bridge works)| [13:14] wwitzel3: did a dumb script: #!/bin/bash\nPASSWORD=$(sudo cat $HOME/.juju/local/agents/machine-0/agent.conf |grep oldpassword |awk '{ print $2 }')\nmongo --ssl -u admin -p $PASSWORD localhost:37017/admin [13:16] dimitern: awesome, that's great news [13:16] katco: nice [13:16] voidspace, i've just reviewed your PR, LGTM [13:16] dimitern: cool, thanks [13:16] wwitzel3: seems like everyone asks how to do that [13:17] TheMue: nate's advice doesn't really make sense in this context [13:17] TheMue: we'd actually need two networkers and two configstate types to do as he suggests [13:17] katco, ha, awk fan, huh :) [13:17] TheMue: sometimes a boolean actually makes sense, I disagree with him on that point I think [13:18] dimitern: lol i am by no means an awk master. i just struggle through it when it makes sense [13:18] TheMue: if go had named parameters it would be easier... [13:18] voidspace: now, not two networkers. two constructors to avoid the parameter [13:18] voidspace: yeah, named parameters are nice [13:18] TheMue: but the networker has to pass this paramter down into another function [13:18] TheMue: so two constructors isn't enough... [13:18] we still need to store the boolean [13:18] voidspace, it has a struct-literal syntax for kinda the same thing [13:19] dimitern: heh, right - could use that I guess [13:19] dimitern: that's even further down the rabbit hole [13:19] voidspace, we use this quite a lot with 3-4+ args functions [13:19] dimitern: right, and there it makes sense [13:19] voidspace: the default constructor would set a field to the one value, while the second one calls the first one but then changes the field. ;) [13:19] dimitern: becuase you can add or remove paramters at will too [13:20] TheMue: I don't think that's any clearer than just a boolean parameter with named constants [13:20] that's plenty readable [13:20] voidspace, I'd just define a couple of int bit flags and | them when calling NewNetworker - everyone's happy [13:21] dimitern: bit flags! [13:22] :) why not? networker.DontWriteConfig [13:22] voidspace, ah, sorry - i've just noticed that's the only bool arg [13:23] if there were 2, i'd use bit|flags [13:23] where is this code in question? [13:23] dimitern: you mean bit flags instead of multiple bools [13:24] yeah, that would work... [13:24] katco: wwitzel3: fwiw, I think the "admin" user is going to be going away as direct DB access eventually, you should be trying to connect as machine-0 user, IIRC [13:26] jam1: will the machine-0 user have the rights? [13:26] voidspace, In this case, the easiest thing really is another ctor NewSafeNetworker? { newNetworker(..,false) }, and NewNetworker also calls the implementation, but with true [13:26] jam1: thanks for the heads up. [13:26] I give in [13:27] wwitzel3: a machine agent that has JobManageEnviron will have admin access on the DB [13:27] in HA mode, eventually machine-0 might not, but the chances of that are quite low [13:27] jam1: so hi [13:27] I'll bring it up if we *actually* want to kill the admin user, but already the "oldpassword" stuff means it isn't quite what it used to be [13:27] hey lifeless [13:27] * dimitern wonders how many *degrees* of bikeshedding are there :D [13:27] jam1: ok, just wondering, I know that the issue i just fixed I wouldn't of been able to without access to the admin database. [13:38] voidspace, interestingly, i came across some problems with deploying in lxc containers - slow startup (not using btrfs?), apt-get install failing the same way (can't get the dpkg lock) - resolved --retry fixes it, mysql start hook fails with random mysql startup errors [13:38] dimitern: that all sounds horrible [13:39] it never used to be so bad [13:39] so maybe the lxc package had some regressive changes lately [13:39] right, maybe [13:40] voidspace, tell me about it :) combined with a couple of hours trying to get the master nuc on my maas to work (at one point no usb ports worked, i.e. no kbd, no wifi or ethernet) [13:41] ouch! [13:42] lifeless: in https://docs.google.com/a/canonical.com/document/d/1wfdGL_vyemT2-ncAB7KIySkKI9HbT8efKeC3Sd8ID0I/edit# the current test setup and status are described [13:43] TheMue: so that doc seems to say that they are both just on "lxcbr0" which is only the local host bridge [13:43] jam1: yes [13:43] TheMue: you need a bridge that is on eth0 if you want containers on VM1 to be able to see anything on VM2 [13:45] TheMue, that's right, and it needs to be an IPv6 bridge I think [13:45] jam1: I’ve tried that too, as I found a a doc. but after shredding my net this way in the first approach even the second one didn’t work [13:46] perrito666, ericsnow: you guys want to wait push standup back and wait for nate? Or do i at 10? [13:46] jam1: so how to add an eth0 bridge for ipv6? [13:46] TheMue: my initial understanding is that installing "bridge-utils" creates a br0 bridge on eth0 [13:47] TheMue: though possibly you need: http://xmodulo.com/2013/04/how-to-configure-linux-bridge-interface.html [13:47] wwitzel3: I guess we can wait [13:47] jam1: thanks for the link, will take a look [13:50] TheMue, this one is specifically for lxc+ipv6 - might help: http://blog.toxa.de/archives/606 [13:50] jam1: so am I right that eth0 and the virtual interfaces of the containers have to be added to the bridge [13:50] dimitern: see my link list [13:50] dimitern: in the document [13:50] TheMue, :) [13:50] haven't checked all of them [13:51] dimitern: done it exectly this way made me restoring my VM to my snapshot, can’t reach it anymore *lol* [13:52] * TheMue will see how the bridge utils work now [14:00] TheMue: have requested access [14:02] lifeless: granted [14:35] TheMue: comments left [14:35] TheMue: ip -6 neigh show [14:35] TheMue: is another useful command [14:35] what horseplay is that... [14:36] TheMue: in short, I think your subnetting is broken - you're putting /64 prefixed addresses on a virtual bridge with a /96 route, we don't expect neighbour discovery outside that /96 to work [14:36] hth [14:36] gnight! [14:37] lifeless: ah, ok, will take a look there, thanks [14:38] currently my vms are rejecting any networking after enslaving eth0 to br0 :( [14:44] whyyyyy cant I break this/ [14:44] ? [14:53] perrito666, azure is not seeing the io timeout? [14:53] sinzui: nope, I am bootstrapping succesfully [14:54] sinzui: I did see the io error in aws a couple of days ago [14:54] that intrigued me [15:01] perrito666, we are testing 1.20 for a ppc fix now. [15:02] perrito666, jam speculates that this error is older, really just a replicate set issue and that the error we now see is just a mutation [15:02] you mean the io timeout? [15:02] perrito666, yes [15:03] do you have any context on that speculation/ [15:17] rick_h__: ping [15:21] bac: https://gist.github.com/jameinel/d0763eb6d8d38cfd64e1 === psivaa is now known as psivaa-afk-bbl [15:38] sinzui: ok, out of 6 i only got one failure and it was dns related :| [15:43] perrito666, I am going to force a rebuild of master when 1.20 finishes the test [15:54] perrito666, did you publish your own streams? did you use --upload-tools? [15:55] I used upload tools I have my streams published if you want me to try with it [15:56] perrito666, no need. I am deploying too while I wait for CI to come free [15:57] too late, it was at the tip of my fingers [16:14] sinzui: with my own stream also works [16:15] perrito666, That last built binary worked for me too? [16:15] sinzui: that is a very hard question for me to answer man :p [16:15] perrito666, let's just stop testing for now. master will rebuild in about an hour [16:16] perrito666, s/?/./ It did work for me [19:29] jcw4: pong? [19:33] Hi rick_h__ ; sorry - I don't think you saw my note in IRC a couple days ago? [19:33] about actions api docs [19:34] jcw4: sorry, saw something go by but been at a sprint out of the country and not peeked at it [19:34] rick_h__: dropped the ball on that, but here is a WIP pr for documentation... https://github.com/juju/juju/pull/468 [19:34] jcw4: ty much [19:34] rick_h__: figured you really wanted that last week, not this week :-/ [19:34] rick_h__: hope you're having a great sprint though :) [19:34] wheeeee! :) [19:35] haha [19:36] evening all [19:36] voidspace: 'ello :) [19:37] o/ [20:09] natefinch: ping [20:11] wwitzel3: pong [20:12] natefinch: so I wanted further pick your brain about using other syslog library? .. is that something I should be looking at for dealing with this all-machines.log issue? Or do we want to worry about that as a seperate concern? [20:14] natefinch: I guess that only solve the aggregation on windows, it doesn't solve the rotation problem. [20:14] wwitzel3: yeah [20:15] natefinch: should I just use logrotate (simplest path) for now? [20:15] wwitzel3: yep [20:16] natefinch: well that makes it easy :) [20:16] well .. I don't have to make any choices .. it probably will be a pain in the ass because it is rsyslog and logrotate [20:16] yeah sorry [20:16] but at least this part was easy :) [20:17] also how in the world should I test this? [20:17] ... wait 1 day and make sure the log rotates .. our tests already take long enough :P [20:19] lol [20:19] that wouldn't extend them all that much ;) [20:19] hah [20:19] *cry* [20:20] natefinch: also do I need to request logrotate get added to some ppa/apt repo? cloud something or other? [20:20] for actual unit tests.... honestly, screw it, this is an external application. test that we set the config right [20:21] wwitzel3: it may already be installed [20:21] wwitzel3: and if not, I expect it'll be in whatever thing is already available [20:22] i.e. sudo apt-get install should work [20:22] natefinch: sounds good, thanks. starting on it now, I'll add a card for it [20:22] is there a lp bug for it too? [20:22] yes....... somewhere [20:22] I searched but lp never gives me any results when I search [20:22] lol [20:24] wwitzel3: https://bugs.launchpad.net/juju-core/+bug/1078213 [20:24] thank you .. need to start a new site .. lngtfy.com (let nate google that for you) [20:24] haha... search didn't find it for me, so I opened up the high bugs.... and it happened to be at the top [20:25] heh reported 21 months ago [23:58] Fix for a bug I saw cropping up on the CI builds: https://github.com/juju/juju/pull/480 [23:58] tests assuming ordered results [23:59] sinzui: any news?