/srv/irclogs.ubuntu.com/2014/04/23/#juju-dev.txt

=== 21WAAD3MF is now known as wallyworld
perrito666anyone knows his way around state/open.go ?00:52
wallyworldperrito666: it depends on what you want to know, i might be able to help01:24
stokachucmars: hah gues you figured i based my plugin off yours :)01:29
wallyworldaxw: mornin'. can we have a hangout now instead of in an hour?01:29
stokachuwas gonna give you credit once i got something working01:29
axwmorning wallyworld. sure thing, just give me a moment01:29
axwwallyworld: erm, my sound isn't working. gotta fix that first...01:31
wallyworldok01:31
perrito666wallyworld: tx, sadly my head is falling on the kb so I better hit the bed before I introduce a bug instead of fixing the current one01:36
wallyworldperrito666: np, i'm on  call anyway now. if you have a question, feel free to email to the list or ask again later01:37
perrito666wallyworld: ok, I am more curious about fixing this bug than about going to sleep :p so her I go01:46
perrito666I am trying to fix the restore functionality01:47
perrito666now, at some point the restore calls state.Open(), I tried to replace it by using juju.NetConn and NewConnFromName and in all cases, it timeouts at mgo.DialWithInfo thile trying to make Ping()01:48
wallyworldperrito666: ok. there may also be someone else looking into that from juju-core01:49
perrito666"that" being?01:49
wallyworldi think Horacio DurĂ¡n01:50
perrito666sadly that would be me01:51
wallyworldhe's started to fix some of the backup bugs and was also going to look at restore01:51
wallyworldoh01:51
wallyworldhi01:51
perrito666hi01:51
wallyworldi didn't realise!01:51
wallyworldperrito666: give me a couple of minutes to finish this call01:52
perrito666sure01:52
wallyworldperrito666: sorry, back now02:00
wallyworldi'm not across the restore stuff specifically02:00
perrito666wallyworld: I think the restore part of my explanation can be safely ignored02:01
axwwallyworld: gotta go to the shops for a little while, bbs02:01
perrito666I just provided it for context02:01
wallyworldaxw: sure, np02:01
wallyworldperrito666: so you are looking to, in general, replace calls to state.Open() with juju.NewConn ?02:02
wallyworldto use the api02:02
wallyworldso you definitely have a state server running?02:03
wallyworldapi server even02:03
perrito666wallyworld: well I am pretty sure I do, I try to query mongo by hand and it responds, yet when juju tries to dial it just timeouts02:05
wallyworldmongo != api server though02:05
wallyworldthe api server listens on port 1707002:06
perrito666true, altough I am pretty sure this breaks before getting to state02:06
wallyworldwhat code are you changing?02:06
perrito666well, current existing code calls open, open in time calls DialWithInfo02:07
wallyworldwhich file?02:07
perrito666DialWithInfo creates a session02:08
perrito666ah sorry02:08
perrito666state/open.go02:08
wallyworldsure, but the caller to that02:08
wallyworldwhich caller of state.Open() is being replaced?02:08
perrito666cmd/plugins/juju-restore/restore.go02:08
perrito666around :18702:09
wallyworldso at the time restore runs, is there a bootstrap node running?02:09
wallyworldi don't think there is02:09
wallyworldah there may be02:10
wallyworldcause looks like it calls rebootstrap()02:10
perrito666there is02:10
wallyworldbut you might find that it is just that the api server has not started yet02:10
wallyworldcause it can take a while to spin up the bootstrap node and then start the services02:11
wallyworldmaybe to see if that's the issue, pause the restore script or add in a big attempt loop to see if it just needs more time02:12
perrito666wallyworld: mm I tried looping on that02:12
perrito666I waited 30 mins total02:12
perrito666that is a lot02:12
wallyworldcan you do a juju status when it fails?02:12
wallyworldie does juju status work?02:12
wallyworldthat would need an api server connection02:12
perrito666mm, it does not02:13
wallyworldso if juju status is broken also, then there's an issue with the bootstrap node02:13
wallyworldyou would need to ssh in and look at the log file02:13
wallyworldcause it could be the node itself starts but then the juju services fail to start02:14
perrito666mm, the service seems to be running, I even restarted it by hand02:15
perrito666in what port should the state server be listening?02:15
stokachu3701702:15
wallyworld1707002:15
wallyworld37017 is ongo02:15
wallyworldmongo02:15
wallyworldperrito666: when you say you restarted the state service by hand, that doesn't make sense to me because the state service runs inside the machine agent - did you start jujud?02:16
perrito666wallyworld: yes02:17
wallyworldand the machine log file is good?02:17
wallyworldand yet juju status fails also02:18
wallyworldthere's gotta be something logged which shows the problem02:18
wallyworlduntil something like juju status is happy, then the code changes to restore.go won't work either02:18
perrito666wallyworld: interesting though, restore is trying to open a state server n 3701702:19
wallyworldthe current restore using state.open()?02:20
wallyworldit will because it connects straight to mongo02:20
wallyworldthe new juju.NewConn() methods instead go via the api server on port 1707002:20
perrito666aghh, juju.NewConn fails just as Open, so someting is definitely broken in my recently restored node02:21
stokachuwallyworld: is that in trunk yet?02:22
stokachumy logs show NewConnFromName accessing mongo directly on 3701702:22
wallyworldstokachu: the api server stuff?02:22
stokachuyea02:22
wallyworldyes, been there since 1.1602:22
wallyworldused universally since 1.1802:22
wallyworldperrito666: i'd be surprised and sad if the log files on that node didn't show what was wrong02:23
* perrito666 run the extremely tedious setup script02:24
wallyworldperrito666: it will still be waiting for you tomorrow after you get some sleep :-)02:25
perrito666wallyworld: certainly but now its personal02:26
wallyworldlol02:26
wallyworldfeel free to pastebin logs files if you want some more eyes02:26
* perrito666 paints canonical logo on his face and yells mel gibson style02:26
stokachuwoot i actually a juju plugin to do something in go02:26
perrito666stokachu: I sense a verb missing there :p02:28
wallyworldwould have been funnier if you said "i a missing verb there" :-)02:29
stokachuhah02:29
axwback...02:29
stokachuto much time looking at juju core code02:29
waiganiwallyworld: axw: I'm here for standup02:30
perrito666wallyworld: my wife is watching tv in spanish next to me, when 2 lang module enabled in my head I loose capacity for witty sentences in both languages02:30
wallyworldwaigani: huh? i thought you were on holidays so we had it early :-)02:30
axwwaigani: we already had it early, weren't expecting you02:30
wallyworldbut we can have another02:30
waigani:(02:30
waiganiI'm in auk airport02:31
waiganiokay, maybe I can talk through what I'm doing?02:31
wallyworldwaigani: sure, i'm in the hangout02:31
axwbrt02:31
perrito666wallyworld: https://pastebin.canonical.com/108967/02:41
wallyworldperrito666: looking, sorry was otp02:44
perrito666on the same note https://pastebin.canonical.com/108968/02:44
wallyworldperrito666: is there any more in machine-0.log?02:47
perrito666wallyworld: well, there is before that altough I am not sure if I can distinguish between pre/post restore (restore is a particularly ugly thing)02:52
wallyworldperrito666: what i mean is, after the output you logged. that log looks ok i think. there was one timeout with the api client connecting but thatcan happen and it appeared to be ok after that but i wanted to be sure by looking at subsequent logging02:53
perrito666nope, after that it just loops with https://pastebin.canonical.com/108969/02:55
wallyworldhmmm, ok. so that says there is an issue with the api server02:56
wallyworldyou may need to enable trace level logging and/or add extra logging to see why it's failing. i wonder if netstat shows the port as open02:57
perrito666tcp        0      1 10.140.171.13:59925     10.150.60.153:17070     SYN_SENT    4001/jujud02:57
wallyworldthat's a different ip address to what is being dialled02:58
wallyworldoh no02:58
wallyworldit's not02:58
perrito666nope, just without the dns nae02:59
wallyworldyeah02:59
wallyworldif it were me, i'd have to add lots of extra debug logging at this point to see what's happening as i'm out of ideas02:59
wallyworldbut you can see even internally the machine agent api client can't start03:00
wallyworldso there's a core issue with starting the api server itself03:00
wallyworldaxw: local provider is sorta ok. it doesn't like starting precise containers on trusty although it used to. and if i start a precise container first and it fails, subsequent trusty containers also fail, but starting a trusty container first works03:01
perrito666wallyworld: well, I think the restore step is actually breaking the state api server03:01
perrito666since it works right before03:01
wallyworldlikely03:01
perrito666(restore bootstraps a machine and then untars the backup on top of it)03:01
wallyworldroger wrote all that so i have no insight off the top of my head as to what might be wrong03:01
axwwallyworld: ah ok. there have been a few bugs flying around about host vs. container series mismatch not working03:02
wallyworldaxw: yeah, i'm going to try explicitly setting default series to see if i can get precise to work. but precise failing should not also then kill trusty :-(03:03
perrito666wallyworld: I think there might be something wrong with the backup, tomorrow I will strip one into pieces and see what is wrong, as for me I am now officially out or tomorrow I will be sleeping on the kn at the standup03:04
perrito666kb*03:04
wallyworldnp, good night :-)03:04
axwwallyworld: oh I didn't see that bit... weird03:04
wallyworldyeah03:04
axwwallyworld: I think you can also bootstrap --series=trusty,precise to get it to work03:04
axwnot sure why trying precise would fail trusty tho03:05
wallyworldta, will try that also to try and get a handle on it03:05
* wallyworld -> food03:05
=== wallyworld_ is now known as wallyworld
axwwallyworld: I just pasted the output I see from destroy-environment with manual03:43
axwwallyworld: it's as I expected03:43
wallyworldaxw: i missed it as my laptop got disconnected03:43
axwwallyworld: I mean I pasted it in the bug03:43
wallyworldah, looking03:43
axw#130635703:43
_mup_Bug #1306357: destroy environment fails for manual provider <destroy-environment> <manual-provider> <juju-core:Incomplete> <https://launchpad.net/bugs/1306357>03:43
wallyworldaxw: clearly then i need to get my eyes tested as i had thought i included it all, sorry :-(03:45
wallyworldalthough i wish the last error was first03:45
axwwallyworld: nps. it does kinda get lost down there...03:45
wallyworldas it would read much nicer that way03:45
wallyworldie root cause, followed by option to fix03:46
=== vladk|offline is now known as vladk
axwwallyworld: I'm going to look at fixing these openstack tests. If you do have any spare time, it would still be useful if you could review the placement CL04:06
axwbut if you're busy then that's okay04:06
wallyworldaxw: funny you should mention that - just finished another review and am looking right now04:07
axwwallyworld: cool :)04:07
wallyworldaxw: this is a personal view, but i tend to think that if a method returning a (value, error) returns a err != nil, then the value should be considered invalid. so this bit irks me:04:17
wallyworldif c.Placement != nil && err == instance.ErrPlacementScopeMissing {04:17
wallyworldi would use an out of band signal like a bool or something04:17
axwwallyworld: err was originally nil, that was something william wanted04:17
axwI suppose I could change it to reutrn a nil placement, and have the caller construct one04:18
wallyworldhmmm. is there value in adding a bool to the return values04:18
wallyworldor something04:18
axwI don't really think so, then you may as well just check if the scope has a non-empty scope04:19
wallyworldi sorta think that err != nil meaning the value is bad is kinda idiomatic Go04:19
axwyeah... probably should have just left it as it was04:20
wallyworldchange it since he isn't here :-)04:20
axwwallyworld: I think I will just change it to return a nil Placement, and hten the caller will create a Placement with empty scope and the input string as the directive field04:22
wallyworldok04:22
wallyworldi think that sounds good04:22
axwthe caller needs to know the rule anyway, at least this way it's the usual case of nil value iff error04:22
wallyworldsorta best of both worlds04:22
wallyworldta04:23
wallyworldaxw: with these lines in addmachine04:28
wallyworldif params.IsCodeNotImplemented(err) {04:29
wallyworld04:29
wallyworld135 if c.Placement != nil {04:29
wallyworldis there any point trying again if c.Placement is nil?04:29
wallyworldshould it just be a single if ... && ...   ?04:29
axwwallyworld: yes we should try again, because we're calling a new API method04:29
axwwallyworld: client.AddMachines now calls a new API method by default04:30
axwwallyworld: and client.AddMachines1dot18 calls the old one04:30
wallyworldoh,right. hadn't go to that bit yet, i recalled it was the same api from earlier review04:30
axwit was, I fixed it :)04:30
wallyworldbut i guess versioning04:30
wallyworldwish we had it04:30
axwindeed04:30
stokachudo i have to invoke "scp" with the ssh.Copy function in utils/ssh?04:32
axwstokachu: the openssh client impl will delegate to scp, if that's what you're asking04:34
stokachuhttps://github.com/battlemidget/juju-sos/blob/master/main.go#L89-L9404:34
stokachuso im trying to replicate juju scp within my plugin04:34
stokachuthis is my log output : http://paste.ubuntu.com/7312090/04:35
stokachui think my actual copyStr is incorrect as i was following was is required by juju scp04:35
* axw looks04:35
stokachuwhat is*04:35
axwstokachu: I think you want the target and source in separate args04:36
stokachuim a newb with golang as well so if i got stupid stuff in there04:36
stokachulemme try that04:37
axwstokachu: i.e. a length-2  slice04:37
stokachuok lemme see if i can make that happen04:37
wallyworldaxw: is there a reason why we store placement as a string and not a parsed object. and hence precheck take s a string and not a parsed struct etc. i would normally look to parse on the way in and then pass around the parsed struct etc so we fail as close to the system boundary as possible. am i missing a design decision?04:38
stokachusweet, gotten farther http://paste.ubuntu.com/7312102/04:39
axwwallyworld: originally I did that, william wanted it changed. it should not get to the environment if the scope doesn't match04:39
stokachuthough maybe i should be using the instance.SelectPublicAddress of machine?04:39
wallyworldaxw: hmmmm. ok. i disagree with william here then :-(04:39
axwstokachu: cool. ahh, "juju scp" does the magic of converting machine IDs to addresses04:40
axwwallyworld: why? the environment should not need the scope04:40
stokachuive got a execssh that i borrowed from someone that uses instance.selectpublicaddress04:40
stokachugoing ot try that04:40
wallyworldaxw: what i mean is that the string should be parsed into whatever internal representation makes sense at the system boundary ie a struct of some sort, possibly different to what is used on the client ie minus the scope04:41
axwstokachu: see juju-core/cmd/juju/scp.go, hostFromTarget  -- that's where it maps machine IDs to addresses04:41
wallyworldand internal apis should then use that typed struct04:41
stokachuaxw: ahh i see that now04:42
wallyworldnot an "untyped" string04:42
wallyworldbut, doesn't matter, it's already been changed to get approval04:42
stokachuto bad expandArgs isnt public04:42
axwwallyworld: the directive string is free-form, so how are you going to do that?04:42
axwwallyworld: it's up to the provider to decide what makes sense in directives04:43
wallyworldaxw: ah bollocks, i was thinking there was more to it than just a string. but you are saying that by the time it's stored, it represents a mass name or whatever04:43
wallyworldthat makes more sense. i hadn't fully re-groked the implementation04:44
axwwallyworld: as far as the infrastructure is concerned, it's an opaque blob of bytes. the provider will interpret it. provider/maas will interpret it as maas-name to start with04:44
wallyworldok04:45
axwwe may converge on some convention, like thing=value04:45
axwaz=uswest-1 or whatever04:45
axwstokachu: it's also worth noting that some providers (e.g. azure) require proxying through machine 004:46
axwstokachu: so you may want to just shell out to "juju scp" if you can...04:46
stokachuaxw: ah good point04:47
stokachucleaner than what im doing04:47
stokachuis there a shell function in juju-core thats exposed?04:47
stokachuor should i just use os.Exec04:47
axwstokachu: os/exec is as good as anything04:48
stokachuaxw: good deal04:48
stokachuill do that instead04:48
axwthere are some utils in juju, but I don't think they'd be useful04:48
stokachucool no worries04:48
wallyworldaxw: yeah, i'm a fan of a little more structure. but none the less, land that f*cker04:48
jamhazmat: fwiw the first line that api-endpoints returns is the one that we last connected to, so if you just do "head -n1" you can get the same output we used to give04:49
axwwallyworld: thanks04:50
wallyworldnp. sorry if i went over old ground04:50
axwnope, that's cool04:50
wallyworldjam: i was going to get your opinion on that bug - i'd like to close now as "invalid" or whatever given the other ifx has landed04:51
jamwallyworld: sorry, which bug?04:51
wallyworldjam: the one you just remarked on above04:51
wallyworldbug 131122704:51
_mup_Bug #1311227: juju api-endpoints cli regression on trunk/1.19 <api> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1311227>04:51
jamwallyworld: localhost shouldn't be in the output04:52
jamand I would be fine pruning ipv6 by default04:52
wallyworldjam: it can be for local provider since localhost is the public address for local provider04:53
wallyworldjam: martin's branch does prune ip6 by default04:53
jamwallyworld: sure, I'm not saying don't print localhost when that's the address, but *don't* print localhost for ec204:53
axwwe shouldn't have localhost for ec2, but we would have 127.0.0.1 and that'll get pruned04:54
wallyworldjam: martin's branch probably ensures that's the case, since for ec2 localhost is machinelocal isn't it?04:54
jamwallyworld: hmmm... I don't know that Martin's patch is *quite* right. I'd rather still cache IPv6, but just not display them on api-endpoints04:54
axwwe don't use any scope heuristics for hostnames04:54
jamwallyworld: right, I think his patch is what we want, and we do want to be caching the network scope data instead of just addrs04:54
wallyworldjam: it's ok for now i think since we don't need/use ip6 yet04:54
wallyworldjam: so, i think then that kapil's bug has 2 bits 1. the ip6/127.0.0.1 stuff which martin's bug fixes, and 2. the multiple api address thing which is new and intended04:55
wallyworldso therefore we can mark the bug as invalid04:56
wallyworldright ?04:56
jamwallyworld: so I still think there are bits that we can evolve on api-endpoints. Namely, to change what we cache from just addrs to being the full HostPort content (which includes network scope), and then api-endpoints can grow flags to do --network-scope=public04:57
jamwallyworld: so while I think we've addressed the regression today04:57
jamI don't think the bug is "just closed"04:57
wallyworldsure, but that's not the bug as described04:57
wallyworldwe can get it off 1.19.1 at least04:58
jamwallyworld: right, i think the *regression* portion is stuff that we intend (multiple addresses, even per server), because we think they might be routable04:58
jamand we don't save enough information (yet) to be able to provide --network-scope04:58
wallyworldyep, i don't see any regression at all04:58
jam(and then default it to public)04:58
jamwallyworld: giving private addresses in api-endpoints by default is wrong04:59
jambut "good enough" for now.04:59
jamAnd hazmat has a point about actually grouping the data by server, so you have a feeling for what machine is a fallback04:59
wallyworldok, so let's retarget off 1.19.1 then04:59
jamSGTM05:00
wallyworldjam: 2.0 or 1.20?05:00
wallyworld2.0 i guess?05:00
jamI'd be ok with 2.005:03
waiganiaxw: when I use restore with patchValue I get this error: http://pastebin.ubuntu.com/7312196/05:04
stokachuso heres my latest change using juju scp https://github.com/battlemidget/juju-sos/blob/master/main.go#L89-L9605:05
stokachuand the error output http://paste.ubuntu.com/7312200/05:05
stokachui verified that juju ssh 1 and /tmp/sosreport*xz exists on the machine05:05
waiganianyway, I need to go catch a plane05:07
axwwaigani: sorry, need more context. show me in vegas :)05:08
stokachuaxw: -r doesn't work with machine num it seems05:09
stokachujuju scp 1:/tmp/test . works05:09
stokachubut juju scp -r 1:/tmp/test* . fails05:09
axwstokachu: you need to separate the command out into individual args05:09
axwstokachu: i.e. "juju", "scp", ...05:09
stokachuthis is manually running the command from the shell05:09
axwstokachu: there are some limitations with juju scp, I forget exactly how to pass extra args... lemme see05:10
stokachuhttp://paste.ubuntu.com/7312211/05:10
stokachuthats what ive tested manually05:10
axwstokachu: stick "--" before -r05:13
stokachuaxw: you da man05:14
jamaxw: is that juju 1.16? as 1.18 is a bit broken wrt scp05:14
jamstokachu: in 1.18 (for a while until it gets fixed) args for just scp must come at the end and be grouped05:14
axwjam: well I'm on trunk... I forget which versions do what wrt scp05:15
jamso: juju scp 1:foo 2:bar "-r -o SSH SpecialSauc"05:15
axwjam: what I just described does work on trunk, so presumably on 1.18 too?05:15
stokachuah05:15
axwjam: i.e. I just tested "juju scp -- -r 0:/tmp/foo /tmp/bar"05:15
jamaxw: https://bugs.launchpad.net/juju-core/+bug/1306208 was fixed in 1.18.1 I guess05:16
_mup_Bug #1306208: juju scp no longer allows multiple extra arguments to pass throug <regression> <juju-core:Fix Released by jameinel> <juju-core 1.18:Fix Released by jameinel> <juju-core (Ubuntu):Fix Released> <juju-core (Ubuntu Trusty):Fix Released> <https://launchpad.net/bugs/1306208>05:16
jamaxw: trunk just lets you pass everything, and you shouldn't need "--" I thought05:16
axwyou do need --, otherwise juju tries to interpret the args05:16
jamaxw: fairy nuff05:17
stokachuyea i had to use -- with 1.18.1-trusty05:20
stokachuaxw: that worked :D:D05:20
axwstokachu: cool :)05:21
vladkjam: morning05:29
jammorning vladk, its early for you, isn't it ?05:30
jamwell, early for you to be on IRC :)05:30
fwereadegood mornings05:42
waiganifwereade: morning :)05:50
jammorning fwereade, we've missed you05:53
fwereadewaigani, jam: it's nice to be back :)05:54
waiganiheh, easter holiday?05:54
jambrb05:55
axwhey fwereade05:57
axwfwereade: I was about to approve https://codereview.appspot.com/85040046 (placement directives) - do you want another look first?05:58
fwereadeaxw, I'll cast a quick eye over it :)05:58
axwokey dokey05:58
fwereadeaxw, ok, based on a quick read of your responses I think I'm fine -- my only question is exactly what happens with the internal API change as we upgrade06:01
axwfwereade: the provisioner will be unhappy until it has upgraded06:02
fwereadeaxw, I *think* that it's fine, given that the environment provisioner only runs on the leader state server, and therefore the upgrade happens in lockstep06:02
fwereadeaxw, but other provisioners?06:02
fwereadeaxw, hm, I have a little bit of a concern about error messages during upgrade06:02
axwfwereade: it will be the same for the container provisioners, I think06:02
jamback06:02
* axw checks06:02
fwereadeaxw, *we* might know they're fine06:02
fwereadeaxw, but people who read our logs don't get quite such a sunny prospect of our general competence06:03
jamaxw: so we talked about having EnsureAvailability with a value of say 0 just preserve the existing desired num of servers06:03
jamAFAICT, we never *record* the desired number of servers06:03
jamwe just have a number of things that are running.06:03
axwjam: it's implied by what's in stateServerInfo06:04
jamand we have stuff like WantsVote() but I can't see anywhere that sets NoVote=true to indicate that we no longer want to be votiing.06:04
axwjam: len(VotingStateMachineIds)06:04
axwjam: that's done in EnsureAvailability, in state/addmachine.go06:04
jamaxw: sure, but isn't that the actual ones that are voting? I guess it would be an availability check?06:04
fwereadeaxw, this must ofc be balanced against the hassle of maintaining the multiple code paths06:04
axwjam: VotingMachineIds is really the ones that *want* to vote06:05
axwfwereade: just checking still, sorry06:05
fwereadeaxw, np06:05
fwereadeaxw, what I did with the unit agent the other day was just to leave it blocking until the state server it's connected to *does* understand the message, and then continue as usual06:06
axwfwereade: yeah, this is common to all provisioners - it will cause an error on upgrade for container provisioners06:06
axwhmm ok06:06
axwI'll take a look at that code06:06
axwfwereade: worker/uniter?06:06
fwereadeaxw, it's not the best code in the world but it seemed to work06:06
fwereadejust a sec yeah somewhere there06:06
axwfwereade: got it I think06:07
axw            logger.Infof("waiting for state server to be upgraded")06:07
axwyeah okay, I can add that in06:07
fwereadeaxw, cool06:07
* axw senses another need for API versioning imminently06:08
axwalthough I suppose we can just see that fields are zero values...06:08
axwfwereade: yuck, this means threading the tomb all the way through... oh well.06:09
axwI suppose it's for the best06:09
* fwereade glances pointedly at jam re API versioning06:09
* jam ducks and pretends to catch a plane06:09
* fwereade does understand06:10
jamfwereade: I made sure it was in the topics list06:10
fwereadejam, great, thanks :)06:10
axwjam: sorry, back to ensure-ha: if you just send 0 or -1 to state.EnsureAvailability, then it can load st.StateServerInfo() and set numStateServers=len(VotingMachineIds)06:10
jamaxw: I'm going to use 0, because it isn't otherwise valid, and we don't have to  woryr about negative numbers.06:12
axwsounds good06:12
jamaxw: I was thinking to do that originaly, but trying to verify the actual meaning of the various values was ... tricky06:12
axwoh I don't have to thread the tomb, hooray06:12
axwjam: it's not super clear, I agree06:13
jamaxw: I was reading through the code and trying to figure out what the actual invariants are06:13
jamaxw: I was really surprised that ensureAvailabilityIntentions doesn't take into account the new request06:13
jamso we end up with 2 passes at it06:13
jamalso, the WantsVote vs HasVote split is confusing. Probably necessary, but very confusing06:14
axwjam: yeah, we need to know what the existing ones want to do06:14
axwjam: we certainly could do with some developer docs on this06:15
axwI don't understand what the peergrouper does, haven't looked at it at all06:15
axwI know what EnsureAvailability does, but it's easy to forget :)06:16
jamaxw: one advantage of "-1" is that it is odd :)06:16
axwheh06:17
jamaxw: I took out the <= 0 and it still failed, and had to remember 0 is even06:17
jamaxw: non-negative or nonnegative ?06:20
jamour error message currently says >006:21
jamand "greater than or equal to 0" is long06:21
axwjam: non-negative looks good to me06:21
jamthough non-math people won't get non-negative, I guess06:21
axwreally?06:21
jamnumber of state servers must be odd >= 006:21
jamnumber of state servers must be odd and >= 006:21
jam?06:21
axwwill non-math people understand >= ? ;)  sure, I guess so06:22
jamaxw: non-engineering/scientists sort of people don't distinguish "positive" from "nonnegative"06:22
jamaxw: I can't even say "must not be even"... -1 for clarity :)06:23
jamonly not06:23
axwhehe06:23
axwfwereade: updated https://codereview.appspot.com/85040046/patch/120001/13003506:44
jamaxw: updated "juju ensure-availability" defaults 3 https://codereview.appspot.com/9016004406:58
axwjam: looking06:58
jamaxw: note that I merged my default-series branch in ther07:03
jamto get the test cases right07:03
jambut that didn't end up landing in the mean time07:03
axwok07:03
jamso there is a bit of diff that should be ignored, but you can't really add a prereq after the fact07:03
axwjam: reviewed07:12
axwjam, wallyworld: review for a goose fix please https://codereview.appspot.com/9054004307:20
jamlooking07:21
jamaxw: lgtm07:22
axwta07:22
axwfwereade: am I okay to land that branch, or are you still looking?07:24
* axw takes silence as acquiescence07:30
fwereadeaxw, sorry, yes, it looks fine :)07:41
axwcool07:42
axwjam: is the bot awake?07:46
jamaxw: checking07:47
jamaxw: it is currently running on addmachine-placement07:47
jamperhaps there was a queu?07:47
jamits been goin for 14 min07:47
axwokey dokey, thanks07:47
axwI thought my goose one would go through first07:47
jamaxw: I don't think there is relative ordering, and the bot only runs one at a time based on what it finds when itwakes up every minute07:48
jamso if you approve both, but it hasn't seen it07:48
jamthen it will wake up, get the list, and start on one07:48
axwok07:48
axwwheee, placement is in07:53
* axw does the maas bits07:53
* fwereade bbiab07:56
axwjam: the bot does do goose MPs, right?08:29
mgzaxw: it does08:30
mgzwallyworld: thanks for landing my branch08:30
wallyworldmgz: np, pleased to help08:30
wallyworldi also tested with local provider just in case08:30
voidspacemorning all08:37
jam1morning voidspace08:40
jam1axw: so the bot has "landed" your code, but the branch isn't a proper checkout, so it didn't get pushed back to LP08:43
jam1I'll fix it08:43
axwdoh08:43
axwjam1: thanks08:43
jam1axw: should be merged now08:45
mgzright, time to get a train to a plane, see you all next week!08:46
jam1mgz: see you soon08:46
jam1have a good trip08:47
jam1you'll see some of us tomorrow at gophercon, righT?08:47
mgzjam1: thanks! and yeah, some this week08:48
jam1axw: lgtm on your dependencies branch09:01
axwjam1: ta09:01
jam1we'll have to make the bot get the latest version, though09:01
jam1fortunately, I know someone who is currently logged in09:01
axw:)09:01
axwI thought the bot updated now?09:01
jam1axw: it runs godeps09:02
jam1but that won't pull in new data09:02
jam1it does do go get -u when you poke config09:02
jam1axw:  Ican't *quite* go get -u to not screw up the directory under test09:02
axwjam1: it does godeps? "godeps -u" updates the code thought...?09:03
axwthough*09:03
vladkjam1: please, take a look https://codereview.appspot.com/9058004309:05
vladkI will be offline until meeting09:05
=== vladk is now known as vladk|offline
axwwoop, add-machine <hostname> works... now the fun of updating the test service09:15
jam1axw: it sets the version of an existing tree to that revision. It does not *pull* data from remote sources.09:28
jam1so if it isn't present locally, godeps -u doesn't work09:28
axwjam1: ah right, I see09:28
jam1axw: so I haven't gotten a chance to dig into it thoroughly, but are we writing "/var/lib/juju/system-identity" via cloud-init? Or are we only using the cloud-initty stuff to get it on their via SSH bootstrap ?09:29
axwjam1: yes, that is how it is done now. I'm not a fan09:30
axwjam1: actually...09:31
axwjam1: sorry, no, we SSH in and then put it in place09:31
axwjam1: anything inside environs/cloudinit.ConfigureJuju happens after cloud-init, but only for the bootstrap node09:32
psivaahello, could someone help me build juju from source pls?09:34
psivaaI'm getting http://paste.ubuntu.com/7313347/ when i run go install -v launchpad.net/juju-core/...09:34
voidspacepsivaa: I'm just doing a pull and trying now09:38
voidspacepsivaa: works for me09:38
voidspacepsivaa: so I suspect you're using a "too old" version of Go09:39
voidspacepsivaa: what does "go version" say?09:39
voidspacepsivaa: I'm on 1.2.1 (built from source)09:39
psivaavoidspace: 'go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64' is the output for go version09:39
axwfwereade: maas-name support -> https://codereview.appspot.com/90470044/09:40
jampsivaa: actually that looks like an incompatible version of go crypto09:40
axwfwereade: still need to support it in bootstrap09:40
fwereadeaxw, awesome :)09:40
axw(and add-unit and deploy, but they're coming later)09:40
jampsivaa: if you "go get launchpad.net/godeps" you can run "godeps -u dependencies.tsv" and it should grab the right versions of dependencies09:40
psivaajam: ack, i did 'hg clone https://code.google.com/p/go.crypto/' to get go crypto.09:41
psivaajam: voidspace: thanks. i'll try your suggestion09:41
jampsivaa: gccgo 4.9 should be new enough09:42
jampsivaa: My guess is that go crypto updated their apis, which broke our use of their code09:42
jamand we haven't caught up yet09:43
jamwhich is why we have dependencies.tsv to ensure we can get compat versions09:43
psivaajam: ahh ack, i'll use that. thanks09:43
jampsivaa: if you don't want godeps, then you can hg update --revision 6478cc9340cbbe6c04511280c5007722269108e909:43
jamI think09:43
jampsivaa: looks like just "hg update 6478cc9340cbbe6c04511280c5007722269108e9"09:44
fwereadeaxw, LGTM, it's really nice to see it implemented with such a small amount of new code:)09:48
axwfwereade: :) thanks09:48
axwfwereade: sadly the bootstrap one will be a bit larger - I'll need to change Environ.Bootstrap09:49
fwereadeaxw, sure, but it's absolutely a desirable change, and subsequent ones (like zone on ec2) will themselves then basically come for free :)09:50
axwyup09:50
fwereadevladk|offline, ping me when you're back please -- wondering whether we should really share an identity across state servers, or whether we should be creating one each09:52
=== axw is now known as axw-away
fwereadevladk|offline, ah, forget it, I made bad assumptions in the first reading09:54
=== vladk|offline is now known as vladk
voidspacemy parents have just turned up for coffee10:06
vladkfwereade: ping10:06
voidspacebe afk for 15minutes :-)10:06
fwereadevladk, pong10:06
fwereadevladk, I see we have separate identities, sorry I misread; but I don't see when we'll rerun those upgrade steps. perhaps we'll definitely never need them?10:07
perrito666good soon to be morning everyone10:17
vladkfwereade: I just used a formatter struct, my code does nothing with upgrade. I don't know whether SSH key will distributed on tools upgrade. It wasn't my task.10:18
vladkBut SSH key will be installed on every new mashing with state agent.10:18
vladkShould I investigate what occurs during upgrade?10:18
fwereadevladk, ahh, I see10:19
fwereadevladk, yes, please see if you can find a way to break it by upgrding at a bad time10:20
fwereadevladk, if you can't, then LGTM, just note it in the CL and ping me to give it the official stamp ;)10:21
fwereadeperrito666, heyhey10:21
fwereadeperrito666, sorry I left you hanging last week, I think I managed to send you another review a day or two ago though -- was it useful?10:21
jam1fwereade: AFAIK we don't have different identities, do we?10:22
jam1fwereade: https://codereview.appspot.com/90580043/patch/1/10013 concerns me10:22
jam1are we actually writing that to userdata ?10:22
jam1(exposing the secret ssh id)10:22
jam1I think axw-away claimed that we didn't actually do that during bootstrap10:22
perrito666fwereade: It was, altough right now I put that on hold since I am juggling with a brand new set of restore bugs :p10:22
fwereadejam1, it does indeed look like we were, grrmbl grrmbl; but it looks to me like what we do now is generate a fresh id and add that to the system, as one of N keys for the state-server "user", per state-server-machine10:23
fwereadejam1, so I think it's solid -- did I miss something10:24
fwereadeperrito666, ok, great -- I'm here to talk further if you need me10:24
jam1fwereade: I haven't yet found that bit that you're talking about (where we actually generate the new value)10:25
jam1I see the code that if we have the value we write it onto disk10:25
jam1fwereade: but while we remove this: https://codereview.appspot.com/90580043/patch/1/1001210:26
jam1I don't see the the SystemPrivateSSHKey being removed from MachineCfg10:26
jam1nor have I yet found anything that creates the populates the contents of identity10:27
jam1but I could easily just be missing it, though I've gone over the patch a few times now10:27
fwereadejam1, hum, yes, I now think I was seeing that bit in the upgrade instructions alone10:27
fwereadejam1, yeah, I think that's the only place -- vladk, thoughts? ^^10:28
fwereadejam1, but fwiw, I suspect that the stuff in cloudinit is actually not in *cloudinit*, only in the bit that gets rendered as a script when we ssh in at bootstrap time10:29
jam1fwereade: and we are calling AddKeys(config.JujuSystemKey, publicKey)  and setting it to exactly 1 key10:29
jam1fwereade: right, so I'm not very sure about the cloudinit stuff because we did the bad thing and punned it10:29
fwereadejam1, AddKeys is meant to *add*, not update -- did that change?10:30
jam1so that sometimes cloud-init is rendered to actual cloud-init10:30
jam1and sometimes it is rendered to a ssh script10:30
jam1fwereade: ah, it might10:30
fwereadejam1, believe me, I told the affected parties when they wrote the environs/cloudinit module *waaay* back in the day -- cloudinit is just one possible output format10:30
fwereadejam1, sadly I was not in an official tantrum-throwing position at that time ;p10:31
jam1fwereade: also, I think we have a point that steps118.go is only run when upgrading from 1.16 to 1.18, so it *won't* be run when upgrading to 1.20 (from 1.18)10:31
jam1but I don't think that actually matters here10:31
jam1as we don't actually need to fix upgrade10:32
jam1because HA is new in 1.19, so we don't have anything that we're upgrading10:32
psivaajam1: jfyi, godeps method made installing from source work for me. thanks10:32
fwereadejam1, I think that, yeah, upgrade is irrelevant except in that it's the one place that actually sets up the keys10:32
jam1fwereade: the issue is that if we are going to give each one a unique identity (which I think is better, fwiw, but I'm not sure if it breaks some assumptions)10:32
jam1I would expect us to see a change in AddMachine()10:32
jam1or EnsureAvailability10:32
jam1fwereade: it sets up the first key10:33
jam1fwereade: I really don't see how his patch would populate the new "identity" field in agent.conf10:33
jam1fwereade: but the fact that we have 3 or 4 types with a StateServingInfo method, and each gets its data from somewhere else10:34
jam1(might be API, might be agent.conf, might be ...)10:34
vladkfwereade, jam1: about https://codereview.appspot.com/90580043/patch/1/1001210:34
vladkThis is a part of ssh-init script construction.10:34
vladkNow ssh key is passed inside of agent.conf file. So I remove it direct creation.10:34
jam1vladk: right, I think that line is great10:35
jam1vladk: but I haven't managed to find the part that actually sets the contents of the agent.conf file10:35
vladkhere https://codereview.appspot.com/90580043/patch/1/1000510:35
vladkvia yaml marshaling10:36
jam1vladk: but what is setting it on the struct10:36
jam1(I'm also not sure that we're allowed to change the content of an agent.conf without bumping the format number, but that is a later concern)10:36
jam1vladk: I see a lot of stuff that "if we have the data set" gets it written to the right places, which all looks good10:37
jam1I just haven't managed to find a line that is "SystemIdentity = XXXXX"10:37
jam1vladk: going the route you did, I would expect to see a change in state/addmachine.go10:39
jam1to something in either EnsureAvailability or elsewhere10:39
jam1to create the system-identity data that the machine agent then reads from agent.conf later10:39
vladkjam1: https://codereview.appspot.com/90580043/patch/1/10008 set to StateServingInfo10:40
vladkhttps://codereview.appspot.com/90580043/patch/1/10005 set to formatter of agent.conf10:40
jam1vladk: thanks, fwereade^^ your original assumption is wrong, they all get the same value, and it is being written via cloud-init (from what I can tell)10:41
jam1which is sad news, I believe10:41
jam1vladk: I expected that we would be actually calling an API to get that data during cmd/jujud/machine.go10:41
jam1if we are only reading it from disk10:41
jam1then we wrote it to disk via cloud-init10:41
jam1which means we are passing our ssh secret key to EC210:41
jam1to hand back to us10:41
jam1we got away with it (slightly) with "bootstrap" because bootstrap actually SSH's onto the machine to write those files10:42
fwereadewell fuck10:42
jam1but all other provisioning is done via cloud-init and follow up calls to the API10:42
fwereadehonestly I'd expect us to just generate it at runtime10:42
fwereadejam1, wait, we're writing state-server info to new state servers we provision?10:43
perrito666wwitzel3: can you see me?10:43
jam1fwereade: I had originally thought they should be shared, but honestly, I like your idea to have the agent come up10:43
jam1check that it doesn't have one10:43
jam1generate it10:43
fwereadejam1, that's *all* meant to come over the API10:43
jam1and add the public key only to the list of accepted keys10:43
fwereadejam1, and indeed in this case there's no reason not to do it on the agent10:43
jam1fwereade: *I* don't understand the code very well10:43
jam1we do some crazy shit10:43
jam1about writing agent.conf10:43
jam1and then reading it back in10:43
jam1fwereade: all of the code in machine.go uses agentConfig.StateServingInfo()10:44
jam1fwereade: except line 24010:44
jam1where we call st.Agent().StateServingInfo()10:44
jam1and then call: err = a.ChangeConfig(func(config agent.ConfigSetter) {10:45
jam1config.SetStateServingInfo(info)10:45
jam1})10:45
jam1to get it written to disk10:45
jam1for everything else to read10:45
jam1fwereade: but I *think* there is a bug that you have to have it written to agent.conf first, so that you come up thinking you want to be an API server10:45
jam1fwereade: also see machine.go line 45810:46
jam1that says "this is not recoverable, so we kill it, in the future we might  get it from the API"10:46
jam1there *is* an issue with bootstrap, the first API server obviously has to get it from agent.conf10:46
jam1so there is some bit of we can't just always read from the api10:46
jam1I guess10:46
jam1but the swings and roundabouts make it hard for me to reason10:46
jam1anyway, standup time, switchnig machines10:47
jamfwereade: standup ?10:48
perrito666Horacio DurĂ¡n10:59
perrito666jam:10:59
voidspacejam: on the logging, the theory is that all the state servers should have *all* the logging - so when bringing up a new state server it really shouldn't need to connect to *all* state servers to get existing logging. Any one (that is fully active) should do.11:38
jamvoidspace: I understand that, but when you go from 1 to 3, you'll probably see the other api server that is coming up at the same time, and then it is just random-chance if you get the full log or not11:38
jam(similarly going from 3-5)11:39
jamthough not going from degraded-2 to 311:39
voidspacejam: right, so being able to determine if it's fully active or not would help - but if we can't do that then maybe there's no other way11:39
jamvoidspace: I certainly understand why it might work, but my point would still be "we can iron out getting the backlog later, because it isn't the most important thing right now"11:39
voidspacejam: ok, understood11:39
voidspaceconnecting to all state servers and filtering out duplicate logging offends me though11:40
voidspace(and it's O(n^2) if you bring up lots of state servers11:40
jamvoidspace: its O(n) if the data was properly sorted :)11:41
natefinchdefinitely just ignore the backlog for now. We'll get a real logging framework set up that will do more than rsyslog.  There's a topic for it in Vegas.11:42
jamthough you only ever have 7 state servers (because we use mongo, and mongo has that limit)11:42
voidspaceah11:42
voidspacestill, I'm sure we can do better11:42
natefinchjam: in theory you can have up to 12 as long as only 7 are voting.11:42
vladkjam: 1) do we need different identites on different machines?11:46
vladk2) should I find places where agent.conf is written and where SystemIdentity is assigned?11:46
ghartmanndo we already have any clue why add-machine doesn't work for local providers anymore ?11:46
jamghartmann: I hadn't heard that that was the case11:53
jamis there a bug/context/paste ?11:53
ghartmannI don't get any logs at all11:53
ghartmannthe machines just stick on pending11:53
ghartmannI tried installing on the VM and seen the same issue11:54
ghartmannI decided to roll back to 1.1811:54
ghartmannand it's kinda working11:54
ghartmannI can't boot precise but trusty works11:55
ghartmannby the way11:58
ghartmannI am willing to help but I am struggling a bit on how to debug the code11:58
fwereadeghartmann, sorry, my internet is up and down, I am missing context12:01
fwereadeghartmann, but I would like to help you if I can12:01
ghartmannI am currently using juju for local provider only12:05
ghartmannbest way to prototype and fix charms12:05
ghartmannbut since I updated juju I am unable to start any machines12:05
ghartmannor they start but that way too long12:05
ghartmann30 minutes if they do start12:06
fwereadeghartmann, hmm, that "way too long" is really interesting, to begin with it sounded like it might be https://bugs.launchpad.net/juju-core/+bug/130653712:06
_mup_Bug #1306537: LXC provider fails to provision precise instances from a trusty host <deploy> <local-provider> <lxc> <juju-core:Triaged> <juju-quickstart:Triaged> <https://launchpad.net/bugs/1306537>12:06
ghartmannI would imagine that someone have reported it because being unable to start machines is a breaking issue12:08
ghartmannI am trying to understand why this happens and how can I help12:09
fwereadeghartmann, ok, the best way to collect information is to `juju set-env "logging-config=<root>=DEBUG"`; and then to look in /var/log/juju-<envname>12:12
fwereadeghartmann, in fact looking at the lxc code you might want to set juju.container.lxc=TRACE12:14
jam1fwereade: I think if you "juju bootstrap --debug" it does that level of logging, doesn't it ?12:15
jam1DEBUG (not TRACE)12:15
fwereadejam1, yeah, I was assuming an existing environment12:15
fwereadejam1, but if it's not working I guess there's not much reason t keep the old one around12:16
fwereadejam1, and in particular a lot of the lxc stuff is only logged at trace level, I now observe12:16
jam1vladk: so having unique identities is more of a "it would be nice if they did" rather than "they must"12:18
fwereadeghartmann, if you're struggling to find *where* in the code I would start poking around in the container/lxc package -- specifically CreateContainer in lxc.go -- but I'm not sure if that's what you're asking12:18
ghartmannthe debug helps a little bit but it seems it believes that it worked ... "2014-04-23 12:16:50 INFO juju.cmd.juju addmachine.go:152 created machine 4"12:20
jam1ghartmann: created machine is creating a record in the DB for a new machine12:21
fwereadeghartmann, that just indicates that it recorded we'd like to start the container12:21
jam1!= actually started a machine12:21
ghartmannah ok12:21
fwereadeghartmann, it's possible that the provisioner is implicated, but in particular the slowness STM to point to the actual nuts and bolts of the container work12:22
jam1fwereade: so I think his statement was "it isn't working after 30 minutes" which means it hasn't actually worked yet12:22
fwereadejam1, ok, I see :)12:22
jam1fwereade: ghartmann: if it *was* working, it would still need to download the precise/trusty cloud image, but that download should only need to happen once12:22
ghartmannI will try looking on lxc12:23
fwereadeghartmann, do you see any lines mentioning the provisioner in the logs?12:23
fwereadeghartmann, in particular "started machine <id> as instance ..."12:24
ghartmannopening environment local12:24
ghartmannno started machine12:24
ghartmannyou mean on .juju/local/log right ?12:25
ghartmannI am stop starting the machine manually12:26
ghartmannit seems that the machine can't start a network device12:27
fwereadeghartmann, ah! you get a container created but it won't do anything?12:28
ghartmannit seems that the lxc-start doesn't start the machine12:32
ghartmannI will try to get it working first12:33
ghartmannit is something related with the network12:33
ghartmannit seems that the network of the machine doesn't start12:34
ghartmannI will try making it as a bridge12:34
ghartmannwill let you know once I finish it12:34
ghartmannthanks for the ideas12:34
fwereadeghartmann, there's a "network-bridge" setting for the local provider which defaults to lxcbr0 -- that works for most people, but possibly you have a different setup there?12:34
ghartmannI am using the standard12:34
ghartmannbut I will change a few things on my network12:35
ghartmannwill take a while12:35
jamfwereade: so there is a bug that deploying precise on trusty will fail because of "no matching tools found"12:37
jamfwereade: 2014-04-23 12:36:43 ERROR juju runner.go:220 worker: exited "environ-provisioner": failed to process updated machines: cannot start machine 1: no matching tools available12:37
fwereadejam, is that different from the one Ilinked?12:37
jamfwereade: it might be the root cause of the one linked, I'm not sure12:38
jamfwereade: ghartmann: so one option is to try running "juju bootstrap --series precise,trusty" or possibly "juju upgrade-juju --series=precise,trusty --upload-tools" to see if that gets things unstuck. But for *me* the provisioner is spinning on not creating an LXC instance because it cannot find the right tools12:42
jamif you got past that part12:42
jamfwereade: so it would seem that if the provisioner cannot provision machine 1 because of no tools, it won't try to provision machine 213:21
jam(in this case, the former is precise, the latter is trusty)13:21
fwereadejam, I think the core of it all is tools.HasTools13:23
fwereadejam, oh, wait, it actually can't be here, can it13:23
fwereadejam, but the provisioner task's possibleTools method is all messed up anyway :/13:25
jamfwereade: the check we have that all machines are running the same agent version also fails when you have dead machines (since nil != "1.18.1.1")13:26
jamso you can't use "juju upgrade-juju --upload-tools --series precise,trusty" to trick it13:26
fwereadejam, not without force-destroying the machines, yeah13:26
jamfwereade: but for *me* if I "juju bootstrap -e local --upload-tools --series precise,trusty" it works13:26
jamwithout the --series trick, it gets stuck never finding tools for the precise charm13:27
jamand then never getting to try for thetrusty charm13:27
jamseemingly13:27
=== BradCrittenden is now known as bac
fwereadejam, it seems reasonably likely that the provisioner is just failing out on the first one, and then trying again in the same order when it comes back up13:28
jamfwereade: right13:29
jamfwereade: I would have thought the provisioner would fail and keep trying the next one13:29
jamthough perhaps the idea is that if tools aren't available yet, it isn't worth trying until later?13:29
fwereadejam, yeah, unless explicitly handled otherwise we assume that errors might fix themselves if we try again later13:30
fwereadejam, frankly it's insane that the provisioner even knows about tools in the first place13:30
jamfwereade: well, it needs to pass them to cloud init13:33
jamso that the machine that is starting up can get them13:33
jamfwereade: why is that insane ?13:33
fwereadejam, the environ *already knows about the tools*. we *ask it where to find the tools*.13:34
voidspacelunch13:34
fwereadejam, a bit more than a year ago, we managed to refactor some of the way, but not all13:34
jamfwereade: is it intended to stay that way? Given we've talked about object storage in mongo13:34
fwereadejam, tools-in-state would indeed change the picture significantly, it's true13:37
fwereadejam, but even then the provisioner would just be a dumb pipe wrt tools, Ithink13:38
jamfwereade: I thought "juju destroy-machine --force" was intended to prevent this status:13:39
jam  "2":13:39
jam    instance-id: pending13:39
jam    life: dead13:39
jam    series: trusty13:39
fwereadejam, hmm, yeah, the provisioner ought to be able to kill all the dead machines before it starts worrying about the live ones13:40
jamfwereade: well it is possible that it will get to it soon, but it is stuck downloading the cloud-image template13:40
jamwhich is a few MB13:40
jamlike 100 or so13:40
fwereadejam, btw, I don't suppose you know where that "instance-id: pending" business comes from?13:40
fwereadejam, either we have an instance-id or we don't13:40
jamfwereade: in that particular case, the "trusty-template" fslock was left stale13:41
jamwhen I called "destroy-environment" while not waiting for trusty to come up.13:41
axw-awayjam: just saw your message about system-identity in cloud-init. that test you linked to is a bit misleading; it's running Configure, when it should be running ConfigureBasic13:41
axw-awayjam: IOW, the test does not reflect what we really do on bootstrap13:41
fwereadeoh WTF13:41
jamfwereade: I'm also seeing: 2014-04-23 13:41:08 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "": no instances found13:41
* axw-away goes back away13:42
fwereadejam, looks.like m.InstanceId is not erroring when it should?13:44
jamfwereade: perhaps13:47
jamfwereade: so from what I can sort out, vladk's patch is worth landing. I'm still confused by bits of it (why is it working), but I can accept that it might just be because I don't understand the swings and roundabouts13:53
jamcertainly he said he confirmed that secrets aren't going to EC213:53
jamfwereade: a potential fix for bug #1306537: https://codereview.appspot.com/9064004313:54
_mup_Bug #1306537: LXC local provider fails to provision precise instances from a trusty host <deploy> <local-provider> <lxc> <juju-core:In Progress by jameinel> <juju-core 1.18:In Progress by jameinel> <juju-quickstart:Triaged> <https://launchpad.net/bugs/1306537>13:54
hazmatquestion via email this morning.. local provider (using lxc).. doing deploy --to kvm:0 is supported?13:57
jamhazmat: my understanding is that it has worked, perhaps accidentally but it was working13:58
wwitzel3voidspace: I'm going to grab an early lunch and do an errand and we can sync up with where we are at when I get back.14:00
fwereadejam, I'm worried about that because tim added a hack somewhere else in an attempt to resolve essentially the same problem14:02
fwereadejam, except it's not quite the-same *enough* I guess14:03
jamfwereade: so there is certainly a bit of "this worked for me" vs feeling good about the change. but I have the strong feeling that feeling good about the change means a much bigger overhaul of our internals14:03
jamfwereade: so I filed bug #131167714:04
_mup_Bug #1311677: if the provisioner fails to find tools for one machine it fails to provision the others <provisioning> <status> <ui> <juju-core:Triaged> <https://launchpad.net/bugs/1311677>14:04
jamand looking at it14:04
jam(the startMachines code)14:04
jamit does exit on the first failure14:04
jamand we have the fact that on "normal" provisioning failures14:04
jamwe call "task.setErrorStatus"14:04
jamso if one fails14:04
jamwe mark it failing14:04
jamand then just go back to doing the next thing when we wake up again14:05
jamhowever, if possibleTools fails14:05
jamwe *don't* call setErrorStatus14:05
jamso that machine stays around blocking up all other work14:05
jamfwereade: my concerns. 1) We could try to keep provisioning even on errors, but if we are getting RateLimitExceeded, we realyl should just shut up and go sleep for a wihle14:06
jam2) Do we expect tha tpossibleTools is actually going to resolve itself RealSoonNow ?14:06
jamnow that we have the idea of Transient failures, could we treat no tools there ?14:06
fwereadejam, still thinking14:08
fwereadejam, re (1), I really think we have to do the rate-limiting inside the Environ, and use a common Environ for the various workers that need one14:08
jamfwereade: so even with that we are likely to eventually exceed our retries14:09
jam(say we retry up to 3 times, do we want to come back tomorrow?)14:09
jamI don't think we want to block a worker thread completely in Environ for more than ... minutes?14:09
* jam gets called away to actually be part of a family14:10
fwereadejam, if you come back sometime soon: I don't think that tools failure is transient, so I don't think treating it as such will really help -- setErrorStatus is probably the right answer to the problem (apart from anything else, precise/trusty are not the only series people will use even if they are *today*)14:13
fwereadeto *that* problem14:13
natefinchfwereade: definitely, no tools is likely to be a semi-permanent problem for all intents and purposes, certainly not something likely to get fixed within a small number of minutes, which is the most amount of time I can conceive of actually waiting for something to succeed.14:14
hazmatjam, it works, the question is it supported, i thought thumper had said that it was, but various folks are getting mixed signals on it14:21
hazmatso there's some confusion in regard14:21
sinzuijam, fwereade, I think we are 2+ week away from a stable 1.20. I want to try for a 1.18.2 release this week.14:22
natefinchhazmat: it works by accident.  I wouldn't say it is "supported"14:22
jam1sinzui: so my understanding is that there is very strong political pressure to get something out that has HA in a 'stable' release by the end of the week. We don't have to close all the High bugs to get there.14:23
natefinchhazmat: which is to say, I wouldn't rely on it working in the future.14:23
jam1I think we might be able to do a 1.19.1 today14:23
jam1which will be missing debug-log in HA, and backup/restore, I think14:23
jam1but I think we can land Vladk's patch to get "juju run" to work in 1.19.1 and HA14:23
sinzuijam1, You cannot have stable release until after users have given feedback. If I release today, you still don't get feedback until next week14:24
hazmatnatefinch, so if we have folks that need a working solution for lxc and kvm today that need a supported solution, the answer is your out of luck? and we don't support lxc and kvm in the same local provider.14:24
jam1fwereade: sinzui: alexisb (if around) I'm not the one who has the specifics for why we need HA available for April 25th, can you give more context ?14:24
sinzuijam1, also CI still doesn't pass HA. Someone might need to work with abentley to make the test pass of find the bug that might be in the code14:25
fwereadehazmat, I don't *like* it, but ISTM that it's (1) useful and (2) used, so we don't have any reasonable option for breaking it without providing an alternative14:25
hazmatfwereade, there's an extant bug on the later to support kvm and lxc containers in the same provider, which would also work, but its a bit more work.14:25
jam1fwereade: hazmat: I would agree with the "we shouldn't break it without providing another way"14:25
jam1hazmat: you still have the problem with spelling "I want to deploy the next one into KVM", unless we go all the way and make all the things you deploy prefixed14:26
hazmatok.. so supported for now .. till we have something better :-)14:26
hazmatjam, any placement effectively bypasses constraints14:26
hazmatfwereade, jam1, thanks14:27
sinzuijam1, alexisb, fwereade: I am not here to be the voice of idealism. I am the voice of pragmatism. We know developers, user, and CI find bugs, and all three need to affirm the feature works. There is not enough information to call HA stable for release14:27
fwereadejam1, hazmat: or we bite the bullet and get multi-provider environments going; at which point it's just another pseudo-provider and should Just Work14:27
fwereadejam1, hazmat: but I'm not confident that'll happen any time soon14:27
jam1fwereade: then there is the argument that cross-env relations is better than multi-provider ones14:27
jam1fwereade: if only because for most of them, you actually still want to run an agent local to that provider14:28
alexisbjam1, the 4/25 date for the 1.20 release was set because the target for a release with HA is ODS and jamespage needs some time to integrate14:28
hazmatlong term that sounds great, manual provider with cross region worked well enough for most of those cases for me till 1.19 (the address stuff breaks it)14:28
alexisbbut as sinzui points out it has to be ready, which it is not14:29
jam1alexisb: fwiw, it is probably ready enough for jamespage to look into integrating it14:29
alexisbjam1, ok, we should connect with jamespage then14:30
sinzuialexisb, jamespage If you get juju 1.19.1 with HA this week, is that good enough to test?14:30
natefinchjam1, alexisb: that was going to be my thought as well.  There's some edge case stuff that should be fixed, but the main workings are all there14:30
jam1sinzui: though probably we'll want to get 1.19.1 rather than have him running trunk14:30
jam1sinzui: I was trying to assign someone to work on the HA bug today ,I think natefinch is the one that volunteered to get the test running14:30
alexisbsinzui, jam1 how close are we to a 19.1 release?14:30
alexisbI see 2 critical bugs still being worked14:31
sinzuialexisb, jam1, you are actually on schedule for a Friday release14:31
jam1alexisb: one of those should have a patch that should be landing, I don't know for sure why it hasn't14:31
sinzuiI just don't see that release being called 1.2014:31
jam1the other is "juju backup" which is also supposed to have something from perrito666, but may not have to block 1.19.114:31
alexisbsinzui, agreed14:31
jam1sinzui: I agree, I don't think 1.19.1 is 1.2014:31
jam1but it is HA out for testing14:31
* perrito666 feels conjured14:31
jam1to get feedback to drive a proper 1.2014:32
jam1perrito666: so you work working to get "juju backup" to find /usr/lib/juju/bin/mongod when available, did that get done?14:32
alexisbjamespage, would a 1.19.1 development release be enough for you to begin testing and integration?14:32
sinzuijam1 yep14:32
jam1alexisb: I know of 2 things that are just-broken when you run HA (juju debug-log and juju run), but we have a patch for the latter, and wwitzel3 and voidspace on the former.14:33
fwereadejam1, I'm not sure how important it is to have a local state-server in the *long* term, but in the short term it is true that we benefit a lot from it14:33
jam1natefinch: did you get to look into the HA CI test suite? Can you give me an update on it by your EOD, as I can look at it tomorrow.14:34
perrito666jam1: I am actually trying to fix the whole thing together (backup/restore) since the test takes time I try to make the best of it, but I can propose the backup fix alone if you want14:34
sinzuijam1, returning to 1.18.2. You have diligently landed some fixes to it. I think there were a few more bugs that would be lovely to include. May I propose some merges to 1.18 to prepare a 1.18.2 that Ubuntu will love?14:34
natefinchjam1: looking at it now, late start to my day today, but i still have a lot of time to put into it.14:34
jam1perrito666: please never block getting incremental improvements on getting the whole thing. In general everyone benefits as long as it doesn't regress things in the mean time.14:35
fwereadeperrito666, I like small branches -- I know that a backup that can't be restored is no backup at all, but I'd still rather see a few branches that we merge all at once if we have to14:35
jam1sinzui: I have the strong feeling that 1.18 is going to stick in Trusty and we're going to be supporting it for a while.14:35
perrito666ack14:35
jam1sinzui: so while I'm not currently focused on it, because of 1.19 and HA stuff filling my queue14:35
perrito666:)14:35
jam1sinzui: patches seem most welcome to 1.1814:35
fwereadeperrito666, jam1: indeed, the only reason to hold off on landing one of those branches is if it does, in isolation, regress something14:35
alexisbjam1, are you thinking that 1.18 will be the long term solution for Trusty?14:36
sinzuijam1. okay. I will make plans for 1.18.214:36
natefinchsinzui: how do I investigate a CI failure?  I believe functional-ha-recovery-devel is the one I'm supposed to be fixing14:36
jam1alexisb: 1.18 doesn't have HA support, and will likely be missing lots of stuff. I just think that given our track record with actually getting stuff into the main archive, we really can't trust it14:37
sinzuinatefinch, abentley in canonical's #juju is seeing errors like this...http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/functional-ha-recovery-devel/64/console14:38
jam1alexisb: so likely we'll want something like cloud-archive for Trusty that provides the latest set of tools that we like14:38
sinzuinatefinch, abentley believes the problem is the test. it is not waiting for the confirmation that juju is in HA.14:38
jam1but I don't think we can actually expect to get things into the Ubuntu archive.14:38
sinzuinatefinch, abentley will ask for assistance if the test continues to fail after assuring itself that HA is up14:39
natefinchsinzui: cool.  I'm more than willing to help.  I know that working with mongo can be hairy14:39
alexisbjam1, yes we are working with the foundations team/TB to define the process for updating juju-core package in trustie14:40
alexisbI don't know yet what the process will be14:40
jam1alexisb: i might be being jaded, but cloud-tools:archive still has 1.16.3 because it never got 1.16.5 landed in Saucy14:40
jam1and that is... 6 months old?14:41
alexisband it could very well become via cloud-tools14:41
jam1alexisb: though again, we've struggled to get stuff in there, too14:43
hazmatare there any tricks to compiling juju with gccgo?14:44
sinzuijam1, alexisb : I thought jamespage had made progress getting juju 1.16.4..1.16.6 in old ubuntu. The issue was the backup and restore plugins...since the backup plugin wasn't in the code, we elected to not package it.14:45
fwereadejam1, re https://codereview.appspot.com/90640043 -- how about fixing environs/bootstrap.SeriesToUpload instead?14:46
jam1sinzui: so cloud-archive:tools still has 1.16.3 as the best you can get: http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/precise-updates/cloud-tools/main/binary-amd64/Packages14:46
alexisbwell HA is really important so we will need to fight the battles to get it into Trustie14:46
jam1fwereade: so instead of LatestLTSSeries it would do AllLTSSeries ?14:47
fwereadejam1, essentially, yeah14:47
fwereadejam1, if we were smart we'd only upload a single binary anyway but I'm not sure we got that far yet14:48
jam1fwereade: so at this point, I think using LatestLTSSeries is still a bit wonky since we really can't expect anything about T+414:48
jam1fwereade: we're not14:48
sinzuialexisb, jam1, we have never tested upgrade from 1.16.3 to 1.18.x. We need to test that if jamespage fails to get 1.16.6 into the cloud-archive...and hope it works14:48
jam1if you bootstrap --debug you can see the double upload14:48
fwereadejam1, yeah, thought so14:48
jam1sinzui: AIUI, the issue was that once Trusty releases, then the version in Trusty becomes the version in cloud-tools, so it will jump from 1.16.3 to 1.18.1 (?)14:50
sinzuijam1, right, that was the jamespage's fear.14:50
jam1fwereade: I would be fine moving it toSeriesToUpload, and *I* would be fine just making that function put Add("precise"), Add("trusty")14:50
jam1fwereade: but *I'm* way past EOD here14:51
fwereadejam1, but regardless, I think we're better off fixing SeriesToUpload (and maybe improving the double-upload, now that it's potentially a triple-upload) than adding another tweak to a code path that is in itself pretty-much straight-up evil in the first place14:51
jam1fwereade: so happy to  LGTM a patch that does that :)14:51
jam1even better that it could *actually* be tested14:52
fwereadejam1, quite so, that was my other quibble there ;)14:52
fwereadejam1, ok, I have a meeting in a few minutes and am not sure I will get to it today myself, but I'll make sure you know if I do14:53
bacsinzui: so the swift fix was a mirage?14:56
sinzuibac: yes14:56
bacdrats14:56
sinzuibac: and the corrupt admin-secret theory is crushed14:57
sinzuibac, also, staging machine-0 has been stuck in hard reboot for a week. I think we can say it is dead.14:58
jam1fwereade: I gave a summary of why vladk's patch works, mostly boiling down to the fact that what we write to the DB is the params.StateServingInfo struct, unlike most of our code which uses separate types for API from DB types15:15
jam1https://codereview.appspot.com/90580043/15:15
jam1vladk: are you able to land that patch today before sinzui can put together a release ?15:16
jam1(and get CI to pass on it, I guess)15:16
vladkjam1: yes15:16
jam1vladk: great15:16
jam1LGTM15:16
jam1vladk: can I ask that you file a "tech-debt" bug to track that we may want to have each API server have their own system identity?15:17
vladkjam1: ok15:17
jam1I think as long as we have the api StateServingInfo we can actually notice who's calling and give them the a different value if we want15:17
hazmatit looks like 1.18 branch has deps on both github.com/loggo/loggo and github.com/juju/loggo are those the same ?15:19
jam1hazmat: they need to be only one, otherwise the objects internally are not compatible15:21
jam1it should all be "github.com/juju/loggo"15:21
hazmatjam1, 1.18 stable branch -> state/apiserver/usermanager/usermanager.go:     "github.com/loggo/loggo"15:25
hazmatjam1, thanks.. i'll mod locally15:25
jam1hazmat: please propose a fix if you could15:25
hazmatjam1, sure.. just need to get through the morning15:26
voidspacejam1: ping, if you have 5 minutes15:26
voidspacejam1: it can wait until tomorrow if not15:27
=== BradCrittenden is now known as bac
voidspaceooh, precise only has version 5 of rsyslog so we can only use the "legacy" configuration format15:42
voidspacelovely15:42
voidspacejam1: cancel my ping :-)15:47
voidspacenatefinch: ping15:47
natefinchvoidspace: howdy15:50
natefinchfwereade: where do I go to approve time off?16:00
perrito666jam1: fwereade sinzui https://codereview.appspot.com/90660043 this fixes the backup part of the issue16:05
perrito666so ptal?16:06
perrito666anyone is encouraged to, although be warned, its bash16:07
fwereadenatefinch, canonicaladmin.com is all I know16:11
=== vladk is now known as vladk|offline
perrito666does anyone now why are we dragging the logs on the backup? (and most precisely why are we restoring them?) I mean I know we might want to back them up for analysis purposes, but restore the old logs pollutes information a bit17:04
jam1natefinch: you should be able to log into Canonical Admin and have "Team Requests" under the Administration section17:17
jam1perrito666: if you want to investigate why something failed in the past, you need the log17:18
perrito666jam1: exactly, but if you restore the log from the previous achine you are lying about the current one17:19
jam1perrito666: but it also contains the whole history of your actual environment17:19
jam1vs just this new thing that I just brought up17:19
jam1I would be fine moving the existing file to the side17:19
jam1but all the juicy history is what you are restoring17:19
jam1perrito666: did you test the backup stuff live against a Trusty bootstrap?17:19
jam1perrito666: nate's patch landed at r266217:20
perrito666jam1: sorry I was at the door17:27
perrito666I did, let me re-check that the env that is being back-up actually has the proper mongodb17:27
perrito666jam1: re your comment, I could try to assert MONGO* is exectuable or fail instead17:30
voidspacegoing jogging, back shortly17:39
jam1perrito666: I don't really think we need to spend many cycles worrying about it.17:41
jam1It may be that just using '-f' will give better failure modes (more obvious if we try to execute something that isn't executable than trying to run a command that isn't in $PATH)17:42
jam1perrito666: anyway, not a big deal, don't spend too much time on it, focus on getting it landed and on to restore17:42
perrito666yea, most likely if you have those and they are not executable you most likely noticed other problems17:42
* perrito666 repeats himself when he stops writing a sentence in the middle and then restarts17:44
jam1that is certainly a common thing17:44
perrito666whell I did a version of restore that backups the old config just so I get to discover what part of our backup restoration breaks the state server17:45
* perrito666 's kindgom for an aws location in south america17:49
=== vladk|offline is now known as vladk
voidspaceEOD folks18:32
voidspaceg'night18:32
perrito666bye18:32
wwitzel3voidspace: see ya18:32
stokachuis juju add-relation smart enough to handle add-relations to non-existent services that may be coming available in the future18:32
stokachufor example if I deploy 3 charms and charm 1 relies on charm 3 so i add the relation during charm 1 deployment18:32
stokachuis it smart enough to retry to add-relations once it sees charm 3 come online?18:33
stokachumarcoceppi: ^ curious if you know this?18:34
marcoceppistokachu: no18:35
stokachumarcoceppi: no to not smart enough or no to you aren't sure?18:35
marcoceppinot smart enough, if you run add-relation then it won't actually work if the one of the two services isn't there18:35
stokachuso that makes it difficult for me to put juju deploy <charm>; juju add-relation <charm> <new_charm_not deployed>; juju deploy <new_charm>18:36
marcoceppistokachu: not difficult, impossible.18:37
marcoceppistokachu: you should run add-relation once you have all your services deployed18:37
stokachuso if i deploy and openstack cloud i'd have to deploy all charms, then re-loop through those charms and add-relations18:37
marcoceppistokachu: or, use juju deployer18:38
bloodearneststokachu: or better yet, deploy charms, mount volumes, then add relations, as many charms expect the volumes to be already configured on the joined hook18:39
stokachubloodearnest: interesting ill look into that19:01
bloodearneststokachu: on account of juju having no way yet to detect/react to volumes changing, AIUI19:02
stokachui wonder if it'd be worth it to have add-relations kept in a queue and when a service comes online it just checks for pending19:03
natefinchstokachu: note that you don't need to wait for the charms to be deployed to add relations. You can fire off deploy deploy deploy add-relation add-relation add-relation, and juju will eventually catch up.   It's just that you have to run the deploy command before the add-relation command19:07
stokachunatefinch: yea thats what im doing now19:07
stokachujust iterating through the charms twice is all19:07
natefinchstokachu: iterate through charms once and then through relations once ;)19:08
natefinchgotta run, car needs to be inspected, back in 45 mins19:08
=== natefinch is now known as natefinch-afk
=== natefinch-afk is now known as natefinch
sinzuiwwitzel3, natefinch CI cursed the most recent juju because of a unit-test failure on precise. Do either of you think the test can be tuned to be reliable on precise? https://bugs.launchpad.net/juju-core/+bug/131182519:35
_mup_Bug #1311825: test failure UniterSuite.TestUniterUpgradeConflicts <ci> <intermittent-failure> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1311825>19:35
natefinchsinzui: looking19:37
wwitzel3sinzui: also taking a look19:37
natefinchman I hate overly refactored tests19:40
natefinchwwitzel3: can you even tell what sub-test is failing?19:47
natefinchall I see is "step 8" which doesn't tell me diddly19:47
wwitzel3natefinch: not really, I've got as far as fixUpgradeError step19:55
wwitzel3natefinch: but it is all nested so I can't tell in which that is happening19:56
=== vladk is now known as vladk|offline

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!