=== 21WAAD3MF is now known as wallyworld | ||
perrito666 | anyone knows his way around state/open.go ? | 00:52 |
---|---|---|
wallyworld | perrito666: it depends on what you want to know, i might be able to help | 01:24 |
stokachu | cmars: hah gues you figured i based my plugin off yours :) | 01:29 |
wallyworld | axw: mornin'. can we have a hangout now instead of in an hour? | 01:29 |
stokachu | was gonna give you credit once i got something working | 01:29 |
axw | morning wallyworld. sure thing, just give me a moment | 01:29 |
axw | wallyworld: erm, my sound isn't working. gotta fix that first... | 01:31 |
wallyworld | ok | 01:31 |
perrito666 | wallyworld: tx, sadly my head is falling on the kb so I better hit the bed before I introduce a bug instead of fixing the current one | 01:36 |
wallyworld | perrito666: np, i'm on call anyway now. if you have a question, feel free to email to the list or ask again later | 01:37 |
perrito666 | wallyworld: ok, I am more curious about fixing this bug than about going to sleep :p so her I go | 01:46 |
perrito666 | I am trying to fix the restore functionality | 01:47 |
perrito666 | now, at some point the restore calls state.Open(), I tried to replace it by using juju.NetConn and NewConnFromName and in all cases, it timeouts at mgo.DialWithInfo thile trying to make Ping() | 01:48 |
wallyworld | perrito666: ok. there may also be someone else looking into that from juju-core | 01:49 |
perrito666 | "that" being? | 01:49 |
wallyworld | i think Horacio DurĂ¡n | 01:50 |
perrito666 | sadly that would be me | 01:51 |
wallyworld | he's started to fix some of the backup bugs and was also going to look at restore | 01:51 |
wallyworld | oh | 01:51 |
wallyworld | hi | 01:51 |
perrito666 | hi | 01:51 |
wallyworld | i didn't realise! | 01:51 |
wallyworld | perrito666: give me a couple of minutes to finish this call | 01:52 |
perrito666 | sure | 01:52 |
wallyworld | perrito666: sorry, back now | 02:00 |
wallyworld | i'm not across the restore stuff specifically | 02:00 |
perrito666 | wallyworld: I think the restore part of my explanation can be safely ignored | 02:01 |
axw | wallyworld: gotta go to the shops for a little while, bbs | 02:01 |
perrito666 | I just provided it for context | 02:01 |
wallyworld | axw: sure, np | 02:01 |
wallyworld | perrito666: so you are looking to, in general, replace calls to state.Open() with juju.NewConn ? | 02:02 |
wallyworld | to use the api | 02:02 |
wallyworld | so you definitely have a state server running? | 02:03 |
wallyworld | api server even | 02:03 |
perrito666 | wallyworld: well I am pretty sure I do, I try to query mongo by hand and it responds, yet when juju tries to dial it just timeouts | 02:05 |
wallyworld | mongo != api server though | 02:05 |
wallyworld | the api server listens on port 17070 | 02:06 |
perrito666 | true, altough I am pretty sure this breaks before getting to state | 02:06 |
wallyworld | what code are you changing? | 02:06 |
perrito666 | well, current existing code calls open, open in time calls DialWithInfo | 02:07 |
wallyworld | which file? | 02:07 |
perrito666 | DialWithInfo creates a session | 02:08 |
perrito666 | ah sorry | 02:08 |
perrito666 | state/open.go | 02:08 |
wallyworld | sure, but the caller to that | 02:08 |
wallyworld | which caller of state.Open() is being replaced? | 02:08 |
perrito666 | cmd/plugins/juju-restore/restore.go | 02:08 |
perrito666 | around :187 | 02:09 |
wallyworld | so at the time restore runs, is there a bootstrap node running? | 02:09 |
wallyworld | i don't think there is | 02:09 |
wallyworld | ah there may be | 02:10 |
wallyworld | cause looks like it calls rebootstrap() | 02:10 |
perrito666 | there is | 02:10 |
wallyworld | but you might find that it is just that the api server has not started yet | 02:10 |
wallyworld | cause it can take a while to spin up the bootstrap node and then start the services | 02:11 |
wallyworld | maybe to see if that's the issue, pause the restore script or add in a big attempt loop to see if it just needs more time | 02:12 |
perrito666 | wallyworld: mm I tried looping on that | 02:12 |
perrito666 | I waited 30 mins total | 02:12 |
perrito666 | that is a lot | 02:12 |
wallyworld | can you do a juju status when it fails? | 02:12 |
wallyworld | ie does juju status work? | 02:12 |
wallyworld | that would need an api server connection | 02:12 |
perrito666 | mm, it does not | 02:13 |
wallyworld | so if juju status is broken also, then there's an issue with the bootstrap node | 02:13 |
wallyworld | you would need to ssh in and look at the log file | 02:13 |
wallyworld | cause it could be the node itself starts but then the juju services fail to start | 02:14 |
perrito666 | mm, the service seems to be running, I even restarted it by hand | 02:15 |
perrito666 | in what port should the state server be listening? | 02:15 |
stokachu | 37017 | 02:15 |
wallyworld | 17070 | 02:15 |
wallyworld | 37017 is ongo | 02:15 |
wallyworld | mongo | 02:15 |
wallyworld | perrito666: when you say you restarted the state service by hand, that doesn't make sense to me because the state service runs inside the machine agent - did you start jujud? | 02:16 |
perrito666 | wallyworld: yes | 02:17 |
wallyworld | and the machine log file is good? | 02:17 |
wallyworld | and yet juju status fails also | 02:18 |
wallyworld | there's gotta be something logged which shows the problem | 02:18 |
wallyworld | until something like juju status is happy, then the code changes to restore.go won't work either | 02:18 |
perrito666 | wallyworld: interesting though, restore is trying to open a state server n 37017 | 02:19 |
wallyworld | the current restore using state.open()? | 02:20 |
wallyworld | it will because it connects straight to mongo | 02:20 |
wallyworld | the new juju.NewConn() methods instead go via the api server on port 17070 | 02:20 |
perrito666 | aghh, juju.NewConn fails just as Open, so someting is definitely broken in my recently restored node | 02:21 |
stokachu | wallyworld: is that in trunk yet? | 02:22 |
stokachu | my logs show NewConnFromName accessing mongo directly on 37017 | 02:22 |
wallyworld | stokachu: the api server stuff? | 02:22 |
stokachu | yea | 02:22 |
wallyworld | yes, been there since 1.16 | 02:22 |
wallyworld | used universally since 1.18 | 02:22 |
wallyworld | perrito666: i'd be surprised and sad if the log files on that node didn't show what was wrong | 02:23 |
* perrito666 run the extremely tedious setup script | 02:24 | |
wallyworld | perrito666: it will still be waiting for you tomorrow after you get some sleep :-) | 02:25 |
perrito666 | wallyworld: certainly but now its personal | 02:26 |
wallyworld | lol | 02:26 |
wallyworld | feel free to pastebin logs files if you want some more eyes | 02:26 |
* perrito666 paints canonical logo on his face and yells mel gibson style | 02:26 | |
stokachu | woot i actually a juju plugin to do something in go | 02:26 |
perrito666 | stokachu: I sense a verb missing there :p | 02:28 |
wallyworld | would have been funnier if you said "i a missing verb there" :-) | 02:29 |
stokachu | hah | 02:29 |
axw | back... | 02:29 |
stokachu | to much time looking at juju core code | 02:29 |
waigani | wallyworld: axw: I'm here for standup | 02:30 |
perrito666 | wallyworld: my wife is watching tv in spanish next to me, when 2 lang module enabled in my head I loose capacity for witty sentences in both languages | 02:30 |
wallyworld | waigani: huh? i thought you were on holidays so we had it early :-) | 02:30 |
axw | waigani: we already had it early, weren't expecting you | 02:30 |
wallyworld | but we can have another | 02:30 |
waigani | :( | 02:30 |
waigani | I'm in auk airport | 02:31 |
waigani | okay, maybe I can talk through what I'm doing? | 02:31 |
wallyworld | waigani: sure, i'm in the hangout | 02:31 |
axw | brt | 02:31 |
perrito666 | wallyworld: https://pastebin.canonical.com/108967/ | 02:41 |
wallyworld | perrito666: looking, sorry was otp | 02:44 |
perrito666 | on the same note https://pastebin.canonical.com/108968/ | 02:44 |
wallyworld | perrito666: is there any more in machine-0.log? | 02:47 |
perrito666 | wallyworld: well, there is before that altough I am not sure if I can distinguish between pre/post restore (restore is a particularly ugly thing) | 02:52 |
wallyworld | perrito666: what i mean is, after the output you logged. that log looks ok i think. there was one timeout with the api client connecting but thatcan happen and it appeared to be ok after that but i wanted to be sure by looking at subsequent logging | 02:53 |
perrito666 | nope, after that it just loops with https://pastebin.canonical.com/108969/ | 02:55 |
wallyworld | hmmm, ok. so that says there is an issue with the api server | 02:56 |
wallyworld | you may need to enable trace level logging and/or add extra logging to see why it's failing. i wonder if netstat shows the port as open | 02:57 |
perrito666 | tcp 0 1 10.140.171.13:59925 10.150.60.153:17070 SYN_SENT 4001/jujud | 02:57 |
wallyworld | that's a different ip address to what is being dialled | 02:58 |
wallyworld | oh no | 02:58 |
wallyworld | it's not | 02:58 |
perrito666 | nope, just without the dns nae | 02:59 |
wallyworld | yeah | 02:59 |
wallyworld | if it were me, i'd have to add lots of extra debug logging at this point to see what's happening as i'm out of ideas | 02:59 |
wallyworld | but you can see even internally the machine agent api client can't start | 03:00 |
wallyworld | so there's a core issue with starting the api server itself | 03:00 |
wallyworld | axw: local provider is sorta ok. it doesn't like starting precise containers on trusty although it used to. and if i start a precise container first and it fails, subsequent trusty containers also fail, but starting a trusty container first works | 03:01 |
perrito666 | wallyworld: well, I think the restore step is actually breaking the state api server | 03:01 |
perrito666 | since it works right before | 03:01 |
wallyworld | likely | 03:01 |
perrito666 | (restore bootstraps a machine and then untars the backup on top of it) | 03:01 |
wallyworld | roger wrote all that so i have no insight off the top of my head as to what might be wrong | 03:01 |
axw | wallyworld: ah ok. there have been a few bugs flying around about host vs. container series mismatch not working | 03:02 |
wallyworld | axw: yeah, i'm going to try explicitly setting default series to see if i can get precise to work. but precise failing should not also then kill trusty :-( | 03:03 |
perrito666 | wallyworld: I think there might be something wrong with the backup, tomorrow I will strip one into pieces and see what is wrong, as for me I am now officially out or tomorrow I will be sleeping on the kn at the standup | 03:04 |
perrito666 | kb* | 03:04 |
wallyworld | np, good night :-) | 03:04 |
axw | wallyworld: oh I didn't see that bit... weird | 03:04 |
wallyworld | yeah | 03:04 |
axw | wallyworld: I think you can also bootstrap --series=trusty,precise to get it to work | 03:04 |
axw | not sure why trying precise would fail trusty tho | 03:05 |
wallyworld | ta, will try that also to try and get a handle on it | 03:05 |
* wallyworld -> food | 03:05 | |
=== wallyworld_ is now known as wallyworld | ||
axw | wallyworld: I just pasted the output I see from destroy-environment with manual | 03:43 |
axw | wallyworld: it's as I expected | 03:43 |
wallyworld | axw: i missed it as my laptop got disconnected | 03:43 |
axw | wallyworld: I mean I pasted it in the bug | 03:43 |
wallyworld | ah, looking | 03:43 |
axw | #1306357 | 03:43 |
_mup_ | Bug #1306357: destroy environment fails for manual provider <destroy-environment> <manual-provider> <juju-core:Incomplete> <https://launchpad.net/bugs/1306357> | 03:43 |
wallyworld | axw: clearly then i need to get my eyes tested as i had thought i included it all, sorry :-( | 03:45 |
wallyworld | although i wish the last error was first | 03:45 |
axw | wallyworld: nps. it does kinda get lost down there... | 03:45 |
wallyworld | as it would read much nicer that way | 03:45 |
wallyworld | ie root cause, followed by option to fix | 03:46 |
=== vladk|offline is now known as vladk | ||
axw | wallyworld: I'm going to look at fixing these openstack tests. If you do have any spare time, it would still be useful if you could review the placement CL | 04:06 |
axw | but if you're busy then that's okay | 04:06 |
wallyworld | axw: funny you should mention that - just finished another review and am looking right now | 04:07 |
axw | wallyworld: cool :) | 04:07 |
wallyworld | axw: this is a personal view, but i tend to think that if a method returning a (value, error) returns a err != nil, then the value should be considered invalid. so this bit irks me: | 04:17 |
wallyworld | if c.Placement != nil && err == instance.ErrPlacementScopeMissing { | 04:17 |
wallyworld | i would use an out of band signal like a bool or something | 04:17 |
axw | wallyworld: err was originally nil, that was something william wanted | 04:17 |
axw | I suppose I could change it to reutrn a nil placement, and have the caller construct one | 04:18 |
wallyworld | hmmm. is there value in adding a bool to the return values | 04:18 |
wallyworld | or something | 04:18 |
axw | I don't really think so, then you may as well just check if the scope has a non-empty scope | 04:19 |
wallyworld | i sorta think that err != nil meaning the value is bad is kinda idiomatic Go | 04:19 |
axw | yeah... probably should have just left it as it was | 04:20 |
wallyworld | change it since he isn't here :-) | 04:20 |
axw | wallyworld: I think I will just change it to return a nil Placement, and hten the caller will create a Placement with empty scope and the input string as the directive field | 04:22 |
wallyworld | ok | 04:22 |
wallyworld | i think that sounds good | 04:22 |
axw | the caller needs to know the rule anyway, at least this way it's the usual case of nil value iff error | 04:22 |
wallyworld | sorta best of both worlds | 04:22 |
wallyworld | ta | 04:23 |
wallyworld | axw: with these lines in addmachine | 04:28 |
wallyworld | if params.IsCodeNotImplemented(err) { | 04:29 |
wallyworld | 04:29 | |
wallyworld | 135 if c.Placement != nil { | 04:29 |
wallyworld | is there any point trying again if c.Placement is nil? | 04:29 |
wallyworld | should it just be a single if ... && ... ? | 04:29 |
axw | wallyworld: yes we should try again, because we're calling a new API method | 04:29 |
axw | wallyworld: client.AddMachines now calls a new API method by default | 04:30 |
axw | wallyworld: and client.AddMachines1dot18 calls the old one | 04:30 |
wallyworld | oh,right. hadn't go to that bit yet, i recalled it was the same api from earlier review | 04:30 |
axw | it was, I fixed it :) | 04:30 |
wallyworld | but i guess versioning | 04:30 |
wallyworld | wish we had it | 04:30 |
axw | indeed | 04:30 |
stokachu | do i have to invoke "scp" with the ssh.Copy function in utils/ssh? | 04:32 |
axw | stokachu: the openssh client impl will delegate to scp, if that's what you're asking | 04:34 |
stokachu | https://github.com/battlemidget/juju-sos/blob/master/main.go#L89-L94 | 04:34 |
stokachu | so im trying to replicate juju scp within my plugin | 04:34 |
stokachu | this is my log output : http://paste.ubuntu.com/7312090/ | 04:35 |
stokachu | i think my actual copyStr is incorrect as i was following was is required by juju scp | 04:35 |
* axw looks | 04:35 | |
stokachu | what is* | 04:35 |
axw | stokachu: I think you want the target and source in separate args | 04:36 |
stokachu | im a newb with golang as well so if i got stupid stuff in there | 04:36 |
stokachu | lemme try that | 04:37 |
axw | stokachu: i.e. a length-2 slice | 04:37 |
stokachu | ok lemme see if i can make that happen | 04:37 |
wallyworld | axw: is there a reason why we store placement as a string and not a parsed object. and hence precheck take s a string and not a parsed struct etc. i would normally look to parse on the way in and then pass around the parsed struct etc so we fail as close to the system boundary as possible. am i missing a design decision? | 04:38 |
stokachu | sweet, gotten farther http://paste.ubuntu.com/7312102/ | 04:39 |
axw | wallyworld: originally I did that, william wanted it changed. it should not get to the environment if the scope doesn't match | 04:39 |
stokachu | though maybe i should be using the instance.SelectPublicAddress of machine? | 04:39 |
wallyworld | axw: hmmmm. ok. i disagree with william here then :-( | 04:39 |
axw | stokachu: cool. ahh, "juju scp" does the magic of converting machine IDs to addresses | 04:40 |
axw | wallyworld: why? the environment should not need the scope | 04:40 |
stokachu | ive got a execssh that i borrowed from someone that uses instance.selectpublicaddress | 04:40 |
stokachu | going ot try that | 04:40 |
wallyworld | axw: what i mean is that the string should be parsed into whatever internal representation makes sense at the system boundary ie a struct of some sort, possibly different to what is used on the client ie minus the scope | 04:41 |
axw | stokachu: see juju-core/cmd/juju/scp.go, hostFromTarget -- that's where it maps machine IDs to addresses | 04:41 |
wallyworld | and internal apis should then use that typed struct | 04:41 |
stokachu | axw: ahh i see that now | 04:42 |
wallyworld | not an "untyped" string | 04:42 |
wallyworld | but, doesn't matter, it's already been changed to get approval | 04:42 |
stokachu | to bad expandArgs isnt public | 04:42 |
axw | wallyworld: the directive string is free-form, so how are you going to do that? | 04:42 |
axw | wallyworld: it's up to the provider to decide what makes sense in directives | 04:43 |
wallyworld | axw: ah bollocks, i was thinking there was more to it than just a string. but you are saying that by the time it's stored, it represents a mass name or whatever | 04:43 |
wallyworld | that makes more sense. i hadn't fully re-groked the implementation | 04:44 |
axw | wallyworld: as far as the infrastructure is concerned, it's an opaque blob of bytes. the provider will interpret it. provider/maas will interpret it as maas-name to start with | 04:44 |
wallyworld | ok | 04:45 |
axw | we may converge on some convention, like thing=value | 04:45 |
axw | az=uswest-1 or whatever | 04:45 |
axw | stokachu: it's also worth noting that some providers (e.g. azure) require proxying through machine 0 | 04:46 |
axw | stokachu: so you may want to just shell out to "juju scp" if you can... | 04:46 |
stokachu | axw: ah good point | 04:47 |
stokachu | cleaner than what im doing | 04:47 |
stokachu | is there a shell function in juju-core thats exposed? | 04:47 |
stokachu | or should i just use os.Exec | 04:47 |
axw | stokachu: os/exec is as good as anything | 04:48 |
stokachu | axw: good deal | 04:48 |
stokachu | ill do that instead | 04:48 |
axw | there are some utils in juju, but I don't think they'd be useful | 04:48 |
stokachu | cool no worries | 04:48 |
wallyworld | axw: yeah, i'm a fan of a little more structure. but none the less, land that f*cker | 04:48 |
jam | hazmat: fwiw the first line that api-endpoints returns is the one that we last connected to, so if you just do "head -n1" you can get the same output we used to give | 04:49 |
axw | wallyworld: thanks | 04:50 |
wallyworld | np. sorry if i went over old ground | 04:50 |
axw | nope, that's cool | 04:50 |
wallyworld | jam: i was going to get your opinion on that bug - i'd like to close now as "invalid" or whatever given the other ifx has landed | 04:51 |
jam | wallyworld: sorry, which bug? | 04:51 |
wallyworld | jam: the one you just remarked on above | 04:51 |
wallyworld | bug 1311227 | 04:51 |
_mup_ | Bug #1311227: juju api-endpoints cli regression on trunk/1.19 <api> <regression> <juju-core:Triaged> <https://launchpad.net/bugs/1311227> | 04:51 |
jam | wallyworld: localhost shouldn't be in the output | 04:52 |
jam | and I would be fine pruning ipv6 by default | 04:52 |
wallyworld | jam: it can be for local provider since localhost is the public address for local provider | 04:53 |
wallyworld | jam: martin's branch does prune ip6 by default | 04:53 |
jam | wallyworld: sure, I'm not saying don't print localhost when that's the address, but *don't* print localhost for ec2 | 04:53 |
axw | we shouldn't have localhost for ec2, but we would have 127.0.0.1 and that'll get pruned | 04:54 |
wallyworld | jam: martin's branch probably ensures that's the case, since for ec2 localhost is machinelocal isn't it? | 04:54 |
jam | wallyworld: hmmm... I don't know that Martin's patch is *quite* right. I'd rather still cache IPv6, but just not display them on api-endpoints | 04:54 |
axw | we don't use any scope heuristics for hostnames | 04:54 |
jam | wallyworld: right, I think his patch is what we want, and we do want to be caching the network scope data instead of just addrs | 04:54 |
wallyworld | jam: it's ok for now i think since we don't need/use ip6 yet | 04:54 |
wallyworld | jam: so, i think then that kapil's bug has 2 bits 1. the ip6/127.0.0.1 stuff which martin's bug fixes, and 2. the multiple api address thing which is new and intended | 04:55 |
wallyworld | so therefore we can mark the bug as invalid | 04:56 |
wallyworld | right ? | 04:56 |
jam | wallyworld: so I still think there are bits that we can evolve on api-endpoints. Namely, to change what we cache from just addrs to being the full HostPort content (which includes network scope), and then api-endpoints can grow flags to do --network-scope=public | 04:57 |
jam | wallyworld: so while I think we've addressed the regression today | 04:57 |
jam | I don't think the bug is "just closed" | 04:57 |
wallyworld | sure, but that's not the bug as described | 04:57 |
wallyworld | we can get it off 1.19.1 at least | 04:58 |
jam | wallyworld: right, i think the *regression* portion is stuff that we intend (multiple addresses, even per server), because we think they might be routable | 04:58 |
jam | and we don't save enough information (yet) to be able to provide --network-scope | 04:58 |
wallyworld | yep, i don't see any regression at all | 04:58 |
jam | (and then default it to public) | 04:58 |
jam | wallyworld: giving private addresses in api-endpoints by default is wrong | 04:59 |
jam | but "good enough" for now. | 04:59 |
jam | And hazmat has a point about actually grouping the data by server, so you have a feeling for what machine is a fallback | 04:59 |
wallyworld | ok, so let's retarget off 1.19.1 then | 04:59 |
jam | SGTM | 05:00 |
wallyworld | jam: 2.0 or 1.20? | 05:00 |
wallyworld | 2.0 i guess? | 05:00 |
jam | I'd be ok with 2.0 | 05:03 |
waigani | axw: when I use restore with patchValue I get this error: http://pastebin.ubuntu.com/7312196/ | 05:04 |
stokachu | so heres my latest change using juju scp https://github.com/battlemidget/juju-sos/blob/master/main.go#L89-L96 | 05:05 |
stokachu | and the error output http://paste.ubuntu.com/7312200/ | 05:05 |
stokachu | i verified that juju ssh 1 and /tmp/sosreport*xz exists on the machine | 05:05 |
waigani | anyway, I need to go catch a plane | 05:07 |
axw | waigani: sorry, need more context. show me in vegas :) | 05:08 |
stokachu | axw: -r doesn't work with machine num it seems | 05:09 |
stokachu | juju scp 1:/tmp/test . works | 05:09 |
stokachu | but juju scp -r 1:/tmp/test* . fails | 05:09 |
axw | stokachu: you need to separate the command out into individual args | 05:09 |
axw | stokachu: i.e. "juju", "scp", ... | 05:09 |
stokachu | this is manually running the command from the shell | 05:09 |
axw | stokachu: there are some limitations with juju scp, I forget exactly how to pass extra args... lemme see | 05:10 |
stokachu | http://paste.ubuntu.com/7312211/ | 05:10 |
stokachu | thats what ive tested manually | 05:10 |
axw | stokachu: stick "--" before -r | 05:13 |
stokachu | axw: you da man | 05:14 |
jam | axw: is that juju 1.16? as 1.18 is a bit broken wrt scp | 05:14 |
jam | stokachu: in 1.18 (for a while until it gets fixed) args for just scp must come at the end and be grouped | 05:14 |
axw | jam: well I'm on trunk... I forget which versions do what wrt scp | 05:15 |
jam | so: juju scp 1:foo 2:bar "-r -o SSH SpecialSauc" | 05:15 |
axw | jam: what I just described does work on trunk, so presumably on 1.18 too? | 05:15 |
stokachu | ah | 05:15 |
axw | jam: i.e. I just tested "juju scp -- -r 0:/tmp/foo /tmp/bar" | 05:15 |
jam | axw: https://bugs.launchpad.net/juju-core/+bug/1306208 was fixed in 1.18.1 I guess | 05:16 |
_mup_ | Bug #1306208: juju scp no longer allows multiple extra arguments to pass throug <regression> <juju-core:Fix Released by jameinel> <juju-core 1.18:Fix Released by jameinel> <juju-core (Ubuntu):Fix Released> <juju-core (Ubuntu Trusty):Fix Released> <https://launchpad.net/bugs/1306208> | 05:16 |
jam | axw: trunk just lets you pass everything, and you shouldn't need "--" I thought | 05:16 |
axw | you do need --, otherwise juju tries to interpret the args | 05:16 |
jam | axw: fairy nuff | 05:17 |
stokachu | yea i had to use -- with 1.18.1-trusty | 05:20 |
stokachu | axw: that worked :D:D | 05:20 |
axw | stokachu: cool :) | 05:21 |
vladk | jam: morning | 05:29 |
jam | morning vladk, its early for you, isn't it ? | 05:30 |
jam | well, early for you to be on IRC :) | 05:30 |
fwereade | good mornings | 05:42 |
waigani | fwereade: morning :) | 05:50 |
jam | morning fwereade, we've missed you | 05:53 |
fwereade | waigani, jam: it's nice to be back :) | 05:54 |
waigani | heh, easter holiday? | 05:54 |
jam | brb | 05:55 |
axw | hey fwereade | 05:57 |
axw | fwereade: I was about to approve https://codereview.appspot.com/85040046 (placement directives) - do you want another look first? | 05:58 |
fwereade | axw, I'll cast a quick eye over it :) | 05:58 |
axw | okey dokey | 05:58 |
fwereade | axw, ok, based on a quick read of your responses I think I'm fine -- my only question is exactly what happens with the internal API change as we upgrade | 06:01 |
axw | fwereade: the provisioner will be unhappy until it has upgraded | 06:02 |
fwereade | axw, I *think* that it's fine, given that the environment provisioner only runs on the leader state server, and therefore the upgrade happens in lockstep | 06:02 |
fwereade | axw, but other provisioners? | 06:02 |
fwereade | axw, hm, I have a little bit of a concern about error messages during upgrade | 06:02 |
axw | fwereade: it will be the same for the container provisioners, I think | 06:02 |
jam | back | 06:02 |
* axw checks | 06:02 | |
fwereade | axw, *we* might know they're fine | 06:02 |
fwereade | axw, but people who read our logs don't get quite such a sunny prospect of our general competence | 06:03 |
jam | axw: so we talked about having EnsureAvailability with a value of say 0 just preserve the existing desired num of servers | 06:03 |
jam | AFAICT, we never *record* the desired number of servers | 06:03 |
jam | we just have a number of things that are running. | 06:03 |
axw | jam: it's implied by what's in stateServerInfo | 06:04 |
jam | and we have stuff like WantsVote() but I can't see anywhere that sets NoVote=true to indicate that we no longer want to be votiing. | 06:04 |
axw | jam: len(VotingStateMachineIds) | 06:04 |
axw | jam: that's done in EnsureAvailability, in state/addmachine.go | 06:04 |
jam | axw: sure, but isn't that the actual ones that are voting? I guess it would be an availability check? | 06:04 |
fwereade | axw, this must ofc be balanced against the hassle of maintaining the multiple code paths | 06:04 |
axw | jam: VotingMachineIds is really the ones that *want* to vote | 06:05 |
axw | fwereade: just checking still, sorry | 06:05 |
fwereade | axw, np | 06:05 |
fwereade | axw, what I did with the unit agent the other day was just to leave it blocking until the state server it's connected to *does* understand the message, and then continue as usual | 06:06 |
axw | fwereade: yeah, this is common to all provisioners - it will cause an error on upgrade for container provisioners | 06:06 |
axw | hmm ok | 06:06 |
axw | I'll take a look at that code | 06:06 |
axw | fwereade: worker/uniter? | 06:06 |
fwereade | axw, it's not the best code in the world but it seemed to work | 06:06 |
fwereade | just a sec yeah somewhere there | 06:06 |
axw | fwereade: got it I think | 06:07 |
axw | logger.Infof("waiting for state server to be upgraded") | 06:07 |
axw | yeah okay, I can add that in | 06:07 |
fwereade | axw, cool | 06:07 |
* axw senses another need for API versioning imminently | 06:08 | |
axw | although I suppose we can just see that fields are zero values... | 06:08 |
axw | fwereade: yuck, this means threading the tomb all the way through... oh well. | 06:09 |
axw | I suppose it's for the best | 06:09 |
* fwereade glances pointedly at jam re API versioning | 06:09 | |
* jam ducks and pretends to catch a plane | 06:09 | |
* fwereade does understand | 06:10 | |
jam | fwereade: I made sure it was in the topics list | 06:10 |
fwereade | jam, great, thanks :) | 06:10 |
axw | jam: sorry, back to ensure-ha: if you just send 0 or -1 to state.EnsureAvailability, then it can load st.StateServerInfo() and set numStateServers=len(VotingMachineIds) | 06:10 |
jam | axw: I'm going to use 0, because it isn't otherwise valid, and we don't have to woryr about negative numbers. | 06:12 |
axw | sounds good | 06:12 |
jam | axw: I was thinking to do that originaly, but trying to verify the actual meaning of the various values was ... tricky | 06:12 |
axw | oh I don't have to thread the tomb, hooray | 06:12 |
axw | jam: it's not super clear, I agree | 06:13 |
jam | axw: I was reading through the code and trying to figure out what the actual invariants are | 06:13 |
jam | axw: I was really surprised that ensureAvailabilityIntentions doesn't take into account the new request | 06:13 |
jam | so we end up with 2 passes at it | 06:13 |
jam | also, the WantsVote vs HasVote split is confusing. Probably necessary, but very confusing | 06:14 |
axw | jam: yeah, we need to know what the existing ones want to do | 06:14 |
axw | jam: we certainly could do with some developer docs on this | 06:15 |
axw | I don't understand what the peergrouper does, haven't looked at it at all | 06:15 |
axw | I know what EnsureAvailability does, but it's easy to forget :) | 06:16 |
jam | axw: one advantage of "-1" is that it is odd :) | 06:16 |
axw | heh | 06:17 |
jam | axw: I took out the <= 0 and it still failed, and had to remember 0 is even | 06:17 |
jam | axw: non-negative or nonnegative ? | 06:20 |
jam | our error message currently says >0 | 06:21 |
jam | and "greater than or equal to 0" is long | 06:21 |
axw | jam: non-negative looks good to me | 06:21 |
jam | though non-math people won't get non-negative, I guess | 06:21 |
axw | really? | 06:21 |
jam | number of state servers must be odd >= 0 | 06:21 |
jam | number of state servers must be odd and >= 0 | 06:21 |
jam | ? | 06:21 |
axw | will non-math people understand >= ? ;) sure, I guess so | 06:22 |
jam | axw: non-engineering/scientists sort of people don't distinguish "positive" from "nonnegative" | 06:22 |
jam | axw: I can't even say "must not be even"... -1 for clarity :) | 06:23 |
jam | only not | 06:23 |
axw | hehe | 06:23 |
axw | fwereade: updated https://codereview.appspot.com/85040046/patch/120001/130035 | 06:44 |
jam | axw: updated "juju ensure-availability" defaults 3 https://codereview.appspot.com/90160044 | 06:58 |
axw | jam: looking | 06:58 |
jam | axw: note that I merged my default-series branch in ther | 07:03 |
jam | to get the test cases right | 07:03 |
jam | but that didn't end up landing in the mean time | 07:03 |
axw | ok | 07:03 |
jam | so there is a bit of diff that should be ignored, but you can't really add a prereq after the fact | 07:03 |
axw | jam: reviewed | 07:12 |
axw | jam, wallyworld: review for a goose fix please https://codereview.appspot.com/90540043 | 07:20 |
jam | looking | 07:21 |
jam | axw: lgtm | 07:22 |
axw | ta | 07:22 |
axw | fwereade: am I okay to land that branch, or are you still looking? | 07:24 |
* axw takes silence as acquiescence | 07:30 | |
fwereade | axw, sorry, yes, it looks fine :) | 07:41 |
axw | cool | 07:42 |
axw | jam: is the bot awake? | 07:46 |
jam | axw: checking | 07:47 |
jam | axw: it is currently running on addmachine-placement | 07:47 |
jam | perhaps there was a queu? | 07:47 |
jam | its been goin for 14 min | 07:47 |
axw | okey dokey, thanks | 07:47 |
axw | I thought my goose one would go through first | 07:47 |
jam | axw: I don't think there is relative ordering, and the bot only runs one at a time based on what it finds when itwakes up every minute | 07:48 |
jam | so if you approve both, but it hasn't seen it | 07:48 |
jam | then it will wake up, get the list, and start on one | 07:48 |
axw | ok | 07:48 |
axw | wheee, placement is in | 07:53 |
* axw does the maas bits | 07:53 | |
* fwereade bbiab | 07:56 | |
axw | jam: the bot does do goose MPs, right? | 08:29 |
mgz | axw: it does | 08:30 |
mgz | wallyworld: thanks for landing my branch | 08:30 |
wallyworld | mgz: np, pleased to help | 08:30 |
wallyworld | i also tested with local provider just in case | 08:30 |
voidspace | morning all | 08:37 |
jam1 | morning voidspace | 08:40 |
jam1 | axw: so the bot has "landed" your code, but the branch isn't a proper checkout, so it didn't get pushed back to LP | 08:43 |
jam1 | I'll fix it | 08:43 |
axw | doh | 08:43 |
axw | jam1: thanks | 08:43 |
jam1 | axw: should be merged now | 08:45 |
mgz | right, time to get a train to a plane, see you all next week! | 08:46 |
jam1 | mgz: see you soon | 08:46 |
jam1 | have a good trip | 08:47 |
jam1 | you'll see some of us tomorrow at gophercon, righT? | 08:47 |
mgz | jam1: thanks! and yeah, some this week | 08:48 |
jam1 | axw: lgtm on your dependencies branch | 09:01 |
axw | jam1: ta | 09:01 |
jam1 | we'll have to make the bot get the latest version, though | 09:01 |
jam1 | fortunately, I know someone who is currently logged in | 09:01 |
axw | :) | 09:01 |
axw | I thought the bot updated now? | 09:01 |
jam1 | axw: it runs godeps | 09:02 |
jam1 | but that won't pull in new data | 09:02 |
jam1 | it does do go get -u when you poke config | 09:02 |
jam1 | axw: Ican't *quite* go get -u to not screw up the directory under test | 09:02 |
axw | jam1: it does godeps? "godeps -u" updates the code thought...? | 09:03 |
axw | though* | 09:03 |
vladk | jam1: please, take a look https://codereview.appspot.com/90580043 | 09:05 |
vladk | I will be offline until meeting | 09:05 |
=== vladk is now known as vladk|offline | ||
axw | woop, add-machine <hostname> works... now the fun of updating the test service | 09:15 |
jam1 | axw: it sets the version of an existing tree to that revision. It does not *pull* data from remote sources. | 09:28 |
jam1 | so if it isn't present locally, godeps -u doesn't work | 09:28 |
axw | jam1: ah right, I see | 09:28 |
jam1 | axw: so I haven't gotten a chance to dig into it thoroughly, but are we writing "/var/lib/juju/system-identity" via cloud-init? Or are we only using the cloud-initty stuff to get it on their via SSH bootstrap ? | 09:29 |
axw | jam1: yes, that is how it is done now. I'm not a fan | 09:30 |
axw | jam1: actually... | 09:31 |
axw | jam1: sorry, no, we SSH in and then put it in place | 09:31 |
axw | jam1: anything inside environs/cloudinit.ConfigureJuju happens after cloud-init, but only for the bootstrap node | 09:32 |
psivaa | hello, could someone help me build juju from source pls? | 09:34 |
psivaa | I'm getting http://paste.ubuntu.com/7313347/ when i run go install -v launchpad.net/juju-core/... | 09:34 |
voidspace | psivaa: I'm just doing a pull and trying now | 09:38 |
voidspace | psivaa: works for me | 09:38 |
voidspace | psivaa: so I suspect you're using a "too old" version of Go | 09:39 |
voidspace | psivaa: what does "go version" say? | 09:39 |
voidspace | psivaa: I'm on 1.2.1 (built from source) | 09:39 |
psivaa | voidspace: 'go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64' is the output for go version | 09:39 |
axw | fwereade: maas-name support -> https://codereview.appspot.com/90470044/ | 09:40 |
jam | psivaa: actually that looks like an incompatible version of go crypto | 09:40 |
axw | fwereade: still need to support it in bootstrap | 09:40 |
fwereade | axw, awesome :) | 09:40 |
axw | (and add-unit and deploy, but they're coming later) | 09:40 |
jam | psivaa: if you "go get launchpad.net/godeps" you can run "godeps -u dependencies.tsv" and it should grab the right versions of dependencies | 09:40 |
psivaa | jam: ack, i did 'hg clone https://code.google.com/p/go.crypto/' to get go crypto. | 09:41 |
psivaa | jam: voidspace: thanks. i'll try your suggestion | 09:41 |
jam | psivaa: gccgo 4.9 should be new enough | 09:42 |
jam | psivaa: My guess is that go crypto updated their apis, which broke our use of their code | 09:42 |
jam | and we haven't caught up yet | 09:43 |
jam | which is why we have dependencies.tsv to ensure we can get compat versions | 09:43 |
psivaa | jam: ahh ack, i'll use that. thanks | 09:43 |
jam | psivaa: if you don't want godeps, then you can hg update --revision 6478cc9340cbbe6c04511280c5007722269108e9 | 09:43 |
jam | I think | 09:43 |
jam | psivaa: looks like just "hg update 6478cc9340cbbe6c04511280c5007722269108e9" | 09:44 |
fwereade | axw, LGTM, it's really nice to see it implemented with such a small amount of new code:) | 09:48 |
axw | fwereade: :) thanks | 09:48 |
axw | fwereade: sadly the bootstrap one will be a bit larger - I'll need to change Environ.Bootstrap | 09:49 |
fwereade | axw, sure, but it's absolutely a desirable change, and subsequent ones (like zone on ec2) will themselves then basically come for free :) | 09:50 |
axw | yup | 09:50 |
fwereade | vladk|offline, ping me when you're back please -- wondering whether we should really share an identity across state servers, or whether we should be creating one each | 09:52 |
=== axw is now known as axw-away | ||
fwereade | vladk|offline, ah, forget it, I made bad assumptions in the first reading | 09:54 |
=== vladk|offline is now known as vladk | ||
voidspace | my parents have just turned up for coffee | 10:06 |
vladk | fwereade: ping | 10:06 |
voidspace | be afk for 15minutes :-) | 10:06 |
fwereade | vladk, pong | 10:06 |
fwereade | vladk, I see we have separate identities, sorry I misread; but I don't see when we'll rerun those upgrade steps. perhaps we'll definitely never need them? | 10:07 |
perrito666 | good soon to be morning everyone | 10:17 |
vladk | fwereade: I just used a formatter struct, my code does nothing with upgrade. I don't know whether SSH key will distributed on tools upgrade. It wasn't my task. | 10:18 |
vladk | But SSH key will be installed on every new mashing with state agent. | 10:18 |
vladk | Should I investigate what occurs during upgrade? | 10:18 |
fwereade | vladk, ahh, I see | 10:19 |
fwereade | vladk, yes, please see if you can find a way to break it by upgrding at a bad time | 10:20 |
fwereade | vladk, if you can't, then LGTM, just note it in the CL and ping me to give it the official stamp ;) | 10:21 |
fwereade | perrito666, heyhey | 10:21 |
fwereade | perrito666, sorry I left you hanging last week, I think I managed to send you another review a day or two ago though -- was it useful? | 10:21 |
jam1 | fwereade: AFAIK we don't have different identities, do we? | 10:22 |
jam1 | fwereade: https://codereview.appspot.com/90580043/patch/1/10013 concerns me | 10:22 |
jam1 | are we actually writing that to userdata ? | 10:22 |
jam1 | (exposing the secret ssh id) | 10:22 |
jam1 | I think axw-away claimed that we didn't actually do that during bootstrap | 10:22 |
perrito666 | fwereade: It was, altough right now I put that on hold since I am juggling with a brand new set of restore bugs :p | 10:22 |
fwereade | jam1, it does indeed look like we were, grrmbl grrmbl; but it looks to me like what we do now is generate a fresh id and add that to the system, as one of N keys for the state-server "user", per state-server-machine | 10:23 |
fwereade | jam1, so I think it's solid -- did I miss something | 10:24 |
fwereade | perrito666, ok, great -- I'm here to talk further if you need me | 10:24 |
jam1 | fwereade: I haven't yet found that bit that you're talking about (where we actually generate the new value) | 10:25 |
jam1 | I see the code that if we have the value we write it onto disk | 10:25 |
jam1 | fwereade: but while we remove this: https://codereview.appspot.com/90580043/patch/1/10012 | 10:26 |
jam1 | I don't see the the SystemPrivateSSHKey being removed from MachineCfg | 10:26 |
jam1 | nor have I yet found anything that creates the populates the contents of identity | 10:27 |
jam1 | but I could easily just be missing it, though I've gone over the patch a few times now | 10:27 |
fwereade | jam1, hum, yes, I now think I was seeing that bit in the upgrade instructions alone | 10:27 |
fwereade | jam1, yeah, I think that's the only place -- vladk, thoughts? ^^ | 10:28 |
fwereade | jam1, but fwiw, I suspect that the stuff in cloudinit is actually not in *cloudinit*, only in the bit that gets rendered as a script when we ssh in at bootstrap time | 10:29 |
jam1 | fwereade: and we are calling AddKeys(config.JujuSystemKey, publicKey) and setting it to exactly 1 key | 10:29 |
jam1 | fwereade: right, so I'm not very sure about the cloudinit stuff because we did the bad thing and punned it | 10:29 |
fwereade | jam1, AddKeys is meant to *add*, not update -- did that change? | 10:30 |
jam1 | so that sometimes cloud-init is rendered to actual cloud-init | 10:30 |
jam1 | and sometimes it is rendered to a ssh script | 10:30 |
jam1 | fwereade: ah, it might | 10:30 |
fwereade | jam1, believe me, I told the affected parties when they wrote the environs/cloudinit module *waaay* back in the day -- cloudinit is just one possible output format | 10:30 |
fwereade | jam1, sadly I was not in an official tantrum-throwing position at that time ;p | 10:31 |
jam1 | fwereade: also, I think we have a point that steps118.go is only run when upgrading from 1.16 to 1.18, so it *won't* be run when upgrading to 1.20 (from 1.18) | 10:31 |
jam1 | but I don't think that actually matters here | 10:31 |
jam1 | as we don't actually need to fix upgrade | 10:32 |
jam1 | because HA is new in 1.19, so we don't have anything that we're upgrading | 10:32 |
psivaa | jam1: jfyi, godeps method made installing from source work for me. thanks | 10:32 |
fwereade | jam1, I think that, yeah, upgrade is irrelevant except in that it's the one place that actually sets up the keys | 10:32 |
jam1 | fwereade: the issue is that if we are going to give each one a unique identity (which I think is better, fwiw, but I'm not sure if it breaks some assumptions) | 10:32 |
jam1 | I would expect us to see a change in AddMachine() | 10:32 |
jam1 | or EnsureAvailability | 10:32 |
jam1 | fwereade: it sets up the first key | 10:33 |
jam1 | fwereade: I really don't see how his patch would populate the new "identity" field in agent.conf | 10:33 |
jam1 | fwereade: but the fact that we have 3 or 4 types with a StateServingInfo method, and each gets its data from somewhere else | 10:34 |
jam1 | (might be API, might be agent.conf, might be ...) | 10:34 |
vladk | fwereade, jam1: about https://codereview.appspot.com/90580043/patch/1/10012 | 10:34 |
vladk | This is a part of ssh-init script construction. | 10:34 |
vladk | Now ssh key is passed inside of agent.conf file. So I remove it direct creation. | 10:34 |
jam1 | vladk: right, I think that line is great | 10:35 |
jam1 | vladk: but I haven't managed to find the part that actually sets the contents of the agent.conf file | 10:35 |
vladk | here https://codereview.appspot.com/90580043/patch/1/10005 | 10:35 |
vladk | via yaml marshaling | 10:36 |
jam1 | vladk: but what is setting it on the struct | 10:36 |
jam1 | (I'm also not sure that we're allowed to change the content of an agent.conf without bumping the format number, but that is a later concern) | 10:36 |
jam1 | vladk: I see a lot of stuff that "if we have the data set" gets it written to the right places, which all looks good | 10:37 |
jam1 | I just haven't managed to find a line that is "SystemIdentity = XXXXX" | 10:37 |
jam1 | vladk: going the route you did, I would expect to see a change in state/addmachine.go | 10:39 |
jam1 | to something in either EnsureAvailability or elsewhere | 10:39 |
jam1 | to create the system-identity data that the machine agent then reads from agent.conf later | 10:39 |
vladk | jam1: https://codereview.appspot.com/90580043/patch/1/10008 set to StateServingInfo | 10:40 |
vladk | https://codereview.appspot.com/90580043/patch/1/10005 set to formatter of agent.conf | 10:40 |
jam1 | vladk: thanks, fwereade^^ your original assumption is wrong, they all get the same value, and it is being written via cloud-init (from what I can tell) | 10:41 |
jam1 | which is sad news, I believe | 10:41 |
jam1 | vladk: I expected that we would be actually calling an API to get that data during cmd/jujud/machine.go | 10:41 |
jam1 | if we are only reading it from disk | 10:41 |
jam1 | then we wrote it to disk via cloud-init | 10:41 |
jam1 | which means we are passing our ssh secret key to EC2 | 10:41 |
jam1 | to hand back to us | 10:41 |
jam1 | we got away with it (slightly) with "bootstrap" because bootstrap actually SSH's onto the machine to write those files | 10:42 |
fwereade | well fuck | 10:42 |
jam1 | but all other provisioning is done via cloud-init and follow up calls to the API | 10:42 |
fwereade | honestly I'd expect us to just generate it at runtime | 10:42 |
fwereade | jam1, wait, we're writing state-server info to new state servers we provision? | 10:43 |
perrito666 | wwitzel3: can you see me? | 10:43 |
jam1 | fwereade: I had originally thought they should be shared, but honestly, I like your idea to have the agent come up | 10:43 |
jam1 | check that it doesn't have one | 10:43 |
jam1 | generate it | 10:43 |
fwereade | jam1, that's *all* meant to come over the API | 10:43 |
jam1 | and add the public key only to the list of accepted keys | 10:43 |
fwereade | jam1, and indeed in this case there's no reason not to do it on the agent | 10:43 |
jam1 | fwereade: *I* don't understand the code very well | 10:43 |
jam1 | we do some crazy shit | 10:43 |
jam1 | about writing agent.conf | 10:43 |
jam1 | and then reading it back in | 10:43 |
jam1 | fwereade: all of the code in machine.go uses agentConfig.StateServingInfo() | 10:44 |
jam1 | fwereade: except line 240 | 10:44 |
jam1 | where we call st.Agent().StateServingInfo() | 10:44 |
jam1 | and then call: err = a.ChangeConfig(func(config agent.ConfigSetter) { | 10:45 |
jam1 | config.SetStateServingInfo(info) | 10:45 |
jam1 | }) | 10:45 |
jam1 | to get it written to disk | 10:45 |
jam1 | for everything else to read | 10:45 |
jam1 | fwereade: but I *think* there is a bug that you have to have it written to agent.conf first, so that you come up thinking you want to be an API server | 10:45 |
jam1 | fwereade: also see machine.go line 458 | 10:46 |
jam1 | that says "this is not recoverable, so we kill it, in the future we might get it from the API" | 10:46 |
jam1 | there *is* an issue with bootstrap, the first API server obviously has to get it from agent.conf | 10:46 |
jam1 | so there is some bit of we can't just always read from the api | 10:46 |
jam1 | I guess | 10:46 |
jam1 | but the swings and roundabouts make it hard for me to reason | 10:46 |
jam1 | anyway, standup time, switchnig machines | 10:47 |
jam | fwereade: standup ? | 10:48 |
perrito666 | Horacio DurĂ¡n | 10:59 |
perrito666 | jam: | 10:59 |
voidspace | jam: on the logging, the theory is that all the state servers should have *all* the logging - so when bringing up a new state server it really shouldn't need to connect to *all* state servers to get existing logging. Any one (that is fully active) should do. | 11:38 |
jam | voidspace: I understand that, but when you go from 1 to 3, you'll probably see the other api server that is coming up at the same time, and then it is just random-chance if you get the full log or not | 11:38 |
jam | (similarly going from 3-5) | 11:39 |
jam | though not going from degraded-2 to 3 | 11:39 |
voidspace | jam: right, so being able to determine if it's fully active or not would help - but if we can't do that then maybe there's no other way | 11:39 |
jam | voidspace: I certainly understand why it might work, but my point would still be "we can iron out getting the backlog later, because it isn't the most important thing right now" | 11:39 |
voidspace | jam: ok, understood | 11:39 |
voidspace | connecting to all state servers and filtering out duplicate logging offends me though | 11:40 |
voidspace | (and it's O(n^2) if you bring up lots of state servers | 11:40 |
jam | voidspace: its O(n) if the data was properly sorted :) | 11:41 |
natefinch | definitely just ignore the backlog for now. We'll get a real logging framework set up that will do more than rsyslog. There's a topic for it in Vegas. | 11:42 |
jam | though you only ever have 7 state servers (because we use mongo, and mongo has that limit) | 11:42 |
voidspace | ah | 11:42 |
voidspace | still, I'm sure we can do better | 11:42 |
natefinch | jam: in theory you can have up to 12 as long as only 7 are voting. | 11:42 |
vladk | jam: 1) do we need different identites on different machines? | 11:46 |
vladk | 2) should I find places where agent.conf is written and where SystemIdentity is assigned? | 11:46 |
ghartmann | do we already have any clue why add-machine doesn't work for local providers anymore ? | 11:46 |
jam | ghartmann: I hadn't heard that that was the case | 11:53 |
jam | is there a bug/context/paste ? | 11:53 |
ghartmann | I don't get any logs at all | 11:53 |
ghartmann | the machines just stick on pending | 11:53 |
ghartmann | I tried installing on the VM and seen the same issue | 11:54 |
ghartmann | I decided to roll back to 1.18 | 11:54 |
ghartmann | and it's kinda working | 11:54 |
ghartmann | I can't boot precise but trusty works | 11:55 |
ghartmann | by the way | 11:58 |
ghartmann | I am willing to help but I am struggling a bit on how to debug the code | 11:58 |
fwereade | ghartmann, sorry, my internet is up and down, I am missing context | 12:01 |
fwereade | ghartmann, but I would like to help you if I can | 12:01 |
ghartmann | I am currently using juju for local provider only | 12:05 |
ghartmann | best way to prototype and fix charms | 12:05 |
ghartmann | but since I updated juju I am unable to start any machines | 12:05 |
ghartmann | or they start but that way too long | 12:05 |
ghartmann | 30 minutes if they do start | 12:06 |
fwereade | ghartmann, hmm, that "way too long" is really interesting, to begin with it sounded like it might be https://bugs.launchpad.net/juju-core/+bug/1306537 | 12:06 |
_mup_ | Bug #1306537: LXC provider fails to provision precise instances from a trusty host <deploy> <local-provider> <lxc> <juju-core:Triaged> <juju-quickstart:Triaged> <https://launchpad.net/bugs/1306537> | 12:06 |
ghartmann | I would imagine that someone have reported it because being unable to start machines is a breaking issue | 12:08 |
ghartmann | I am trying to understand why this happens and how can I help | 12:09 |
fwereade | ghartmann, ok, the best way to collect information is to `juju set-env "logging-config=<root>=DEBUG"`; and then to look in /var/log/juju-<envname> | 12:12 |
fwereade | ghartmann, in fact looking at the lxc code you might want to set juju.container.lxc=TRACE | 12:14 |
jam1 | fwereade: I think if you "juju bootstrap --debug" it does that level of logging, doesn't it ? | 12:15 |
jam1 | DEBUG (not TRACE) | 12:15 |
fwereade | jam1, yeah, I was assuming an existing environment | 12:15 |
fwereade | jam1, but if it's not working I guess there's not much reason t keep the old one around | 12:16 |
fwereade | jam1, and in particular a lot of the lxc stuff is only logged at trace level, I now observe | 12:16 |
jam1 | vladk: so having unique identities is more of a "it would be nice if they did" rather than "they must" | 12:18 |
fwereade | ghartmann, if you're struggling to find *where* in the code I would start poking around in the container/lxc package -- specifically CreateContainer in lxc.go -- but I'm not sure if that's what you're asking | 12:18 |
ghartmann | the debug helps a little bit but it seems it believes that it worked ... "2014-04-23 12:16:50 INFO juju.cmd.juju addmachine.go:152 created machine 4" | 12:20 |
jam1 | ghartmann: created machine is creating a record in the DB for a new machine | 12:21 |
fwereade | ghartmann, that just indicates that it recorded we'd like to start the container | 12:21 |
jam1 | != actually started a machine | 12:21 |
ghartmann | ah ok | 12:21 |
fwereade | ghartmann, it's possible that the provisioner is implicated, but in particular the slowness STM to point to the actual nuts and bolts of the container work | 12:22 |
jam1 | fwereade: so I think his statement was "it isn't working after 30 minutes" which means it hasn't actually worked yet | 12:22 |
fwereade | jam1, ok, I see :) | 12:22 |
jam1 | fwereade: ghartmann: if it *was* working, it would still need to download the precise/trusty cloud image, but that download should only need to happen once | 12:22 |
ghartmann | I will try looking on lxc | 12:23 |
fwereade | ghartmann, do you see any lines mentioning the provisioner in the logs? | 12:23 |
fwereade | ghartmann, in particular "started machine <id> as instance ..." | 12:24 |
ghartmann | opening environment local | 12:24 |
ghartmann | no started machine | 12:24 |
ghartmann | you mean on .juju/local/log right ? | 12:25 |
ghartmann | I am stop starting the machine manually | 12:26 |
ghartmann | it seems that the machine can't start a network device | 12:27 |
fwereade | ghartmann, ah! you get a container created but it won't do anything? | 12:28 |
ghartmann | it seems that the lxc-start doesn't start the machine | 12:32 |
ghartmann | I will try to get it working first | 12:33 |
ghartmann | it is something related with the network | 12:33 |
ghartmann | it seems that the network of the machine doesn't start | 12:34 |
ghartmann | I will try making it as a bridge | 12:34 |
ghartmann | will let you know once I finish it | 12:34 |
ghartmann | thanks for the ideas | 12:34 |
fwereade | ghartmann, there's a "network-bridge" setting for the local provider which defaults to lxcbr0 -- that works for most people, but possibly you have a different setup there? | 12:34 |
ghartmann | I am using the standard | 12:34 |
ghartmann | but I will change a few things on my network | 12:35 |
ghartmann | will take a while | 12:35 |
jam | fwereade: so there is a bug that deploying precise on trusty will fail because of "no matching tools found" | 12:37 |
jam | fwereade: 2014-04-23 12:36:43 ERROR juju runner.go:220 worker: exited "environ-provisioner": failed to process updated machines: cannot start machine 1: no matching tools available | 12:37 |
fwereade | jam, is that different from the one Ilinked? | 12:37 |
jam | fwereade: it might be the root cause of the one linked, I'm not sure | 12:38 |
jam | fwereade: ghartmann: so one option is to try running "juju bootstrap --series precise,trusty" or possibly "juju upgrade-juju --series=precise,trusty --upload-tools" to see if that gets things unstuck. But for *me* the provisioner is spinning on not creating an LXC instance because it cannot find the right tools | 12:42 |
jam | if you got past that part | 12:42 |
jam | fwereade: so it would seem that if the provisioner cannot provision machine 1 because of no tools, it won't try to provision machine 2 | 13:21 |
jam | (in this case, the former is precise, the latter is trusty) | 13:21 |
fwereade | jam, I think the core of it all is tools.HasTools | 13:23 |
fwereade | jam, oh, wait, it actually can't be here, can it | 13:23 |
fwereade | jam, but the provisioner task's possibleTools method is all messed up anyway :/ | 13:25 |
jam | fwereade: the check we have that all machines are running the same agent version also fails when you have dead machines (since nil != "1.18.1.1") | 13:26 |
jam | so you can't use "juju upgrade-juju --upload-tools --series precise,trusty" to trick it | 13:26 |
fwereade | jam, not without force-destroying the machines, yeah | 13:26 |
jam | fwereade: but for *me* if I "juju bootstrap -e local --upload-tools --series precise,trusty" it works | 13:26 |
jam | without the --series trick, it gets stuck never finding tools for the precise charm | 13:27 |
jam | and then never getting to try for thetrusty charm | 13:27 |
jam | seemingly | 13:27 |
=== BradCrittenden is now known as bac | ||
fwereade | jam, it seems reasonably likely that the provisioner is just failing out on the first one, and then trying again in the same order when it comes back up | 13:28 |
jam | fwereade: right | 13:29 |
jam | fwereade: I would have thought the provisioner would fail and keep trying the next one | 13:29 |
jam | though perhaps the idea is that if tools aren't available yet, it isn't worth trying until later? | 13:29 |
fwereade | jam, yeah, unless explicitly handled otherwise we assume that errors might fix themselves if we try again later | 13:30 |
fwereade | jam, frankly it's insane that the provisioner even knows about tools in the first place | 13:30 |
jam | fwereade: well, it needs to pass them to cloud init | 13:33 |
jam | so that the machine that is starting up can get them | 13:33 |
jam | fwereade: why is that insane ? | 13:33 |
fwereade | jam, the environ *already knows about the tools*. we *ask it where to find the tools*. | 13:34 |
voidspace | lunch | 13:34 |
fwereade | jam, a bit more than a year ago, we managed to refactor some of the way, but not all | 13:34 |
jam | fwereade: is it intended to stay that way? Given we've talked about object storage in mongo | 13:34 |
fwereade | jam, tools-in-state would indeed change the picture significantly, it's true | 13:37 |
fwereade | jam, but even then the provisioner would just be a dumb pipe wrt tools, Ithink | 13:38 |
jam | fwereade: I thought "juju destroy-machine --force" was intended to prevent this status: | 13:39 |
jam | "2": | 13:39 |
jam | instance-id: pending | 13:39 |
jam | life: dead | 13:39 |
jam | series: trusty | 13:39 |
fwereade | jam, hmm, yeah, the provisioner ought to be able to kill all the dead machines before it starts worrying about the live ones | 13:40 |
jam | fwereade: well it is possible that it will get to it soon, but it is stuck downloading the cloud-image template | 13:40 |
jam | which is a few MB | 13:40 |
jam | like 100 or so | 13:40 |
fwereade | jam, btw, I don't suppose you know where that "instance-id: pending" business comes from? | 13:40 |
fwereade | jam, either we have an instance-id or we don't | 13:40 |
jam | fwereade: in that particular case, the "trusty-template" fslock was left stale | 13:41 |
jam | when I called "destroy-environment" while not waiting for trusty to come up. | 13:41 |
axw-away | jam: just saw your message about system-identity in cloud-init. that test you linked to is a bit misleading; it's running Configure, when it should be running ConfigureBasic | 13:41 |
axw-away | jam: IOW, the test does not reflect what we really do on bootstrap | 13:41 |
fwereade | oh WTF | 13:41 |
jam | fwereade: I'm also seeing: 2014-04-23 13:41:08 WARNING juju.worker.instanceupdater updater.go:231 cannot get instance info for instance "": no instances found | 13:41 |
* axw-away goes back away | 13:42 | |
fwereade | jam, looks.like m.InstanceId is not erroring when it should? | 13:44 |
jam | fwereade: perhaps | 13:47 |
jam | fwereade: so from what I can sort out, vladk's patch is worth landing. I'm still confused by bits of it (why is it working), but I can accept that it might just be because I don't understand the swings and roundabouts | 13:53 |
jam | certainly he said he confirmed that secrets aren't going to EC2 | 13:53 |
jam | fwereade: a potential fix for bug #1306537: https://codereview.appspot.com/90640043 | 13:54 |
_mup_ | Bug #1306537: LXC local provider fails to provision precise instances from a trusty host <deploy> <local-provider> <lxc> <juju-core:In Progress by jameinel> <juju-core 1.18:In Progress by jameinel> <juju-quickstart:Triaged> <https://launchpad.net/bugs/1306537> | 13:54 |
hazmat | question via email this morning.. local provider (using lxc).. doing deploy --to kvm:0 is supported? | 13:57 |
jam | hazmat: my understanding is that it has worked, perhaps accidentally but it was working | 13:58 |
wwitzel3 | voidspace: I'm going to grab an early lunch and do an errand and we can sync up with where we are at when I get back. | 14:00 |
fwereade | jam, I'm worried about that because tim added a hack somewhere else in an attempt to resolve essentially the same problem | 14:02 |
fwereade | jam, except it's not quite the-same *enough* I guess | 14:03 |
jam | fwereade: so there is certainly a bit of "this worked for me" vs feeling good about the change. but I have the strong feeling that feeling good about the change means a much bigger overhaul of our internals | 14:03 |
jam | fwereade: so I filed bug #1311677 | 14:04 |
_mup_ | Bug #1311677: if the provisioner fails to find tools for one machine it fails to provision the others <provisioning> <status> <ui> <juju-core:Triaged> <https://launchpad.net/bugs/1311677> | 14:04 |
jam | and looking at it | 14:04 |
jam | (the startMachines code) | 14:04 |
jam | it does exit on the first failure | 14:04 |
jam | and we have the fact that on "normal" provisioning failures | 14:04 |
jam | we call "task.setErrorStatus" | 14:04 |
jam | so if one fails | 14:04 |
jam | we mark it failing | 14:04 |
jam | and then just go back to doing the next thing when we wake up again | 14:05 |
jam | however, if possibleTools fails | 14:05 |
jam | we *don't* call setErrorStatus | 14:05 |
jam | so that machine stays around blocking up all other work | 14:05 |
jam | fwereade: my concerns. 1) We could try to keep provisioning even on errors, but if we are getting RateLimitExceeded, we realyl should just shut up and go sleep for a wihle | 14:06 |
jam | 2) Do we expect tha tpossibleTools is actually going to resolve itself RealSoonNow ? | 14:06 |
jam | now that we have the idea of Transient failures, could we treat no tools there ? | 14:06 |
fwereade | jam, still thinking | 14:08 |
fwereade | jam, re (1), I really think we have to do the rate-limiting inside the Environ, and use a common Environ for the various workers that need one | 14:08 |
jam | fwereade: so even with that we are likely to eventually exceed our retries | 14:09 |
jam | (say we retry up to 3 times, do we want to come back tomorrow?) | 14:09 |
jam | I don't think we want to block a worker thread completely in Environ for more than ... minutes? | 14:09 |
* jam gets called away to actually be part of a family | 14:10 | |
fwereade | jam, if you come back sometime soon: I don't think that tools failure is transient, so I don't think treating it as such will really help -- setErrorStatus is probably the right answer to the problem (apart from anything else, precise/trusty are not the only series people will use even if they are *today*) | 14:13 |
fwereade | to *that* problem | 14:13 |
natefinch | fwereade: definitely, no tools is likely to be a semi-permanent problem for all intents and purposes, certainly not something likely to get fixed within a small number of minutes, which is the most amount of time I can conceive of actually waiting for something to succeed. | 14:14 |
hazmat | jam, it works, the question is it supported, i thought thumper had said that it was, but various folks are getting mixed signals on it | 14:21 |
hazmat | so there's some confusion in regard | 14:21 |
sinzui | jam, fwereade, I think we are 2+ week away from a stable 1.20. I want to try for a 1.18.2 release this week. | 14:22 |
natefinch | hazmat: it works by accident. I wouldn't say it is "supported" | 14:22 |
jam1 | sinzui: so my understanding is that there is very strong political pressure to get something out that has HA in a 'stable' release by the end of the week. We don't have to close all the High bugs to get there. | 14:23 |
natefinch | hazmat: which is to say, I wouldn't rely on it working in the future. | 14:23 |
jam1 | I think we might be able to do a 1.19.1 today | 14:23 |
jam1 | which will be missing debug-log in HA, and backup/restore, I think | 14:23 |
jam1 | but I think we can land Vladk's patch to get "juju run" to work in 1.19.1 and HA | 14:23 |
sinzui | jam1, You cannot have stable release until after users have given feedback. If I release today, you still don't get feedback until next week | 14:24 |
hazmat | natefinch, so if we have folks that need a working solution for lxc and kvm today that need a supported solution, the answer is your out of luck? and we don't support lxc and kvm in the same local provider. | 14:24 |
jam1 | fwereade: sinzui: alexisb (if around) I'm not the one who has the specifics for why we need HA available for April 25th, can you give more context ? | 14:24 |
sinzui | jam1, also CI still doesn't pass HA. Someone might need to work with abentley to make the test pass of find the bug that might be in the code | 14:25 |
fwereade | hazmat, I don't *like* it, but ISTM that it's (1) useful and (2) used, so we don't have any reasonable option for breaking it without providing an alternative | 14:25 |
hazmat | fwereade, there's an extant bug on the later to support kvm and lxc containers in the same provider, which would also work, but its a bit more work. | 14:25 |
jam1 | fwereade: hazmat: I would agree with the "we shouldn't break it without providing another way" | 14:25 |
jam1 | hazmat: you still have the problem with spelling "I want to deploy the next one into KVM", unless we go all the way and make all the things you deploy prefixed | 14:26 |
hazmat | ok.. so supported for now .. till we have something better :-) | 14:26 |
hazmat | jam, any placement effectively bypasses constraints | 14:26 |
hazmat | fwereade, jam1, thanks | 14:27 |
sinzui | jam1, alexisb, fwereade: I am not here to be the voice of idealism. I am the voice of pragmatism. We know developers, user, and CI find bugs, and all three need to affirm the feature works. There is not enough information to call HA stable for release | 14:27 |
fwereade | jam1, hazmat: or we bite the bullet and get multi-provider environments going; at which point it's just another pseudo-provider and should Just Work | 14:27 |
fwereade | jam1, hazmat: but I'm not confident that'll happen any time soon | 14:27 |
jam1 | fwereade: then there is the argument that cross-env relations is better than multi-provider ones | 14:27 |
jam1 | fwereade: if only because for most of them, you actually still want to run an agent local to that provider | 14:28 |
alexisb | jam1, the 4/25 date for the 1.20 release was set because the target for a release with HA is ODS and jamespage needs some time to integrate | 14:28 |
hazmat | long term that sounds great, manual provider with cross region worked well enough for most of those cases for me till 1.19 (the address stuff breaks it) | 14:28 |
alexisb | but as sinzui points out it has to be ready, which it is not | 14:29 |
jam1 | alexisb: fwiw, it is probably ready enough for jamespage to look into integrating it | 14:29 |
alexisb | jam1, ok, we should connect with jamespage then | 14:30 |
sinzui | alexisb, jamespage If you get juju 1.19.1 with HA this week, is that good enough to test? | 14:30 |
natefinch | jam1, alexisb: that was going to be my thought as well. There's some edge case stuff that should be fixed, but the main workings are all there | 14:30 |
jam1 | sinzui: though probably we'll want to get 1.19.1 rather than have him running trunk | 14:30 |
jam1 | sinzui: I was trying to assign someone to work on the HA bug today ,I think natefinch is the one that volunteered to get the test running | 14:30 |
alexisb | sinzui, jam1 how close are we to a 19.1 release? | 14:30 |
alexisb | I see 2 critical bugs still being worked | 14:31 |
sinzui | alexisb, jam1, you are actually on schedule for a Friday release | 14:31 |
jam1 | alexisb: one of those should have a patch that should be landing, I don't know for sure why it hasn't | 14:31 |
sinzui | I just don't see that release being called 1.20 | 14:31 |
jam1 | the other is "juju backup" which is also supposed to have something from perrito666, but may not have to block 1.19.1 | 14:31 |
alexisb | sinzui, agreed | 14:31 |
jam1 | sinzui: I agree, I don't think 1.19.1 is 1.20 | 14:31 |
jam1 | but it is HA out for testing | 14:31 |
* perrito666 feels conjured | 14:31 | |
jam1 | to get feedback to drive a proper 1.20 | 14:32 |
jam1 | perrito666: so you work working to get "juju backup" to find /usr/lib/juju/bin/mongod when available, did that get done? | 14:32 |
alexisb | jamespage, would a 1.19.1 development release be enough for you to begin testing and integration? | 14:32 |
sinzui | jam1 yep | 14:32 |
jam1 | alexisb: I know of 2 things that are just-broken when you run HA (juju debug-log and juju run), but we have a patch for the latter, and wwitzel3 and voidspace on the former. | 14:33 |
fwereade | jam1, I'm not sure how important it is to have a local state-server in the *long* term, but in the short term it is true that we benefit a lot from it | 14:33 |
jam1 | natefinch: did you get to look into the HA CI test suite? Can you give me an update on it by your EOD, as I can look at it tomorrow. | 14:34 |
perrito666 | jam1: I am actually trying to fix the whole thing together (backup/restore) since the test takes time I try to make the best of it, but I can propose the backup fix alone if you want | 14:34 |
sinzui | jam1, returning to 1.18.2. You have diligently landed some fixes to it. I think there were a few more bugs that would be lovely to include. May I propose some merges to 1.18 to prepare a 1.18.2 that Ubuntu will love? | 14:34 |
natefinch | jam1: looking at it now, late start to my day today, but i still have a lot of time to put into it. | 14:34 |
jam1 | perrito666: please never block getting incremental improvements on getting the whole thing. In general everyone benefits as long as it doesn't regress things in the mean time. | 14:35 |
fwereade | perrito666, I like small branches -- I know that a backup that can't be restored is no backup at all, but I'd still rather see a few branches that we merge all at once if we have to | 14:35 |
jam1 | sinzui: I have the strong feeling that 1.18 is going to stick in Trusty and we're going to be supporting it for a while. | 14:35 |
perrito666 | ack | 14:35 |
jam1 | sinzui: so while I'm not currently focused on it, because of 1.19 and HA stuff filling my queue | 14:35 |
perrito666 | :) | 14:35 |
jam1 | sinzui: patches seem most welcome to 1.18 | 14:35 |
fwereade | perrito666, jam1: indeed, the only reason to hold off on landing one of those branches is if it does, in isolation, regress something | 14:35 |
alexisb | jam1, are you thinking that 1.18 will be the long term solution for Trusty? | 14:36 |
sinzui | jam1. okay. I will make plans for 1.18.2 | 14:36 |
natefinch | sinzui: how do I investigate a CI failure? I believe functional-ha-recovery-devel is the one I'm supposed to be fixing | 14:36 |
jam1 | alexisb: 1.18 doesn't have HA support, and will likely be missing lots of stuff. I just think that given our track record with actually getting stuff into the main archive, we really can't trust it | 14:37 |
sinzui | natefinch, abentley in canonical's #juju is seeing errors like this...http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/functional-ha-recovery-devel/64/console | 14:38 |
jam1 | alexisb: so likely we'll want something like cloud-archive for Trusty that provides the latest set of tools that we like | 14:38 |
sinzui | natefinch, abentley believes the problem is the test. it is not waiting for the confirmation that juju is in HA. | 14:38 |
jam1 | but I don't think we can actually expect to get things into the Ubuntu archive. | 14:38 |
sinzui | natefinch, abentley will ask for assistance if the test continues to fail after assuring itself that HA is up | 14:39 |
natefinch | sinzui: cool. I'm more than willing to help. I know that working with mongo can be hairy | 14:39 |
alexisb | jam1, yes we are working with the foundations team/TB to define the process for updating juju-core package in trustie | 14:40 |
alexisb | I don't know yet what the process will be | 14:40 |
jam1 | alexisb: i might be being jaded, but cloud-tools:archive still has 1.16.3 because it never got 1.16.5 landed in Saucy | 14:40 |
jam1 | and that is... 6 months old? | 14:41 |
alexisb | and it could very well become via cloud-tools | 14:41 |
jam1 | alexisb: though again, we've struggled to get stuff in there, too | 14:43 |
hazmat | are there any tricks to compiling juju with gccgo? | 14:44 |
sinzui | jam1, alexisb : I thought jamespage had made progress getting juju 1.16.4..1.16.6 in old ubuntu. The issue was the backup and restore plugins...since the backup plugin wasn't in the code, we elected to not package it. | 14:45 |
fwereade | jam1, re https://codereview.appspot.com/90640043 -- how about fixing environs/bootstrap.SeriesToUpload instead? | 14:46 |
jam1 | sinzui: so cloud-archive:tools still has 1.16.3 as the best you can get: http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/precise-updates/cloud-tools/main/binary-amd64/Packages | 14:46 |
alexisb | well HA is really important so we will need to fight the battles to get it into Trustie | 14:46 |
jam1 | fwereade: so instead of LatestLTSSeries it would do AllLTSSeries ? | 14:47 |
fwereade | jam1, essentially, yeah | 14:47 |
fwereade | jam1, if we were smart we'd only upload a single binary anyway but I'm not sure we got that far yet | 14:48 |
jam1 | fwereade: so at this point, I think using LatestLTSSeries is still a bit wonky since we really can't expect anything about T+4 | 14:48 |
jam1 | fwereade: we're not | 14:48 |
sinzui | alexisb, jam1, we have never tested upgrade from 1.16.3 to 1.18.x. We need to test that if jamespage fails to get 1.16.6 into the cloud-archive...and hope it works | 14:48 |
jam1 | if you bootstrap --debug you can see the double upload | 14:48 |
fwereade | jam1, yeah, thought so | 14:48 |
jam1 | sinzui: AIUI, the issue was that once Trusty releases, then the version in Trusty becomes the version in cloud-tools, so it will jump from 1.16.3 to 1.18.1 (?) | 14:50 |
sinzui | jam1, right, that was the jamespage's fear. | 14:50 |
jam1 | fwereade: I would be fine moving it toSeriesToUpload, and *I* would be fine just making that function put Add("precise"), Add("trusty") | 14:50 |
jam1 | fwereade: but *I'm* way past EOD here | 14:51 |
fwereade | jam1, but regardless, I think we're better off fixing SeriesToUpload (and maybe improving the double-upload, now that it's potentially a triple-upload) than adding another tweak to a code path that is in itself pretty-much straight-up evil in the first place | 14:51 |
jam1 | fwereade: so happy to LGTM a patch that does that :) | 14:51 |
jam1 | even better that it could *actually* be tested | 14:52 |
fwereade | jam1, quite so, that was my other quibble there ;) | 14:52 |
fwereade | jam1, ok, I have a meeting in a few minutes and am not sure I will get to it today myself, but I'll make sure you know if I do | 14:53 |
bac | sinzui: so the swift fix was a mirage? | 14:56 |
sinzui | bac: yes | 14:56 |
bac | drats | 14:56 |
sinzui | bac: and the corrupt admin-secret theory is crushed | 14:57 |
sinzui | bac, also, staging machine-0 has been stuck in hard reboot for a week. I think we can say it is dead. | 14:58 |
jam1 | fwereade: I gave a summary of why vladk's patch works, mostly boiling down to the fact that what we write to the DB is the params.StateServingInfo struct, unlike most of our code which uses separate types for API from DB types | 15:15 |
jam1 | https://codereview.appspot.com/90580043/ | 15:15 |
jam1 | vladk: are you able to land that patch today before sinzui can put together a release ? | 15:16 |
jam1 | (and get CI to pass on it, I guess) | 15:16 |
vladk | jam1: yes | 15:16 |
jam1 | vladk: great | 15:16 |
jam1 | LGTM | 15:16 |
jam1 | vladk: can I ask that you file a "tech-debt" bug to track that we may want to have each API server have their own system identity? | 15:17 |
vladk | jam1: ok | 15:17 |
jam1 | I think as long as we have the api StateServingInfo we can actually notice who's calling and give them the a different value if we want | 15:17 |
hazmat | it looks like 1.18 branch has deps on both github.com/loggo/loggo and github.com/juju/loggo are those the same ? | 15:19 |
jam1 | hazmat: they need to be only one, otherwise the objects internally are not compatible | 15:21 |
jam1 | it should all be "github.com/juju/loggo" | 15:21 |
hazmat | jam1, 1.18 stable branch -> state/apiserver/usermanager/usermanager.go: "github.com/loggo/loggo" | 15:25 |
hazmat | jam1, thanks.. i'll mod locally | 15:25 |
jam1 | hazmat: please propose a fix if you could | 15:25 |
hazmat | jam1, sure.. just need to get through the morning | 15:26 |
voidspace | jam1: ping, if you have 5 minutes | 15:26 |
voidspace | jam1: it can wait until tomorrow if not | 15:27 |
=== BradCrittenden is now known as bac | ||
voidspace | ooh, precise only has version 5 of rsyslog so we can only use the "legacy" configuration format | 15:42 |
voidspace | lovely | 15:42 |
voidspace | jam1: cancel my ping :-) | 15:47 |
voidspace | natefinch: ping | 15:47 |
natefinch | voidspace: howdy | 15:50 |
natefinch | fwereade: where do I go to approve time off? | 16:00 |
perrito666 | jam1: fwereade sinzui https://codereview.appspot.com/90660043 this fixes the backup part of the issue | 16:05 |
perrito666 | so ptal? | 16:06 |
perrito666 | anyone is encouraged to, although be warned, its bash | 16:07 |
fwereade | natefinch, canonicaladmin.com is all I know | 16:11 |
=== vladk is now known as vladk|offline | ||
perrito666 | does anyone now why are we dragging the logs on the backup? (and most precisely why are we restoring them?) I mean I know we might want to back them up for analysis purposes, but restore the old logs pollutes information a bit | 17:04 |
jam1 | natefinch: you should be able to log into Canonical Admin and have "Team Requests" under the Administration section | 17:17 |
jam1 | perrito666: if you want to investigate why something failed in the past, you need the log | 17:18 |
perrito666 | jam1: exactly, but if you restore the log from the previous achine you are lying about the current one | 17:19 |
jam1 | perrito666: but it also contains the whole history of your actual environment | 17:19 |
jam1 | vs just this new thing that I just brought up | 17:19 |
jam1 | I would be fine moving the existing file to the side | 17:19 |
jam1 | but all the juicy history is what you are restoring | 17:19 |
jam1 | perrito666: did you test the backup stuff live against a Trusty bootstrap? | 17:19 |
jam1 | perrito666: nate's patch landed at r2662 | 17:20 |
perrito666 | jam1: sorry I was at the door | 17:27 |
perrito666 | I did, let me re-check that the env that is being back-up actually has the proper mongodb | 17:27 |
perrito666 | jam1: re your comment, I could try to assert MONGO* is exectuable or fail instead | 17:30 |
voidspace | going jogging, back shortly | 17:39 |
jam1 | perrito666: I don't really think we need to spend many cycles worrying about it. | 17:41 |
jam1 | It may be that just using '-f' will give better failure modes (more obvious if we try to execute something that isn't executable than trying to run a command that isn't in $PATH) | 17:42 |
jam1 | perrito666: anyway, not a big deal, don't spend too much time on it, focus on getting it landed and on to restore | 17:42 |
perrito666 | yea, most likely if you have those and they are not executable you most likely noticed other problems | 17:42 |
* perrito666 repeats himself when he stops writing a sentence in the middle and then restarts | 17:44 | |
jam1 | that is certainly a common thing | 17:44 |
perrito666 | whell I did a version of restore that backups the old config just so I get to discover what part of our backup restoration breaks the state server | 17:45 |
* perrito666 's kindgom for an aws location in south america | 17:49 | |
=== vladk|offline is now known as vladk | ||
voidspace | EOD folks | 18:32 |
voidspace | g'night | 18:32 |
perrito666 | bye | 18:32 |
wwitzel3 | voidspace: see ya | 18:32 |
stokachu | is juju add-relation smart enough to handle add-relations to non-existent services that may be coming available in the future | 18:32 |
stokachu | for example if I deploy 3 charms and charm 1 relies on charm 3 so i add the relation during charm 1 deployment | 18:32 |
stokachu | is it smart enough to retry to add-relations once it sees charm 3 come online? | 18:33 |
stokachu | marcoceppi: ^ curious if you know this? | 18:34 |
marcoceppi | stokachu: no | 18:35 |
stokachu | marcoceppi: no to not smart enough or no to you aren't sure? | 18:35 |
marcoceppi | not smart enough, if you run add-relation then it won't actually work if the one of the two services isn't there | 18:35 |
stokachu | so that makes it difficult for me to put juju deploy <charm>; juju add-relation <charm> <new_charm_not deployed>; juju deploy <new_charm> | 18:36 |
marcoceppi | stokachu: not difficult, impossible. | 18:37 |
marcoceppi | stokachu: you should run add-relation once you have all your services deployed | 18:37 |
stokachu | so if i deploy and openstack cloud i'd have to deploy all charms, then re-loop through those charms and add-relations | 18:37 |
marcoceppi | stokachu: or, use juju deployer | 18:38 |
bloodearnest | stokachu: or better yet, deploy charms, mount volumes, then add relations, as many charms expect the volumes to be already configured on the joined hook | 18:39 |
stokachu | bloodearnest: interesting ill look into that | 19:01 |
bloodearnest | stokachu: on account of juju having no way yet to detect/react to volumes changing, AIUI | 19:02 |
stokachu | i wonder if it'd be worth it to have add-relations kept in a queue and when a service comes online it just checks for pending | 19:03 |
natefinch | stokachu: note that you don't need to wait for the charms to be deployed to add relations. You can fire off deploy deploy deploy add-relation add-relation add-relation, and juju will eventually catch up. It's just that you have to run the deploy command before the add-relation command | 19:07 |
stokachu | natefinch: yea thats what im doing now | 19:07 |
stokachu | just iterating through the charms twice is all | 19:07 |
natefinch | stokachu: iterate through charms once and then through relations once ;) | 19:08 |
natefinch | gotta run, car needs to be inspected, back in 45 mins | 19:08 |
=== natefinch is now known as natefinch-afk | ||
=== natefinch-afk is now known as natefinch | ||
sinzui | wwitzel3, natefinch CI cursed the most recent juju because of a unit-test failure on precise. Do either of you think the test can be tuned to be reliable on precise? https://bugs.launchpad.net/juju-core/+bug/1311825 | 19:35 |
_mup_ | Bug #1311825: test failure UniterSuite.TestUniterUpgradeConflicts <ci> <intermittent-failure> <test-failure> <juju-core:Triaged> <https://launchpad.net/bugs/1311825> | 19:35 |
natefinch | sinzui: looking | 19:37 |
wwitzel3 | sinzui: also taking a look | 19:37 |
natefinch | man I hate overly refactored tests | 19:40 |
natefinch | wwitzel3: can you even tell what sub-test is failing? | 19:47 |
natefinch | all I see is "step 8" which doesn't tell me diddly | 19:47 |
wwitzel3 | natefinch: not really, I've got as far as fixUpgradeError step | 19:55 |
wwitzel3 | natefinch: but it is all nested so I can't tell in which that is happening | 19:56 |
=== vladk is now known as vladk|offline |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!