[00:01] wallyworld_: why is bootstrap doing sync tools ? [00:01] this is ec2 [00:01] not sure. it should only do that if it can't find any. i'll take a look [00:02] davecheney: well, it was looking for 1.15 [00:02] which it couldn't find [00:02] so you need --upload-tools [00:02] sure, but 1.12 exists in the s3 bucket [00:02] otherwise it will sync [00:03] but different minor version [00:03] 12 = 15 [00:03] != [00:03] wallyworld_: is that important [00:03] 1.12.0-precise-amd64 tools exist in the public s3 bucket [00:03] this is an ec2 environment [00:03] this is a regression [00:04] no - we can't guarantee 12 is compatible with 15 [00:04] we talked about this remember? [00:04] you asked the change be held till after 1.13 was released [00:04] yes, we talked [00:04] but i don't see how this is related [00:04] i am using 1.15 client [00:04] you are bootstrapping using juju 1.15 right? [00:05] 1.12 tools exist in the public bucket [00:05] why is bootstrap syncing the 1.12 tools [00:05] shouldn't it refuse to run saying there are no 1.15 tools ? [00:05] sync tools is dumb [00:05] it needs work [00:05] should I raise a bug ? [00:05] it is doing what it has always done and grab whatever tools it can find - the algorithm needs work [00:06] the problem is that folks in CTS will be using the devel version and they won't understand why their bootstrap times on ec2 have blown out [00:06] sure, raise a bug - it's being worked on as we speak [00:06] as part of the tools work [00:06] wallyworld_: i thought the change was exact match or TFO [00:06] TFO? [00:08] bootstrap requires an exact match, but if it can't find one, it does a sync tools. and sync tools needs some love. it's all in progress [00:09] i guess the expectation was that people running from source ie dev version, should always know to do an upload-tools [00:09] upload-tools = problem solved [00:10] /s/FTO/GTFO [00:11] ok, upload-tools it is [00:12] sorry about that - upload tools is a consequence of exact version match [00:12] i think it should do it automatically for dev versions [00:12] with an appropriate message [00:17] wallyworld_: 2013-09-10 00:15:00 INFO juju.environs.tools tools.go:131 filtering tools by series: precise [00:17] 2013-09-10 00:15:00 INFO juju.environs.tools tools.go:53 no architecture specified when finding tools, looking for any [00:17] ^ why does it say that [00:17] amd64 is the default [00:18] davecheney: the tools finding code it calling "findtools" without specifying an architecture - that's how it's always been. but now, with the simplestreams metadata, tools info is keyed of arch and series, so it logs that message [00:19] so the code which calls FindTools needs to be looked at to see if it can get access to an arch to pass in [00:19] davecheney: where is the default arch specified - i can't recall right now [00:20] and if amd64 is not found, doesn't it default back to i386? [00:20] in which case passing in nothing for arch is ok? since it will look for tools matching any arch? [00:23] bigjools: ffs. gwacl panics when asked for the url of a non-existent file :-( [00:24] wallyworld_: dunno, sounds wrong [00:24] the default is precise/amd64 [00:24] and the tools lookup clearly shows it is scoping by precise [00:24] wallyworld_: \o/ [00:24] it shold also scope by amd64 [00:25] davecheney: probs. but i'm not sure how arm or i386 would be specified [00:25] wallyworld_: I hear you're an expert in Go, you could fix it! [00:25] bigjools: have i told you today? [00:25] lol [00:25] actual lol [00:25] wallyworld_: ok [00:25] good point [01:06] bigjools: can you add me the the gwacl-hackers team, and +1 this? https://code.launchpad.net/~wallyworld/gwacl/storage-error-not-panic/+merge/184704 [01:06] wallyworld_: yes yes [01:06] bigjools: thanks, and we use the bot is assume? [01:06] ie just set to approved and the bot will work [01:06] wallyworld_: I hope so. [01:07] we'll find out :-) [01:07] wallyworld_: it requires one +1 review [01:07] juju-core does now too [01:07] for a little while [01:10] wallyworld_: you need to run " make format" [01:10] bigjools: i ran go fmt and it fucked everything [01:10] i had to revert and start again :-( [01:10] wallyworld_: yes, use make format :) [01:10] WHY??? [01:10] what's wrong with go fmt [01:10] because [01:10] because tabs are fucking evil [01:10] that is the problem with standards; everyone wants their own [01:11] there are so many to choose from [01:11] yeah agree. wish i had known that [01:11] ... the Aristocrats! [01:11] before if*cked myself [01:11] bzr has this really useful "commit" thing [01:12] bigjools: yes, but one does not expect to have to run it before doing a go fmt [01:12] changes pushed [01:13] wallyworld_: I never ever trust tools [01:13] ever [01:13] bigjools: that's why i don't trust you [01:13] and he's here ALL WEEK [01:13] yep :-D [01:17] wallyworld_: anyway I already told you once we used 4 space indents in gwacl and you said that was awesome [01:18] bigjools: and you expect me to remember that after many bottles of red have passed under the bridge? [01:18] and months of "go fmt" muscle memory [01:18] hahaha [01:18] we left our present [01:19] brb, someone just rang my doorbell [01:20] hey wait, why is that bag on fire [01:20] snork [01:23] o/ [01:23] * thumper makes a smoothie for lunch [01:27] davecheney: you wanna LGTM this? https://codereview.appspot.com/13620043/ [01:30] I wanna [01:30] done [01:30] LGTM [01:30] fire when ready [01:31] many thanks [01:32] thank you sir [02:13] wallyworld_: ping [02:13] hi [02:14] wallyworld_: got a few minutes for a hangout? [02:14] thumper: sure, just give me 5 [02:20] thumper: ok, free now [02:20] ok [02:21] wallyworld_: https://plus.google.com/hangouts/_/6b1e2ea0df710aa62ee610909253df3e89de4c9b?hl=en [03:21] somehow my branch revision just disappeared :( [03:21] * axw_ starts the null provider again [03:23] ?! [03:23] axw_: disappeared how? [03:23] nfi. [03:23] I committed [03:23] then did a go build [03:23] then my files are all gone, and the revision is back where it was before [03:24] fortunately all the hard stuff is already uploaded [03:33] weird === axw__ is now known as axw [04:22] axw: ing [04:22] p [04:24] thumper: pong [04:24] axw: are you backporting bug 1222664 to the 1.14 branch? [04:24] <_mup_> Bug #1222664: maas provider's instance is not a Stringer [04:24] not this very moment, but I will [04:24] if you are busy [04:24] I can do it [04:24] won't take long at all [04:24] then you can ack it [04:25] if you don't mind, trying to unbugger my stream [04:25] kk [04:25] will do, right now [04:25] thanks [04:27] happy birthday thumper! Today is also my son's birthday. [04:28] jam: thanks, and happy birthday to your son :) [04:28] outed! happy birthday thumper [04:29] you sly bugger [04:33] axw, davecheney: https://code.launchpad.net/~thumper/juju-core/backport-bug-1222664-to-1.14/+merge/184718 I've kept it out of lbox for simplicity [04:33] thumper: approved, thank you [04:33] np [04:34] * thumper looks at the last issue [04:44] davecheney: ping [04:44] thumper: looking [04:44] davecheney: I don't need you to look [04:44] I need to talk to you [04:44] ack [04:44] davecheney: this one with the race to start [04:44] yup [04:44] you talk about the provisioner, but I don't see that in the logs [04:44] do you mean the state connection? [04:45] this is the machiner on machine 0 [04:45] which has the provisioner worker [04:45] my terminilogy is old [04:45] you mean the machine agent [04:45] the provisioner is a task, as is the machiner [04:46] * thumper thinks how to solve this sanely [04:46] thumper: the agent ? [04:46] is trying to conncet to a service that it hosts itself [04:46] hence, a gordian knot [04:46] * thumper nods [04:47] I think I have a plan [04:47] * thumper tries it [04:47] i guess that plan doesn't include making the api server a different process ? [04:48] https://bugs.launchpad.net/juju-core/+bug/1218329 [04:48] <_mup_> Bug #1218329: Update default environment.yaml for Azure to use Precise for default-series [04:48] ^ is this really fix committed on 1.14 ? [04:48] i see no branch [04:49] I didn't branch [04:49] davecheney: yes it is there [04:49] I checked the last commit [04:49] I'll help axw understand the process later [04:49] when I'm done [04:49] ok [04:49] cool [04:49] perhaps we should write up a backporting howto [04:49] just checking [04:49] and shove it in the tree [04:49] davecheney: I see the problem with the machine agent [04:49] * thumper thinks [04:52] davecheney: bug #1218329 was landed, let me see if I can dig up the branch [04:52] <_mup_> Bug #1218329: Update default environment.yaml for Azure to use Precise for default-series [04:53] jam: i can see the commit, it'll do [04:53] davecheney: https://code.launchpad.net/~axwalk/juju-core/lp1218329-azure-released-images/+merge/183381 [04:53] it is pretty small :) [04:54] jam: that is not the commit [04:54] that is the trunk commit [04:55] davecheney: I don't quite understand your "that is not the commit". The specific fix for bug 1218329 is just to change those 3 lines. [04:55] <_mup_> Bug #1218329: Update default environment.yaml for Azure to use Precise for default-series [04:56] it is about having a good "juju init" value for azure, which was delayed until we had proper images upload to azure [04:57] jam: i am lookin for the mp for the merge onto the 1.14 branch [04:58] davecheney: it hasn't ever landed on 1.14 [04:58] Fix Committed is landed-in-trunk [04:59] jam: it's there, I just didn't do whatever the normal procedure is [04:59] jam: ok, i'm trying to get to the bottom of why it says ' [04:59] davecheney: http://bazaar.launchpad.net/~juju/juju-core/1.14/revision/1738 [04:59] thank you [05:00] i cannot find a mp for that [05:00] so i cannot link it to the issue [05:00] davecheney: there isn't one [05:00] that was my problem [05:00] according to axw [05:00] * axw nods [05:00] one problem with making the series branches not owned by the bot, is that people end up with direct commit access. [05:01] jam: how can we fix that ? [05:01] davecheney: we can change the owner and I can add the bot to controlling that branch [05:01] excellent, sgtm [05:03] thumper: is there a way to change the owner of a branch to a person where you aren't the direct owner? [05:03] davecheney: worst case, I create a bot branch, and just change what "lp:juju-core/1.14" points to [05:03] sgtm [05:04] jam: not sure I understand [05:04] but right now, I only have groups *I'm* in as potential owners, and the bot is intentionally not in ~juju so he can't touch things he hasn't been given direct access to. [05:04] thumper: I want the go-bot to own an existing branch [05:04] and the go-bot is a person? [05:04] but he isn't in ~juju, and I'm not the bot if I'm ~jameinel [05:04] thumper: yes [05:04] jam: you need to get an lp-admin to do that [05:04] you can't [05:04] thumper: then I'll just create another branch [05:04] you *used* to be able to hand branches to people [05:04] however [05:04] but I think that was considered a csecurity hole [05:04] you can pass it through an intermediate team [05:04] if you are both a member of the same team [05:05] you change it to the team [05:05] they change from the team to themselves [05:05] yes, it was a security hole [05:05] we closed it [05:06] thumper: I thought the plan was to just have confirmation by both parties [05:06] so someone gives it to you, but you have to 'accept' it. [05:06] but meh [05:06] nah [05:06] too much work [05:14] davecheney: how do I ask if a channel is closed? [05:14] thumper: you cannot [05:14] we normally use a tomb to provide that [05:15] select on a closed channel succeeds immediately yes? [05:18] yes [05:18] select { [05:18] case c, ok := <- ch: ; if !ok // closed [05:18] you can also do [05:19] c, ok := <- ch outside select and it will block until ch is closed [05:19] but there is no isClosed(ch) builtin [05:20] c, ok := <- ch will return if closed or something on the channel, right? [05:20] y [05:27] davecheney: https://code.launchpad.net/juju-core/1.14 is now pointing at ~go-bot/juju-core/1.14 [05:28] thumper: I deleted the ~juju/juju-core/1.14 branch, so it deleted your MP against it. But I really didn't want to have 2 1.14 branches that might cause confusion [05:28] ok [05:28] it had landed [05:28] thumper: yeah, if you want it for posterity, you can submit it against the new branch and mark it merged :) [05:29] nah, don't care [05:33] davecheney: so the change from ian was explicitly requested from fwereade_ about "juju bootstrap" will only support exact Major.Minor versions. [05:33] so for dev releases you have to use '--upload-tools' [05:34] because of the chance that we break bootstrap between minor versions (which we've done) [05:34] davecheney: I'm happy to have you in the conversation (I'm more on the "lets make it easier to be cross version compatible, rather than stricter"). And it doesn't make a *huge* difference for finally released things. [05:35] but it does mean you can't use dev to bootstrap stable [05:41] mgz: when you're up and around, I need to get some info from you to control the go-bot config. (You can't access the root bot with just the admin password, you have to have the cert.pem files as well) [05:41] mgz: also, I can't just test CACerts by renaming my own certs, because what we need to test is the bootstrap on the remote machine and "wget" *there* not on my local machine. [05:42] (I can set up a test-suite HTTPS server with self-signed certs pretty easily, so I have 'local' testing done) [05:51] * thumper gives up on subtlety [05:58] * thumper adds a sleep [06:02] thumper: how did you arrive at 3s ? [06:02] :) [06:02] would it be easier to just retry 1 time to connect rather than sleeping? [06:02] as in, try to connect for 3s [06:02] rather than sleep [06:02] have you looked at the code.. [06:03] hmm... [06:04] perhaps [06:04] thumper: line 147 [06:04] honestly, I didn't think of it from that side :) [06:04] when it says "failed to connect" error, just retry 1 time, or retry for 3s or whatever [06:04] and EOD now [06:04] an exercise for the reader [07:50] mornin' all [07:51] morning [07:51] morning [07:52] thumper not around anymore? hmm, ok, have to congratulate him later. [08:00] jam: ping [08:03] axw: do you know anything about the background for this, by any chance? https://codereview.appspot.com/13412047 [08:04] rogpeppe1: I think there's a race between the API server starting up, and something trying to connect to it; the connection fails, and brings down the process [08:04] or something like that. [08:04] axw: that's what the comment in the CL seems to imply, but i can't quite see how it can actually happen [08:04] * axw looks [08:05] axw: AFAICS if the API worker fails to connect to the API, it will finish with a non-fatal error, and be restarted after a little while by the top level runner [08:06] axw: that's how i *intended* it to work anyway :-) [08:08] hey rogpeppe1 [08:08] jam: hiya [08:08] jam: see my question to axw above [08:08] rogpeppe1: right now, when the APIWorker fails it restarts everything [08:08] https://bugs.launchpad.net/ubuntu/+source/charm-tools/+bug/1223225 - this needs fixing. [08:08] there is a bug on it [08:08] <_mup_> Bug #1223225: charm-tools needs to stop recommending juju [08:09] jam: really? [08:09] jam: if the connection to the API fails, it should not restart everything [08:09] rogpeppe1: bug #1220027 [08:09] <_mup_> Bug #1220027: worker/provisioner: cannot restart cleanly due to hard dependency on api server [08:09] jam: the top level runner does not use allFatal [08:09] rogpeppe1: the openAPIAs code has DialTimeout of 0 [08:09] so it always restarts [08:10] jam: that should be fine, because the top level runner does *not* exit when one of its tasks exits [08:10] jam: so the APIWorker task should be restarted until the API server is available, no? [08:11] rogpeppe1: the issue is the "provisioner" is triggering restarts because it can't connect to the API [08:11] that is what the bug report claims [08:11] jam: why is the provisioner trying to connect to the API? [08:11] rogpeppe1: I haven't debugged this [08:12] I'm just conveying the context I have so far [08:12] jam: shouldn't it be using the API connection that is opened by openAPIState? [08:12] jam: sure, thanks [08:12] rogpeppe1: I see in worker.Runner.run code that does check isFatal [08:12] rogpeppe1: state.OpenState (something) returns both an api and state connection [08:12] am I looking at the right thing? [08:12] everything needs a connection to the api [08:12] maybe also the api [08:12] axw: isFatal is slightly-less than allFatal [08:12] jam: well, it then goes and calls killAll(workers) [08:12] davecheney: yeah, the API also (initially) needs a connection to the API, except on the bootstrap node, obviously [08:13] axw: see the definition of isFatal in jujud/agent.go for which errors are considered fatal at the top level [08:13] rogpeppe1: this creapt in late in the 1.11.x cycle [08:14] it is present in 1.12.0 [08:14] davecheney: what crept in, sorry? [08:14] rogpeppe1: the race condition triggering restarts [08:14] rogpeppe1: ah sorry, I misread your statement before - I thought you said it didn't call isFatal [08:15] mgz: poke [08:15] mgz [08:15] jam, davecheney: i'm not sure that https://bugs.launchpad.net/juju-core/+bug/1220027 is a bug at all [08:15] <_mup_> Bug #1220027: worker/provisioner: cannot restart cleanly due to hard dependency on api server [08:16] jam, davecheney: i *think* it's working as intended [08:16] In case you are actually around now: [08:16] (9:41:32) jam: mgz: when you're up and around, I need to get some info from you to control the go-bot config. (You can't access the root bot with just the admin password, you have to have the cert.pem files as well) [08:16] (9:42:00) jam: mgz: also, I can't just test CACerts by renaming my own certs, because what we need to test is the bootstrap on the remote machine and "wget" *there* not on my local machine. [08:17] rogpeppe1: "but it causes extended delays" sounds like it is more than a 3s [08:17] delay [08:17] jam: it does, yes. but i don't see more than a 3 second delay in the bug report [08:18] davecheney: you've marked the bug as "papercut" - could you explain a bit more about why it's a particular problem? [08:18] rogpeppe1: so davecheney ran into this at someone's site (hence the papercut issue) so I'm guessing there could be more background. I agree that the particular log snippet only takes 3s from start to finish, but note the first line is "restarting "state" [08:18] rogpeppe1: I think the first line is pretty key [08:19] something caused the "state" worker to restart [08:19] 1:46:09 is restarting state in 3s, then 1:46:11 is restarting api in 3s, then 1:46:12 is starting "state" again. [08:19] rogpeppe1: papercut is (one of the many) bugs used by various parts of the company to indicate that this bug affects customers [08:20] the name is the least describtive, but comes from SABFDL [08:20] davecheney: yeah, i'm aware of the name - i just wondered how this was affecting customers [08:20] rogpeppe1: the process will eventually start up correctly, as the orer of job manage environ jobs is not specified [08:20] TheMue: as you're on call today, can you look at my goose branches? https://codereview.appspot.com/13379047/ and https://codereview.appspot.com/13396048/ [08:21] jam: yeah, that's odd, and something which probably isn't addressed by the proposed fix AFAICS [08:21] jam: sure [08:21] TheMue: thanks [08:21] davecheney: have you got the full log from when the problem happens? [08:21] davecheney: I think what rogpeppe1 is getting at, is that the bouncing provisioner might be a symptom rather than a cause. [08:21] however rogpeppe1: if the provisioner fails to start, won't it bounce the "state" worker, [08:21] causing the API server to bounce [08:22] causing them to get into a restart dance? [08:22] so the issue is that something that *isn't* being run by the APIWorker is depending on the API, which then kills the API servere [08:22] server [08:22] so it can't start itself. [08:22] jam: could you please take a look at https://codereview.appspot.com/13522043/ for me? need a lgtm here too. ;) [08:23] jam: ah, that seems wrong [08:23] jam: i thought the provisioner only talked to state [08:23] rogpeppe1: you can generate one for yourself [08:23] just deploy a few services on ec2 [08:24] and watch the machine-0 log [08:24] davecheney: it happens every time you deploy a service? [08:24] TheMue: so that one is "support empty strings locally as meaning "set this to the empty strings", but preserve the API path which means "empty strings ==> default value" ? === ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: TheMue | Bugs: 9 Critical, 122 High - https://bugs.launchpad.net/juju-core/ [08:25] jam: yep [08:26] rogpeppe1: pretty reliably at the moment [08:27] davecheney: hmm, weird - i will take a deeper look. [08:27] rogpeppe1: ta [08:30] jam: both reviewed [08:34] fwereade_: rogpeppe1, TheMue: I'm not going to make the standup today (it is my son's 6th b-day party at school). so you can make dimitern crack the whip if you want. [08:34] jam: ok [08:34] jam: ok, have sweet-cake fun :-) [08:34] jam: and enjoy the b-day [08:39] ahasenack: ping [08:39] axw: hi [08:39] ahasenack: hi. stupid question - how did the /etc/init/juju* files get into your clean container? [08:39] axw: oh, that is run inside the container? [08:39] axw: I ran that on my host [08:39] ahh [08:40] yes, on the container [08:40] I'll have to run again then [08:40] axw: so my env is bootstrapped using lxc [08:40] ahasenack: ok, that'll be a problem. you can't add-machine an already bootstrapped machine [08:40] axw: I will bring up a new container with lxc-create and lxc-start, one that was never touched by juju [08:40] cool [08:40] axw: should that work? [08:40] on my laptop: [08:40] juju bootstrap (lxc) [08:41] jam: thx for the review, will add some tests [08:41] lxc-create && lxc-start [08:41] then juju add-machine ssh:that-new-container-I-created-manually [08:41] ahasenack: yes if the container is clean, it should work [08:41] ok, that's what I thought I did, but let me do it carefully again [08:42] ahasenack: thanks. I'm doing the same now [08:50] axw: got the same error [08:51] axw: inside the container there is no /etc/init/juju* [08:51] axw: I created the container fresh, precise [08:51] started it [08:51] juju bootstrapped (lxc), juju status shows only bootstrap [08:51] ran add machine [08:51] and got that error [08:52] ok. that doesn't make any sense to me :( [08:52] I just did all that and it worked up until apparmor got in my way [08:53] let me paste what I did [08:56] axw: http://pastebin.ubuntu.com/6087049/ [08:58] ahasenack: thanks, I'll try with your configuration and see what I get [08:58] axw: that ls /etc/init/ test, it's run inside the container really? [08:59] because I do have such files in my host, of course [08:59] /etc/init/juju-agent-andreas-local.conf and /etc/init/juju-db-andreas-local.conf [08:59] ahasenack: yes, it'll ssh and execute ls /etc/init/juju* [09:00] "ls /etc/init/ | grep juju.*\\.conf || exit 0" to be precise [09:07] it checks the output of that, or the exit code? [09:07] anyway, [09:07] $ ssh 10.0.3.230 "ls /etc/init/ | grep juju.*\\.conf || exit 0" [09:07] Warning: Permanently added '10.0.3.230' (ECDSA) to the list of known hosts. [09:07] andreas@nsn7:~/canonical/source/landscape/production$ echo $? [09:07] 0 [09:07] $ [09:37] jam, davecheney: how about this as an alternative fix? https://codereview.appspot.com/13640043 [09:38] jam, davecheney: i haven't decided on a good way of testing it yet [09:50] axw: I don't understand this, this will always return 0 [09:50] const checkProvisionedScript = "ls /etc/init/ | grep juju.*\\.conf || exit 0" [09:50] ah, you look for the output [09:50] return len(strings.TrimSpace(string(out))) > 0, nil [09:51] what is wrong with [09:51] I wonder if it's my ssh warning that is messing with the output [09:51] ls /etc/init/juju*conf ? [09:52] "Warning: Permanently added '10.0.3.230' (ECDSA) to the list of known hosts." [09:52] I get that [09:52] or, to be more specific [09:52] $ ssh 10.0.3.230 "ls /etc/init/ | grep juju.*\\.conf || exit 0" [09:52] Warning: Permanently added '10.0.3.230' (ECDSA) to the list of known hosts. [09:52] that "Warning" line is making len(string(out)) > 0 be true [09:53] out, err := cmd.CombinedOutput() [09:53] because of that probably [09:53] combined [09:53] axw: ^^^ [09:53] yeah, we dont want combined [09:53] that is wrong [09:55] davecheney: your thoughts on https://codereview.appspot.com/13640043 would be appreciated [09:56] rogpeppe1: i looked at it [09:56] i'm not qualified to comment [09:57] sounds like it would just be easier to have the api server ignore other errors [09:57] davecheney: the API server isn't seeing any errors [09:58] this is why i am not qulified to review this patch [09:58] davecheney: it's being taken down because another state worker is going down [09:58] davecheney: np [09:59] rogpeppe1: i think that is the error I was talking about [09:59] davecheney: ah - but that's a provisioner error, no? [09:59] but the solutoin I would like to see is job manage environe ensure a startup order [10:00] we know the allfatal/shutdown/restart path works [10:00] i'd rather not change it [10:00] all that needs to be done is ensure the api server is running first [10:05] davecheney: we could do that, but i'd prefer to fix the underlying problem, which is that we don't need to take the api server down at all [10:06] davecheney: fwereade_ has been asking me to fix the allFatal thing for ages [10:06] fwereade_: your thoughts on https://codereview.appspot.com/13412047/ would be good to have, please [10:07] davecheney: the jujud logic needs to be able to cope with an API server that is only sporadically available anyway [10:07] * fwereade_ looks [10:07] rogpeppe1: then making the api server connection retry sounds like the best approach [10:07] fwereade_: oops, wrong CL [10:07] fwereade_: https://codereview.appspot.com/13640043/ [10:07] the not trying behavior that currently exists sounds like it is too chummy with other workers in its process [10:08] davecheney: it does retry currently [10:08] ahasenack: ahh [10:08] thank you :) [10:09] rogpeppe1: i don't think that is correct, but am not in a position to comment authoratively [10:10] davecheney: the retrying is what prints the "worker: restarting "api" in 3s" log msgs [10:10] rogpeppe1: i was not clear enough [10:10] not restarting the api process [10:10] /s/process/jo [10:10] job [10:11] but retrying the connection _to_ the api [10:11] davecheney: i'm not sure i see the distinction [10:11] davecheney: the first thing APIWorker does is try to connect to the API [10:11] davecheney: and when it fails, the outer runner will wait for 3 seconds, then try again [10:12] rogpeppe1: ok, then something must be wrong, because looking at the log, if the connection fails, allFatal is triggered and everyone shuts down [10:12] if what I just wrote is not true [10:12] then close the bug [10:12] it's not a bug [10:12] davecheney: allFatal is not used in the outer level runner [10:13] i don't think i can add anything more useful here [10:13] i don't know the code [10:13] i can only describe what I see in the log file [10:13] davecheney: it would be clearer with the full log file, i think [10:13] rogpeppe1, https://codereview.appspot.com/13640043/ looks pretty sane, I think [10:13] fwereade_: thanks [10:14] fwereade_: any ideas for a good way to test it? [10:14] rogpeppe1, custom mgo test runner that exposes a kill method..? [10:14] rogpeppe1: https://bugs.launchpad.net/juju-core/+bugs?field.searchtext=restart&search=Search&field.status%3Alist=NEW&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.assignee=&field.bug_reporter=&field.omit_dupes=on&field.has_patch=&field.has_no_package= [10:14] best we have [10:15] rogpeppe1, would be nicest if we didn't have to have the whole stack in place ofc, but I don't see a clear way to get there from here in a reasonable amount of time... [10:17] ahasenack: FYI, https://bugs.launchpad.net/juju-core/+bug/1223277 [10:17] <_mup_> Bug #1223277: manual provisioning should ignore ssh warnings when detecting presence of /etc/init/juju* [10:17] axw: right [10:50] plan b! [10:50] yep [10:51] standing [10:51] let's go in this order: dimitern, TheMue, wallyworld, jam, mgz, rogpeppe1, fwereade_ ? [10:51] :) [10:52] +1 [10:52] dimitern: what's the problem? [10:52] dimitern: i'm in the standup [10:52] rogpeppe1: no sound [10:52] hangout is not working [10:52] dimitern: ok [10:52] so I'll go first then [10:52] dimitern: jam cannot participate [10:53] ah? ok [10:53] dimitern: b-day of his 6yo [10:53] oh, jam congrats then! [10:53] ...oooOOO( like thumper today, only older *smile* ) [10:53] i'm succeeding in talking to nate on the hangout [10:53] hey it seems like it's better in the hangout now [10:54] standup: so, I've landed a couple of prereq branches in the uniter api yesterday, and stated working on migrating the uniter code from state to the api, still working on it [10:54] standup: I think I'll have something to propose later today. that's me - TheMue? [10:55] dimitern: try joining the standup? [10:55] dimitern: sorry, hangout [10:56] dimitern: we've got sound now [11:03] rogpeppe1: wrt https://codereview.appspot.com/13640043/, would that actually solve the problem? One issue here is that we *can't* connect to the API so calling .Ping() will still say "unable to connect" which will cause us to restart. [11:05] jam: it might be easier to chat directly; perhaps we could go through this some time after the standup? === TheRealMue is now known as TheMue [11:44] jam: i think it would solve the problem, which is caused by workers under StateWorker dying and causing their peers to die - that will ping the mongo state, not the API state, so the API server should remain up [11:45] jam: i changed the allFatal in APIWorker for symmetry - it doesn't actually help to fix the bug we're trying to address [11:48] rogpeppe1: apiserver is a StateWorker [11:48] ah, I think I see your point [11:48] it will notice it still has a connection to mongo [11:48] jam: yeah === gary_poster|away is now known as gary_poster [12:26] rogpeppe1: https://codereview.appspot.com/13512049 [12:26] rogpeppe1: that's the apiuniter package, as discussed on the standup [12:28] dimitern: LGTM [12:33] rogpeppe1: thanks! [12:33] rogpeppe1: I think I've beaten my previous 8000-line trivial diff record by far :D [12:38] lunch === marcoceppi_ is now known as marcoceppi [14:21] rogpeppe1: you all set with that StateWorker testing? [14:22] natefinch: i'm going to propose a branch that adds this to worker/runner.go: http://paste.ubuntu.com/6088175/ [14:23] natefinch: and we can layer testing onto that [14:23] fwereade_: how does that sound to you? [14:23] * fwereade_ looks [14:24] rogpeppe1, lgtm [14:24] rogpeppe1, I might even make it a Thing not a DebugThing, and implement one that did the logging for normal usage [14:24] fwereade_: that's an interesting idea actually [14:25] fwereade_: although in that case i'd probably make it an argument to NewRunner [14:27] fwereade_: i'm not sure actually - there are more log messages than just those events, and i'm not sure i want to either lose log messages or turn them *all* into events [14:30] rogpeppe1: that looks good to me. I agree that making it a Thing is a good idea. I'd change Start to Starting, though. Also, I agree that we don't want everything to be an event. These are the major events, obviously. If we need more we can make more. [14:31] natefinch: Starting sgtm [14:32] natefinch: the problem i have with making it a Thing and not including all the events is that different events on the same runner will be logged in quite different ways. [14:32] natefinch: which may well make the log messages less obvious (and they're not that obvious as it is) [14:33] rogpeppe1: Why does the logging have to change at all? Can we not just add this and leave the actual logging as-is? [14:34] natefinch: mostly because i can't think of a decent use for this that doesn't involve testing or debugging - i don't want our code to start relying on it in subtle and hard-to-reason-about ways. [14:35] natefinch: if we leave the logging as it is, i think i'd be happier with it left as a DebugThing, so people don't see it later and start leveraging it for evil witchery [14:36] rogpeppe1: thats a valid concern. [14:37] rogpeppe1: maybe keep it as debug only for now, and if it turns out we need it for something later (probably unlikely), we can always promote it to non-debug. And since promoting it is non-trivial, it'll take more thought than a casual call to it [14:38] rogpeppe1: do you need the full pointer to the runner? That kind of invites people to twiddle with the runner, rather than just passing an identifier that can't be used for witchcraft [14:39] (or at least makes witchcraft more difficult) [14:39] natefinch: i could pass fmt.Sprintf("%p", runner) i suppose. [14:40] natefinch: but given that it's only for testing, i'm ok sending the whole pointer, i think [14:41] rogpeppe1: fair enough [14:50] jam, fwereade_: other things I should be working on? [14:52] natefinch, I would be most grateful if you would take a quick look at https://bugs.launchpad.net/juju-core/+bug/1218616 and see what we need to do [14:52] <_mup_> Bug #1218616: all-machines.log is oversized on juju node [14:52] natefinch, I don't recall the CL that reportedly added log rotation, but I may just be ignorant [14:54] fwereade_: sure thing, I'll take a look. [14:55] natefinch, there may or may not be a related issue with machine-0.log contents being repeatedly spammed into all-machines.log [14:56] natefinch, I'm not sure whether that one ever made it to the top of anyone's queue [14:56] fwereade_: that bug you linked to links to this one, which has a fix.... not sure if it actually rolls the log... doesn't look like it from a casual glance: https://bugs.launchpad.net/juju-core/+bug/1195223 [14:56] <_mup_> Bug #1195223: juju all-machines.log is repetitive and grows unbounded [14:56] natefinch, there was definitely an issue with repeated unit-log spam into all-machines.log, but that one was resolved [14:58] natefinch, yeah, that may be the one [14:58] natefinch, rsyslog stuff happens both at cloudinit time for machine logs, and in a deployer context for unit logs [15:00] fwereade_: there are a bunch of bugs about not rotating logs... I don't think it's been fixed, but I'll double check in the code [15:02] natefinch, thanks [15:43] fwereade_: so it's how I feared [15:44] dimitern, that sounds ominous [15:44] fwereade_: finally everything builds with the api, but the tests will need some refactoring [15:44] dimitern, I presume they'll need to use backend syncs across theboard [15:44] fwereade_: mocking will really help to separate api parts from state parts [15:44] fwereade_: that's that perhaps (haven't even reached that far) [15:45] fwereade_: but the problem is a whole lot of things fail due to permission denied [15:45] dimitern, ok, but I'm inclined to imagine they're just bugs [15:45] fwereade_: we have to re-login to the api at some places to be able to pass the test as written [15:46] dimitern, that's an issue on the api side not refreshing auth when required, surely? there's the delayed getAuthFunc malarkey to address that scenario, I think [15:46] fwereade_: well, at least I can now say, save for 2 CLs which I discovered are missing, the API is indeed complete [15:47] fwereade_: I can propose these today, and continue on the struggle [15:47] fwereade_: no, no - it's just trying to access freshly added unit [15:47] dimitern, the login failures sound like they're detecting api server bugs, and fixes for those will be great candidates for individual CLs before you land the big one [15:48] fwereade_: and the api assumes we are working with a specific unit the entire sesson [15:48] fwereade_: not like create this - try it out - scrap it - create a new one - do that.. [15:48] dimitern, ahh... yeah, I see [15:49] dimitern, so we actually should be logging in again in those cases [15:49] fwereade_: yeah [15:49] dimitern, thanks [15:49] fwereade_: it's possible there would be some lurking bugs as well [15:49] fwereade_: for example I discovered the really weird way filter_test.go is written - never noticed before it's not in uniter_test package, but in uniter [15:50] dimitern, quite so [15:50] fwereade_: hence, the suites, etc. are exported [15:50] dimitern, I forget precisely the combination of forces that led me to believe that was the least evil thing to do at the time [15:50] fwereade_: or maybe not [15:50] dimitern, huh, that shouldn't happen in a _test.go file [15:50] fwereade_: yeah [15:51] dimitern, it's one of those things that's crying out for its own package, but the winds were not favourable at the time [15:51] dimitern, and go does tend to make package contents a bit hard to extract [15:51] dimitern, one for our Copious Free Time [15:51] fwereade_: the major issues I'm having is that all public or intra-package types need to use the api types, not the state types [15:52] fwereade_: and the tests use only the state types to setup scenarios, etc. [15:52] dimitern, cath's made mushroom risotto; when you can't stand it any more, come over and help us eat it? [15:52] fwereade_: and I need to bring up the full facade just to get an apiRelUnit instance to give to the thing that needs it for example [15:53] dimitern, understood; tedious :( [15:53] dimitern, if you spot opportunities to ease your burden with interfaces, take them :) [15:53] fwereade_: mm sgtm, but I'll perhaps give up in an hour or so and lay down [15:53] fwereade_: I need a freash head for that - but I might have spotted some places [15:54] dimitern, don't exhaust yourself [15:54] fwereade_: I'll just extract these two CLs from the whole mess of like 30 changed files and propose them, so I can scratch the API completely [15:55] dimitern, perfect [15:56] natefinch, fwereade_, TheMue: worker/runner introspection - a prereq to doing the testing right in the earlier branch: https://codereview.appspot.com/13493044 [15:56] rogpeppe1: looking [16:00] rogpeppe1: I'm really not a fan of having a testing-only exported type in the system. Can you put the type in a _test file so it won't be exposed outside of test-time? [16:00] natefinch: nope, i'm afraid not [16:00] natefinch: that won't be available to cmd/jujud [16:00] natefinch: i'm not a fan either [16:01] rogpeppe1, reviewed [16:01] natefinch: i suppose an alternative might be to mock worker.NewRunner inside cmd/jujud [16:03] rogpeppe1: looking [16:03] rogpeppe1, or to make the hooks a first-class feature of Runner, and just mock out the DRA we pass in in the jujud tests [16:04] fwereade_: DRA? [16:04] rogpeppe1, DebugRunnerActions [16:04] rogpeppe1: that's a possibility. There has to be a better way, even if they're also suboptimal.. [16:05] fwereade_: i think making the hooks a first class feature is just as bad [16:05] fwereade_: there's really no reason to use them other than for testing/debugging [16:06] fwereade_: and by making them a first-class feature we invite abuse [16:06] rogpeppe1: It's going to stab me in the pancreas every time I see DebugRunnerActions in the godoc [16:06] rogpeppe1, it means you don't need to expose funky global config in distant packages in order to test jujud [16:06] natefinch, that's another reason to make it Proper Code imo [16:06] natefinch: i kinda want it to be funky, because it *is* funky [16:06] s/natefinch/fwereade_/ [16:07] rogpeppe1, the globalness contributes a pretty serious amount to the funkness [16:07] fwereade_: I'd rather it be real code that invites abuse than funky code that we assume people won't use, but invites the same abuse. But I think there must be a third way [16:08] natefinch, maybe we could implement AOP for go by way of promiscuous code generation :) [16:09] fwereade_: ha! tempting.... (not really :) [16:10] fwereade_: how about mocking worker.NewRunner in cmd/jujud instead? [16:11] fwereade_: i.e. var newRunner = worker.NewRunner [16:11] fwereade_: and replace it for testing [16:12] fwereade_: actually var newRunner workerRunner = worker.NewRunner [16:14] rogpeppe1, so you just check that appropriate StartWorker and StopWorker calls are made? [16:14] fwereade_: yeah [16:14] rogpeppe1, I'd be fine with that [16:15] fwereade_: and we can also check that the functions are called at the right time too [16:15] fwereade_: we've got just as much of a handle on things as worker/runner.go, tbh [16:15] fwereade_: i'd still have the original worker.Runner inside the type - we'd just wrap it so we know what's going on [16:36] hi, question: i'm finding that every time i destroy an environment and want to bootstrap a new one, i get the error of "no tools available", and i have to run the juju sync-tools command every time, that takes ages [16:36] is there any way to avoid it? [16:44] yolanda, are you on maas? [16:44] fwereade_ , no, canonistack [16:46] yolanda, blast, I can't find it [16:46] yolanda, I think there's a public-bucket you should be able to use [16:52] fwereade_, where should i look at? it's a waste of time to always sync tools [16:53] yolanda: hm, you should just pick up the cloud-wide sumple streams stuff I'd have thought [16:54] trivial review anyone? https://codereview.appspot.com/13381045 [16:55] looking [16:58] dimitern: seems fine as a stopgap. what's the end api we're actually envisioning? [16:58] mgz: that's pretty much it - I'm preparing the last CL now [16:59] mgz: I have a huge branch about the uniter api migration I'm splitting up [16:59] mgz: this and the following CL are the only needed api changes [17:04] fwereade_, mgz, dimitern, natefinch, jam: a new testing helper - there are lots of places in the tests that could be simplified by using this: https://codereview.appspot.com/13651043 [17:04] i put it in checkers because it has no dependencies, but it could be argued that it would be better in testing [17:05] rogpeppe1: can you paste some examples how it simplifies things? [17:06] dimitern: ok, i'll propose a branch that uses it in some places [17:09] rogpeppe1: what happens if you screw up and try to set a value to something that isn't assignable to it? [17:09] natefinch: you get a panic [17:09] natefinch: (which i think is reasonable in a testing context) [17:09] rogpeppe1: can you throw in a test that makes sure that's true? [17:09] rogpeppe1: absolutely [17:10] natefinch: good point, will do [17:19] fwereade_: I never said explicitly, but we're definitely not rolling the all-machines log. I'm new to rsyslog, so coming up to speed on how to translate our current configuration to one that rolls the logs. Also, any suggestions as to rolling policies are welcome, otherwise I'll just make something up :) [17:24] another really small review? https://codereview.appspot.com/13648044/ [17:25] mgz: ? [17:28] dimitern: reviewed [17:28] dimitern: here's the example you asked for: https://codereview.appspot.com/13512051 [17:28] rogpeppe1: thanks, looking [17:29] rogpeppe1: that's nice [17:30] dimitern: thanks. i've been thinking about doing it for ages, but a change i was making today finally persuaded me it was worth doing [17:30] rogpeppe1: LGTMed [17:30] dimitern: thanks [17:31] * dimitern needs to stop for now [17:32] g'night all! [17:37] dimitern: see ya [17:37] natefinch: reproposed with test for panic when not assignable [17:38] natefinch: i'm presuming it LGTY? [17:38] rogpeppe1: yeah lemme look real quick [17:40] natefinch: thanks [17:40] rogpeppe1: LGTM :) [17:40] natefinch: [17:40] natefinch: thanks! [17:46] * rogpeppe1 must go now [17:46] g'night all [17:46] rogpeppe1: gnight! [18:16] o/ X-warrior [18:16] \o [18:17] well I will be leaving soon, anyway I will drop the idea so you guys can think about it a little bit and we can discuss it later. [18:20] I'm willing to add Elastic IP control to juju, but the question is is elastic ip control inside juju scope? Does it sound like a good feature? [18:26] I'm leaving now, talk to you later guys :D [20:54] morning [21:01] morning [21:01] taking daughter to the doctor shortly [22:39] morning [22:41] o/ [22:42] thumper: how's the daughter? [22:42] antibiotics for ear infection [22:42] hoping it doesn't rupture [22:42] doc said if it was going to, it would do so in the next 24 hours [22:42] eugh, my son just had that too [22:42] but nothing they can do about it really [22:43] ruptured mine once when I blew my nose [22:43] not nice [22:44] not much too it really, just get a bang and hearing goes funny. Fixes itself in no time [22:44] s/too/to/ [23:14] * thumper headdesks [23:34] davecheney, wallyworld: https://code.launchpad.net/~thumper/juju-core/task-race-on-api/+merge/184726 [23:34] https://codereview.appspot.com/13412047 has the diff wrong [23:34] the lp review has it right [23:34] for some reason lbox chooses an incorrect parent to do the diff from [23:34] thumper: ping [23:35] hi rogpeppe1 [23:35] you just missed a post [23:35] davecheney, wallyworld: https://code.launchpad.net/~thumper/juju-core/task-race-on-api/+merge/184726 [23:35] https://codereview.appspot.com/13412047 has the diff wrong [23:35] the lp review has it right [23:35] rogpeppe1: does a retry in the agent open bit [23:35] thumper: i've got 5 minutes before my lovely partner finishes her TV crime program - fancy a quick hangout about it? [23:36] thumper: or happy to talk here whatever [23:36] thumper: i saw your post (that's kinda why i dropped in) [23:36] which ever is easiest [23:36] happy to hangout [23:37] thumper: higher bandwidth might be good (but a couple beers down, so excuse me in advance :-]) [23:37] :) [23:37] rogpeppe1: https://plus.google.com/hangouts/_/40d498494100dea75532416a284f2bd890dd7a29?hl=en [23:45] davecheney,wallyworld: unping re review, I'll take it down as unneeded, rogpeppe1 is going to fix [23:46] ok