[00:00] davecheney: in a meeting, be with you soon [00:06] wallyworld_: done [00:06] davecheney: tyvm, will look after my meeting [00:08] sinzui: how long does ci take to go through its tests? [00:08] sinzui: I'm wondering how long before we know if ericsnow's branch fixes the problem [00:09] aboput 18 minutes atm [00:09] davecheney: that is just to land, I'm talking about the 'ci' tag being taken off the bug so it unblocks landings [00:14] davecheney: I could but I dont understand what that means [00:15] perrito666: i don't either [00:15] is something broken ? [00:15] i'm completely lost as to where we left our discussion [00:15] lol, ok, you answered a review request for a wrapper around chmod with an email asking something [00:16] rings a bell? [00:16] right, [00:16] yes, why do we need this ? [00:17] davecheney: so I answered that mail but in summary, chmod, even though it "runs" on windows, does not work [00:17] ok, but why do we need to chmod files ? === kadams54 is now known as kadams54-away [00:19] davecheney: yes, we do sometime ago I did a check with the cloudbase guys to see what was not working on windows workloads and that one was one of the outstanding [00:19] ok, my question is can we solve this problem by removing the cause ? [00:19] chmod might be busted on windows [00:19] but if we can remove the requirement to change permissions [00:20] then that also solves the problem [00:20] well its used for charms and It seems necessary [00:21] how is it used for charms ? [00:21] come on man, work with me here [00:21] davecheney: sorry I am re-reading the code as we speak, hold === kadams54-away is now known as kadams54 [00:25] davecheney: I dont fully see all implications but, except for one case where it it actually there for windows, it seems that we can remove/do only for linux the chmods [00:26] to me that sounds like a better solution [00:29] davecheney: Ill have a talk with fwereade tomorrow and see if he remembers why we decided to do this instead of nuke all appearances of chmod, Ill be glad to se this go (and with windows tests in place it might never come back) [00:30] perrito666: so this is a charm hook that needs to chmod a file ? [00:33] davecheney: the uses of Chmod=? [00:34] yup [00:34] i'm still digging for the why [00:38] there are around 26, for what I see almost half are tests simmulating lack of permissions or similar situation, the rest are giving more permissions to certain files, there is one case in environs/config.go that corrects a possible wrong permission on a file and the rest I would have to look [00:39] we use a lot chmod for my taste [00:42] so, looking at the existing chmod most if not all are of no consequence for windows, I worry a bit about future uses, now we actually need to care for windows workloads and if only os.Chmod would panic as file.Chmod does I would not fear that it might silently crawl under our radar [00:43] I might be just over engineering [00:44] thumper trunk looks to be in bad shape. Lots of tests failed [00:44] perrito666: i'm only interested in charm hooks that use chmod [00:44] imo there shuld be none [00:44] thumper, I am retesting those that looks like cloud failures [00:44] sinzui: ta [00:49] thumper, but ha-backup-restore is not fixed, the bug has mutated. I add the new error message https://bugs.launchpad.net/juju-core/+bug/1398837 [00:49] Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf [00:49] * thumper groans === kadams54 is now known as kadams54-away [00:50] thumper, sinzui: I already have http://reviews.vapour.ws/r/573/ that should address that EOF issue [00:50] hurray === kadams54-away is now known as kadams54 [00:51] it just needs a review [00:56] ericsnow: I'm looking at it [00:56] thumper: thanks [00:56] ericsnow: but I don't see how your change improves anything [00:56] ericsnow: can you explain? [00:57] thumper: if we notice a dropped connection we reconnect and try again [00:58] ericsnow: a minor change then land it [00:58] thumper: k [00:59] ericsnow: http://reviews.vapour.ws/r/573/ not lgtm, yet [00:59] k [01:00] 12noon ppl [01:00] oh [01:00] the meeting is actually at 2pm [01:00] ignore me [01:00] or one pm [01:01] calendars are hard, let's go shopping [01:01] davecheney: I would really like, but most imports are closed to make people buy local for christmas [01:01] :p [01:15] davecheney: I replied to your comment [01:15] davecheney: basically, I'm pretty sure that code should stick around [01:18] axw: standup? [01:18] davecheney: but I'll do it cleaner [01:18] ok [01:18] axw: oops sorry, you're not here [01:22] davecheney: http://reviews.vapour.ws/r/531/diff/# needs a full review [01:22] thumper: on it [01:22] cheers [01:24] anastasiamac: any ETA on your blocking branch? I don't really want to put my machine branch up for review until I have that merged in === _sEBAs_ is now known as sebas5384 [01:32] thumper: today?.. [01:32] * anastasiamac fingers xrossed [01:34] davecheney: I've updated http://reviews.vapour.ws/r/573/ [01:34] * thumper crosses fingers too [01:35] don't forget to add your things to the agenda [01:35] https://docs.google.com/a/canonical.com/document/d/1eeHzbtyt_4dlKQMof-vRfplMWMrClBx32k6BFI-77MI/edit [01:35] if you don't, then I get to talk for the whole time [01:35] and you probably want to avoid that [01:36] ericsnow: ok [01:41] ericsnow: review done [01:42] i'm not happy with the specialised logic inside getBackupTargetDatabases [01:42] the DBSession interface should describe what it needs [01:42] if it needs a .Copy method [01:42] then the mock needs to implement that as well [01:43] davecheney: it has to match mgo's Session.Copy which returns *mgo.Session, making the interface method kind of goofy [01:44] davecheney: but I agree with you :) [01:44] ericsnow: perhaps this is not the right place for a mock then [01:45] however unpleasent that is [01:45] davecheney: you may be right :( [01:46] davecheney: in the meantime for the sake of opening the landing bot... [01:46] ericsnow: please raise a bug against the next milestone [01:46] davecheney: k [01:56] I need a second reviewer for http://reviews.vapour.ws/r/573/ [01:56] thumper: wallyworld_ menn0 ? [01:56] sure [02:04] my cal says we have the team meeting now, but no one is here? [02:05] ah, just my crappy connection === kadams54 is now known as kadams54-away [02:27] thumper: I finally found someone that would do an affogato en in argentina :) I only had to go 90Km away to get it [02:28] haha [02:52] davecheney: I replied to you comments [02:52] wallyworld_: I'm pretty sure the dropped session (the io.EOF) is due to the replicaset functionionality of HA [02:53] davecheney: one key change with putting the feature flags into juju/utils is that the flags themselves had to be agnostic of any particular environment variable [02:53] so it can't really return a map unless we pass the key in [02:53] which we could make a helper to do... [02:54] ericsnow: ok, np. i'm not across it enough. is only one retry enough? [02:54] wallyworld_: I just updated the patch to retry 10 times [02:54] wallyworld_: but still only for io.EOF (which mgo returns when the connection drops) [02:55] davecheney: I'm beginning to think that we should have the feature flag init in each of the main blocks [02:56] davecheney: then we wouldn't need the test for osenv [02:56] ericsnow: so why not wait until replicaset is up before doing backup? [02:56] don't we do that elsewhere? [02:57] wallyworld_: I don't know about elsewhere, but I wouldn't mind waiting until the replicaset is up [02:57] wallyworld_: wouldn't that but in the CI test script though? [02:57] s/but/be/ [02:58] our code needs to wait [02:58] if backup invoked, it needs to ensure replicaset is up before proceeding [02:58] wallyworld_: k [02:58] like i think state server might do when starting [02:58] that's IMHO [02:59] but seems like the right thing to do [02:59] wallyworld_: okay === kadams54-away is now known as kadams54 [03:00] wallyworld_: I'm not well versed in the HA stuff but what you're saying makes sense [03:00] ericsnow: nate knows all about HA etc :-) [03:00] wallyworld_: lovely :P [03:01] wallyworld_: I'm pretty sure y'all don't want to wait until Nate is back online before I get this pushed :) [03:01] I'm here :) [03:01] why would I not be here? :) [03:01] magic [03:01] ericsnow: i'd rather not alnd a bad solution [03:01] lol [03:01] haha [03:02] natefinch: wallyworld_ is suggesting that we wait for replicaset to finish coming up before running backup [03:02] natefinch: how do we test for that? [03:02] wallyworld_: agreed [03:04] ericsnow: so, iirc, it's non-trivial [03:04] natefinch: not the right answer :P [03:05] wallyworld_: I'm still going to be out for a while longer. I'll be working late tonight [03:05] axw: np at all [03:11] ericsnow: michael did a bunch of work on that relatively recently, you should talk to him in the morning [03:12] natefinch: k [03:13] ericsnow: I think there were updates to the replicaset code and/or tests in order to determine that. IIRC it was something like ping until we get something reasonable back. [03:13] natefinch: got it [03:15] natefinch: keep in mind that for backup we can do the check on the API server side, so we have access to state [03:16] ericsnow: yeah, the replicaset tests assume DB access too. There's just no actual flag saying "replicasets are up" === kadams54 is now known as kadams54-away === kadams54-away is now known as kadams54 === kadams54 is now known as kadams54-away [04:37] axw: ping [04:39] https://gist.github.com/anonymous/de06b097d25d690b684f after seeing this log i'm pretty sure i'm not able to use kvm [04:39] hehe [05:04] wallyworld_: I've updated http://reviews.vapour.ws/r/573/ [05:05] ok, looking, wow, must be late for you [05:05] wallyworld_: is that what you meant about checking if HA is ready? [05:05] wallyworld_: well, I hate leaving CI blocked [05:08] ericsnow: looks ok i think, but backup doesn't seem to call WaitUntilReady() and haEnabled() always returns true [05:09] wallyworld_: isn't HA always enabled (even if not utilized), i.e. the --replset option is always passed to mongod [05:09] wallyworld_: do you think WaitUntilReady would be more appropriate than IsReady? [05:10] true, so why the haEnabled() function? and what about older environments? or are they upgraded to ha ? [05:10] i think they are upgraded [05:10] wallyworld_: the new backups only applies to 1.22+ [05:11] wallyworld_: and haEnabled gets patched to return false in the tests (since we don't use HA there) [05:11] wallyworld_: is "haEnabled" the wrong name? (perhaps "replSetEnabled"? [05:11] so then that's a bit misleading, the block of code should be extracted from create [05:12] and put inside a func [05:12] and that func should be what's patched [05:12] wallyworld_: fair enough [05:12] but isn't that what WaitUntilReady is for? [05:13] if you just call WaitUntilReady, it should all be fine, just patch WaitUntilReady? [05:13] and then IsReady doesn't need to be exported [05:15] wallyworld_: it depends on it we want backup to fail immediately if HA isn't ready or if we make users wait [05:15] oh right i see [05:15] wallyworld_: currently it fails immediately, but it sounds like you would rather we take the waiting approach [05:15] for now just do what's needed to unblock so you csn goto bed [05:16] wallyworld_: I could drop the WaitUntilReady func [05:16] k [05:16] yes, drop that for now [05:16] jut the minimum, but then come back and fix if needed [05:16] i'd just like to see the code extracted [05:16] so the haEnabled() can be dropped [05:16] k [05:16] doing it now [05:16] ty [05:25] wallyworld_: pong [05:26] axw: quick hangout? [05:26] sure [05:26] 1:1 one [05:26] ok [05:45] wallyworld_: hypothetically the openstack provider could do bad things if your environment name contains regexp meta characters [05:46] axw: do we allow that? i thought env names were constrained [05:46] to valid chars [05:46] maybe... trying to find where [05:48] wallyworld_: seems we just check that it doesn't contain "/" [05:48] oh [05:48] in that case i need to do more in the azure one also [05:48] I'll see what openstack allows for machine names... [05:52] azure is ok [05:52] "alphanumeric characters and underscores are valid in the name" [05:55] hmm, can't find any info about it on openstack... [05:57] possibly this? https://github.com/openstack/nova/blob/master/nova/api/validation/parameter_types.py#L61 [05:58] which includes . [05:59] wallyworld_: updated http://reviews.vapour.ws/r/573/ (and added tests) [06:00] looking [06:04] ericsnow: +1, but i just realised, the CI script might still fail [06:04] FWIW, I plan on following up with mfoord in the morning [06:04] wallyworld_: how so? [06:04] as it will still see an error [06:04] it will try to backup and get a "not ready, try again later" error [06:04] wallyworld_: oh, duh [06:04] as opposed to a EOF error [06:04] wallyworld_: dang it [06:05] i think we can ask that the script be changed [06:05] maybe [06:05] or else we will need that retry loop [06:05] but you go to bed, i'll follow up [06:05] wallyworld_: and change it to a CI bug rather than a core bug? [06:05] maybe, i'll have to ask [06:06] i can see both sides of the argument [06:06] wallyworld_: k, I'll get the merge started [06:06] sure, tyvm [06:06] and then we can add to it if needed to put in the wait until ready [06:06] wallyworld_: that WaitUntilReady function is still in the commit history ;) [06:07] indeed :-) [06:07] wallyworld_: because I'm sneaky like that :) [06:07] lol [06:08] wallyworld_: okay, it's running the merge CI right now [06:08] ty :-) [06:08] wallyworld_: I'll leave it in your hands (thanks!) [06:08] night night [06:22] wallyworld_: have a cheeky glass of red for me [06:22] wish i could [06:22] i'm still working [06:23] i was saying good bye to eri [06:23] c [06:23] axw: here's the gwacl branch https://code.launchpad.net/~wallyworld/gwacl/prefix-service-match/+merge/243620 [06:23] i need to modify juju to pass in a separator [06:24] looking [06:33] axw: i might do as you suggest, i can just abandon the gwacl branch [06:33] wallyworld_: I kinda wish those convenience functions in gwacl weren't there [06:34] yeah [06:34] that one in particular is trouble, obviously [06:34] indeed === fuzzy_ is now known as Fuzai [07:02] anyone else need a review ? [07:04] axw: http://reviews.vapour.ws/r/580/ [07:04] looking [07:04] davecheney: i do, but andrew can look as we've discussed befre hand [07:14] wallyworld_: done, fwiw [07:14] thanks dave :-) [07:17] wallyworld_: done also [07:17] thanks [07:18] i checked the azure doc, service names don't allow meta chars [07:18] but i'll add the quoting to be sure [07:18] wallyworld_: they don't *now*, but who's stopping them from changing that? [07:18] sure [07:27] jam1: you ok for storage meeting a bit later? [07:27] wallyworld_: yep [07:27] jam1: great [07:28] jam1: axw: i may be a minute or 2 later as i have to drive my wife to the city and am not sure of traffic [07:28] i should be back on time [07:28] okey dokey [08:00] wallyworld_: when you have a moment, can you see if this makes sense to you? https://docs.google.com/a/canonical.com/document/d/1-9ZPfdgpkj2R9mBG_tlSclGGyK3tRpMf2L4C37mzYD8/edit#bookmark=id.g0p2mahykmz [08:44] morning [09:24] morning all [09:27] dimitern: morning [09:28] morning voidspace, TheMue [09:28] dimitern: voidspace: o/ [09:39] TheMue: hiya [10:01] jam1, voidspace, standup? [10:02] dimitern: just working on a meeting, will be there soon [10:02] ok [10:02] dimitern: omw [10:05] * fwereade_ out for a bit [10:21] morning [10:33] perrito666: morning [10:33] perrito666: heya [10:35] wallyworld_: axw: are we back in https://plus.google.com/hangouts/_/canonical.com/juju-storage [10:35] ? [10:35] omw [11:38] * dimitern out for a 1h [13:02] perrito666: if you get a chance care to look at this one: http://reviews.vapour.ws/r/583/ [13:02] perrito666: you're probably more familiar with this code than others, but it's a simple change [13:02] perrito666: restores (and tests) some work by ericsnow [13:03] voidspace: looking [13:03] perrito666: thanks [13:09] voidspace: only nits not worth mentioning, LGTM, but it is not worthy that I LGTM since you will need for david tomorrow to ship it :p so look for a better source of lgtmness [13:10] perrito666: ok, cool - thanks [13:12] anyone else who fancies an easy review [13:12] http://reviews.vapour.ws/r/583/ [13:12] I'm off to lunch === jcw4 is now known as jw4 [14:45] sinzui: where are we on those blockers? [14:45] natefinch, still taking stock. I just reported https://bugs.launchpad.net/juju-core/+bug/1399229 [14:45] Bug #1399229: win client cannot get status after bootstrap [14:53] sinzui: can we get a --debug run on that? [14:53] perrito666, I am trying. I need to setup another win machine first [14:55] natefinch, I suspect ha is brittle, but we have so many hot issue, I am going to get you better info for the current bugs than investigate ha [14:58] natefinch, the ha issues are connect shutdown/refused talking to the api server, which is the current problem with the backup-restore test, so maybe we already have a bug tracking the problem [15:02] perrito666, damn it. We cannot get a debug with a matching server because the testing streams are for 1.21. I will try a bootstrap anyway and hope for a reproduction [15:02] sinzui: hold, I have both windows and a fake stream [15:02] tell me how to try that [15:03] perrito666, The test is just a bootstrap into aws, proof that it can bring up and talk to a state-server [15:03] sinzui: ok, that is trunk? [15:03] * perrito666 goes again [15:03] perrito666, yes [15:03] ok, firing up windows [15:11] can you believe it? compiling juju in your machine and inside a vm is a bad idea [15:15] dimitern, do we really need to backport bug 1397376 to 1.20? the stakeholders are pointing to 1.20 as an example that works for them [15:15] Bug #1397376: maas provider: 1.21b3 removes ip from api-endpoints [15:22] ok sinzui please dont hate me for the stupid thing I am about to ask [15:22] where should I store environments.yaml in windows? [15:22] sinzui, I wasn't sure, let me check if the logic differs in 1.20 [15:23] perrito666, /users//.juju/environments.yaml and you need a .ssh/id_rsa [15:23] perrito666: there should be an environmental variable for /users/.... forget what it is [15:23] perrito666: perhaps %HOME% [15:24] sinzui, confirmed - 1.20 is not affected by the same address ordering issue [15:25] \o/ [15:26] asdkjhaskdjhna ksdhjask windows will not allow me to name something .juju from the gui [15:27] oh you have to name it .juju. [15:27] and windows removes the dot at the end [15:28] perrito666: you need to turn off the "hide extensions" thingy, I think... I've never had a problem using wacky extensions in windows before. [15:29] natefinch: the issue was not with the extension is that the file began with . [15:29] so, no, windows is not looking for %HOME%/.juju [15:30] oh yeah, you can't make a file in the UI that starts with a ..... sorry [15:30] er with a . that is [15:30] perrito666: %USERPROFILE%\.juju\environments.yaml ? [15:30] %UserProfile% [15:30] looking [15:30] I thought we made juju home on windows just "Juju" [15:31] natefinch: let see [15:32] perrito666: yeah it's %APPDATA%/Juju [15:33] dimitern, you still around? [15:34] alexisb: my chrome just froze, will be back on the call in a minute.... [15:34] bootstraping [15:34] sinzui: jw4 katco natefinch tx for your help [15:34] alexisb, yes [15:34] perrito666: anytime perrito666 [15:35] on the cross team call, was curious on status on the api endpoints bugs [15:45] natefinch, axw targeted https://bugs.launchpad.net/juju-core/+bug/1388860 to 1.20.14. It is a backport of a fix. He suggests it because there is another bug reporting that the issue does indeed affect the 1.20 series. Can you get someone to look into the backport now, and if it is ugly, then maybe we shouldn't...just go head with the release. [15:45] Bug #1388860: ec2 says agent-state-info: 'cannot run instances: No default subnet for availability zone: ''us-east-1e''. (InvalidInput)' [15:45] [15:46] alexisb, i'm working on a fix now [15:46] alexisb, it turned out to be more complicated to test than to fix :) - I should be ready later today or tomorrow morning [15:47] sinzui: natefinch http://pastebin.ubuntu.com/9368819/ [15:47] perrito666, try again, that is aws's sucky mirrors [15:48] sinzui: I am about ot [15:48] to [15:48] perrito666, oh.. [15:48] perrito666, you can tell juju to not do updates and upgrades! [15:49] * perrito666 had to step down a moment to go order food, which requires me standing on a bench on the backyard, single place in the house with cell reception [15:49] perrito666, you can improve your chances if success: [15:49] enable-os-upgrade: false [15:49] enable-os-refresh-update: false [15:49] ^ we do that with canonistack because its network is unreliable [15:50] sinzui: is the test running that way? [15:50] dimitern, awesome, thank you for turning that around so quickly [15:50] man, it really sucks we're still using launchpad for bugs... it makes everything so much more manual [15:50] I would like to stay as close as possible to the tests [15:50] sinzui: do you happen to have a link to the PR that fixed https://bugs.launchpad.net/juju-core/+bug/1388860 for 1.21? [15:50] Bug #1388860: ec2 says agent-state-info: 'cannot run instances: No default subnet for availability zone: ''us-east-1e''. (InvalidInput)' [15:50] [15:51] alexisb, no worries [15:51] perrito666, no, but I think it is somewhat irrelevant because we have fresh images in aws, charms need updated and upgrades, but this failure is about talking to the state server [15:51] sinzui: running second bootstrap with your options added [15:53] natefinch, I do not, I will look [15:53] I have a fix for issue 1398837 [15:53] waiting for a review [15:53] sinzui: I found it [15:54] bug 1398837 [15:54] Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf [15:54] sinzui: looks to be a trivial change [15:55] natefinch, I see the change, but not the actual pull request [15:55] https://github.com/juju/juju/search?q=1388860&type=Issues&utf8=%E2%9C%93 [15:56] amazing, a search that actually works the way you'd expect :) [15:56] natefinch, https://github.com/juju/juju/commit/9e1f40588eb6befcc543ae64e15cf7d8b11fd090 [15:56] natefinch, I was slow, I looked at the date and read the commits [15:56] sinzui: that's the one. Super simple. Want me to backport it? [15:56] natefinch, please do. pretty please [16:05] odd, ECONREFFUSEd [16:05] sinzui: regarding 1398837, part of the problem is that the failing test does not wait long enough for HA to be ready before trying to run backup [16:05] ericsnow, okay, what time should we set? [16:06] sinzui: I'm not sure how long it takes for HA to get ready, but I see that voidspace's patch has a timeout of 60 seconds [16:07] what! [16:07] voidspace: thanks for taking that over, by the way [16:07] ericsnow, I think we wait 10-20 minutes [16:07] sinzui: uff, /proc/self/fd/9: 9: exec: varlibjujutoolsmachine-0/jujud: not found [16:07] someone broke paths [16:07] ericsnow: no problem, care to review it? [16:07] sinzui: that test has a total runtime of about 15 minutes [16:08] perrito666, \o/, or old nemesis win paths [16:08] voidspace: sure, though it looks strangely familiar :) [16:08] sinzui: I wonder how in the universe does a windows client affects that path [16:09] ericsnow, I am looking at the lib now and hoping for an easy timeout change [16:09] sinzui: k, thanks [16:09] ericsnow: heh [16:10] ericsnow: sinzui: in my investigations I could find *no* deterministic way to tell when a replicaset is ready [16:10] perrito666, the win module is convoluted. it switched between native path separators and posix depending on the code *knowing* that it will be executed against the state server [16:10] voidspace, :( [16:10] ericsnow: sinzui: even connecting separately to all of them and waiting until the configuration from *all members* reports that they're all ready wasn't enough after a reconfigure [16:10] ericsnow: sinzui: at which point I gave up [16:10] voidspace: I wanted to ask you about that (the IsReady function I added) [16:10] ericsnow: sinzui: this fix will definitely help *sometimes* [16:10] nnnnnailed [16:10] sinzui: https://github.com/juju/juju/commit/ad420d9 [16:11] ericsnow: I have been down this road and this will help sometimes [16:11] ericsnow: but sometimes they report ready and the next operation can still fail [16:11] ericsnow: although your problem is with initiation - my problem was with reconfigure [16:11] ericsnow: so it's likely to be better [16:11] ericsnow: the Initiate function could call WaitUntilReady [16:12] voidspace: ah, cool [16:12] ericsnow: I didn't make that change as I was focussed on fixing the specific problem [16:13] sinzui: Ill try to fix it [16:13] voidspace: ack [16:15] natefinch, perrito666: standup? [16:17] ericsnow: trying... google doesn't like me [16:17] google is not your friend... [16:17] natefinch: use firefox [16:18] it was trying to join as my gmail account [16:18] ericsnow: why did you remove WaitUntilReady? [16:19] ericsnow: a timeout in the script won't help without a retry loop [16:19] voidspace: we weren't using it so wallyworld_ asked me to remove it [16:19] voidspace: right, a timeout won't help but a sleep will :) [16:20] ericsnow: ah, it was wallyworld_ who asked me to put it back in [16:20] this morning [16:20] voidspace: yeah, I told him it was in the commit history still :) [16:20] ericsnow: a retry loop is better than a sleep, surely? [16:20] ericsnow: I'm agnostic on it - up to sinzui really [16:21] voidspace: sure, but that's up to sinzui [16:21] whether we fix it in juju or they fix it in their test harness [16:21] ericsnow: ok [16:23] adding WaitUntilReady to Initiate slows the test suite down a lot [16:23] I'll tell you by how much when it actually finishes! [16:24] voidspace: I noticed that each test I added for IsReady added about 12 seconds to the tests [16:24] well, from 244 seconds to 365 - including one failure (probably needs a mock) [16:24] ericsnow: heh, ouch [16:24] ericsnow: my tests for WaitUntilReady are fast because they all mock IsReady... [16:27] voidspace: nice [16:27] and use a timeout of 1 second... [16:28] voidspace, the test sets the timeout for ha to 1200...could something else be timing out before then? [16:28] well obviously something is [16:29] sinzui: backup was set to hard fail if the replicaset was not ready [16:29] voidspace: IsReady may need a tweak (it will return an error for anything but io.EOF) [16:29] sinzui: so it's not a timeout you need as much as a *retry* if it fails for that reason [16:29] ericsnow: that sounds correct to me [16:29] ericsnow: what other error would you expect? [16:29] voidspace, retry of what, status, ha? [16:29] sinzui: of the backup itself I think [16:29] sinzui: that's where the failure was IIUC [16:30] oh, backup, doh [16:30] voidspace: "dial tcp 172.31.3.149:17070: connection refused" [16:30] ericsnow: ah [16:30] ericsnow: what's the actual error type? [16:30] ericsnow: or should I do a match for "connection refused"? [16:30] voidspace: not sure, sinzui noted it in bug 1398837 [16:30] Bug #1398837: cannot extract configuration from backup file: "var/lib/juju/agents/machine-0/agent.conf [16:31] ericsnow, the bug is really about the test failing. if we mark it fix released, we replace it with another bug that backup tests still fail [16:31] sinzui: +1 [16:32] ericsnow: hmmm... that bug actually reports that restore fails [16:32] ericsnow: are you sure that a WaitUntilReady in create will fix that? [16:33] voidspace: are you sure? "ERROR:root:Command '['juju', 'backup']' returned non-zero exit status 1" [16:33] ericsnow: ah, that's further down the bug report [16:33] "the bug has mutated" [16:34] voidspace: the original restore error was caused by backup though [16:35] ericsnow: ok, fair enough [16:35] natefinch: you really need to work in your unmuting skills [16:36] voidspace: it's all because I stuck a "juju backups create" call in the backup plugin script a couple weeks back [16:36] voidspace: however, these are real issues that need addressing at some point so might as well be now [16:46] ericsnow: so this would be my fix for the connection refused issue [16:46] ericsnow: http://pastebin.ubuntu.com/9369537/ [16:46] no, hang on [16:47] if errors.Cause(err) == io.EOF || (err != nil && strings.Contains(err.Error(), "connection refused")) { [16:47] sinzui: so should we apply Michael's fix (http://reviews.vapour.ws/r/583/) or will you be able to add retries/sleep to the CI test script (the HA one) around the backup call? [16:48] ericsnow, I am still looking at extending the timeout [16:48] voidspace: I so hate testing for strings in err.Error() :( [16:48] sinzui: I'm not sure a timeout will help [16:49] sinzui: it has to wait somehow for HA to be ready before running backup (or retry the backup if it fails) [16:50] sinzui: voidspace's fix would probably help, but I'd rather not do that *just* for the sake of the CI test if we can help it [16:50] ericsnow, yep I see ensure-availability as the issue, I will report a new bug about this, close the backup bug and hope the patch works [16:50] voidspace: but you're probably right [16:51] heh [16:51] voidspace: I was relying just on io.EOF for IsReady because of the precedent elsewhere in the replicaset code (I don't have a very good knowledge of the problem-space otherwise) [16:53] voidspace: davechaney had suggested checking for other kinds of failures but I didn't find any examples of that elsewhere in juju so I stuck with just io.EOF [16:53] voidspace: so I'm good with checking for "connection refused" (my dislike of checking err.Error() aside) [16:53] right [16:53] cool [16:54] I've pushed it and we can let the PR lie until we get a definite decision [16:54] I'm returning to IP address stuff [16:54] voidspace: k [16:54] sinzui: thanks! [16:55] sinzui: regardless, good came of the bug (though as the cost of CI being blocked for an extra day) [16:56] voidspace: I'll put up a separate patch just for the "connection refused" part of that so that it's not conflated with the WaitUntilReady part [16:57] ericsnow: ah, I pushed that [16:57] ericsnow: it doesn't do any harm is my thinking... [16:57] voidspace: that's okay [16:58] sinzui: natefinch found it, running tests now for proposal [17:03] voidspace, ericsnow I reported https://bugs.launchpad.net/juju-core/+bug/1399277 about the ha issue, I add a line for beta4, because I /think/ this will help, but we can discuss it as out of scope [17:03] Bug #1399277: ensure-availability is not reliable [17:03] voidspace: thanks! [17:04] sinzui: ^ [17:04] sinzui: it applies to 1.22 as well, right? (where we are seeing the CI backups failures) [17:05] ericsnow, yess 1.22 is really hurting [17:05] sinzui: :( [17:15] rick_h_: to confirm where should we log bugs for jujucharms.com? [17:15] https://github.com/CanonicalLtd/jujucharms.com/issues arosales_ [17:15] arosales_: updated bug link is landed and will be in monday release [17:18] rick_h_: thanks [17:18] arosales_: np, ty for the bug reports [17:19] np, thanks for responding to them :-) [17:44] uff this patch is awfully hard to revert [17:45] perrito666: ...subsequent changes? [17:46] mgz_: yes most likely [17:50] Can someone review my backport? It's a very small change: http://reviews.vapour.ws/r/584/ [17:54] ahh the old "on call reviewers are both done for the day by 8am" [17:54] mgz_: can you look? ^ [17:54] natefinch: on it [17:55] natefinch: is the indent of that switch off or is that just reviewboard messing with me? [17:57] mgz_: that's how gofmt likes it. [17:57] just double checked [17:58] personally, I prefer the cases to be indented, but it does then cause double indent for the stuff under the case, so.... yeah. *shrug* [17:58] natefinch: I reviewed that patch [17:58] I think I may just be misreading the html output... it looks like the returns are differently indented, but there are change-indent marks that implies otherwise [17:59] lgtm otherwise [17:59] mgz_: oh I see what you mean [18:00] mgz_: the left hand side code is only under a single "if", the right hand side is under an "if" and a switch, so it is indented more. [18:00] yup [18:01] I honestly didn't even see the indent marks at first. It's nice that it doesn't mark the whole thing as just different when it's just the different indent. [18:04] natefinch: we should keep track of these (things people *like* about RB) [18:04] natefinch: all I ever hear is what annoys people (which is typical) :P [18:05] ericsnow: is easy, all that is not annoying we like [18:05] the whole "make a ton of changes and then publish as one action" is pretty fantastic.... [18:05] natefinch: agreed [18:06] perrito666: heh [18:08] ericsnow: being able to expand all files with a single click is also pretty awesome. I hate that I can't do that in github's diffs [18:29] natefinch: what do you think of this as a conversion function - dotted quad (IP address) to decimal? [18:29] natefinch: [18:29] the whole "make a ton of changes and then publish as one [18:29] oops... [18:30] natefinch: http://pastebin.ubuntu.com/9370769/ [18:30] voidspace: FYI: http://reviews.vapour.ws/r/585/ (check other "conn dropped" conditions) [18:30] natefinch: just wondering if it has any chance of passing review [18:30] ericsnow: LGTM [18:31] natefinch: the address is already validated, so no need to handle that potential error case [18:31] although it should work fine anyway as ParseInt would fail [18:32] and the *caller* will be constructing a sensible error message from that failure [18:34] ooh, bug [18:34] they need zero padding [18:34] grr... [18:43] right, g'night all [18:43] voidspace: net.IP has some interesting stuff [18:43] natefinch: ah, I'll check [18:43] natefinch: I need int representations [18:43] natefinch: it maybe that I don't need to write them myself [18:44] voidspace: I always try to write as little as possible myself :) [18:44] natefinch: it doesn't have a ToDecimal [18:44] or equivalent [18:45] nor the reverse [18:45] I need both [18:45] voidspace: I'll look around to see if something's already out there [18:45] voidspace: don't want to keep you from EOD [18:45] ok [18:46] natefinch: thanks, if you see something an email would be awesome [18:46] voidspace: will do [18:46] natefinch: they're not hard to write functions - just converting via string functions is a little icky [18:46] natefinch: decimal to dotted quad is less icky [18:46] g'night [19:23] for the fantastic chance to unlock CI http://reviews.vapour.ws/r/586/ [19:23] it is extremely trivial [19:23] has been tested on windows and linux [19:28] perrito666: will this work if deploying a unit to a windows machine? [19:29] natefinch: I dont know, I cannot deploy one of those [19:29] but if not, I am ok with it since it is a regression [19:29] and also https://bugs.launchpad.net/juju-core/+bug/1399322 [19:29] Bug #1399322: ToolsDir should be series based [19:29] I opened a bug to have someone solve that doubt [19:34] perrito666: let's ship it [19:35] aghh what was the _fixes thing? [19:35] perrito666: $$fixes-XXXXX$$ [19:36] tx ericsnow [19:36] sent [19:37] I think the comments just need "fixes-123456" in them, and then $$whatever$$ works [19:45] ok, since the fix is hopefully being merged, I am going to step down for a moment [19:57] morning folks [19:58] thumper: morning, shot you an email about rescheduling next week if you get time to peek [19:58] rick_h_: hey [19:59] rick_h_: can you do two days earlier and an hour later? [20:00] man I hate mongo [20:00] sinzui: what's the status of CI? [20:00] thumper: I can [20:00] thumper: can you email that back and let urulama respond as well? [20:00] kk [20:00] ty [20:00] thumper, master is blocked by 2 regressions, both with fixes due soon [20:02] * thumper sighs === kadams54 is now known as kadams54-away === kadams54-away is now known as kadams54 === kadams54 is now known as kadams54-away [20:50] grr... [20:54] gc.Not is so fucked [20:54] thumper: could u cast ur eyes over my changes for block commands [20:54] anastasiamac: sure [20:55] thumper: the PR is so huge now, m thinking to break it into smaller pieces [20:55] thumper: block functionality itself [20:55] and than a PR for each command [20:55] as long as your first commands are add and remove machine I don't care :-) [20:55] thumperbut if it's all good as it is now, I'd rather commit my monster without breaking it apart [20:56] thumper: :) [20:56] thumper: lets he how it reviews to u atm [20:56] I'm already going to have to do a mega-conflict merge [20:56] /s/he/see [20:56] thumper: :-( [20:56] that's fine [20:56] I'm used to it [20:56] thumper: but beta u than me :) [20:56] sure [20:56] :) [20:57] thumper: m off to sort kids and co... [21:00] kk [21:15] I think I'll fetch a coffee and my almond croisant before starting this review [21:15] it may take a while [21:28] ericsnow: ping [21:28] menn0: o/ [21:29] ericsnow: howdy. [21:29] ericsnow: what's the state of play with the ensure-availability/backup issue? [21:29] ericsnow: looks like you have a Ship It for PR 1271 [21:30] ericsnow: is that PR not particularly important? [21:31] menn0: last I was aware, sinzui was going to see about tweaking the function-ha-backup-restore test so that HA is up before backup runs (or something along those lines) [21:31] ericsnow: ok cool [21:31] menn0: that PR isn't important for unblocking CI [21:32] ericsnow: there's also voidpspace's PR 1269 [21:32] ericsnow: which seems relevant but hasn't been merged yet [21:32] menn0: it was just something we noticed would be good to do at some point (so it can wait until things are unblocked) [21:33] ericsnow: ok so we're really just waiting on sinzui's change to get CI unblocked [21:34] menn0: yeah, Michael's patch (http://reviews.vapour.ws/r/583/) should help but I think the current behavior is correct for users (so I'd rather we not land that patch just for the sake of CI) [21:34] menn0, not quite, because I don't know how to make ci know when ha is ready [21:34] menn0: yeah, and he had said he might change that bug to be a CI bug IIRC, which would unblock us [21:35] sinzui: can you just throw a sleep in there before calling backup or put that call to backup in a retry loop [21:35] ? [21:36] ericsnow, sleeping for 5 minutes the running backup can still fail. [21:36] sinzui: yuck [21:36] ericsnow, I need to poll something that means we are ready [21:36] sinzui: how long does it take for HA to be ready? [21:37] sinzui: I'm not the best resource for finding that polling solution but I'll give it some thought [21:38] natefinch: do you have any ideas off-hand on how sinzui might poll for HA-ready (from a script) [21:39] ericsnow, it is variable. Our code calls a method named wait_for_ha() we don't start backup until [21:40] ericsnow, We read status http://pastebin.ubuntu.com/9372831/ [21:40] ericsnow, menn0 is there something else to read about juju *really* being ready [21:41] sinzui: I wish I knew. I believe that can sometimes pass and mongo can still somehow not be ready-ready. [21:42] could is "juju run" something on the state-server or the other voting machines? [21:43] s/could is/could 1/ [21:43] * sinzui gives up [21:44] well, I think testing got lucky. I think the replica-set was ready this time [21:45] natefinch, ericsnow I will add a 5m sleep if you think it will fix the issue 80% of the time [21:46] sinzui: If it's close enough that it just succeeded as-is, then I'd expect such a sleep (hacky as it is) would help a bunch [21:46] * sinzui adds hack [21:46] sinzui: I would not expect a lot of variability in how long it takes for HA to come up [21:48] sinzui, ericsnow, natefinch: wouldn't voidspaces change also greatly reduce the odds of the replicaset not being ready? [21:48] menn0, I think so [21:48] sinzui: of course that doesn't solve the problem of accurately/reliably introspecting the HA status, but that the point of but 1399277, right? [21:49] menn0: if I were a user I would rather it fail when the replicaset isn't ready than have it wait [21:50] menn0: however, I'd expect the odds of this issue affecting actual users to be remote [21:51] menn0: I'd rather we didn't apply voidspace's patch just for the sake of CI (more -0 than -1) [21:52] ericsnow: what, you don't think a lot of people will do "juju bootstrap && juju ensure-availability && juju backup"? :) [21:52] natefinch: :) [21:52] ericsnow: ok, well if backup reports a clear error immediately about the replicaset not being ready can't the test detect that and retry a few times [21:52] natefinch: you can't be too careful :) [21:52] natefinch: however, there's a chance this could bite someone if they did " juju ensure-availability && juju backup" on an existing env, no? [21:53] menn0: that was what voidspace suggested when we discussed it [21:54] sinzui, menn0: the error message is "HA not ready; try again later" [21:54] (apiserver/backups/create.go) [21:56] ericsnow: I'm not too concerned if someone does "juju ensure-availability && juju backup" and the backup fails with "HA not ready" (or something similar). [21:56] natefinch: right [21:56] natefinch: that's the point of what I did yesterday [21:57] ericsnow: well that's great. I'm fine with there being some times when backup can't be done. [22:00] Gotta run [22:07] sinzui: would it be reasonable to drop the "ci" tag from bug 1399277? [22:07] Bug #1399277: ensure-availability is not reliable [22:07] no, the test has to consistent;y pass [22:10] sinzui: I meant since it's more of a CI bug but having a more reliable way to know when HA is ready is still something we want to get [22:10] ericsnow, hell no. enterprises script this out like we do [22:10] sinzui: good point [22:11] ericsnow, This issue is new, so I think something bad has happened. Regardless, I am adding a sleep [22:12] sinzui: what changed is I updated the "juju backup" plugin to call "juju backups create" [22:13] sinzui: but that was a few weeks ago so if this issue is new as of a matter of days then yeah [22:14] ericsnow, yeah :( If I add a 5 minutes sleep, the test suite also sleeps. I need to do more work [22:14] sinzui: :( [22:15] sinzui: what about a loop around the backup call that checks for the "HA not ready; try again later" message? [22:16] ericsnow, :/ doable but maybe award not all juju have this problem. We test 18 and 20 too [22:17] sinzui: can you not query status to determine if the ha servers are ready? or they are marked ready before replicaset is actually ready? [22:17] perrito666, we do! status said has vote, so we started backup [22:17] perrito666: they are already doing that: http://pastebin.ubuntu.com/9372831/ [22:18] ok so status is lying [22:22] ok,EOD my brain is fried [22:24] sinzui: before I leave, where can I see https://bugs.launchpad.net/juju-core/+bug/1399229 job? does it run on the same CI? [22:24] Bug #1399229: win client cannot get status after bootstrap [22:24] being windows [22:25] perrito666, this is the job, and your commit is the top result http://juju-ci.vapour.ws:8080/job/win-client-deploy/ [22:25] :D [22:26] I go in peace then, have a nice night everybody