[00:03] menn0: bet you can reproduce if you try to do ssh-agent forwarding and try to bootstrap juju using a remote machine [00:04] mgz: ok i'll try that [00:05] menn0: look at the configuration for the job [00:05] mgz: i did notice that it took a long time for the bootstrap command to think it could connect when I could already connect using ssh myself in another terminal [00:06] mgz: maybe openssh on vivid has changed in terms of timeouts or retries [00:10] wallyworld, mgz: I wonder if we should always be passing "-F /dev/null" to ssh to take the user's .ssh config out of the equation? [00:10] I guess that might cause other problems when people have config that is required for ssh to work for them. [00:10] right, we've always subtly depended on some bits of ssh config [00:11] really I think we should explictly take some bits and otherwise have completely isolated env to run ssh in [00:16] that sounds sensible [00:16] mgz: but first i'd like to be able to repro the issue so we're not guessing [00:17] menn0: so, I think we just want to run the underlying workspace-run command, using the real vivid-slave, as that job is [00:17] menn0: bzr branch lp:workspace-runner [00:17] ...how is this new code actually harder to run manually than the old way >_< [00:18] wwitzel3: do i not have the latest code? error: missing WORDPRESS_DB_HOST and MYSQL_PORT_3306_TCP environment variables [00:18] Did you forget to --link some_mysql_container:mysql or set an external db [00:18] with -e WORDPRESS_DB_HOST=hostname:port? [00:18] [00:21] * menn0 looks [00:22] mgz: sorry, family wants me. All we need to do is copy one of the jobs again and force it to download the wily juju to the vivid machine [00:23] sinzui, mgz: although that would be useful to try, I suspect it's less about the juju version and more about the ssh config and openssh version of the test host [00:24] I can't repro the issue from my vivid host, bootstrapping a trusty env on ec2 [00:24] menn0: Host 10.* 192.168.* [00:24] StrictHostKeyChecking no [00:24] UserKnownHostsFile /dev/null [00:24] User ubuntu [00:24] IdentityFile /var/lib/jenkins/cloud-city/staging-juju-rsa [00:24] ^ all the config [00:25] sinzui: no global config? [00:25] menn0: its what ships with vivid [00:26] ok [00:26] good to know [00:26] Host * [00:26] SendEnv LANG LC_* [00:26] HashKnownHosts yes [00:26] GSSAPIAuthentication yes [00:26] GSSAPIDelegateCredentials no [00:26] sinzui: i'll try with that config locally in case it's relevant [00:27] mgz: so what does workspace-runner do? [00:27] mgz: never mind, saw the readme [00:27] :) [00:28] menn0: vivid-slave is unique in that it doesn't have a jenkins on it, instead we use the workspace running to execute the commands via ssh. Jenkins also uses the same ssh connection string (because we knew that worked) [00:31] sinzui: ok. [00:32] menn0: would you likes to visit the machine? [00:33] sinzui: I wonder if the whole running juju commands over ssh setup is the root cause here. not that you shouldn't be able to but that might be what's triggering the different behaviour for the vivid tests. [00:33] sinzui: yes please [00:34] menn0: okay, here's what I'm trying [00:35] running workspace-run directly, as that job would be doing, with a few test things filled in [00:35] first, ssh config for me, note key from cloud-city [00:36] menn0: try ubuntu@vivid-slave.vapour.ws [00:36] http://paste.ubuntu.com/11923154 [00:36] mgz: Why are you up. I am already burned out [00:36] then config json file for the runner command like so http://paste.ubuntu.com/11923178 [00:37] then running this command in the workspace runner directory, changing cfg.json and s3cfg paths as needed http://paste.ubuntu.com/11923163 [00:37] that's running deploy-job and stuck on ssh currently [00:38] I bet if I do basically the same bootstrap that the runner is from inside vivid-slave with my creds fiddled, it would not hang [00:38] sinzui: it's hot. also, I have to do the drying up before going to bed and I'm putting it off. [00:39] menn0: that make sense at all? [00:39] mgz: yep, makes sense [00:40] mgz: i was just waiting to see if another bootstrap attempt was going to work. Was using the ssh config that sinzui pasted in. [00:40] mgz: it did work. i'll try your instructions [00:42] sinzui: I can get to vivid-slave. thanks. [00:44] mgz: workspace-runner is doing it's think now [00:44] thing [00:44] so, on vivid-slave, I can ssh to both the 52. address that the workspace runner is stuck connecting to, *if* I add -i param with the key from the local cloud-city dir [00:48] mgz: could that be the problem then? [00:48] mgz: the ssh config file Host section that includes the key is only for "10.* 192.168.*" [00:49] menn0: let me check what jenkins master has exactly for this [00:50] mgz: it certainly doesn't [00:54] jenkins master has way more, ... seems reasonable to try making the vivid-slave config hosts bits on host * and maybe adding agent fowarding as well [00:55] mgz: do you want me to try that [00:55] ? [00:56] mgz: using workerspace-runner I was able to repro the 10min timeout [00:57] menn0: trying this change on vivid-slave http://paste.ubuntu.com/11923214 [00:58] note I have not touched the identity stuff, if it's id related I expect it to still hang [00:58] mgz: looks good [00:58] mgz: I'm going to bet it's the identity stuff [00:58] mgz: but it's good to pin it down [00:59] mgz: if this is the problem we need to change juju to report the ssh failures it's seeing (and swallowing) during bootstrap [00:59] if this fails, will make it use the defined key locally on the slave [00:59] agent-forwarding can work, but can also not. [00:59] okay, that failed, but failed fast [01:00] mgz: what was the error? [01:00] "TLS handshake timeout" refreshing addresses, get https://ec2.us-east-1.amazonaws.com?... [01:01] mgz: hmmm, that's not related to what we're looking at [01:01] mgz: that's an error talking to the API [01:01] (Amazon's API) [01:01] nope, probably just bad handling of intermittent https [01:01] juju shouldn't fail there [01:02] mgz: no it shouldn't [01:02] should just let the retry loop continue [01:02] mgz: we should file a bug about that [01:02] mgz: but also try again with the current ssh config [01:02] now... that didn't actually do a destroy-enviroment [01:04] okay, gets done on before starting next run, we do handle htat [01:05] going again, end of last run: http://paste.ubuntu.com/11923231 [01:08] mgz: want me to file the bug for the TLS issue? [01:11] menn0: please do [01:15] okay, okay that timed out. [01:15] what's somewhat annyoing is juju doesn't log what key it thinks it's using [01:16] vivid-slave *has* the cloud-city key copied into ~/.ssh/id_rsa [01:16] so I presume that... but I guess it's then not using the same key to try and connect? [01:19] mgz: the cloud-city key and the id_rsa key are not the same. i just diffed them. [01:20] well, that'll be it then [01:21] mgz: i've just added debug logging to the SSH connection attempt code used by bootstrap so that at least we can see why the attempts are failing [01:21] mgz: i'll get that merged today [01:22] mgz: maybe symlink the key? (and remove mention of it from the ssh config file to avoid confusion?) [01:22] although ... ssh can be picky about the perms on the key files so that might be tricky [01:24] how did I manage to get in... I guess it must be creating with the cloud-city key but then trying to connect with whatever the hell this key in ~/.ssh is [01:27] menn0: ln -s just made my current run get in and start installing stuff [01:28] mgz: \o/ [01:28] I'm going to invalid off this bug and make you a nice new small one on our ssh output being a pain to debug [01:29] mgz: assign that new one to me b/c I already have a fix for it [01:29] Our current blocker then is https://bugs.launchpad.net/juju-core/+bug/1477355 [01:29] Bug #1477355: MachineSuite.TestDyingMachine fails on windows [01:29] mgz: one last thing: when you ran into that TLS handshake issue, Juju didn't clean up after itself, is that right? [01:30] menn0: right, and neither did our scripts, but I *think* that's as designed now? somewhere I think we're setting the leave-failed-bootstrap-around thing [01:31] oh ok [01:31] i'll mention it as a possible issue on the bug at least [01:33] mgz: here's bug 1477357 [01:33] Bug #1477357: EC2 API TLS handshake failure aborts bootstrap [01:36] menn0: bug 1477358 [01:36] Bug #1477358: No output from ssh problems during bootstrap [01:38] menn0: sinzui: axw is going to fix that new blocker above [01:43] Bug #1477157 changed: Broken windows dependencies [01:43] Bug #1477293 changed: Bootstrap fails to connect on vivid/go 1.3 [01:43] Bug #1477355 opened: MachineSuite.TestDyingMachine fails on windows [01:43] Bug #1477357 opened: EC2 API TLS handshake failure aborts bootstrap [01:43] Bug #1477358 opened: No output from ssh problems during bootstrap [01:47] mgz: cheers [01:48] wallyworld: https://github.com/juju/juju/pull/2867 [01:48] please [01:48] looking [01:50] axw: do you need to verify that branch actually passes tests on windows, or are you sure anyway? [01:51] axw: we don't need a linux build directive, right? the compilter is smart enough to know what to do? [01:51] mgz: I'm 99% sure, enough that it would be a productivity killer to set up a VM to test in windows, rather than just watch CI [01:51] wallyworld: correct [01:52] mgz: yeah, i think the pr will fix the issue [01:55] wallyworld: it's 2013 because that's just moved code [01:55] ah, doh, sorry [01:55] axw: 99% is high enough for me to not volunteer either. you saw the LoopUtilSuite failures at the bottom of the page as well as the /proc/1/cgroup ones? [01:56] mgz: doh, no I did not. I'll fix that separately. [01:56] actually, the bot didn't see my PR yet [01:56] I'll update it [01:56] * wallyworld should have noticed too [01:56] seperate branch and fixes-1477255 to land both is fine. [01:57] s/2/3/ [02:11] wallyworld: PTAL, I added another commit [02:21] axw: will do as soon as team meeting finishes :-) [02:21] oh crap [02:41] thumper: didn't want to hold up the meeting with our inane small talk [02:41] that's for our 1:1 tomorrow [02:41] :) [02:41] oh yeah, can't make that [02:41] \o/ [02:41] wallyworld: wanna do it now? [02:41] sure [03:19] Bug #1474195 changed: juju 1.24 memory leakage [03:23] wallyworld, axw: http://reviews.vapour.ws/r/2248/diff/# [03:24] menn0: looking [03:26] menn0: shipit [03:27] axw: cheers [03:28] i'm going to get that into 1.24 as well given that it's tiny but potentially quite useful [03:31] Bug #1474195 opened: juju 1.24 memory leakage [03:34] katco: you would need to pull either my branch or eric's branch to get the envvars fixes [03:34] katco: neither of them have landed on the feature branch yet [03:34] Bug #1474195 changed: juju 1.24 memory leakage [03:54] niemeyer: are you around? [03:55] natefinch:Depends.. :-) [03:56] niemeyer: quick question that I'm pretty sure I know the answer to. Is there any way to force goyaml to write out strings with surrounding quotes, presuming the quotes are not strictly necessary? [03:57] natefinch: Nope.. Not right now [03:58] niemeyer: ok, I was pretty sure. Thanks. [03:58] natefinch: np [04:04] waigani_: do we not need to upgrade the last-login/conn info? it's not really important, right? [04:37] katco: so I put the wordpress image up in s3 for now and updated the charm to pull it down from there and load it [04:37] katco: it pulls fairly quick from an EC instance [04:41] axw: yeah, that is what I was thinking. But fair call, I'll double check with thumper [05:06] wallyworld: if this is about changing the last login / connection time, then yes, I would prefer an upgrade step [05:06] thumper: waigani_ perhaps? [05:06] sorry wallyworld [05:06] :-) [05:06] meant waigani_ [05:28] katco: just did an end to end with the wpm charm pulling the image tar from S3 and it worked great [09:01] fwereade: standup? [09:11] dooferlad: did you pinged me here on irc? somehow notification failed [09:11] TheMue: I pinged you yesterday, but got on with other fixes. No worries. [09:13] dooferlad: ah, ok, will take a look now [09:29] dooferlad: so, done. [09:29] https://github.com/juju/juju/wiki/MongoDB-and-Consistency expanded a little [09:30] comments, corrections, critiques actively sought [09:31] fwereade: http://reviews.vapour.ws/r/2252/ [09:32] fwereade: I added azure as well since I was gonna start with it today and I realized it might have the same problem [09:32] Bug #1477464 opened: juju does not support custom signed image metadata [09:34] bogdanteleaga, the thing I want to be certain of is that we will reject unsigned metadata that purports to come from streams.c.c [09:35] bogdanteleaga, but I'm afraid I don't have the bandwidth to think about this properly today -- so wallyworld is surely going to be a better reviewer than me [09:36] TheMue: thanks! [09:36] dooferlad: yw [10:07] hi code devs, when do you expect trunk to be unlocked? [12:04] axw, wallyworld: do either of you know why we have such weird restrictions on sets of status data? [12:04] axw, wallyworld: specific messages only allowed for allocating [12:04] axw, wallyworld: extra data forbidden here and there [13:14] * dooferlad is back on line again. First a server problem (power button broke off its wires so couldn't turn back on) then the ISP flakes out. Yay. [13:18] Bug #1477293 opened: Bootstrap fails to connect on vivid/go 1.3 [13:56] mgz: Can I delay us for 5 min? [13:57] xwwt: no problem, poke me when you're free [13:57] Bug #1477293 changed: Bootstrap fails to connect on vivid/go 1.3 [13:57] mgz: ty [14:17] mgz: sorry, I am back now. If you still have time to meet, let me know. [14:17] xwwt: lets go [14:34] sinzui: I was able to finally get into the call [14:49] ericsnow: ping [14:49] wwitzel3: hey [14:50] ericsnow: currently, when running the launch command, the register call internally errors and complains about at least one bad arg in bulk api call [14:50] wwitzel3: k [14:50] ericsnow: so checking launch for a non-zero exit and setting status is getting hung up there [14:52] wwitzel3: is there any indication on why that bulk call is having a problem? [14:52] ericsnow: that one of the args in the bulk call is bad ;P [14:53] wwitzel3: just what I though :) [14:53] thought [14:53] wwitzel3: sounds like a bug to me [14:54] ericsnow: error: ¶ms.Error{"", "at least one bulk arg has an error"} [14:54] wwitzel3: also, why is the plugin succeeding even though it failed to start? [14:54] ericsnow: it didn't fail to start [14:54] ericsnow: it fails to register [14:55] ericsnow: this started happening after the flush fix went i [14:55] n [14:55] wwitzel3: I thought the point was that status is "Running" even though it failed to start [14:55] wwitzel3: ah, flush [14:56] ericsnow: no, the launch command is returning a non-zero exit, because register fails, but the container is successfully run [14:56] wwitzel3: in the case of the bulk call, the specific problem will be in the Error field of one or more of the individual results [14:56] wwitzel3: k [14:57] Bug #1474291 opened: juju called unexpected config-change hooks after read tcp 127.0.0.1:37017: i/o timeout [15:25] bodie_: happy birthday :) [16:09] I'm not sure who's working on storage, but I'm wondering if a storage-list command is planned, for running within a hook/action context? [16:10] aisrael, I would *think* so; axw would know for sure [16:11] aisrael, he won't be awake for... 6 or 7 hours I think? [16:11] fwereade: Ok, thanks! I'll follow up with him before I eod. [16:16] aisrael, axw would love feedback, so please do send mail or ping him when he is online [16:16] aisrael, wallyworld can also help [16:17] alexisb: ack, will do. I've been building a charm to put storage through its places, particularly as it applies to benchmarking. I'm sure I'll have some feedback to give. [16:17] awesome [16:56] ericsnow: wrapping up some touches on the charm now, did you get a chance to look at the register/apiserver bulk thing? [16:57] wwitzel3: sorry, didn't realize [17:15] Bug #1477464 changed: juju does not support custom signed image metadata [17:15] Bug #1320312 opened: fallback to unsigned stream metadata may have security issues [17:43] That feeling when you realize the conversion you're doing you've already written 2 weeks ago. [18:18] Bug #1477709 opened: default config lacks state-port [18:26] wwitzel3: ping [18:26] lazyPower: pong [18:26] wwitzel3: have you seen this before? http://paste.ubuntu.com/11926451/ [18:27] happy to file a bug, just trying to figure out what happened so its not a pebkac issue living in the tracker. [18:27] lazyPower: doesn't matter if it, the juju command should never panic and barf in the console for a user ;) [18:27] lazyPower: so it is a bug, for sure [18:27] fair enough [18:28] lazyPower: looks like a result of the google environment missing a config value [18:28] lazyPower: it should probably print something like .. "missing " [18:28] ;) [18:30] rule #1 of software - never show a stack trace to the user [18:31] and especially true of Go where panics are almost always due to a programmer error. [18:31] are the docs right, its project-id and not project_id? Seems odd to have one config option that doesn't conform [18:32] https://bugs.launchpad.net/juju-core/+bug/1477712 [18:32] Bug #1477712: GCE provider dumps stacktrace when missing a config option/value [18:32] bug for reference [18:33] lazyPower: yep, project-id, auth-file [18:34] lazyPower: OR project-id, private-key, client-email, client-id [18:34] ok note: when supplying the json file, no stacktrace was emitted, it behaved [18:35] it seems to be related to embedding the data in the environments.yaml vs supplying the auth-file [18:35] lazyPower: can you add that in the ticket real quick [18:35] ah and if its dashes, the docs are wrong [18:35] then the docs are wrong, it is for sure dashes [18:36] ack, i'll tear down and try again with dashes and see if it chokes again [18:36] and fix the docs while i'm in here [18:37] thanks wwitzel3 [18:37] np, thank you [18:37] you are a gentleman and a scholar [18:42] Bug #1477712 opened: GCE provider dumps stacktrace when missing a config option/value [19:13] natefinch: sorry meeting running over, we'll probably cancel? [19:14] katco: that's fine. I still have a bunch of work to do anyway [19:22] wwitzel3: took me a minute to circle back, but that was it. convert from underscores to dashes and the panic goes away. [19:29] wwitzel3: hey can you hop in moonstone rq? [20:23] katco: ping [20:30] wwitzel3: pong [20:31] katco: hey, was stuff food in my face, still want to hangout? [20:31] wwitzel3: nah the moment hath past [20:31] katco: k, sorry [20:31] wwitzel3: wasn't anything too important [20:31] wwitzel3: no worries at all [20:34] Bug #1431286 changed: juju bootstrap fails when http_proxy is set in environments.yaml [20:47] morning folks [20:47] wwitzel3: hey there [20:47] thumper: hey [20:48] wwitzel3: LXD now supports cloud images as of 0.14 [20:48] wwitzel3: I'll be chatting with stgraber next week about any other bits we need [20:48] oh, thumper, that is my cue to leave [20:49] * thumper tips hat [20:49] thumper: that is great [20:49] wwitzel3: so Friday work to resume after annecy [20:49] thumper: ok, I'll be sure to touch base with you on your Friday standup after Annecy === natefinch is now known as natefinch-afk [22:44] axw: Sorry your fix for the wiindows test is incomplete. I updated the https://bugs.launchpad.net/juju-core/+bug/1477355 with the error windows now sees [22:44] Bug #1477355: MachineSuite.TestDyingMachine fails on windows [22:54] wallyworld, perrito666: please both of you give this CL some close attention when you can: http://reviews.vapour.ws/r/2255/ [22:54] fwereade_: sure [22:54] aye [22:55] fwereade_: btw, i saw this morning i had missed some messages last night - the reason for only allowing status data sometimes etc - NFI, the semantics were in place before the workload status changes, they were simply preserved with the new work [22:56] wallyworld, awesome [22:57] wallyworld, I remember them being developed, I see no reason to keep them, if you don't either then awesome :) [22:57] i can't see any reason ottomh [22:58] fwereade_: "no longer mixes txn and non-txn writes to statushistory collection" [22:58] hasn't perrito666 already done that? [22:59] wallyworld, it includes perrito666's backport [22:59] ok [22:59] wallyworld, added when I thought I'd finish earlier, and he hadn't done it yet [22:59] wallyworld, I can suck up the merge if his lands [22:59] np, ta just reading the mp description [23:00] wallyworld, and it's a *shitty* diff :( [23:00] wallyworld, you will probably get more out of cloning and grepping for Status in state [23:00] fwereade_: the nowToTheSecond() stuff - that was cargo culted from elsewhere. i believe mongo has timestamp issues [23:01] issues with precision in vs precision out [23:01] fwereade_: if you added the same change I think it would be dumb to merge my backport [23:01] wallyworld, that's why you convert to nanoseconds and store an int64 ;p [23:01] perrito666, fair point :) [23:01] fwereade_: yuk [23:01] wallyworld, better than silently discardinng precision imo [23:01] depends on the use case [23:02] wallyworld, I think "don't discard precision until you have to" is a pretty solid principle [23:02] most dbs i've worked with handle timestampts propery [23:02] nfi why mongo doesn't [23:02] wallyworld, it keeps ms accuracy I think? [23:03] "have to" - do we really need to know a machine status changed at 12:31:22.374747473 [23:03] I would have expected it to store with more precission 0padding [23:03] wallyworld, I would also suggest that that is a better source of ordering than the sequence thing [23:03] fwereade_: it may be m2, not sure, but i seem to recall (maybe incorrectly) there was an issue reading back out? [23:04] wallyworld, which is one extra db write/read-result, on the same doc, every time we write status history [23:04] wallyworld, there is a certain amount of whiny editorialising in the comments [23:04] fwereade_: it's not so much ordering but also the culling of old records, i can't recall the details now [23:04] wallyworld, I have Opinions about that too ;) [23:05] i'm sure you do :-) [23:05] wallyworld: maybe you want to try a quick followup branch to address a commplication in the window's fix https://bugs.launchpad.net/juju-core/+bug/1477355. I can pause CI for a few hours to ensure we test master again. [23:05] Bug #1477355: MachineSuite.TestDyingMachine fails on windows [23:06] sinzui: yeah saw that :-( i'll get axw to fix asap [23:06] bah [23:07] wallyworld: I am going to force CI to make it pause for up to 5 hours. We can manually enable build-revision if e get a fix in sooner [23:07] ok, tyvm [23:10] fwereade_: why was all the status validation (including for old stuff like machine status) removed from helper functions in status.go? [23:10] wallyworld, it moved to the methods on the entities themselves [23:10] hmmm [23:10] wallyworld, which now call down to a common implementation [23:11] wallyworld, we can fight this out in annecy -- but I think you are very wrong in your view that the business rules need to be deeper embedded in state [23:11] i would have preferred the validation to be kept in a separate method [23:11] like we do for configs etc [23:11] i don't think that [23:12] at all [23:12] state is for persistence [23:12] the model and validation is separate [23:12] the old implementation had the validation deep in state [23:12] wallyworld, I am glad then, you seemed to be suggesting that earlier [23:12] i am not defending that [23:12] no [23:12] wallyworld, this has it slightly less deep [23:12] +1 to that [23:13] katco: pushed up the latest charm that works against the latest feature-proc-mgmt branch [23:13] i do think we are in violent agreement, just some tinking on the details [23:13] wallyworld, with a view to pulling it out to the facades, because this is a situation where there's no need to weave business rules in with persistence [23:13] fwereade_: but yeah, let's get into this next week [23:13] katco: I had to remove the current retval check from process-launch because I still haven't fixed the issue with register [23:13] wwitzel3: k [23:13] wallyworld: http://reviews.vapour.ws/r/2256/ [23:13] wwitzel3: i'll try it out asap [23:13] katco: but it is all there if you wanted to do another deployment and have the updated status-history [23:14] axw: looking [23:14] wallyworld: confirmed it builds in windows, using go 1.5 cross compiling [23:14] axw: lgtm [23:14] sinzui: fix on the way [23:14] faboo [23:15] wallyworld, and it was also a bit of a reaction to the N types and N methods and N funcs all just doing the same incorrect serialization prep plus a bit of validation, far from where the data entered the package [23:15] fwereade_: sadly that tends to be *everywhere* [23:16] our code has sort of devolved as it has grown organically [23:16] especiallywhen the api layer was added [23:17] wallyworld, yeah, people have to look after it as they go [23:18] wallyworld, but I'm not sure what you mean about validation done deep in state [23:18] wallyworld, do you mean consistency concerns? lots of them for sure [23:18] fwereade_: sorry, had to join meeting [23:19] wallyworld, but the tolerable pattern is ExportedMethod() { Validation(); TransactionLoop() } [23:21] dammit late again [23:21] * fwereade_ bed [23:38] thank you axw. I see the merge into master [23:39] sinzui: great, now for the real test ;)