[01:33] davecheney: question for you [01:33] wallyworld_: shoot [01:34] on trunk, i try this: juju set-env some-bool-field=true [01:34] it fails [01:34] expected bool, got string("true") [01:34] o_o [01:34] have you seen that? [01:35] i haven't use that command [01:35] certainly never with bool fields [01:35] do we even support them ? [01:35] which charm ? [01:35] there's code there to parse a string to bool, but it appears to not be called at the right place [01:35] this is setting an env config value [01:35] ahh [01:35] i bet nobody ever tried [01:35] cf. the horror show that is the environment config [01:35] and updating it after the fact [01:36] yeah, appears so :-( [01:36] time for a bug report [01:36] or it could be fallout from moving to api [01:36] could be [01:36] the only bool env field I know of is [01:36] use-ssl [01:36] thanks, just wanted to check before raising a bug [01:36] or the use-insecure-ssl [01:36] i think you've got a live one [01:36] there's also development [01:36] ]and a new one i am doing [01:36] provisioner-safe-mode [01:37] which will tell provisioner not to kill unknown instances [01:37] wallyworld_: i think nobody has ever tried to change a boolean env field after deployment [01:37] :-( [01:38] we've only even had that use insecure ssl one and you need that to be set for bootstrapping your openstack env [01:38] ok, ta. bug time then [02:26] sinzui: any word on 1.16.5 / 1.17.0 ? [02:47] wallyworld_, axw: if you're unable to get into garage MaaS, you could probably ask bigjools nicely if you can use his equipment. [02:48] jam: he's been busy supporting site [02:48] and only has small micro servers [02:48] axw: from what I inferred when Nate got access, it was essentially smoser just "ssh-import-id nate.finch" as the shared user on that machine. [02:49] wallyworld_: sure, but I don't think we're testing scaling, just that the backup restore we've put together works [02:49] w/ MaaS [02:49] sure, but we need at least 2 virtual instances, not sure how well that will be handled [02:50] wallyworld_: well, I wasn't suggesting using VMs on top of his MaaS, just using the MaaS [02:51] wallyworld_ you already have access [02:51] yes i do. i was waiting for your on site support effortd to wind down [02:51] consider it down [02:51] you seemed stressed enough, didn;t want to add to it [02:51] hi bigjools [02:51] when the guy you're helping f*cks off mid-help, I consider it done. [02:51] ouch [02:52] :/ [02:52] wallyworld_: you could come round as well if you want direct access [02:52] so i'm currently working on one of the critical bugs [02:52] bigjools: so do you know someone who already has Garage MaaS access to the shared user? From what I can tell the actual way you get added is by adding your ssh key to the "shared" account [02:52] was hoping to get that done before i looked at the restore doc [02:53] "needing to be in the group" seems like a red herring [02:53] jam: I have access, want me to add anyone? [02:53] me and axw :-) [02:53] me please [02:53] bigjools: axw, wallyworld_, and ? [02:53] heh [02:53] lp ids please [02:53] wallyworld [02:53] ~axwalk [02:53] bigjools: I haven't done the other steps, but ~jameinel is probably good for my long term health [02:55] ok you're all in [02:56] I am having a lunch break, if you need me wallyworld_ can just call me [02:56] hooray. thanks bigjools [02:56] np [02:57] axw: the other bit that I've seen, is that you might have a *.mallards.com line in your .ssh/config with your normal user, but you need to still use the other User shared [02:58] if the *.mallards line comes first, it overrides the individual stanza [02:58] jam: I explcitly tried logging in as shared@ [02:58] i can ssh in now [02:58] it works for me now too [02:59] axw: so I think you have to be in the iom-maas to get into loquat.canonical.com, but to get into maas.mallards you just get added to the shared account [03:00] that would seem to be the case [03:00] axw: as in, I'm trying and can't get to loquat [03:00] axw: can you update the wiki? [03:00] jam: right, you need to get IS to do that [03:00] I would get rid of the "host maas.mallards" line in favor of the *.mallards line [03:00] jam: sure - "step 3: ask bigjools to add you to the shared account"? ;) [03:01] axw: ask someone who has access to run "ssh-import-id $LPUSERNAME" [03:01] as the shared user [03:01] axw: Hopefully we can make it a big enough warning for IS people to realize they aren't managing that acccount [03:03] thanks for setting them up bigjools [03:04] * jam is off to take my son to school [03:50] ping -> https://code.launchpad.net/~dave-cheney/goose/001-move-gccgo-specific-code-to-individual/+merge/196643 [03:55] axw: thanks for the review [03:55] now I can close this issue [03:55] nps [04:04] davecheney: you forgot to set a commit message on lp:~dave-cheney/goose/goose [04:04] https://code.launchpad.net/~dave-cheney/goose/goose/+merge/196471 [04:04] I'll put one in there [04:05] it should get picked up in 1 min [04:05] and then you can approve your above branch [04:06] ah [04:06] thanks [04:06] i was wondering whta was going on [04:06] i didn't realise you added the commit message for me [04:15] * axw froths at the mouth a little bit [04:15] wtf is going on with garage maas === philipballew is now known as philip [05:30] axw: isn't maas server supposed to be localhost given Nate's instructions? [05:30] You're generally supposed to be running a KVM (virtual MaaS) system just on one of the nodes [05:31] in Garage Maas [05:31] the main reason we use g-MaaS is because the nodes there have KVM extensions and are set up for it [05:31] in theory you could do it on your personal machine [05:31] jam: maas-server gets inherited by the nodes [05:32] they'll just try to contact whatever you put in there [05:32] (e.g. localhost) [05:32] you need to put in an absolute address [05:33] axw: ah, sure. So 10.* whatever, but not 'localhost' [05:33] yup [05:33] axw: I can imagine that maybe bootstrap works, or some small set of things, but then it doesn't actually work together [05:33] seems like the provider should be able to figure it out itself, but I dunno the specifics [05:34] jam: bootstrap doesn't even work- the node comes up, but the cloud-init script tries to grab tools from localhost [05:35] axw: well "juju bootstrap" pre-synchronous works, right? Just nothing else does :) [05:35] "the command runs and exits cleanly" [05:35] yes :) [06:02] wallyworld_: how's bug #1254729 coming? [06:02] <_mup_> Bug #1254729: Update Juju to make a "safe mode" for the provisioner [06:03] jam: we hit small bug where juju set-env something-boolean={true,false} [06:03] didn't work as expected [06:03] I saw that part, didn't know you were working on it with him [06:03] i think wallyworld_ is in that rabbit hold atm [06:03] jam: been stuck on some stuff inside the provisioner task. i think i've got a handle on it. issues with knowing about dead vs missing machines [06:03] when I saw, [06:03] You could cheat and make it an int [06:03] i mean wallyworld_ [06:04] and when i say we, i mean ian [06:04] yeah me [06:04] :) [06:04] * davecheney ceases to 'help' [06:04] wallyworld_: so you mean we "should kill machines that are marked dead" but not "machines which are missing" ? [06:04] davecheney: thanks for being supportive [06:04] yeah [06:05] sort of [06:05] we have a list of instance ids [06:05] wallyworld_: I'm guessing thats "we asked to shutdown a machine, wait for the agent to indicate it is dead, and then Terminate" it [06:05] and knowing which of those are dead vs missing is the issue, due to how the code is constructed [06:05] but we were detecting that via a mechanism that wasn't distinguishing an instance-id we don't know about from one that we asked to die [06:06] wallyworld_: I don't think you mean "missing", I think you mean "extraneous" [06:06] yeah [06:06] the code was destroying the known instance id too soon [06:06] agent for $INSTANCE-ID is now Dead => kill machine, unknown INSTANCEID => do nothing. [06:26] jam: I've just started a new instance in MAAS manually - shouldn't machine-0 be killing it? [06:26] it's been there for a little while now, still living [06:26] axw: you're using 1.16.2+ ? [06:26] jam: 1.16.3 [06:26] axw: did you start it manually using the same "agent_name" ? [06:27] jam: yeah, I used my juju-provision plugin [06:27] jam: do you know how I can confirm that it's got the same agent_name? [06:27] axw: some form of maascli node list [06:28] axw: it has been a while for me, might want to ask in #maas [06:28] jtv and bigjools should be up around now [06:28] nodes list doesn't seem to show it [06:28] ok [06:28] axw: if nodes list doesn't list it, it sure sounds like it isn't running [06:29] jam: no I mean it doesn't show agent_name [06:29] the node is there in the list [06:30] axw: try "maascli node list agent_name=XXXXX" [06:30] it looks like it isn't rendered, but if supplied it will be used as a filter [06:32] that worked [06:33] jam: the new one does have the same agent_name [06:33] axw: so my understanding is that we only run the Provisioner loop when we try to start a new unit. You might try add-unit or something and see if it tries to kill of the one you added [06:33] ah ok [06:33] thanks [06:38] axw: did it work? [06:38] jam: not exactly; I tried to deploy to an existing machine. it only triggers if a machine is added or removed [06:38] makes sense [06:39] anyway, it was removed [06:39] so I'll go through the rest of the steps now [06:39] axw: so I've heard talk about us polling and noticing these things earlier, but with what ian mentioned it actually makes sense [06:39] the code exists there to kill machines that were in the environment but whose machine agents were terminated [06:40] yup [06:40] and it had the side effect of killing machines it never knew about [06:40] which we decided to go with [06:50] * axw watches paint dry [06:54] axw: ? [06:54] provisioning nodes does not seem to be the quickest thing [06:55] axw: provisioning in vmaas I would think would be reasonably quick, no? [06:55] jam: it's likely the apt-get bit that's slow, but *shrug* [06:56] it's definitely not quick [06:56] I will investigate later [06:58] axw: the fix for that is to smuggle the apt-cache details into your environment [06:58] however when you're on one side of the world [06:58] and the env is on the other [06:58] it's unlikely that there is a good proxy value that will work for both you and your enviornment [07:19] davecheney: garage maas is in Mark S's garage, so I think it would be both reasonably close and have decent bandwidth to the datacenter (I could be completely wrong on that) [07:31] * jam heads to the grocery store for a bit [07:58] mgz, rogpeppe: any updates re agent-fixing scripts? [08:23] fwereade: i've got a script that works, but i don't know whether mgz wanted to use it or not [08:23] fwereade: i phrased it as a standalone program rather than a plugin, but that wouldn't be too hard to change [08:23] rogpeppe, I don't see updates to the procedure doc explaining exactly how to fix the agent and rsyslog configs [08:24] rogpeppe, documenting exactly how to fix is the most important thing [08:24] rogpeppe, scripting comes afterwards [08:24] rogpeppe, sorry if that wasn't clear [08:24] fwereade: ah, ok, i'll paste the shell scripty bits into the doc [08:24] rogpeppe, <3 [08:25] fwereade: I just finished running the process (manually) on garage MAAS [08:25] I keep writing garaage [08:25] anyway [08:25] all seems to be fine [08:25] I missed rsyslog, now that I think of it [08:25] axw, ok, great [08:26] fwereade: sent out an email with the steps I took [08:26] axw, if you can be around for a little bit, would you follow rog's instructions for fixing those please, just for independent verification? [08:26] fwereade: sure thing [08:27] axw, so did the addressupdater code not work? [08:27] fwereade: the what? [08:27] axw, you said you fixed addresses in mongo [08:27] ah, maybe I didn't need to do that bit? [08:27] axw, rogpeppe: addresses should update automatically once we're running [08:27] ok [08:28] rogpeppe, can you confirm? [08:28] fwereade, axw: it seemed to work for me [08:29] rogpeppe: no worries, I was just poking in the database and thought I'd have to update - I'll put a comment in the doc that it was unnecessary [08:30] fwereade: hmm, i realised i fixed up the rsyslog file, but didn't do anything about restarting rsyslog... [08:31] axw, well, technically, we don't know it was unnecessary [08:32] axw, rogpeppe: I am a little bit baffled that the "one approach" notes seem to have been used instead of the main doc [08:32] fwereade: i didn't suggest that [08:32] fwereade: my mistake, I just picked up the wrong thing [08:32] rogpeppe, I know you didn't suggest that bit [08:33] fwereade: i thought dimitern had some notes somewhere, but i haven't seen them [08:33] rogpeppe, they're linked in the main document [08:33] rogpeppe, axw, dimitern: fwiw I have no objection to writing your own notes for things, this is good [08:34] fwereade: just trying to fill in the hand wavy "do X in MAAS" bits :) [08:34] rogpeppe, axw, dimitern: but if they don't filter back into updates to the main doc -- and if they're left lying around without a big link to the canonical one -- we end up with contradictory information smeared around everywhere [08:34] sure [08:35] rogpeppe, axw, dimitern: eg axw trying to use rogpeppe's incorrect mongo syntax [08:36] fwereade: tbh dimitern's isn't quite right either, currently [08:36] dimitern: shall i update it to use $set ? [08:36] rogpeppe, dimitern: fixing your notes is fine if you want [08:36] fwereade: my notes were fixed when you mentioned the problem FWIW [08:37] fwereade: I'll run through the main doc and see if I can spot any problems [08:37] fwereade: it was just a copy/paste failure [08:37] rogpeppe, dimitern, axw: but the artifact we're meant to have *perfect* by now is the main one [08:37] rogpeppe, I don't mind what notes you make, so long as it's 100% clear that they're not meant to be used by anyone else, and they link to the canonical document [08:38] rogpeppe, and I'm pretty sure mramm and I were explicit about using something that understands yaml to read/write yaml files [08:39] rogpeppe, sed, for all its joys, is not aware of the structure of the document;) [08:39] fwereade: does it actually matter in this case? we know what they look like and how they're marshalled, and the procedure leaves everything else unaffected - it's pretty much what you'd do using a text editor [08:40] fwereade: i wanted to use something that didn't need anything new installed on the nodes [08:40] fwereade, sorry, just catching up on emails [08:40] fwereade, yes, the $set syntax should work [08:40] fwereade: and i'm not sure that there's anything yaml-savvy there by default [08:41] rogpeppe, crikey [08:41] rogpeppe, well if that's the case I withdraw my objections [08:42] rogpeppe: pyyaml is required by cloud-init, so it's on there [08:42] rogpeppe, objections backin force [08:42] but... IMHO sed is fine here [08:43] * axw makes everyone hate him at the same time [08:43] * rogpeppe leaves it to someone with less rusty py skills to do the requisite yaml juggling [08:43] * fwereade flings stones indiscriminately [08:43] axw: did you check with anyone in #maas if maas-cli still doesn't support uploading? The post from allenap was from June (could be true, and you could have experienced it first hand) [08:43] rogpeppe, did you hear from mgz at all yesterday? [08:43] jam: the bug is still open, so I didn't [08:43] but [08:44] I couldn't get it to work [08:44] fwereade: briefly - he'd been offline, but i didn't see his stuff [08:45] fwereade: mgz posted his plugin to the review queue [08:46] fwereade: I'll just update the address in mongo back to something crap and make sure the addressupdater does its job; so far the main doc is fine, tho I had to add the quotes into the mongo _id value filters [08:46] axw: as long as its "I tried and couldn't, then I found the bug" I'm happy. vs if it was "I found the bug, so I didn't try" [08:47] jam: defintely the former :) [08:48] axw, thanks for fixing the main doc :) [08:48] np [08:48] axw, and let me know if the address-updating works as expected [08:48] will do [08:49] jam, axw: it doesn't support uploading still [08:50] bigjools: thanks for confirming [08:50] thanks bigjools [08:51] fwereade: confirmed, addressupdater does its job [08:51] sorry for the confusion [08:52] axw: did you have to set LC or LC_ALL when doing mongodump ? [08:53] axw: or is it (possibly) set when you ssh into things [08:53] jam: I did not, but I didn't check if it was there already; I'll check now [08:54] thx [08:54] not set to anything [08:54] dunno why it didn't affect me [08:55] axw: one thought is that you only have to set it if you don't have the current lang pack installed (which a cloud install may not have) ? not really suer [08:58] jam, rogpeppe: hey, I just thought of something [08:58] fwereade: oh yes? [08:58] ? [08:59] jam, rogpeppe: we should probably be setting *all* the unit-local settings revnos to 0 [08:59] fwereade: i thought of something similar yesterday actually, but not so nice [08:59] fwereade: that would be a good thing to do [09:00] rogpeppe, yeah, it was inspired by your comments yesterday, it just took a day for it to filter through [09:00] fwereade: I don't actually know what revnos you are talking about. Mongo txn ids? [09:00] fwereade: that gets you unit settings, but what about join/leave? [09:00] jam: the unit agent stores some state locally [09:00] jam: so that it can be sure to execute the right hooks, even after a restart [09:01] rogpeppe, join/leave should be good, the hook queues reconcile local state against remote [09:01] fwereade: great [09:01] fwereade: do config settings need anything special? [09:01] rogpeppe, config settings should also be fine thanks to the somewhat annoying always-run-config-changed behaviour [09:01] rogpeppe, we have a bug for that [09:02] fwereade: currently we can treat it as a useful feature :-) [09:02] rogpeppe, indeed :) [09:03] axw: when you did your testing, did you start machine-0 before updating the agent address in the various units? [09:03] fwereade: it would be interesting to try to characterise the system behaviour when restoring at various intervals after a backup [09:03] jam: no, I started it last [09:04] fwereade: e.g. when the unit/service was created but is not restored [09:04] jam: sorry, I'll add that step in :) [09:04] jam: actually [09:04] I lie [09:04] I did start it first [09:04] fwereade: i suspect that's another case where we really don't want to randomly kill unknown instances [09:04] axw, dimitern, rogpeppe, mgz, *everyone* -- *please* be *doubly* sure that you test the canonical procedure [09:05] axw: actually we *wanted* to do it last [09:05] rogpeppe, well, there's no way to restore those things at the moment anyway [09:05] jam, why? [09:05] axw: so update machine-0 config, start it, then go around and fix the agent.conf [09:05] fwereade, ok, i'm starting a fresh test with the canonical procedure now [09:05] fwereade: didn't you want to split "fixing up mongo + machine-0" from "fixing up all other agents" ? [09:05] fwereade: agreed, but the user might have important data on those nodes [09:06] jam: yeah that's what I did, sorry [09:06] axw: sorry, "when I say do it last" it was confusing what thing "it" is [09:06] axw: start jujud-machine-0 should come before updating agent.conf [09:06] jam, I think we suffered a communication failure -- you seemed to be suggesting he should fix agent confs before starting the machine 0 agent [09:06] thanks [09:07] fwereade: yes. I think we all agree on what should be done :) [09:07] jam: I fixed mongo, started machine-0, fixed provider-state, fixed agent.conf [09:07] axw: I'm copying some of your maas specific steps into the doc [09:07] cool [09:07] rogpeppe, this is true, hence https://codereview.appspot.com/32710043/ -- would you cast your eyes over that please? [09:08] fwereade: looking [09:08] rogpeppe, there's not much opportunity to fix them, it's true [09:08] fwereade: I used my plugin to provision the new node; how are people expected to do it without it (and get a valid agent_name)? [09:08] rogpeppe, and the rest of the system should anneal so as to effectively freeze them out [09:09] rogpeppe: fwereade: are we actually suggesting run "initctl stop" rather than just "stop foo" ? [09:09] jam, I don't think so [09:09] we do it differently at different points in the file [09:09] fwereade: yeah [09:09] jam, where didinitctlcomefrom? [09:09] "sudo start jujud-machine-0" but [09:09] "for agent in *; do initctl stop juju-$agent" [09:09] fwereade: in the main doc, I think rogpeppe put it [09:09] I'll switch it [09:09] jam: i generally prefer "initctl stop" rather than "stop" as i think it's more obvious, but that's probably just me [09:10] jam: the two forms are exactly equivalent i believe [09:10] rogpeppe, it's just you :) [09:11] rogpeppe, i preferred service stop xyz before, but now i find stop xyz or start xyz pretty useful [09:11] rogpeppe, and i don't think they are quite equivalent [09:11] * rogpeppe thinks it was rather unnecessary for upstart to take control of all those useful verbs [09:11] rogpeppe: honestly, I think they are at least roughly equivalent, but we should be consistent in the doc [09:11] dimitern: no? [09:12] dimitern: [09:12] % file /sbin/stop [09:12] /sbin/stop: symbolic link to `initctl' [09:12] main problem *I* had with "service stop" is I always wanted to type it wrong "service stop mysql" vs "service mysql stop" [09:12] I still am not sure which is correct :) [09:12] rogpeppe, initctl is the same as calling the script in /etc/init.d/xyz {start|stop|etc..} [09:12] jam: initctl stop mysql [09:12] rogpeppe, whereas start/stop and service are provided by upstart [09:13] dimitern: yeah, I confirmed rogpeppe is right that stop is a symlink to initctl [09:13] at least on Precise [09:13] dimitern: i believe that stop is *exactly* equivalent to initctl stop [09:14] dimitern: try man 8 stop [09:14] rogpeppe, hmm.. seems right [09:14] dimitern: (it doesn't even mention the aliases) [09:14] dimitern: that's why i like using initctl, as it's in some sense the canonical form [09:14] rogpeppe: can you double check the main doc again. I reformatted the text, and reformatting regexes is scary :) [09:14] rogpeppe, but again, I usually am too lazy to type more, if I can type less :) [09:14] https://docs.google.com/a/canonical.com/document/d/1c1XpjIoj9ob_06fvvGJz7Jm4qS127Wtwd5vw_Jeyebo/edit# [09:15] dimitern: this is a script :-) [09:15] jam: looking [09:15] rogpeppe: actually, it is a document describing what we want other people to type [09:15] again, it doesn't matter terribly, but we should be consistent [09:15] jam: i don't expect anyone to actually type that [09:16] rogpeppe: that is what this doc *is about* actually [09:16] rogpeppe: right down what the manual steps are to get things working [09:16] and then maybe we'll script it later [09:16] jam: i realise that, but surely anyone that's doing it will copy/paste? [09:16] jam: rather than manually (and probably wrongly) type it all out by hand [09:16] rogpeppe: well, C&P except they have to edit bits, and its actually small, so they'll just type it, and ... [09:16] jam: i wouldn't trust anyone (including myself) to type out that script by hand [09:17] like "8.3" ADDR=<...>" [09:17] they *can't* just C&P [09:17] jam: i deliberately changed it so that the only bit to edit was that bit [09:18] rogpeppe, btw for the copy/paste to work we need to use the correct arguments, like --ssl instead of -ssl ;) [09:18] dimitern: good catch, done [09:23] so... have we stopped the "stay on the hangout" bit of the day ? [09:25] fwereade: I used my plugin to provision the new node; how are people expected to do it without it (and get a valid agent_name)? [09:25] jam, i for one find it a bit distracting tbo [09:25] (just wondering if I should proceed to fix it or not) [09:26] axw: maascli acquire agent_name=XXXXX [09:27] jam: ah :) [09:27] then I shall just let that code sit there for now [09:28] jam: do you think it's worth putting that in the doc? [09:34] axw: well, if you don't mind testing it and finding the exact right syntax, then I'd like it in the doc [09:35] jam: I'll see what I can do before the family gets home [09:36] jam, wallyworld_: reviewed https://codereview.appspot.com/32710043/ [09:36] fwereade: ^ [09:38] rogpeppe: i'll read your comments in detail - the changes i made were what i found i had to do to make the tests pass [09:39] wallyworld_: what was failing? [09:39] otherwise it had issues distinguishing between dead vs extra instances [09:39] a number of provisioner tests [09:39] concerning seeing which instances were stopped [09:40] your proposed code may well work also [09:40] wallyworld_: so that original variable "unknown" didn't actually contain the unknown instances? [09:40] jam: there are other things that StartInstance does for MAAS too, like creating the bridge interface [09:41] wallyworld_: i would very much prefer to change as little logic as possible here [09:41] rogpeppe: it also contained dead ones i think from memory [09:41] cause the dead ones were removed early from machines map [09:41] jam: tho I guess this is moot if they're just doing a bare-metal backup/restore [09:43] axw: so... we should now if this stuff works by going through the checklist we've created. If we really do need something like juju-provision, then we should document it as such. [09:44] jam: the problem is that step 1 is vague as to how to achieve the goal [09:45] axw: so 1.1 in the main doc is about "provision an instance matching the existing as much as possible" [09:45] jam: yeah, how? maybe it's obvious to people seasoned in maas, I don't know [09:46] axw: as in *we need to put it in there* to help people [09:46] it may be your juju-provision [09:46] it may be "maascli do stuff" [09:46] it may be ? [09:46] but we shouldn't have ? in that doc :) [09:46] jam: ok, we're on the same page now: that is what my question was before [09:46] i.e. is there some other way to do this, or do we still need juju-provision [09:48] axw: so we are focused on "manual steps you can do today" in that document, though referencing "there is a script over here you can use" [09:50] jam: ok. well, fwiw that plugin works fine now, so if we can't figure out something better, there's that [09:51] rogpeppe: so i needed to leave the dead machines in the machine map until the allinstances had been checked, so that the difference between nachine map and allinstances really represented unknown machines. after that the dead ones could be processed [10:07] jam: didn't get anywhere with maas-cli; I need to head off now, I'll check in later [10:07] axw: np [10:07] axw: have a good afternoo [10:07] afternoon [10:20] wallyworld_: ok - i'd assumed that unknown really was unknown. i will have a better look at your CL in that light now [10:20] fwereade: i've added a script to change the relation revnos [10:21] rogpeppe, cool, thanks [10:22] fwereade, the procedure as described checks out [10:24] dimitern, awesomesauce [10:24] fwereade, for ec2 ofc, haven't tried the maas parts [10:27] rogpeppe, great, thanks [10:27] dimitern, would you run rog's new change-version tweak against your env too please? [10:28] fwereade, what's that tweak? [10:28] dimitern, in the doc: if [[ $agent = unit-* ]] [10:28] then [10:28] sed -i -r 's/change-version: [0-9]+$/change-version: 0/' $agent/state/relations/*/* [10:28] fi [10:28] dimitern, to be run while the unit agent's stopped [10:28] dimitern, it'll trigger a whole round of relation-changed hooks [10:29] dimitern, should be sufficient to bring the environment back into sync with itself even if it was backed up while not in a steady state [10:29] fwereade, i'll try that [10:30] fwereade, wait, which doc? machine doc? [10:31] dimitern, in the canonical source-of-truth doc, in section 8, with the scripts rog write [10:31] fwereade, ah, ok [10:34] fwereade, i can see the hooks, seems fine [10:34] dimitern, sweet [10:46] fwereade, rogpeppe, mgz, jam, TheMue, natefinch, standup time [10:50] mgz, jam, TheMue: https://plus.google.com/hangouts/_/calendar/am9obi5tZWluZWxAY2Fub25pY2FsLmNvbQ.mf0d8r5pfb44m16v9b2n5i29ig [10:53] TheMue: ^^ ? if you want to join [10:53] mgz: ^^ [11:48] fwereade: pushed some changes. wrt the question - can we call processMachines when setting safe mode - what machine ids would i use in that case? [11:48] cause normally the ids come from the changes pushed out by the watcher [11:59] wallyworld_: i think you could probably get all environ machines and use their ids [12:00] rogpeppe: i considered that but in a large environment the performance could be an issue [12:00] wallyworld_: no worse than the provisioner bouncing [12:00] wallyworld_: and this is something that won't happen very often at all, i'd hope [12:01] hmmm ok [12:01] wallyworld_: um, actually... [12:01] i'll look into it [12:01] wallyworld_: perhaps you could pass in an empty slice [12:02] then it wouldn't pick up any dead machines, but may not matter [12:02] wallyworld_: i don't think we'll do anything differently with dead machines between safe and unsafe mode [12:03] wallyworld_: the thing that changes is how we treat instances that aren't in state at all, i think [12:03] i thought about using a nil slice and thought it may be an issue but i can't recall why now. i'll look again [12:04] wallyworld_: BTW you probably only need to call processMachines when provisioner-safe-mode has been turned off [12:05] yep, figured that :-) [12:15] * TheMue => lunch [12:28] wallyworld_: reviewed. sorry for the length of time it took. [12:28] np, thanks. i'll take a look [12:32] rogpeppe: with the life == Dead check - if i remove it, wont' we encounter this line else if !params.IsCodeNotProvisioned(err) { [12:32] and exit with an error [12:33] wallyworld_: i don't *think* it's an error to call InstanceId on a dead machine [12:33] well, it will try and find an instance record in the db and fail [12:33] or maybe not [12:34] i think it will only fail once the machine is removed [12:34] i just don't see the point of a rpc round trip [12:34] wallyworld_: i think it will probably work even then [12:34] when it is not needed [12:35] wallyworld_: it is strictly speaking not necessary, yes, but your comment is only necessary because the context that makes the code correct as written is not inside that function [12:36] wallyworld_: it only works if we *know* that stopping contains all dead machines [12:36] yeah it sorta is - the population of stopping and processiing of that [12:36] ok,i see your point [12:36] but [12:37] the comment clears up any confusion [12:37] wallyworld_: i'd prefer robust code to a comment, tbh [12:37] and i hate invoking rpc unless necessary, and we are trusting that we either get an instance id or that specific error always and we are not sure [12:38] calling rpc unnecessarily can be unrobust also [12:38] wallyworld_: i believe it's premature optimisation [12:38] wallyworld_: correctness is much more important here [12:39] eliminating rpc is never premature optimisation [12:39] especially when we can have 1000s of machines [12:39] wallyworld_: *any* optimisation is premature optimisation unless you've measured it [12:39] except for networking calls [12:39] wallyworld_: none of this is on a critical time scale [12:39] they can be indeterminately long [12:39] wallyworld_: it's all happening at leisure [12:40] but, it is a closed system and errors/delays add up [12:40] wallyworld_: look, we're getting the instance ids of every single machine in the environment [12:40] wallyworld_: saving calls for just the dead ones seems like it won't save much at all [12:41] wallyworld_: if we wanted to save time there, we should issue those rpc's concurrently [12:41] fwereade: for the fix for "destroy machines". I'd like to warn if you supply --force but it won't be supported, should that go via logger.Warning or is there something in command.Context we would use? [12:41] there is a Context.Stderr [12:42] rogpeppe: can we absolutely guarantee that for all dead/removed machines, instanceid() will return a value or a not provisioned error? [12:42] jam, I'd write it to context.Stderr, yeah [12:42] wallyworld_: assuming the api server is up, yes [12:43] rogpeppe, wallyworld_: NotFound? [12:43] rogpeppe: we don't have any way to make our RPC server pretend an API doesn't actually exist, right? [12:43] fwereade: can't happen [12:43] it would be nice for testing backwards compat [12:43] fwereade: look at the InstanceId implementation [12:44] fwereade: i wouldn't mind an explicit IsNotFound check too though, for extra resilience [12:44] rogpeppe, looks possible to me [12:44] fwereade: if (err == nil && instData.InstanceId == "") || (err != nil && errors.IsNotFoundError(err)) { [12:44] fwereade: err = NotProvisionedError(m.Id()) [12:44] rogpeppe, I'm looking at apiserver [12:44] looks like it will return not found [12:45] looking at api server [12:45] fwereade: ah, it'll fetch the machine first [12:45] that's my issue [12:45] wallyworld_: in which case, check for notfound too [12:45] hence the == dead check [12:45] seems rather fragile [12:45] wallyworld_: will the == dead check help you? [12:45] yes, because that short circuits the need for getting instance if [12:46] id [12:46] so we don't need to guess error codes [12:46] fwereade: I can give a warning, or I can make it an error, thoughts? (juju destroy-machine --force when not supported should try just plain destroy-machine, or just abort ?) [12:46] wallyworld_: can't the machine be removed anyway, even if the machine is not dead? it could become dead and then be removed [12:46] jam, I'd be inclined to error, myself, tbh [12:47] rogpeppe: if it is not dead, there is also processing for that elsewhere [12:48] wallyworld_: i think this code is preventing you from calling processMachines with a nil slice [12:49] which code specifically? [12:49] wallyworld_: the "if m.Life() == params.Dead {" code [12:50] save me looking, how? [12:50] wallyworld_: because stopping doesn't contain *all* stopping machines (your comment there is wrong, i think) [12:51] wallyworld_: it (i *think*) contains all dead machines that we've just been told had their lifecycle change [12:51] yes [12:51] wallyworld_: and this is what makes me think that the code is not robust [12:51] but that's the the current processing does [12:51] looks at changed machines [12:51] wallyworld_: no [12:51] wallyworld_: task.machines contains every machine, i think, doesn't it? [12:52] yes, i meant the ids [12:52] stopping is populated from the ids [12:53] wallyworld_: so, if there's a dead machine that's not in the ids passed to processMachines, its instance id will be processed as unknown, right? [12:53] i think so [12:53] but it would have previous triggered [12:53] wallyworld_: so this code will be wrong if you pass an empty slice to processMachines, yes? [12:54] i'd have to trace it through [12:54] wallyworld_: (which is something that would be good to do) [12:54] wallyworld_: please write the code in such a way that it's obviously correct [12:54] wallyworld_: (which the current code is not, IMHO) [12:54] obviously is subjective [12:55] wallyworld_: ok, *more* obviously :-) [12:55] wallyworld_: "If a machine is dead, it is already in stopping" is an incorrect statement, I believe. Or only coincidentally correct. And thus it seems wrong to me to base the logic around it. [12:56] if a changing machine is dead it is in stoppting [12:56] that assumption still needs to be true [12:56] regardless of if i take out the == dead check [12:57] wallyworld_: thanks [12:57] wallyworld_: "if a changing machine is dead it is in stoppting" is not the invariant you asserted [12:57] what for? [12:57] wallyworld_: making the change [12:57] i haven't yet [12:57] wallyworld_: oh, sorry, i misread [12:57] still trying to see if i can rework it [12:58] wallyworld_: i think this code should be robust even in the case that there are dead machines that were not in the latest change event [12:59] oh dammit [12:59] yes. i wonder what the code used to do, i'll look at the old code [13:00] hmm, this code is the only code that removes machines, right? [13:00] i think so [13:01] at first glance, i'm not sure if the old code was immune to the issue of ids not containing all dead machines [13:02] the old code looks like it used to rely on dead machines being notified via incoming ids [13:02] wallyworld_: i *think* it was [13:02] wallyworld_: it certainly relied on that [13:03] so i'm doing something similar here then [13:03] wallyworld_: but the unknown-machine logic didn't rely on the fact that all dead machines were in stopping [13:03] wallyworld_: which your code does [13:03] hmmm. [13:03] wallyworld_, rogpeppe: I'd really prefer to avoid further dependencies on machine status, the pending/error stuff is bad enough as it is [13:03] wallyworld_, fwereade: BTW i can't see any way that a machine that has not been removed could return a not-found error from the api InstanceId call, can you? [13:04] fwereade: i'm not quite sure what you mean there [13:04] rogpeppe, wallyworld_: the "stopping" sounded like a reference to the status -- as in SetStatus [13:04] fwereade: nope [13:05] rogpeppe, wallyworld_: ok sorry :) [13:05] fwereade: i'm talking about the stopping slice in provisioner_task.go [13:05] fwereade: and in particular to the comment at line 288 of the proposal: [13:05] what he said [13:05] // If a machine is dead, it is already in stopping and [13:05] 289 // will be deleted from instances below. There's no need to [13:05] 290 // look at instance id. [13:06] rogpeppe, wallyworld_: wrt machine removal: destroy-machine --force *will* remove from state, but I'd be fine just dropping that last line in the cleanup method and leaving the provisioner to finally remove it [13:06] fwereade: this discussion is stemming from my remark on that comment [13:06] fwereade: that would be much better [13:06] one place to remove is best [13:06] fwereade: otherwise we can leak that machine's instance id [13:06] fwereade: if we're in safe mode === gary_poster|away is now known as gary_poster [13:07] rogpeppe, wallyworld_: I saw I'd done that the other day and thought "you idiot", for I think exactly the same reasons, consider a fix for that pre-blessed [13:07] rogpeppe: i think i can see your point [13:08] wallyworld_: phew :-) [13:08] that stopping won't contain all dead machines [13:08] sorry, it's late here, i'm tired, that's my excuse :-) [13:08] wallyworld_: np [13:08] wallyworld_: thing is, it *probably* does, but i don't think it's an invariant we want to rely on implicitly [13:09] i was originally worried about the error fragility [13:09] wallyworld_: especially because we can usefully break that invariant to good effect (by passing an empty slice to processMachines) [13:09] i'm still quite concerned about all the rpc calls we make (in general) [13:10] wallyworld_: an extra piece of code explicitly ignoring a not-found error too would probably be a good thing to add [13:10] ok [13:10] wallyworld_: well, me too, but you'll only be saving a tiny fraction of them here [13:10] yeah, we really need a bulk instance id call - i thought all our apis were supposed to be bulk [13:11] putting remote interfaces on domain objects eg machine is also wrong, but thats another discussion [13:12] imagine a telco with 10000 or more machines [13:12] wallyworld_: they are, kinda, but a) we don't make them available to the client and b) we don't implement any server-side optimisation that would make it significantly more efficient [13:12] well, here the provisioner is a client [13:12] task [13:12] wallyworld_: if we had 10000 or more machines, we would not want to process them all in a single bulk api call anyway [13:13] wallyworld_: indeed [13:13] sure, but that optimisation can be done under the covers [13:13] the bulk api can batch [13:13] so bottom line - we can't claim to scale well just yet [13:13] more work to do [13:14] wallyworld_: to be honest, just making concurrent API calls here would yield a perfectly sufficient amount of speedup, even in the 10000 machine case, i think [13:14] wallyworld_: without any need for more mechanism [13:14] you mean using go routines? [13:15] wallyworld_: yeah [13:15] well, that could happen under the covers [13:15] but we need to expose a bulk api to callers [13:15] wallyworld_: i'm not entirely convinced. [13:15] and then the implementation can decide how best to do it [13:16] wallyworld_: the caller may well want to do many kinds of operation at the same time. bulk calls are like vector ops - they only allow a single kind of op to be processed many times [13:16] wallyworld_: that may not map well to the caller's requirements [13:16] yes, which is why remote apis need to be desinged th match the workflow [13:16] wallyworld_: agreed [13:16] ours are just a remoting layer on top of server methods [13:17] which is kinda sad [13:17] wallyworld_: which is why i think that one-size-fits all is not a good fit for bulk methods [13:17] wallyworld_: actually, it's perfectly sufficient, even for implementing bulk calls [13:17] all remote methods should be bulk, but how stuff is accumulated up for the call is workflow dependent [13:18] wallyworld_: it's just a name space mechanism [13:18] anytime a remote method call is O(N) is bad [13:18] wallyworld_: there are many calls where a bulk version of the call is inevitably O(n) [13:18] it should't be if designed right [13:19] to match the workflow [13:19] wallyworld_: if i'm adding n services, how can that not be O(n) ? [13:19] what i mean is - if you have N objects, you don't make N remote calls to get info on each one [13:19] i don't mean the size of the api [13:19] but the call frequency [13:20] to get stuff done [13:20] wallyworld_: if calls can be made concurrently (which they can), then the overall time can still be O(1) [13:20] the client should not have to manually do that boiler plate [13:20] wallyworld_: assuming perfect concurrency at the server side of course :-) [13:20] wallyworld_: now that's a different argument, one of convenience [13:21] so imagine if you downloaded a file and the networking stack made you as a client figure out how to chunk it [13:21] wallyworld_: personally, i think it's reasonable that API calls are exactly as easy to make concurrent as calling any other function in Go [13:21] no - rpc calls should never be treated like normal calls [13:21] wallyworld_: it does [13:22] networked calls are always different [13:22] wallyworld_: i disagree totally [13:22] so, you've never read the 7 falicies of neworked code or whatever that paper is called? [13:22] wallyworld_: any time you call http.Get, it looks like a normal call but is networking under the hood. [13:23] wallyworld_: we should not assume that it cannot fail, of course [13:23] wallyworld_: and that's probably one of the central fallacies [13:23] people know http get is networked at do tend to programme aroud it accordingly [13:23] wallyworld_: but a function works well to encapsulate arbitrary network logic [13:24] wallyworld_: sure, you should probably *know* that it's interacting with the network, but that doesn't mean that calling a function that interacts with the network in some way is totally different from calling any other function that interacts in some way with global state [13:24] wallyworld_: in a way that can potentially fail [13:25] it is different - networks can disappear, have arbitary lag, different failure modes etc etc [13:25] the programming model is different [13:25] wallyworld_: not really - the function returns an error - you deal with that error [13:26] it is different at a higher level that that [13:26] wallyworld_: i don't believe that any network interaction breaks all encapsulation [13:26] see http://www.rgoarchitects.com/files/fallacies.pdf [13:26] wallyworld_: which is what i think you're saying [13:27] wallyworld_: i have seen that [13:27] wallyworld_: i'm not sure how encapsulating a networking operation in a function that returns an error goes against any of that [13:27] the apis design, error handling and all sorts of other things are different when dealing with networked apis [13:28] the encapsulation isn;t the issue [13:28] it's the whole api design [13:28] and underlying assumptions abut how such apis can be called [13:28] wallyworld_: i don't understand [13:29] case in point - it might make sense to call instanceId() once per 10000 machines when inside a service where a machine domain object is colocated, but it is madness to do that over a network [13:30] the whole api decomposiiton, assumptoons about errors, retries etc needs to be different for networked apis [13:30] wallyworld_: so, there's no reason that where we need it, we couldn't have State.InstanceIds(machineIds ...string) as well as Machine.InstanceId [13:31] we should never have machine.InstanceId() - networked calls do not belong on domain objects but services [13:31] wallyworld_: well, it's certainly true that some designs can make that necessary; eventual consistency for one breaks a lot of encapulation [13:31] thats the big mistake java made with EJB 1.0 [13:31] and it took a decade to recover [13:32] wallyworld_: what's the difference between machine.InstanceId() and InstanceId(machine) ? [13:32] domain objects encapsulate state; they shouldn't call out to services [13:33] dimitern: trivial review of backporting your rpc.IsNoSuchRPC to 1.16: https://codereview.appspot.com/32850043 [13:33] the first example above promotes single api calls [13:33] which is bad [13:33] wallyworld_: and the second one doesn't? [13:33] wallyworld_, looking [13:33] the second should be a bulk call on a service [13:34] wallyworld_: even if it doesn't make sense to be a bulk call? [13:34] wallyworld_, the diff is messy [13:34] wallyworld_: anyway, i think this is somewhat of a religious argument :-) [13:34] dimitern: did you mean jam ? [13:34] wallyworld_: we should continue at some future point, over a beer. [13:35] rogpeppe: it always makes sense to provide bulk calls, and if there happens to be only one, just pass that in as a single elemnt array [13:35] yes [13:35] jam, oops yes [13:35] wallyworld_: i'm distracting you :-) [13:35] yes [13:35] :-) [13:35] dimitern: the diff looks clean here, is it because of unified vs side-by-side? [13:35] i've seen too many systems fall over due to the issues i am highlighting [13:36] I have "old chunk mismatch" in side-by-side but it looks good in unified, I think [13:36] ugh, it is targetting trunk [13:36] jam, yeah, the s-by-s diff is missing [13:36] I thought I stopped it in time [13:36] dimitern: so I'll repropose, lbox broke stuff [13:36] you can look at the unified diff, and that will tell you what you'll see in a minute or so [13:36] jam, cheers [13:40] dimitern: https://codereview.appspot.com/32860043/ updated [13:41] jam, lgtm, thanks [13:47] dimitern, fwereade: if you want to give it a review, this is the "compat with 1.16.3" for 1.16.4 destroy-machines, on the plus side, we *don't* have to fix DestroyUnit because that API *did* exist. (GUI didn't think about Machine or Environment, but it *did* think about Units) [13:47] https://codereview.appspot.com/32880043 [13:47] jam, looking [13:51] jam, lgtm [13:51] fwereade: do you want to give an eyeball if that seems to be a reasonable way to do compat code? We'll be using it as a template for future compat [13:52] jam,will do, we have that meeting in a sec [13:52] fwereade: sure, but it is 1hr past my EOD, and my son needs me to take him to McDonalds :) [13:53] jam, ok then, I will look as soon as I can, thanks [13:53] fwereade: no rush on your end [13:53] I think it is ~ok, though I'd *love* to actually have tests that compat is working [13:53] rogpeppe: more changes pushed. but calling processMachines(nil) hangs the tests so that bit is not there yet [13:54] sinzui: maybe we could do cross version compat testing in CI for stuff we know changed? [13:54] I could help write those tests [13:54] wallyworld_, might processMachines(nil) be a problem if the machines map is empty? [13:54] wallyworld_: looking [13:55] wallyworld_: could you propose again? i'm getting chunk mismatch [13:55] fwereade: could be, i haven't traced through the issue yet fully. not sure how much further i'll get tonight, it's almost midnight and i'm having trouble staying awake [13:55] wallyworld_, ok, stop now :) [13:55] wallyworld_, tired code sucks [13:56] wallyworld_, landing it now will not make the world of difference [13:56] yep. i don't have to be tired to write sucky code :-) [13:56] wallyworld_, fwereade: i could try to take it forward. mgz is now online so can probably take the bootstrap-update stuff forward [13:56] wallyworld_, ;p [13:56] rogpeppe, wallyworld_, mgz: if that works for you all, go for it [13:56] or, it probably doesn't make much difference, as fwereade says [13:56] rogpeppe: i pushed again [13:56] wallyworld_: thanks [13:58] wallyworld_: you need to lbox propose again. [13:58] wallyworld_: oh, hold on! [13:58] a thrid time? [13:58] wallyworld_: page reload doesn't work, i now remember [13:58] * wallyworld_ hates reitveld [13:59] wallyworld_: ah, it works, thanks! [13:59] wallyworld_: that bit is really shite, it's true [13:59] wallyworld_: i saw a proposal recently to fix the upload logic [13:59] hope they land it soon [14:00] wallyworld_: it would be nice if the whole thing was a little more web 2.0, so you didn't have to roundtrip to the server all the time. [14:00] yeah [14:01] that also messes up browser history [14:03] jam, I had the same idea. I added it to my proposal of what we want to see about a commit in CI https://docs.google.com/a/canonical.com/spreadsheet/ccc?key=0AoY1kjOB7rrcdEl3dWl0NUM3RzE2dXFxcGxwbVZtUFE&usp=drive_web#gid=0 [14:06] wallyworld_: i think i know why your processMachines(nil) call might be failing [14:06] ok [14:07] wallyworld_: were you calling it from inside SetSafeMode? [14:07] yeah [14:07] wallyworld_: thought so. that's not good - it needs to be called within the main provisioner task look [14:07] s/look/loop/ [14:08] ok [14:08] wallyworld_: so i think the best way to do that is with a channel rather than using a mutex [14:08] rogpeppe: but setsafemode is called from the loop [14:08] wallyworld_: it is? [14:09] ah, provisioner loop [14:09] wallyworld_: yup [14:09] not provisioner task [14:09] wallyworld_: indeed [14:09] save me tracing through the code, why does it matter? [14:10] wallyworld_: because there is lots of logic in the provisioner task that relies on single-threaded access (all the state variables in environProvisioner) [14:10] wallyworld_: that's why we didn't need a mutex there [14:11] makes sense [14:13] wallyworld_: you'll have to be a bit careful with the channel (you probably don't want the provisioner main loop to block if the provisioner task isn't ready to receive) [14:14] yeah, channels can be tricky like that [14:14] if anyone has a moment, i would appreciate a review of this trivial that resolves two issues with manual provider, https://code.launchpad.net/~hazmat/juju-core/manual-provider-fixes [14:15] wallyworld_: this kind of idiom can be helpful: http://paste.ubuntu.com/6479150/ [14:16] rogpeppe: thanks, i'll look to use something like that [14:16] wallyworld_: it works well when there's a single producer and consumer [14:17] hazmat: i'll look when the diffs are available. codereview would be more conventional. [14:18] rogpeppe, doh. [14:18] rogpeppe, its a 6 line diff fwiw [14:18] hazmat: lp says "An updated diff will be available in a few minutes. Reload to see the changes." [14:19] http://bazaar.launchpad.net/~hazmat/juju-core/manual-provider-fixes/revision/2095 [14:19] * hazmat lboxes [14:22] rogpeppe, https://codereview.appspot.com/32890043 [14:24] hazmat: axw_ might have some comments on the LookupAddr change. [14:24] rogpeppe, what it was doing previously was broken [14:24] hazmat: it looks like it was done like that deliberately. [14:24] hazmat: agreed. [14:25] rogpeppe, yes deliberately broken, i've already discussed with axw [14:25] hazmat: it should at the least fall back to the original address [14:25] rogpeppe, it hangs indefinitely [14:25] hazmat: ok, if you've already discussed, that's fune [14:25] fine [14:25] rogpeppe, and there's no reason for requiring dns name [14:25] hazmat: hmm, hangs indefinitely? [14:26] hazmat: ah, if it doesn't resolve, then WaitDNSName will loop [14:26] hazmat: yeah, i think that's fair enough. the only thing i was wondering was if something in the manual provider used the address to name the instance [14:26] hazmat: but even then, a numeric address should be fine [14:27] yes.. slavish adherence to name is name, when name is actually address and the api should get renamed. [14:27] to name is the issue [14:27] hazmat: yeah. [14:28] * hazmat grabs a cup of coffee [14:28] hazmat: i think the api was originally named after the ec2 name [14:37] sinzui: * vs x is ? [14:37] stuff that is done, vs proposed ? [14:37] or stuff that is done but failing tests [14:39] sinzui: if you can give me a template or some sort of process to write tests for you, I can do a couple [14:40] jam, in15 minutes I can [14:52] rogpeppe, thanks for the review, replied and pushed. [14:53] hazmat: looking [14:54] hazmat: LGTM [15:06] sinzui: no rush on my end. I'm EOD and just stopping by IRC from time to time [15:07] jam, okay, I will send an email to the juju-dev list so that the knowledge is documented somewhere [15:10] is there a way to move a window that's off the screen back onto the screen? I know windows tricks to do it, but not linux. (and I know about workspaces, I'm not using them) [15:11] natefinch: i enabled workspaces for that reason only [15:11] rogpeppe: heh, well, maybe I should turn them back on [15:15] natefinch: if you click on the workspace icon in the bar you'll get all four and can move windows [15:16] TheMue: I had workspaces off.... I think Ubuntu just gets confused when I go from one monitor to multiple monitors and back again === teknico_ is now known as teknico [15:17] natefinch: computers don't have to have more than one monitor *tryingToSoundPowerful* [15:17] ;) [15:17] haha [15:18] And I turned off workspaces because the keyboard shortcuts don't work :/ [15:52] * rogpeppe goes for a bite to eat [16:03] are we doing 2 LGTM for branches or one? [16:04] one [16:05] natefinch, thanks [16:50] is there a known failing test in trunk? [16:50] ie cd juju-core/juju && go test -> http://pastebin.ubuntu.com/6479834/ [16:50] hazmat, which one? [16:51] hazmat: thats a pretty common sporadic failure, yes. [16:51] hazmat, yeah, that's known [16:51] hazmat, it's pretty random to reproduce [16:51] hazmat: if you have a way of reliably reproducing it, i want to know [16:51] k, it seems to happen fairly regularly for me [16:52] rogpeppe, atm on my local laptop i can reproduce every time.. generating verbose logs atm [16:53] hazmat: do you get it when running the juju package tests on their own? [16:53] rogpeppe, here's verbose logs on the same http://paste.ubuntu.com/6479841/ [16:53] rogpeppe, yes i do [16:53] hazmat: and this is on trunk? [16:53] hazmat, can you check your /tmp folder to see and suspicious things - like too many mongo dirs or gocheck dirs? [16:53] rogpeppe, if i just run -gocheck.f "DeployTest*" i don't get failure [16:54] dimitern, not much in /tmp three go-build* dirs [16:54] rogpeppe, yes on trunk [16:54] hazmat, ok, so it's not related then [16:55] hazmat, running a netstat dump of open/closing/pending sockets to mongo might help [16:55] hazmat: is it always TestDeployForceMachineIdWithContainer that fails? [16:56] rogpeppe, checking.. its failed a few times on that one.. every time.. not sure [16:56] rogpeppe, yeah.. it does seem to happen primarily on that one [16:56] hazmat: how about: go test -gocheck.f DeploySuite ? [16:56] rogpeppe, i think that works fine.. its just testing the whole package that fails [16:56] yeah. that works fine [16:57] hmm [16:57] hazmat: i'd quite like to try bisecting to see which other tests cause it to fail [16:57] rogpeppe, hold on a sec.. your cli for gocheck.f results in zero tests [16:57] hazmat: oops, sorry, DeployLocalSuite [16:58] hazmat: go test -gocheck.list will give you a list of all the tests it's running [16:58] yeah.. all tests pass [16:58] if running just that suite [16:58] hazmat: ok... [16:58] hazmat: how about go test -gocheck.f 'DeploySuite|ConnSuite' ? [16:59] rogpeppe, thanks for the trip re -gocheck.list [16:59] rogpeppe, that fails running both different test failure DeployLocalSuite.TestDeploySettingsError [16:59] same error [17:00] hazmat: good [17:00] hazmat: now how about go test -gocheck.f 'DeploySuite|^ConnSuite' ? [17:00] rogpeppe, fwiw re Deploy|Conn -> http://paste.ubuntu.com/6479877/ [17:01] hazmat: oops, that doesn't match what i thought it would [17:02] rogpeppe, yeah.. it runs both still [17:02] rogpeppe, you meant this ? go test -v -gocheck.vv -gocheck.f 'DeployLocalSuite|!NewConnSuite' [17:02] hazmat: ok, instead of juggling regexps, how about putting c.Skip("something") in the SetUpSuite of all the suites except NewConnSuite, ConnSuite and DeployLocalSuite? [17:03] hazmat: no, i was trying to specifically exclude ConnSuite [17:03] rogpeppe, thats what it does [17:03] rogpeppe, that cli only runs deploy local suite tests [17:07] hazmat: hopefully you can then run go test and it'll still fail [17:08] rogpeppe, so it passes with 'NewConnSuite|ConnSuite' and fails if i add |DeployLocalSuite [17:08] hazmat: then we can try skipping NewConnSuite [17:08] k [17:08] rogpeppe, fails with ConnSuite|DeployLocalSuite [17:09] hazmat: woo [17:10] hazmat: does anything change if you comment out the "if s.conn == nil { return }" line in ConnSuite.TearDownTest ? [17:13] rogpeppe, no.. still fails with ConnSuite|DeployLocalSuite and that part commented out [17:13] hazmat: ok, that was a long shot :-) [17:14] hazmat: could you skip all the tests in connsuite, then gradually reenable and see when things start failing again? [17:15] rogpeppe, sure [17:15] hazmat: hold on, i might see it [17:15] hazmat: try skipping just TestNewConnFromState first [17:16] hazmat: oh, no, that's rubbish [17:16] hazmat: ignore [17:16] hazmat: but ConnSuite does seem to be an enabler for the DeployLocalSuite failure, so i'd like to know what it is that's the trigger [17:17] rogpeppe, lunch break, back in 20 [17:17] hazmat: k [17:20] back, and walking through the tests [17:23] rogpeppe, interesting.. i added a skip to the top of every test method in ConnSuite, and it still fails when doing ConnSuite|DeployLocalSuite [17:24] hazmat: ah ha! i wondered if that might happen [17:25] hazmat: what happens if you actually comment out (or rename as something not starting with "Test") the test methods in ConnSuite? [17:27] rogpeppe, what i'm seeing when it happens on my machine, is that the SetUpTest (or SetUpSuite - can't remember exactly) is the thing that fails [17:27] dimitern: which SetUpTest? [17:27] which causes one of a few tests to fail [17:27] rogpeppe, odd.. that gets a failure (deploymachineforceid), but effectively renaming all the tests negates the suite so... it should be equivalent to running DeployLocalSuite by itself.. which still works for me. [17:27] rogpeppe, DeployLocalSuite - always [17:27] hmm.. rerunning gets failure on DeployLocalSuite.TestDeployWithForceMachineRejectsTooManyUnits [17:27] dimitern: i'm very surprised it's SetUpTest, because i don't think that checks for state connection closing [17:28] hazmat: that's which which tests commented out? [17:28] TearDownTest that fails for me [17:28] s/which/with/ [17:28] dimitern: i think it's usually TearDownTest because that calls MgoSuite.TearDownSuite [17:28] rogpeppe, yes thats' with tests prefixed with XTest, the suite doesn't show up at all in -gocheck.list [17:28] TearDownTest, of couse [17:29] hazmat, rogpeppe, ha, yes - it was TearDownTest in fact with me as well [17:29] hazmat: interesting [17:29] hazmat: so just to sanity check, you still see failures if you comment out or delete all except SetUpSuite and TearDownSuite in ConnSuite? [17:30] k [17:30] but can't reproduce it consistently - maybe one in 10 runs, but maybe not, and only when I run all the tests from the root dir [17:30] dimitern: i can't reproduce it even that reliably [17:30] dimitern: which why i get excited when someone can :-) [17:30] which is why... [17:31] rogpeppe, yeah.. stilll fails [17:31] rogpeppe, even with everything commented but the suite setup/teardown [17:31] hazmat: now we're starting to get suitably weird [17:31] rogpeppe, and still passes if i run DeployLocalSuite in isolation [17:31] hazmat, version of go? [17:31] 1.1.2 [17:32] maybe it's something related to parallelizing tests gocheck does? [17:32] hazmat: again to sanity check, does it pass if you comment out the MgoSuite.(SetUp|TearDown)Suite calls in ConnSuite? [17:32] i can switch versions of go if that helps.. i was running trunk of go for a little while, but its pretty broken with juju (and go trunk) [17:32] hazmat, no, i'm on 1.1.2 as well [17:32] hazmat: please don't switch now! [17:32] :-) ok [17:33] :) [17:33] * dimitern brb [17:33] hazmat: (though FWIW i'm using go 1.2rc2) [17:33] rogpeppe, i had lots of issues with ec2/s3 and trunk.. (roughly close to 1.2rc2) couldn't even bootstrap [17:34] which is why i walked back to 1.1.2 [17:34] hazmat: weird. i've had no probs. [17:34] hazmat: i hope you filed bug reports [17:34] rogpeppe, something for another time.. no i didn't.. i've fallen out bug reports.. i should get back into it [17:34] rogpeppe, so that still fails with mgoSuite teardown/setup calls commented in ConnSuite [17:35] hazmat: oh damn [17:35] hazmat: now that's even weirder [17:35] hazmat: what if you comment out the LoggingSuite calls? [17:36] hazmat: (leaving ConnSuite as a do-nothing-at-all test suite) [17:36] rogpeppe, sorry i think i missed something on the mgo teardown, revisiting [17:36] i had commented it out in setup/teardown on test not suite [17:37] commenting out setup/teardown on suite first [17:37] er.. on test [17:38] sinzui, re this bug, its reproducable for me with JUJU_ENV set.. currently marked incomplete https://bugs.launchpad.net/juju-core/+bug/1250285 [17:38] <_mup_> Bug #1250285: juju switch -l does not return list of env names [17:38] okay.. still fails with test tear/setup commented.. moving on to mgo comments in suite tear/setup [17:39] and still fails with mgo commented in connsuite tear/setup [17:39] hazmat: given that there are no tests in that suite, i wouldn't expect test setup/teardown to make a difference [17:39] hazmat: in connsuite suite setup/teardown? [17:39] rogpeppe, yeah.. i suspect its actually an issue DeployLocalSuite, and running with any additional catches it. [17:40] hazmat, I will test that bug again, oh and I think you and rogpeppe are looking at the mgo test teardown that affects me [17:40] hazmat: i think so too, but i can't see how running LoggingSuite.SetUpTest and TearDownTest could affect anything [17:40] rogpeppe, for ref here's my current connsuite http://paste.ubuntu.com/6480040/ [17:41] ConnSuite is basically empty with only suite tear/setup methods that do nothing [17:41] hazmat: oh, i thought you were skipping NewConnSuite (and the other suites) [17:41] rogpeppe, i'm only running go test -v -gocheck.vv -gocheck.f 'ConnSuite|DeployLocalSuite' [17:41] hazmat: that will still run NewConnSuite [17:42] hazmat: could you comment out or delete or skip NewConnSuite? [17:42] hazmat: or just comment out line 46 [17:42] oh.. [17:42] rogpeppe, sorry for the confusion then.. okay back tracking [17:43] hazmat: np, it's so easy to do when trying to search for bugs blindly like this. [17:46] rogpeppe, so correctly running just ConnTestSuite and DeployLocalSuite works [17:47] hazmat: ok, so... you know what to do :-) [17:47] indeed [17:47] hazmat: thanks a lot for going at this BTW [17:47] hazmat: it's much appreciated [17:48] rogpeppe, np.. its annoying have intermittent test failures, esp with async ci merges [17:48] hazmat: absolutely [17:51] natefinch, fwereade: I have pushed juju tagged 1.16.2 plus the juju-update-bootstrap command to lp:~juju/juju-core/1.16.2+update [17:53] mgz, great, thanks -- I've got to be off, I'm afraid, would you please reply to the mail so ian knows where to go? and nate, please test when you get a mo [17:53] natefinch, I'll try to be back on to hand over to ian at least [17:53] fwereade: no problem [17:54] fwereade: replying to your hotfix branch email now [17:56] rogpeppe, so its not an exact test failure, its some subset of the newconnsuite .. still playing with it, but this is the current minimal set of tests to failure http://pastebin.ubuntu.com/6480107/... [17:57] hazmat: if you could get to a stage where you can't remove any more tests without it passing, that would be great [18:00] hazmat: actually, i have a glimmer of suspicion. each time you run the tests, could you pipe the output through timestamp (go get code.google.com/p/rog-go/cmd/timestamp). i'm wondering if there's something time related going on in the background. [18:01] hazmat: it's probably nothing though [18:02] there a certain amount of randomness to it.. so it quite possible [18:21] rogpeppe, so i think i have some progress. i can get both suites running reliably minus one test.. TestConnStateSecretsSideEffect [18:22] hazmat: cool [18:23] hazmat: so if you skip that test and revert everything else, everything passes reliably for you? [18:25] just leaving that one test commented out the entire package test suite succeeds (running everything 5 times to account for intermittent) [18:25] yeah.. reliably passes minus that test [18:25] hazmat: great [18:26] * hazmat files a bug to capture [18:26] hazmat: out of interest, what happens if you comment out the SetAdminMongoPassword line? [18:27] fwiw filed as https://bugs.launchpad.net/juju-core/+bug/1255207 [18:27] <_mup_> Bug #1255207: intermittent test failures on package juju-core/juju [18:29] rogpeppe, that seems to do the trick, . still verifying.. found a random panic.. on Panic: local error: bad record MAC (PC=0x414311) but unrelated i think [18:29] hazmat: i *think* that's unrelated, but i have also seen that. [18:31] rogpeppe, yeah.. passed 20 runs with that one liner fix [18:31] hazmat: could you paste the output of go test -gocheck.vv with that fix please? [18:31] pwd [18:32] also verified i can still get the error with the line back in.. output coming up [18:33] rogpeppe, http://paste.ubuntu.com/6480306/ [18:34] hazmat: ok, line 667 is what i was expecting [18:35] hazmat: there's something odd going on with the mongo password logic [18:35] hazmat: what version of mongod are you using, BTW? [18:35] 2.4.6 [18:35] hazmat: ahhh, maybe that's the difference [18:36] hazmat: where did you get it from? [18:36] hazmat: i'm using 2.2.4 BTW [18:36] rogpeppe, 2.4.6 is everywhere i think.. [18:36] rogpeppe, its the package in saucy and its in cloud-archive [18:36] tools pocket [18:36] hazmat: ah, i'm still on raring [18:37] cloud-archive tools pocket means that's what we use in prod setups on precise.. [18:37] rogpeppe, driveby: it's what we install everywhere and should be using ourselves [18:38] fwereade: i know, but i had an awful time upgrading to raring (took me weeks to recover) and i've heard that saucy has terrible battery life probs [18:38] fwereade: and i really rely on my battery a lot [18:38] not really noticed anything bad [18:38] the btrfs improvements are very nice [18:38] with the new kernel [18:39] battery life impact seems pretty minimal but maybe a few percent [18:39] rogpeppe, alternatively you can just install latest mongodb [18:40] hazmat: for the moment, i'd like to do that. [18:40] hazmat: i can't quite bring myself to jump of the high board into the usual world of partially completed and broken OS installs [18:42] rogpeppe: for one data point - my battery life isn't terrible.... it's hard for me to judge on the new laptop, but it seems within range of what is expected. perhaps slightly lower than what people were seeing on windows for my laptop, but not drastically so. [18:42] natefinch: that's useful to know. i currently get about 10 hours, and a little more usage can end up as a drastic reduction in life [18:43] natefinch: and certainly at one point in the past (in quantal, i think) i only got about 2 hours, and i really wouldn't like to go back there [18:43] natefinch: still, my machine has been horribly flaky recently [18:43] rogpeppe, understood, i used to feel that way.. atm. i tend to jump onto the new version during the beta cycle.. the qa process around distro has gotten *much* better, things are generally pretty stable during the beta/rc cycles.... i don't generally tolerate losing work do to desktop flakiness. [18:43] natefinch: perhaps saucy might improve that [18:43] rogpeppe, what's your battery info like? [18:43] hazmat: battery info? [18:45] rogpeppe, upower -d [18:45] it will show design capacity vs current capacity on your battery if your battery reports it through acpi [18:46] hazmat: cool, didn't know about that [18:46] hazmat: http://paste.ubuntu.com/6480352/ [18:46] ummm.. you should be getting way more than 2hrs [18:47] hazmat: i do, currently [18:47] rogpeppe, i use powertop to get a guage of where my battery usage is going [18:47] hazmat: but some time in the past i didn't [18:47] hazmat: currently i get about 10h [18:47] and i have some script i use when i unplug to get extra battery life by shutting down extraneous things. [18:47] hazmat: which means i can hack across the atlantic, for example [18:48] hazmat: usually i shut down everything and dim the screen, which gets me a couple more hours [18:50] yeah.. getting off topic.. but switching out to saucy really shouldn't do much harm to battery life, i havent really noticed anything significant (intel graphics / x220) [18:54] hazmat: do you use a second monitor? [18:56] rogpeppe: multi montitor support is not ubuntu's strong suit. I just had to putmy laptop to sleep and then open it back up after unplugging two monitors, otherwise my laptop screen was blank :/ [18:56] rogpeppe: or at least, it's not a strong suit on the two recent laptops I've had [18:57] natefinch: it works ok for me usually, except the graphics driver acceleration goes kaput about once a day [18:57] rogpeppe: it's only really a problem for me when I add or remove monitors. Steady state works fine for me. [18:58] natefinch: adding and removing works ok usually. i was really interested to see if hazmat had the same issue as me, 'cos his hardware is pretty similar [18:59] rogpeppe: ahh [18:59] rogpeppe: what laptop do you have, anyway? 10 hours is impressive [18:59] natefinch: lenovo x220 [19:01] very nice. I get about 4-5 hours on battery... I probably should have gone for the bigger battery in this thing that would have given me 6-8. [19:01] natefinch: you've got a much bigger display, i think [19:01] rogpeppe, i do use a second monitor [19:02] rogpeppe: yeah, mine's 15.6" and hi res [19:02] rogpeppe, i typically only use one external screen and turn off internal.. i used to do two internal screens (with docking station) [19:02] er.. two external [19:02] works pretty well for me [19:02] one screen, wow, I wouldn't be able to do it :) [19:02] hazmat: hmm, i think i'm the only person that ever sees the issue [19:03] natefinch, one .. 24 inch screen works well enough for me. [19:03] hazmat: i reported the bug ages ago, but i probably reported it to the wrong place. never saw any feedback. [19:04] natefinch, i've had that issue, the screen is still there, though.. i just enter a password to get past the unrendered screen saver password, and i'm back to the desktop.. its basically a wake from monitor shutdown.. [19:04] er.. monitor power saving mode [19:04] hazmat: yeah, if I close the laptop lid and reopen it, it seems to sort itself out. Just kind of annoying. [19:04] not very common anymore, but still annoying.. and led to me accidentally doing password into active window (irc ) a few weeks ago. [19:06] rogpeppe, the x220 tricks out quite nicely.. i added an msata card for lxc containers and 16gb of ram as upgrades this year.. also picked up the slice battery, but not clear that was as useful.. but with it roughly 16hrs of battery life (my is bit more degraded then yours on capacity) [19:06] hazmat: haha, I did the same thing, into an IT-specific facebook group, no less [19:06] a bit annoyed there moving to max 12gb of ram on the x240 and x440 [19:07] natefinch, the m3800 / xps looks pretty nice, just not sure about that screen res issue on the os level. i assume your just playing around with the scaling to make things usable? [19:08] hazmat: yeah, I set the OS font 150%, set the cursor to be like triple normal size, and zoom in on web pages.... it's actually not terrible [19:09] hazmat: and it is a really really sharp display [19:10] hazmat: and the build quality overall is exceedingly nice. It feels really sturdy, but surprisingly thin and light for being a pretty beefy machine [19:10] btw, is there a way to get ubuntu to turn off the touchpad while I'm typing? I palm-click constantly [19:10] natefinch: msata is just a solid state drive, right? [19:11] rogpeppe: msata is just the interface type and size, but yes, there's no spinning msatas that I know of. [19:11] yeah.. too small for spinning rust [19:12] rogpeppe: electrically, it's just a different shaped plug from regular sata.... exact same specs etc, you can mount an msata in a regular sata drive by just hooking up the wires correctly [19:12] hazmat: so there's room an an x220 for one of those in addition to the usual drive? [19:12] rogpeppe, yes [19:12] ahh, cool, yeah, my xps15 has that too [19:12] though at the expense of the larger battery [19:12] rogpeppe, i dropped a 128gb plextor m5 in.. needs a keyboard removal though, but its pretty straightforward, youtube videos cover it [19:12] er rather, the 2.5" drive is at the expense [19:13] hazmat: cool. i'm a little surprised there's space in there! [19:13] rogpeppe, there's some additional battery draw, in terms of finding a perf compromise.. the msatas are super tiny [19:13] rogpeppe, http://www.google.com/imgres?imgurl=http://www9.pcmag.com/media/images/357982-will-ngff-replace-msata.jpg%3Fthumb%3Dy&imgrefurl=http://www.pcmag.com/article2/0,2817,2409710,00.asp&h=275&w=275&sz=64&tbnid=D6nAHdfDO9YioM:&tbnh=127&tbnw=127&zoom=1&usg=__fRuk3l4RfCrNCEY6gQ32RZaHaA8=&docid=uliVfmMKZbEonM&sa=X&ei=3fKUUrXUDaiusASxiYCYDw&ved=0CDwQ9QEwAw [19:13] ugh.. google links [19:14] heh [19:24] I reported Bug #1255242 about a CI failure that relates to an old revision. Upgrading juju on hp cloud consistently breaks mysql [19:24] <_mup_> Bug #1255242: upgrade-juju on HP cloud broken in devel [19:43] dammit, my mouse cursor disappeared. [19:47] sinzui: a comment posted to bug #1255242 [19:47] <_mup_> Bug #1255242: upgrade-juju on HP cloud broken in devel [19:47] I need to go to bed now [19:48] sinzui: I don't doubt we have a problem, but from all indications this isn't an *upgrade* bug, because Upgrade is never triggered in that log file [19:49] jam, yes, the issue is confusing, which is why we spent so long looking into it ourselves [19:50] Line 50 is: 50:juju-test-release-hp-machine-0:2013-11-26 15:06:39 DEBUG juju.state.apiserver apiserver.go:102 <- [1] machine-0 {"RequestId":6,"Type":"Upgrader","Request":"SetTools","Params":{"AgentTools":[{"Tag":"machine-0","Tools":{"Version":"1.17.0-precise-amd64"}}]}} [19:50] which is machine-0 telling itself that its version is 1.17.0 [19:51] sinzui: ERROR juju runner.go:220 worker: exited "environ-provisioner": no state server machines with addresses found [19:52] is probably a red herring [19:52] I think it is the environ-provisioner waking up before the addresser [19:52] jam, thank you for the comment. I think I see a clue. The bucket has a date str in it and we increment it because I think it can contain cruft. That date is not even close to now. So out HP tests might be dirty. It also relates to our concern that we want juju clients to bootstrap matching servers. [19:52] so it tries to see what API servers to connect to, but the addresser hasn't set up the IP address yet [19:52] * sinzui arranges for a test with a new bucket [19:53] sinzui: 2013-10-10 does look a bit old [19:53] * jam goes to bed [19:55] sinzui: ok, I thought I was going.... I'm all for being able to specify what version you want to bootstrap "juju bootstrap --agent-version=1.16.3" or something like that. I don't think users benefit from it over getting the latest patch (1.16.4) when their client is out of date. [19:56] jam, fab. I will arrange another play of the test with a clean bucket [19:59] wallyworld_, fwereade: i've sent an email containing a branch and some comment on my progress [20:06] * rogpeppe is done for the day [20:06] g'night all [20:36] woot just got 666666 otp 2fa [20:49] sinzui: abentley do you guys have a good jenkins backup/restore config setup in place? [20:49] hazmat: lol, now if only it was fri-13th [20:50] rick_h_: No. [20:50] abentley: ok so much for cribbing :P [21:02] jam: We never released 1.16.4 because it would have introduced an API incompatibility. It's not safe to assume that agent 1.16.4 is compatible with client 1.16.3. This is not a theoretical risk. It very nearly happened. [21:48] mgz: you around?