=== gary_poster is now known as gary_poster|away [02:23] * thumper misses list comprehension in go [03:02] thumper: mr ocr, i have a branch which hooks up simplestreams mirrors support for tools https://codereview.appspot.com/13952043 [03:07] for fuck sake, out landing bot has been shot down [03:07] shut [03:07] ah maintenance i think [03:13] hi wallyworld_ [03:13] hi [03:13] I'll look shortly [03:13] i hope canonistack is back soon [03:13] np [03:20] wallyworld_: I thought it was back already? [03:20] bradm: when i nova list it says the instances are shutdown [03:20] +--------------------------------------+----------------------+---------+-------------------------+ [03:20] | ID | Name | Status | Networks | [03:20] +--------------------------------------+----------------------+---------+-------------------------+ [03:20] | 4829b364-72ad-4ee7-a21c-3ba640f28854 | juju-gobot-machine-0 | SHUTOFF | canonistack=10.55.32.55 | [03:20] | 97a7c226-a195-4014-9df5-c998bba3a491 | juju-gobot-machine-3 | SHUTOFF | canonistack=10.55.32.52 | [03:20] +--------------------------------------+----------------------+---------+-------------------------+ [03:21] wallyworld_: yeah, the compute node being rebooted will do that [03:21] bradm: it would not have been in the procedures to restart stuff that was running? [03:22] wallyworld_: I wasn't directly involved, but it would seem not to be the case [03:22] :-( [03:22] this is the second time our instances have been broken :-( [03:22] you can't just power it on? [03:22] i'm not sure how [03:23] i assume thee's a nova command [03:23] i'll take a look [03:23] nova start [03:23] yeah, trying that now [03:24] bradm: back running, seems quicker perhaps [03:25] wallyworld_: probably, there's likely hardly anyone elses instances going :) [03:25] \o/ [03:26] the compute nodes are pretty beefy machines [03:26] they're just being overcommitted by a lot [03:29] I'll chase up what happened internally, the announcements did say things would be restarted [03:29] but that definately appears not to be the case, or at least not consistantly [03:31] thanks :-) [03:33] wallyworld_: something is wrong with the gobot [03:33] no mongod [03:33] :-( [03:34] i'm not familiar with how it is set up sadly [03:34] we need more monday gods [03:34] mon-god [03:34] yeah [03:34] although stopping and starting should not have affected it you'd think [03:36] fwiw with my dinky little juju test env on lcy02 the reboot didn't break it, its back up and going [03:37] oh good [03:38] thumper: i had a quick look - mongod is in /usr/local/bin and /usr/local/bin is in the path so i'm not sure [03:41] * thumper -> haircut [03:50] thumper, testing saucy local fwiw [03:52] thumper, is there a particular version of interest? trunk i assume? [04:16] thumper, fails for me.. although looks like diff issue, namely the upstart job needs a wait between dropping an upstart template to disk and starting till inotify triggers and register with upstart. [04:32] this is very interesting, a default juju bootstrap on lcy02 fails, since the instance type isn't big enough [04:34] mongodb shuts itself down saying there's not enough space [04:53] bradm: that started happening about a week ago for some reason, i think folks are looking into it [04:53] wallyworld_: I can tell you why [04:54] wallyworld_: the default instance is a m1.tiny, which has a 2G / [04:54] ok :-) [04:54] juju used to be ok in 1G [04:54] or even 512 [04:54] wallyworld_: I just bootstrapped with more, mongodb alone uses 3G [04:54] wallyworld_: its disk thats the issue, not memory [04:54] serious? the landing bot bootstrap machine used to be a 512M instance [04:54] ah disk [04:55] i thought you were talking about ram [04:55] still, juju should not pick tiny on canonistack [04:55] I can't say why mongodb suddenly wants all your disk, but that seems to be the c ase [04:56] bootstrap it with a m1.tiny and you'll see, check the logs in /var/log/mongodb [04:56] it pretty clearly says it needs mor disk [04:56] there's some issue in how juju is choosing the instance [04:56] it used to work. it should be picking small [04:56] i'm not sure of the current status though but it is being looked at [04:56] yeah, not sure where it changed, but thats the fix, to bootstrap with contraints that give you a bigger disk [04:59] is that whats happening with your gobot? needs more disk for mongodb? [05:01] I wonder if mongodb should be using the smallfiles option [05:02] bradm: could be, but it was running fine before the shutdown [05:04] wallyworld_: /var/log/mongodb/mongodb.log should make it pretty clear [05:05] bradm: yeah, true. i'm tied up trying to get some coding finished, but i'll look soon [05:15] wallyworld_: cool, I can do some more testing myself once I've gotten this charm done [05:16] bradm: ok. i'm flat out right now as i'm off from tomorrow for a week and am trying to get everything done before i go. i'll hopefully be ale to look a bit later [05:17] wallyworld_: actually, I'm off next week too :) [05:17] \o/ [05:17] going anywhere? [05:18] yeah, my parents have taken our son for a holiday, we're driving up there to pick him up and spend some time with them [05:18] its one of our first times (outside of hospital) that we've been away from him, its interesting [05:19] how old is he? [05:19] 6 [05:19] yeah, we didn't spent time away from out son for a few years also [05:20] there are medical issues with him too, so we're probably a bit more protective than normal [05:20] yeah, i can understand that [05:20] he had 2 open heart surgeries before he was 5 [05:25] wow [05:25] glad he's ok [05:25] yeah, he's pretty good given what he's had to go thru [05:25] how about you? going anywhere interesting? [05:26] hervey bay to watch whales, then to frazer island for a few days [05:26] looking forward to it [05:27] ahh, nice - I've been to hervey bay whale watching before, lots of fun [05:28] yeah, me too about 10 years ago [05:28] with kid #1. now with kid #2 [05:29] we're starting to think along those lines for holidays as the boy gets older, he might actually get a bit more out of it [05:29] yep. we took kid #1 to nz when he was 4 and he remembers nothing. what a waste [05:30] it'd be pointless for us before now, we always seemed to spend a good portion of the year with him in and out of hospitals [05:32] that's a shame, i hope he gets well asap [05:32] he's been really good this year [05:32] usually a flu would mean a trip to hospital, this year so far things have been good [05:34] \o/ [05:35] ohh, there's two mongodb running in my juju env [05:35] and its the non juju one taking up all the space [05:37] the juju started one has --smallfiles, the other one doesn't [05:38] ah === thumper is now known as thumper-afk [06:13] wallyworld_: as I'm working through some other things, I came across this question. How does "juju bootstrap --upload-tools" work today w/ openstack. Doesn't it put the tools in your private bucket, which should *not* be world readable? [06:13] jam: yes [06:13] it puts them in private bucket [06:14] wallyworld_: right, and both cloud-init and Upgrader just use a "wget" to get the tools [06:14] no Auth [06:14] jam: bot is down since the canonistack maintenance. i haven't had a chance to look deeply, but running tests says it can't find mongod in path, but mongod is in /usr/local/bin and that dir is in the path from what i can see [06:15] jam: it uses a temp url [06:15] which is publically readable [06:15] jam: i've tested bootstrapping with upload tools and simplestreams and it works fine [06:16] unless i've missed smething [06:17] jam: when getting to tools URL, it does a storage.URL() which for environs storage returns a url from which anyone can read [06:17] wallyworld_: we don't have temp urls on canonistakc, IIRC. I'm worried we're actually making our private containers world readable [06:17] wallyworld_: sounds like the bot is using the old tarball [06:17] it should use mongodb from ppa:juju/stable [06:17] wallyworld_: when working it out originally, we decided it was ok that the "public-bucket-url" had to be world readable [06:18] I don't expect it to be any different with tools-url [06:18] but I'm seriously suspecting that we should be able to "juju bootstrap --upload-tools" on Canonistack [06:18] jam: i'll have to check but it all seemed to work ok [06:18] upload tools is now automatic [06:18] wallyworld_: you mean sync-tools ? [06:18] no, upload [06:18] again, I think if it *is* working, we have a security hole [06:19] i don't recall explicitly setting permissions on the control buket [06:21] wallyworld_: "swift stat $PRIVATE-BUCKET" has ".r:*,.rlistings" [06:21] wallyworld_: :( [06:21] hmmm. the tool stuff doesn't set that i'mpretty sure [06:21] wallyworld_: I don't know *who* is setting it, but it is wrong, and it means private tools won't work when we "fix" it. [06:22] wallyworld_: I think the "auto-upload-tools" stuff creates a bucket and sets it world readable [06:22] i'll have to check [06:23] jam: [06:23] containerName: ecfg.controlBucket(), [06:23] // this is possibly just a hack - if the ACL is swift.Private, [06:23] // the machine won't be able to get the tools (401 error) [06:23] containerACL: swift.PublicRead, [06:23] this was put in in january [06:23] by dimiter i think [06:23] wallyworld_: with nobody realizing "you can't get the tools, but you're exposing all of your secrets to the world" ? [06:24] I'm not 100% sure what goes in the private bucket [06:24] as I don't think we put creds there. [06:24] So it *might* be ok [06:24] we put the state file there [06:24] wallyworld_: which is just the IP address, right? [06:24] i'd have to check but i don't think creds go there [06:24] i *think* so [06:29] wallyworld_: I think the only actually private thing is potentially private charms [06:30] As I'm pretty sure we put the charm data in there [06:30] however [06:30] that *also* needs to be accessible via "wget" because of how we removed Environ creds from the Uniter agents. [06:30] so it could be worse i guess [06:31] fwereade: I need to chat with you about this. [06:31] wallyworld_, fwereade: security bug #1231278 [06:31] ok [06:31] (mup won't find it because it is private) [06:32] wallyworld_, jam, reading back [06:32] We just need a discussion, because there is certainly a "vulnerability vs not working at all" that we have to sort through. [06:32] fwereade: G+ might be appropriate [06:33] wallyworld_: bigjools seems to be enjoying himself without you so far :) [06:33] jam, well, I'm here, if the sight of a dressing gown will not be damaging to your sensibilities [06:33] jam: how do you know? [06:33] wallyworld_: he's posting pics of the great barrier reef on G+ [06:33] fwereade: my eyes, my poor, poor eyes [06:33] fwereade: started: https://plus.google.com/hangouts/_/26fdcf993421ca83a1cf0b1a3ddd35772695e493 [06:33] jam: ah ok. that social networking thing i ignore [06:34] fwereade: you could just turn the camera off :) [06:44] https://code.launchpad.net/~dave-cheney/juju-core/158-lp-1210407/+merge/187675 [06:56] axw: thanks for your review, see my comments [06:56] looking [07:04] davecheney: will lgtm, just curious about this: "we don't reboot machines" -- it doesn't work? [07:05] I get your point though - it doesn't really matter [07:05] axw: if you reboot a machine it gets a new ephemeral ip [07:05] and at that point, nothing works [07:09] axw: why do you say twice ? [07:09] I get your point. It just feels wrong to do it twice when it only ought [07:09] to be done once. But, given that's not really possible... LGTM. [07:09] " [07:09] davecheney: *if* it were able to reboot [07:10] it's idempotent though, so doesn't matter. [07:10] axw: fair point [07:11] also, bootcmd http://cloudinit.readthedocs.org/en/latest/topics/examples.html [07:11] does what runcmd does [07:11] it has the same firstboot properties [07:12] davecheney: ok, then the doc comment on juju-core/cloudinit/Config.AddBootCmd is wrong :) [07:12] axw: ok, i'll fix that in a followup [07:13] davecheney: in that page, the comment for bootcmd has a hidden gem: " * bootcmd will run on every boot" [07:14] urgh [07:14] oh well [07:14] care factor, quite small [07:14] this may fix the azure disk suckage [07:15] but while reading that page [07:15] where does it say runcmd is only rnu once ? [07:15] * axw shrugs [07:16] i'm glad we've arrived at this place [07:16] it says it in juju-core, but that's maybe not authoritative [07:17] axw, best I can tell, we've never tried [07:17] everyone knows rebooting an ec2 intsance will screw it [07:18] no worries, it's not a big deal [07:18] fwereade: when you have a moment, would you mind expanding on your comments here? https://codereview.appspot.com/13832045/ [07:19] axw, sure [07:19] I've changed the authentication stuff around a bit to allow HTTP GETs & HTTPS PUTs; wanted to know what you meant first, though, in case I was expending too much effort on this... [07:21] axw, I was just ruminating that *if* cert distribution prove to be some sort of hassle (mainly because the CLI still needs direct storage access for deploy/upgrade-charm) we *could* use ssh storage for the manual provider and filesystem storage for the local one, because the clients that need write access should already have the information needed to set up the appropriate Storage types [07:22] ah, right [07:22] fwereade: does anything other than CLI need write access? [07:22] axw, the API server itself may do [07:23] axw, but (assuming non-HA, anyway) that's doable via the filesystem [07:23] fwereade: ok. it can write directly, given it's local, so that's fine [07:23] yep [07:23] ok, so yes I did expend too much effort [07:23] oh well [07:23] axw, well, you expended it too early, at least [07:23] :) [07:23] axw, but no real harm done, I think [07:25] axw, we would ideally like to not depend on provider storage at all but that's not an immediate plan [07:26] ok [07:27] axw, what we will need to do soon, though, is start exposing storage access via the API, so that an API-only CLI can still upload charms from local repos [07:27] fwereade: also, when you're not busy, would you please look at my latest replies on these two: https://code.launchpad.net/~axwalk/+activereviews [07:27] fwereade: I was wondering if/when storage would be API based [07:27] axw, will do [07:28] thanks [07:34] fwereade: actually, my changes to httpstorage aren't for naught [07:34] they'll allow GETs to not require a self-signed cert [07:34] forgot taht important bit :) [07:34] axw, I don't think they are, indeed [07:34] fwereade: I mean, changes I haven't pushed yet [07:34] axw, ah right -- cool then :) [07:34] I've been changing things today [07:38] axw, https://codereview.appspot.com/13632046/ LGTM [07:39] fwereade: thanks. the error is tested in jujutest/livetests [07:39] axw, I was just thinking of direct tests for New/Is [07:40] fwereade: ok, then no. I'll add some before landing [07:40] axw, cheers [07:42] eh that package has no tests... time to add some [07:48] mornin' all [07:59] morning rogpeppe1 [07:59] axw: hiy [08:00] a [08:00] :-) [08:04] axw, https://codereview.appspot.com/13255051/ nearly LGTM, take a look and let me know your thoughts [08:04] rogpeppe1, morning [08:04] fwereade: yo! [08:04] fwereade: thanks, reading [08:12] fwereade: replying now, but yeah, there is currently no handling of destruction for bootstrap nodes [08:12] the others can be destroyed as usual [08:13] I wasn't really sure where to draw the line with "null" :) [08:24] fwereade: i'd don't quite understand this comment: https://codereview.appspot.com/13912043/diff/1/environs/configstore/disk.go#newcode134 [08:25] Hi jam, hi mgz… would any of you have time to talk about what seems to be a serious bug in the MAAS provider (bug 1229275). [08:25] fwereade: i *think* the only time we add attributes is when we call Prepare [08:25] <_mup_> Bug #1229275: juju destroy-environment also destroys nodes that are not controlled by juju [08:25] rvba: oops! [08:25] rvba: yup, but I'll need to get on a bus in a sec [08:27] rogpeppe1, it's not really actionable, especially in the light of our later discussions [08:27] fwereade: ok, cool [08:27] rogpeppe1, prepare chooses bootstrap-state and writes it; bootstrap uses exactly that [08:27] fwereade: yup [08:27] rogpeppe1, it may involve some light massage of bootstrap responsibilities vs prepare responsibilitie [08:27] rogpeppe1, but nbd [08:27] So basically, juju destroys all the instances he gets back from the provider's instances() method, and that is basically all the instances. [08:28] rvba, that looks like a critical to me [08:28] Critical indeed. [08:28] rvba, how does the maas providers markinstances as controlled by itself? [08:28] rvba: the provider's Instances method should not be returning instances it didn't itself create [08:28] fwereade: it doesn't [08:28] rogpeppe1: that's the problem indeed. [08:28] rvba: the other providers take care to avoid that [08:29] rvba, well, crap -- as someone maasy, how would you recommend we do so? [08:30] fwereade: if this needs to be addressed on the MAAS side, then the easiest way is probably to set a tag on the nodes. A tag identifying the juju environment. [08:31] Out of curiosity, how do the other providers do it? [08:32] rvba: either by looking at the security groups or the name attached to the instances, I believe [08:33] instances that env controls are given names juju-ENVNAME-* [08:36] Right… that's how the Azure provider works too now that I think of it. [08:38] mgz, rvba: fwiw envname is bad [08:39] mgz, rbva: long-term, envname can only ever be a local alias for the actual environment uuid [08:39] it's all a little dodgy, but I don't like a the alternatives much [08:39] mgz, rvba: and we've already had problems with two people using the same env name and same provider credentials [08:40] it's easy to say "don't do that then" [08:40] well, we should check for that on bootstrap and blow up [08:40] right, I need to get on bus [08:40] but that's not as helpful as designing things such that we don't have to do so in the first place [08:41] rvba, I need to take a break for a bit, but... actually just a mo [08:41] rvba, how does juju not destroy those other instances first? [08:42] rvba, the provisioner will be asking for all instances and culling those it doesn't recognise [08:42] rvba, so *starting* a juju environment should also kill everything else [08:42] rvba, as should upgrading it [08:42] rvba, do you know if that's the case? [08:43] fwereade: I just tested it, that's not what happens. [08:43] (I'm testing with the latest trunk) [08:44] rvba, ok, that's weird [08:45] rvba, if it's not culling unknown instances it implies that actually AllInstances is reporting the right ones [08:46] fwereade: that should happen during bootstrap right? [08:48] fwereade: well, you also have to run "juju status" first or it can poll the Provider at all yet [08:48] we've had many "run status and everything dies" bugs :) [08:50] fwereade: I simply tested running "juju bootstrap", is the culling supposed to happen there or later, for instance when the bootstrap node comes up? [08:50] rvba, jam speaks truth, you need to connect once before the bad things will happeb [08:50] rvba, it'll happen when the proviioner starts running [08:50] Okay, testing that now. [08:50] (node is installing) [08:51] rvba, which will happen just after the first command that connects [08:51] rvba, cheers [08:51] axw, responded, let me know what you think of the Destroy error question [08:51] bbiab [08:53] fwereade: yes sorry, I agree Destroy should return an error for now [08:56] fwereade: you were right, culling did happen. [09:02] rvba, well, ok, the good thing here is we don't have to worry about backward compatibility then, because nothing (sensible) we do can make the situation any worse [09:05] axw, this right here ^ is a reason for an EnvironUpgrader that acts directly on the environment (independent of the should-it-hit-state discussion) [09:06] * axw reads back [09:07] axw, short version: maas instances are not tied back to their environment, and getting instances from maas gets *all* instances, not all instances in the *environment* [09:07] * axw nods [09:07] and destroys them all [09:07] fwereade: where's the link to EnvironUpgrader? [09:08] I didn't get much sleep last night, so a little slower than usual today [09:08] axw, sorry, we were chatting about it in the state-upgrades thread [09:09] axw, your contention was that it should connect to state [09:09] axw, I think that's the wrong way round [09:10] axw, *but* that adding an optional upgrade method to environ might be a good idea for other reasons [09:10] fwereade: for example, so you could add a tag to the maas nodes that you control? [09:10] axw, exactly so :) [09:11] axw, or indeed so we could correct the envname problem (above) for the other providers [09:11] fwereade: your latest reply clarifies things for me, and yes, much nicer to not manipulate state from environ [09:11] axw, great [09:15] fwereade: I've updated https://codereview.appspot.com/13255051/ [09:16] okay if I handle Destroy properly in a followup? [09:19] axw, absolutely [09:20] axw, LGTM [09:20] thanks [09:20] fwereade: I'll get the last of the httpstorage stuff in next, then get onto Destroy [09:21] axw, perfect, tyvm [09:21] fwereade: and then Prechecker wireup [09:22] great -- that one's going to be a bit interesting, I think, we should plan how we get it in there ahead of time [09:22] * fwereade bbiab again, see you all atthe meeting [09:23] me too, I need a break. bbl === thumper-afk is now known as thumper [09:48] rogpeppe1: I've realized that I really don't like mornings [09:49] thumper: that's taken you a while :-) [09:49] thumper: i've realised that i forgot (again!) about our chat last night [09:49] My head just isn't in it that early [09:50] I should go check the agenda [10:36] https://bugs.launchpad.net/bugs/1229275 is that actually Critical ? [10:36] <_mup_> Bug #1229275: juju destroy-environment also destroys nodes that are not controlled by juju [10:36] seems High at best [10:36] especially given "nobody is assigned to it" [10:58] fwereade, there it is https://codereview.appspot.com/13963043 - first part, the secrets blanking will follow [11:02] dimitern: the other way around === gary_poster|away is now known as gary_poster [11:18] dimitern, would you take a really quick look at https://bugs.launchpad.net/juju-core/+bug/1229286 ? it feels somewhat likely to be unitery [11:18] <_mup_> Bug #1229286: debug-log and boolean options are broken in trunk [11:19] fwereade, looking [11:19] dimitern, the config bits specifically [11:20] dimitern, may be helpful to confer with TheMue, he was touching config recently [11:20] fwereade, I haven't tried juju set when live testing the api uniter [11:21] fwereade, just debug-hooks and relation-set/get [11:21] dimitern, yeah, I should have thought of that [11:21] dimitern, in fact the stuff you're doing is as critical as this regardless [11:21] TheMue, is there any likelihood you'll be able to look into it this pm? [11:22] fwereade: yep, will do [11:23] fwereade: lunch in a few moments, but then [11:23] fwereade, did debug-log show the hooks output before? [11:23] frankban, hey [11:23] TheMue, cool, thanks, please just verify what's happening with set vs config-changed [11:24] frankban, about that bug ^^ [11:24] frankban, have you tried using debug-hooks instead? [11:24] dimitern: no [11:24] dimitern, frankban: re logging you need to enable that logging in env config now [11:24] dimitern, frankban: thumper knows exactly [11:25] frankban, debug-hooks will show you if config-changed got fired [11:25] dimitern: as I mentioned in the bug description, I am pretty sure that config-changed is called [11:25] dimitern, frankban: it is due to logging changes that were made recently to make things more "productiony" [11:25] bootstrap with --debug [11:26] or --log-config==DEBUG [11:26] thumper: cool, good to know [11:26] or whatever you want [11:26] this log config then propagates to all the agents [11:26] thumper: so, by default, hooks output is not displayed in the debug log, correct? [11:26] ah, good to know [11:26] can be updated using "juju set-env log-config=blah" [11:26] frankban: correct [11:26] only warning and errors [11:27] used to be debug for everything [11:27] I'll write an email for juju-dev tomorrow to explain the changes [11:27] and hooks [11:27] not juju hooks [11:27] but how to do other logging stuff [11:28] dimitern: so, the real bug is about boolean options: it seems they are always set to false [11:28] thumper: thanks for the clarification [11:29] np [11:29] frankban, hmm.. TheMue, can this be relevant to your recent config changes? [11:30] dimitern: should not, only empty settings have been touched [11:30] dimitern: will will take a look after lunch [11:30] * TheMue => lunch [11:35] fwereade, did you have a chance to look at https://codereview.appspot.com/13963043 ? [11:35] dimitern, been in meetings I'm afraid, i'll try to fit in in before Igo forlunch [11:36] fwereade, ok [11:41] dimitern, did we not have an implementation for Upgrader that swapped out 127.0.0.1? [11:41] dimitern, erDeployer [11:41] fwereade, that's from there [11:42] fwereade, it's not swapping anything [11:42] fwereade, and it actually works like proposed - live tested on ec2 [11:42] dimitern, I see, ok, no quibbles with what we're doing [11:43] dimitern, but would you please pull the common implementation of those methods out into a common type we can embed, like the other shared functionality? [11:43] dimitern, Ican live with that as an *immediate* followup [11:44] fwereade, even though it's going away as soon as we have machine addresses? [11:44] dimitern, we're still going to need to do the same thing in the same two places, aren't we? [11:45] fwereade, I'll do it in this CL, not to much to do I think [11:45] dimitern, we'd just stop using an environ to do so, surely [11:45] dimitern, that's even better :) [11:47] thanks [11:48] * fwereade quick lunch [11:52] dimitern: https://codereview.appspot.com/13964043/ looks pretty much the same as the one you set back to WIP and were going to resubmit. Did you mark the wrong one? [11:52] https://code.launchpad.net/~dimitern/juju-core/145-apiserver-provisioner-blank-secrets/+merge/187577 looks just like https://code.launchpad.net/~dimitern/juju-core/147-apiprovisioner-blank-env-secrets/+merge/187738 [11:53] dimitern: maybe you meant to reject https://code.launchpad.net/~dimitern/juju-core/146-apiprovisioner-addresses/+merge/187719 ? [11:55] jam, no, it has almost the same description and diff, but different prereq [12:38] TheMue, when you get back would like to know how https://bugs.launchpad.net/juju-core/+bug/1224568 is doing [12:38] <_mup_> Bug #1224568: Improve hook error reporting [12:39] gary_poster: it's almost done, one smaller CL is missing. after investigating the problem of frankban i'll continue (tests are missing) [12:40] awesome thanks TheMue @ [12:40] ! [12:52] frankban: ping [12:55] TheMue: pong [12:55] frankban: the boolean value, how is it configured? [12:57] TheMue: I saw every boolean values set to False, both if they are true by default (in config.yaml) and when they are set to True using "juju set". Hope that answers your question [13:00] frankban: the setting makes me wonder, there has been a change in getting handling nil values when default is set [13:00] frankban: the change happened with rev 1800 [13:01] TheMue: it is possible, I saw this problem in trunk, but it works as usual reverting to 1750 [13:02] TheMue: the bug includes instructions to dupe, I'd ensure this is not soemthing wrong in my local configuration before investigating [13:02] frankban: so if 1799 would be ok and 1800 not we've got it ;) [13:04] frankban: the change has been to omit nil values if default is set. and this may be interpreted as false [13:05] TheMue: the weird think is that it seems the value is False in the hooks execution even when you explicitly set an option to true (and the default is false) [13:05] frankban: are you still on 1750 or back on trunk [13:05] TheMue: 1750 [13:06] frankban: the hook execution part is strange [13:06] frankban: take a look at http://bazaar.launchpad.net/~go-bot/juju-core/trunk/revision/1800, get.go line 52 (the rest are tests) [13:07] TheMue: so rev 1800 has "if option.Default != nil { info["value"] = option.Default" which seems to be the only change. Otherwise we leave value untouched. [13:07] frankban: yes, exactly [13:08] frankban: before that change the map contains the key "value", only with a value nil [13:10] frankban: so with a quick hack on your 1750 to behave here like the 1800 and showing the same errors shows that it's a shitty CL :/ [13:11] TheMue: so you duped? [13:11] frankban: yes, i would revert it then [13:12] frankban: but it would help me if you make that quick hack test to be sure that this is the correct concluion [13:12] conclusion [13:16] TheMue: are you sure the problem is there? AFAICT ServiceGet works correctly (the correct values are showed, i.e. in the GUI (and the GUI takes that information using the API) [13:17] frankban: no, i'm not sure, that's so far the only change i've found regarding config later than 1750 [13:17] TheMue, there's another biiig one [13:17] TheMue, uniter working via API [13:18] frankban: so you see the correct values in GUI? fine [13:18] TheMue: yes [13:18] frankban: ok, will investigate there (uniter) [13:24] fwereade, updated https://codereview.appspot.com/13963043 [13:25] dimitern, cheers [13:28] dimitern, nice and clean, LGTM [13:29] fwereade, thanks [13:29] dimitern, remind me what else is on your plate after that one? the blanking? [13:30] got a lead on our memory/tiny booting issues, bug 1227425 may be related [13:30] <_mup_> Bug #1227425: Cloud images do not need apt-xapian-index [13:30] TheMue, ah-ha [13:31] TheMue, a true boolean is being reported to the uniter as "" [13:31] fwereade, I realized we no longer need StateAddresses() and APIAddresses() on agent.Config, so I'll remove these as well [13:31] dimitern, nice [13:31] dimitern, thanks [13:31] fwereade: i'm currently digging in the uniter [13:31] fwereade: where are you [13:32] TheMue, add a boolean to testing/repo/series/wordpress/config.yaml [13:33] TheMue, find the uses of assertYaml in uniter_test.go [13:38] shit [13:38] config data is getting squeezed through map[string]string and we didn't spot because we didn't have tests involving non-string config settings at the sharp end [13:39] a small MP that might speed up tests slightly: https://codereview.appspot.com/13968043/ [13:39] TheMue: revno 1800 works well fwiw. trying 1845 now [13:40] fwereade: testing it, just had to change something in my test code ;) [13:40] frankban: aha [13:40] TheMue, frankban, dimitern: state/apiserver/uniter/uniter.go:509 [13:40] TheMue, frankban, dimitern: those are not relation settings are are most definitely not a map[string]string [13:41] TheMue, frankban, dimitern: this is critical [13:41] so sval, _ := v.(string) is killing booleans? [13:42] fwereade, hmm [13:42] fwereade, ok, so we need map[string]interface{} there? [13:42] dimitern, yeah [13:43] wow [13:43] dimitern, the confusing range of configgy/settingsy types with their selection of arbitrarily different rules is deeply depressing to me [13:43] fwereade, if it's only that, it's easy enough to fix the API [13:44] dimitern, bad luck for getting caught up in it (and Iprobably reviewed it too :/) [13:44] dimitern, I believe so [13:44] dimitern, we did release with the uniter api active, didn't we? [13:45] fwereade, we did [13:45] dimitern, still, upgrading the return type won't actually hurt [13:45] dimitern, or will it [13:45] dimitern, what happens if we try to deserialize a map[string]interface{} with mixed values into a map[string]string? [13:46] fwereade, it ignores non-strings? [13:46] dimitern, that'd be nice, and I think it might, but we should check [13:47] fwereade, I mean - non-strings get empty string values [13:47] dimitern, that would mean behaviour wouldn't change [13:48] fwereade, I can do a CL that changes the result of ConfigSettings() to params.ConfigResults (new type - like SettingsResults, but with params.Config instead) [13:49] dimitern, can we give them explicit ConfigSettingsResults and RelationSettingsResults names please? [13:49] dimitern, and name the types they use ConfigSettings and RelationSettings [13:49] fwereade, well, ConfigResult is used by the provisioner actually, for environ config result [13:50] dimitern, fwereade, TheMue, natefinch, mgz, jam: environment file extension: anyone want to weigh in? https://codereview.appspot.com/13969043 [13:50] fwereade, we can change these, but that means even more api incompatibility [13:50] dimitern, type names are arbitrary, aren't they? where's the incompatibility? [13:50] dimitern, field names are a problem [13:50] fwereade, protocol on-the-wire might change? [13:51] fwereade, or not, ok [13:51] dimitern, if they suck we just have to eat it up and hope we learn from our mistakes :) [13:51] fwereade, next CL will be about that then [13:51] dimitern, I think it's even more important than the secret-masking tbh [13:52] dimitern, this is a pretty devastating regression [13:53] rogpeppe1: reviewed [13:53] fwereade, I'm done with the provisioner for now - submitted the first for landing, the second one is next, and while waiting I'll tend to the uniter [13:54] * fwereade throws flowers before dimitern's path [13:54] I like jenv because if we decide we don't like yaml anymore, we can put something else in there. I do sorta have a hatred for prefixing things with j, just due to an inordinate amount of time exposed to java crap [13:55] * natefinch isn't bitter though... [13:55] fwereade, (if we ask Captain Hindsight for advice it'll be:) we would've caught this if we had tests for non-string settings [13:55] thank you, Captain Hindsight! [13:56] dimitern, perfectly correct [13:56] fwereade, so I'll look about adding some [13:58] dimitern, stick to local unit tests for the bit you change, for now, please -- I consider this critical and don't want to release with it *again* ;p [13:58] dimitern, changing the uniter tests to exercise it may be noisy [13:59] dimitern, they must ofc be done but they'll delay landing the fix [13:59] fwereade, ok [14:00] dimitern, that said, hmm, how do we test in the api? [14:01] hey [14:01] looking at https://codereview.appspot.com/13962043/ [14:01] dimitern, if we use wordpress' config settings [14:01] rather than disabling certificate checking ... [14:01] wouldn't it be better to add the certificates ? [14:01] it seems juju would know them. [14:02] cloud-init has config that explicitly allows adding certificates that should then be accepted. [14:02] hazmat, ^ ? [14:02] jam, smoser makes an interesting point ^ [14:03] dimitern, anyway: if we are using wordpress as the "standard" testing charm [14:03] http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/view/head:/doc/examples/cloud-config-ca-certs.txt [14:04] fwereade, I have some simple charms I can use [14:04] dimitern, we should probably just add all config types to it and so gently encourage people testing to actually check them all [14:04] dimitern, you may find that the uniter is tightly coiled around the fake wordpress charm [14:05] dimitern, but, eh, that's the next branch anyway, I'll stop distracting you [14:06] smoser, I think the encompassing issue may be that some clouds don't even have certs configured [14:06] is that possible ? [14:06] ignorance being exposed.... [14:07] but when i go to some https sight with firefox [14:07] it says "Hey, this doens't look right". You want to get the certificate and trust it ? [14:07] smoser, I have only second-hand "knowledge", inferred from the conversations of those who know more than me [14:07] can't juju client just do the "get the certificate" bit. and then launch instances with that. [14:08] mgz, IIRC you were doing ugly things to induce certificate errors recently I recall -- did I misunderstand your saying you'd been removing the certs temporarily and things had still worked? [14:34] fwereade, the fix is done, testing now [14:38] fwereade: jam had done testing along those lines, but only for the client side so far I think (as it's harder to screw up the certs on a booted node and check that works) [14:43] mgz, fwereade: I ssh'd into the node and messed up the certs for testing the patch I proposed. [14:43] fwereade, smoser: While I like adding the functionality to allow a new known cert, I don't think it has the same user impact [14:43] because digging up the cert and adding it to the config is far more complex than just shoving a "false" in there when you are testing. [14:43] so I'd be happy to add support for custom certs [14:44] but I think we still need the "disable" ability [14:45] jam, not necessarily [14:45] see my comment about firefox above [14:45] firefox bsaically allows me to say 'false' for checking of that server. and it does the rest. [14:46] i've actually done this once before on a project for exlicitly this reason. i figured out how firefox did what it does... how it gets the certificate and did that. and inserted that certificate. [14:47] i do see the point about this being "testing" and that https is likely only used without certificates on "test" scenarios. [14:48] mgz: just one question about the tag solution: if you upgrade a juju deployment that was created before we used the tags and then use a version of juju which uses the tags to filter out machines, your deployment will be broken. What's the policy to solve that kind of upgrade problems in juju? [14:50] hm, good question [14:50] that would be the case with either solution [14:51] True. [14:51] we could use compat code that detected the hey, no tag named after our environment, assume old behaviour of all machines are ours [14:52] but that may not be the best way [14:52] rbva, mgz: we are getting closer to sanity for upgrades, but there's little so far [14:52] mgz: that seems like the only solution [14:52] rvba, I was tending towards mgz's suggestion myself... it's bad but I don't see alternatives [14:53] Well, another solution is to have juju detect that there is no tag, and then create it and attach all the nodes it knows about to it. [14:54] we'd need to be doubly sure that destroy-enviornment *twice* wouldn't then go and delete all maas nodes anyway [14:54] rvba, mgz, tag only the machines that have instance ids assigned in state? [14:54] because hey, the second time there's no tag named after our env, so everything must be ours, so wipe it... [14:54] fwereade: yes [14:55] mgz: the second time, no machine id will be stored, so no machine removed. [14:55] rvba, mgz: that can't happen automaticaly within the environment though [14:56] it seems an easy enough disaster to avoid [14:57] rvba, mgz: yeah -- axw has a lot on his plate right now but he seems enthusiastic about doing the long-overdue upgrade stuff in the near future [15:00] mgz: maybe the first solution (explicitly supporting the old behavior) is simpler after all. [15:02] fwereade: out of curiosity, why doesn't juju itself keeps track of the machines it owns? [15:02] keep* [15:03] rvba, it does -- but Destroy is entirely internal to the environment, which is itself expected to keep track of its own machines and differentiate between those in and out of the environment [15:04] rvba, it would indeed be possible to have written it such that juju had to specify all the instances it knew about [15:04] anyone seen this local provider error: http://paste.ubuntu.com/6159055/ [15:04] rvba, but I think that would make it very hard for juju to effectively reap instances that it needed to itself [15:04] it used to work fine a week ago [15:05] loaded invalid environment configuration: storage-port: expected int, got 8040 [15:06] dimitern, that looks kinda like an int has been inappropriately coerced to a string somewhere, doesn't it [15:06] fwereade: I don't want to bother you with that, but I don't really understand. If juju has the list of all the machine it owns, it can pass it to the environment when destroying it. [15:06] machines* [15:06] fwereade, it does [15:07] But that's not the way it works now so we have to fix the MAAS provider anyway :). [15:07] rvba, if we start an instance but fail to record it against a machine, we want to automatically trash that instance [15:08] fwereade: hum, I see. [15:08] rvba, I will try to make the situation clearer than it currently is in the writing-a-provider doc I'm working on [15:08] Cool [15:17] mgz: what's the status of the VPC-only bug? [15:22] mgz, if you read the bug report, it states in the description how to get enabled with that on an existing account [15:22] https://bugs.launchpad.net/juju-core/+bug/1221868 [15:22] <_mup_> Bug #1221868: juju broken with ec2 and default vpc [15:22] its took about 2biz days [15:26] fwereade, ping [15:26] dimitern, pong [15:27] fwereade, how do you suggest to live test that thing? so far I tried ec2 live testing and calling juju set svc flag=True, calls config-changed in a debug hooks session and config-get shows it as expected [15:27] dimitern, that sounds solid [15:27] dimitern, but that local provider thing is really alarming [15:28] fwereade, I'll check on trunk to see if it's my branch or it's broken [15:28] dimitern, thanks [15:31] ah, tests pass [15:37] fwereade, same effect in trunk [15:37] args... couple annoying bugs in goyaml..... unmarshaling "" into a *string makes the string nil (not an empty string), and unmarshalling [] into a slice gives you a nil slice (not an empty slice). PITA [15:37] fwereade, so the local provider was broken earlier [15:37] * fwereade freaks out at dimitern but wants to chat to nate for a moment [15:38] natefinch, that's annoying [15:38] fwereade: yeah, we already had one workaround in constraints [15:38] natefinch, I'm sure there was a similar bug with goyaml in the past [15:40] fwereade: yeah, we had to set up a whole SetYAML method because the containertype was getting unmarshaled as nil instead of empty. [15:40] natefinch, ouch -- do you know if there's a goyaml bug for that? [15:40] can someone else try bootstrapping a local environment from trunk and deploying anything, to see if all-machines.log shows this error http://paste.ubuntu.com/6159055/ [15:41] fwereade: didn't look like it when I perused the bug list (only 13 bugs listed) [15:44] TheMue, rogpeppe1, jam, mgz ^^ ? [15:45] and please make sure you did go install . in cmd/juju and jujud/, and use --upload-tools on bootstrap [15:49] hazmat: thanks, I'm just not certain I want to do that on the shared bzr account, how disruptive was it for you? [15:53] fwereade, there's the fix https://codereview.appspot.com/13908044 [15:54] mgz, seamless, just pick a region your not using [15:54] mgz, you have to clear out ec2 resources in that region (ie no running instances, also good to clear out groups) [15:54] ah, that does seem good [15:55] mgz, so i take it then there hasn't been any progress on this? we really need it for 1.16.. [15:55] i ran into two users last week, who couldn't use juju on ec2.. [15:56] fwereade: now there are bugs [15:56] natefinch, thanks [15:58] ok, so no one wants to try to reproduce the local provider issue, i'm filling a bug [16:01] fwereade, do you have a revision that you want to release as 1.15.0? [16:02] sinzui, I am very worried that I do not, because dimitern's problem seems pretty critical to me [16:03] fwereade, okay. That's fine. Is there a bug I can track [16:03] sinzui, dimitern is filing it as we speak [16:04] fab. Thank you. [16:07] fwereade, sinzui: there it is bug 1231543 [16:07] <_mup_> Bug #1231543: upgrader startup failure with local provider [16:09] Thank you dimitern [16:10] dimitern, would you please mark that critical and start investigating? TheMue, are you on something else or can you assist reproing? [16:10] fwereade, it's filed as critical [16:10] fwereade, and I'm looking at it [16:10] fwereade, the uniter fix is proposed already [16:10] dimitern, you anticipate my micromanagement with aplomb and panache [16:11] dimitern, I'm about to LGTM it I think [16:12] dimitern, yep, LGTM, just one tweak needed [16:12] fwereade, ok, will tend to it afterwards [16:13] fwereade: can do tomorrow morning, have to reactivate the matching VM (not enough space anymore on disk) [16:15] fwereade: currently I'm fighting with a called but non-existing constructor *sigh* [16:16] * TheMue still will propose now, so the changes can be reviewed [16:18] * fwereade is taking a short family break but will return anon [16:19] shit, propose will not work with the missing function :( [16:21] dimitern: i'll start to setup my testing vm now [16:22] dimitern: will you not any findings in the issue to that i can support you after setup later [16:24] cu later [16:24] TheMue, so far I tested it happens in trunk and r1885, will go further [16:26] dimitern, mgz, jam, natefinch: next stage in environment info storage, reviews appreciated please: https://codereview.appspot.com/13970043 [16:26] fwereade: ^ [16:30] dimitern: ping [16:30] ok, so it doesn't happen as far as r1844, going back up [16:30] rogpeppe1, pong [16:30] rogpeppe1, I'm up to my elbows into the local provider atm [16:30] dimitern: i'm just wondering about API connections and how they can find the API addresses to store locally [16:31] rogpeppe1, expand a bit please [16:31] dimitern: so, the plan is that when we make an API connection, we find out the current set of API addresses and store that locally in a .jenv file [16:31] rogpeppe1, how about if they change after that? [16:32] dimitern: we refresh the cache each time we connect [16:32] dimitern: and fall back to environ config info if the connection fails [16:32] rogpeppe1, sgtm [16:32] dimitern: but we need to find out the current set of API addresses so we can store them [16:33] dimitern: and i'm thinking of an API call that's available to anyone that can access the API that returns them [16:33] rogpeppe1: it could be returned from Login [16:33] rogpeppe1, so like a Login call [16:34] jam: that's an interesting idea [16:34] jam: i quite like that actually. [16:34] jam: then api.Open can cache it, so it can be retrieved by a later call [16:34] jam: so we don't have to change the type sig [16:34] something like that, yeah [16:35] ah, there's a problem, i think [16:36] jam: i *think* that State.APIAddresses just returns the same IP addresses that mongo peers use to talk to each other [16:36] jam: which probably won't be public IP addresses [16:36] rogpeppe1, they aren't [16:36] damn. i guess i'll need to fix that first [16:37] rogpeppe1, but with the addresser stuff coming up it might not be needed [16:37] machine addressability [16:37] dimitern: go on... how does that help? [16:37] rogpeppe1, machines will know their own addresses (public, private, all) [16:38] dimitern: go on [16:38] rogpeppe1, and you can query state for them, and there will be a worker to update them as needed [16:38] rogpeppe1, mgz is working on that I think for some time [16:38] dimitern: so to find the API addresses, you do a search for all machines with ManageState, then query their addresses? [16:39] s/ManageState/JobManageState/ [16:39] rogpeppe1, yes [16:39] rogpeppe1, and for other potential new jobs we have [16:39] dimitern: that seems somewhat inefficient. wouldn't it be a linear scan? [16:40] rogpeppe1, who needs to know? [16:40] dimitern: it'll happen every time someone connects to the API [16:40] rogpeppe1, and currently it happens thorough the StateInfo [16:40] dimitern: i was thinking that we'd have a doc in mongo which held the API addresses, then some agent would maintain that [16:40] through [16:41] rogpeppe1, that might be an addition to the addressability stuff, or even orthogonal to it [16:41] dimitern: i think it's orthogonal, yes [16:42] hmm, how does a machine's public address get filled in now? by the provisioner, i guess [16:42] rogpeppe1: that's the idea [16:43] not sure what you mean by "linear scan" though [16:43] mgz: well, if i want to find out the addresses of all machines that are state servers, how should i do it? [16:43] rogpeppe1, not really [16:44] rogpeppe1, the unit's addresses are set by the uniter, but the machine addresses are taken from the environment [16:44] query out machines that have the stateserver bit set in mongo, and pull the address? [16:44] rogpeppe1, by the provisoner, but it doesn't set them anywhere yet [16:44] mgz: won't that be a linear scan through all machines? [16:44] having a seperate table with addresses of state servers doesn't *sound* faster to me [16:45] but is also perfectly possible, it's just a denormalisation [16:46] fwereade, I found the culprit - the issue in bug 1231543 starts to happen in r1877 [16:46] <_mup_> Bug #1231543: upgrader startup failure with local provider [16:46] mgz: to me it sounds like one fetch of a document in a single document collection, versus a scan through potentially many hundreds [16:47] mgz: but... i think that for the time being it's probably fine [16:47] mgz: storing the addresses separately is an optimisation really. [16:48] dimitern: hmm, so the uniter API has PublicAddress and SetPublicAddress. is there any particular reason for that? [16:49] rogpeppe1, the uniter sets these on startup [16:50] dimitern: what i mean is: why have the PublicAddress method if it's only there to pass its result to SetPublicAddress? [16:51] dimitern: (which also gives a compromised uniter the potential freedom to muck with its reported public address, something you probably don't want) [16:51] rogpeppe1, the uniter needs both to set public/private addresses of a unit, and to read them [16:52] dimitern: why's that? [16:52] rogpeppe1, the addresses shouldn't be on a unit at all - they should be on a machine, but that's that [16:52] dimitern: i'm wondering about an API call, say Start, which informs the API that the uniter has started [16:52] rogpeppe1, because public-address is one of the relation settings set automatically when entering scope for example [16:53] dimitern: ah, good point, so we need PublicAddress [16:53] rogpeppe1, the API very well knows when the unit agent connects, and starts a pinger now [16:53] dimitern: in that case, that's probably the moment that the public and private addresses should be set [16:54] rogpeppe1, perhaps, if we're not using a separate worker for that [16:54] rogpeppe1, and setting them on the machine, not on the unit [16:54] dimitern: yeah [16:55] dimitern: but the point is that we could remove that stuff from ModeInit, i think [16:55] dimitern: hmm, except not right now of course [16:56] dimitern: because it really does get the public address from the provider [16:56] dimitern: ok, ignore my stupidity [16:56] I've added an explaination to bug 1227533 about our memory woes the last week [16:56] <_mup_> Bug #1227533: Juju fails to bootstrap if memory is lower than 1GB [16:56] now I must depart, farewell! [16:56] mgz: one mo, please? [16:57] one mo while I close things :) [16:57] rogpeppe1, there's a todo about it in mode init [16:57] mgz: kapil was asking about the status of the VPC-only bug... [16:57] dimitern: yeah, i understand that now :-) [16:58] rogpeppe1, ...and a few other places, and there's the tech-dept bug 1205371 [16:58] <_mup_> Bug #1205371: state.Addresses and APIAddresses need better implementation [16:58] dimitern: hmm, so there's no way of finding out a machine's public address currently unless it has a unit on it? [16:58] rogpeppe1: it's the next on my list, but haven't started yet, saw his comments earlier [16:58] mgz: ok, cool [16:58] will tackle the registration stuff at least tomorrow [16:58] okay, now must fly [17:00] * dimitern is totally puzzled how r1877 could lead to that local provider issue [17:22] i'd love a review of https://codereview.appspot.com/13970043/ if anyone has a little time [17:22] rogpeppe1: I can take that [17:23] natefinch: ta muchly [17:23] natefinch: [17:33] dimitern, thanks, Iwill meditate upon 1877 [17:34] dimitern, "The simplestreams tools metadata includes a sha256..."? [17:35] rogpeppe1: what's the difference between done := make(chan struct{}) [17:35] go func() { info.BootstrapConfig(); done <- struct{}{} }() [17:35] <-done [17:35] and just calling info.BootstrapConfig() in the current goroutine? They both just block waiting for bootstrapconfig to finish, right? [17:35] natefinch: ha, there is a subtle difference, but it's just a debugging remnant [17:35] natefinch: i'll revert it [17:35] natefinch: 2 points if you can tell me why i did it :-) [17:36] rogpeppe1: if you had a panic in bootstrap config it would make the call stack a lot shorter [17:36] natefinch: close [17:38] rogpeppe1: could be something to do with the scheduler, but that seems too subtle to matter [17:38] natefinch: nah [17:38] natefinch: it's to do with gocheck [17:38] natefinch: if you panic, then gocheck catches it and distorts things [17:38] natefinch: so by panicing in a goroutine you get a much cleaner idea of what's going on at that momen [17:38] t [17:39] ahh ok [17:40] rogpeppe1: I presume you'll take out the log messages in there as well [17:40] natefinch: yes [17:40] k\ [17:42] rogpeppe1: btw, is "erewhemos" someone misspelling "somewhere" backwards, or something that actually makes more sense? [17:42] natefinch: the former :-) [17:42] rogpeppe1, sweet! i'll remember that trick next time i'm fighting tests panic [17:43] rogpeppe1: ha, ok. I thought so, but you never know [17:43] natefinch: just a nonsense name that's unlikely to be confused with anything in the production code [17:43] natefinch, I'm sorry about that, there was a satirical work by samuel butler called "erewhon" which is not *quite* "nowhere" backwards [17:43] natefinch, it seemed like a good idea at the time [17:43] fwereade, yes that's whati found so fr [17:44] dimitern, just to be crystal clear: 1876 works, 1877 does not? [17:44] fwereade: we're in the *distopia* right? [17:44] dystopia, sorry [17:44] fwereade: haha, ok. not up on my Victorian authors [17:44] rogpeppe1, heh [17:44] fwereade, that's what I see, but I'll double check, just a minute [17:49] fwereade, indeed [17:49] fwereade, and the error now makes sense 2013-09-26 17:48:00 ERROR juju runner.go:211 worker: exited "upgrader": cannot set agent tools for machine 0: empty size or checksum [17:51] fwereade, but, interestingly the coercing error is not there in 1877 [18:02] natefinch: still waiting for that review, BTW :-) [18:03] rogpeppe1: still doing it. Had to stop in the middle for a little bit. Almost done :) [18:03] natefinch: np [18:08] fwereade, so the other error starts to show in my r1884 that switches to api provisioner [18:13] fwereade: do you know what stage mgz is at with the addressing stuff? [18:13] rogpeppe1: done [18:13] fwereade: i just started hacking up the publisher/addresser worker, then realised that he might already have done/nearly done it [18:13] natefinch: thanks [18:18] rogpeppe1, I'm afraid I do not actually know, i was kinda expecting a CLfrom him today [18:19] fwereade: i need that, or something like it, to cache the API addresses [18:19] dimitern, ah ok [18:19] dimitern, so the upgrader thing appears to be a problem [18:19] fwereade: this is the sketch of the code i just wrote: http://paste.ubuntu.com/6159815/ [18:19] fwereade, yeah [18:20] fwereade: oops, this is better: http://paste.ubuntu.com/6159817/ [18:20] dimitern, I thought all we were meant to be setting was a version, not a whole tools [18:21] fwereade, and the other thing - it doesn't seem to be an int coerced to string, it's an int - I debugged so far as to say the provisionerAPI returns the correct map[string]interface{} in worker/WaitForEnviron [18:21] dimitern, oh, ffs, is it possibly a json problem? definitely an int and not a float? [18:23] rogpeppe1, sorry, I have only skim-read it, but I think it may well have overlap [18:23] fwereade, trying to see exactly what now [18:24] fwereade: yeah, if he's doing an addresser worker, it almost certainly will [18:24] fwereade: well, i'll keep it around in case [18:24] any idea why this error? ERROR juju.provider.local environ.go:482 could not install machine agent service: exec ["start" "juju-agent-dimitern-local"]: exit status 1 (start: Job is already running: juju-agent-dimitern-local) [18:25] time to stop for the day [18:26] dimitern, aw hell, that really should be fixed for 1.16 too, we don't seem to shut down local envs cleanly [18:27] fwereade, hmm - we *are* stopping them, but the upstart job remained and it though "because it's there, it must be running" [18:28] dimitern, looks like we're calling StopAndRemove though [18:30] fwereade, hmm.. it get's deeper [18:30] fwereade, so now the upstart job hangs [18:30] fwereade, that's why the bootstrap doesn't complete and I terminated it [18:31] g'night all [18:31] might be back later, actually [18:31] rogpeppe1, see you soon [18:32] dimitern, "cannot install, already running" seems to imply that it really was running [18:32] dimitern, and was thus not properly cleaned up [18:33] fwereade, believe me, ps xa | grep juju was the first thing I did - no results, even as root [18:33] fwereade, just the upstart job was there [18:33] dimitern, very strange [18:38] fwereade, so the mongo hangs at bootstrap [18:38] fwereade, and that fails the whole thing [18:38] fwereade, it's indeed running now, and the error is correct [18:39] dimitern, ok, so we have *some* sort of poorly characterized local provider cleanup problem [18:39] fwereade, and even upstart believes jujud job is running [18:39] fwereade, and I can't see it [18:40] dimitern, and a clear current issue: that we're recording full agent tools including hashes for no clear reason, when all we really care about it the binary version they're running [18:40] dimitern, concur> [18:42] fwereade, not sure I get you there [18:43] dimitern, so the problem seems to be that we're setting *tools* on the agent, rather than just setting the binary version which is all anyone cares about AFAIK [18:44] dimitern, and we can't set tools because we didn't record the hash we downloaded and verified [18:44] dimitern, and it seems a bit pointless to report it back to juju when juju told it to us in the first place [18:45] fwereade, yes, that seems likely [18:48] fwereade, I have to stop though.. lest my head explodes :/ [18:48] dimitern, no worries at all, you are already above and beyond [18:49] dimitern, is there a specific bug for the tools issue? [18:49] fwereade, don't know [18:49] fwereade, I added the one for the upgrader, but this seems unrelated [18:50] dimitern, the upgrader was what I meant by the tools issue [18:50] fwereade, bs, actually the upgrader error is about tools, the other errors were different [18:50] fwereade, :) [18:50] dimitern, I think there is one for screwy local-env destruction [18:51] fwereade, maybe [19:25] fwereade: the point of setting tools on the agent was so that it was possible to make available that information in the status, so you could know exactly what s/w was running on each machine [19:27] rogpeppe1, ok, so we *should* have to record and write into the tools dirs the hashes of the original tarballs? [19:27] fwereade: yes [19:29] rogpeppe1, I dob't really see how that helps anyone [19:30] fwereade: when debugging stuff it means you have an unambiguous record of what is being run where, which i *think* could be very useful at times [19:30] fwereade: for reproducibility and diagnosis of difficult issues in a highly distributed environment [19:31] fwereade: and i don't really see why it should be a hard thing to do, though i haven't read through the discussion above, so i don't know what the current issue is [19:32] rogpeppe1, it looks like we're barfing when calling SetAgentTools because the tools in state now demand a hash [19:32] fwereade: and you can't have a Tools with an empty hash? [19:32] rogpeppe1, apparently not [19:33] rogpeppe1, it seems to be demanding that if there's a URL, there must be a size and checksum [19:34] fwereade: oh yes, checkToolsValidity [19:34] rogpeppe1, but not barfing if there's no URL [19:34] rogpeppe1, when I *thought* we always wrote a URL [19:35] rogpeppe1, but ofc do not necessarily have the original tgz available and so can't always manage size/hash [19:35] rogpeppe1, (not that we do, even when we do, AFAIK -- maybe that changed somewhere?) [19:42] fwereade: sorry, computer just crashed [19:43] fwereade: last thing i was was "it seems to be demanding that if there's a URL, there must be a size and checksum" [19:43] s/was/saw was/ [20:26] sigh.... goyaml doesn't differentiate between nil slices and empy slices :/ === sidnei` is now known as sidnei [22:22] fwereade: hiya, saw the email about the error, i can take a look [22:22] wallyworld, tyvm [22:22] any clues to get me started? i see a few comments in the bug [22:22] could it be related to the env split up? [22:25] grr [22:25] I have the upgrader constantly bouncing [22:25] any one else noticed? [22:25] wallyworld: fwereade: ??? http://paste.ubuntu.com/6160651/ [22:26] thumper, https://bugs.launchpad.net/juju-core/+bug/1231543 [22:26] <_mup_> Bug #1231543: upgrader startup failure with local provider [22:26] thumper, wallyworld is looking at it now dimitern has I think stopped [22:26] thumper: that error looks like tools checksum is failing to be calculated [22:27] kk [22:27] I'm trying to chase the lxc issues [22:27] fwereade: thumper's error message mentions checksums, whereas bug says something about ports [22:27] wallyworld, that is also a problem [22:28] yeah, so issues \o/ [22:28] 2 [22:28] wallyworld, but the tools checksum is easier to get a handle on and isolate [22:28] the tools one is my fault [22:28] if i can't easy find it i can just disable the checksum check for now [22:28] wallyworld, so do we now write out size/sha256 into the tools dir when we unbundle? [22:28] we do [22:29] but for some reason the checksum is not getting passed down the api [22:29] wallyworld, I bet we just miss it in the local provider then [22:29] wallyworld, or is it happening everywhere? [22:29] it could be that the tools are being read from the old place which means no checksum [22:29] wallyworld, although, hmm, yeah exactly [22:29] fwereade: i tested bootstrapping on ec2, hp etc with the new stuff and it works [22:29] wallyworld, I'm a little scepticalabout the value of recording all that in state anyway [22:30] wallyworld, cool [22:31] fwereade: we recorded the url in state, from which a tools stuct is made. and that tools struct is used to find a tools tarball. so it needs the checksum [22:32] wallyworld, we only ever call SetAgentTools in code that has already been extracted from the tarball in question [22:33] fwereade: i'll have to re-read the code - what do we use the agent tools stored in state for? the tools info from SetAgentTools? [22:33] wallyworld, not much [22:34] fwereade: we should get around to fixing the tools for the local provider [22:34] so i could drop the checksum requirement. i thought it was needed somewhere, can't recall though [22:34] wallyworld, that said, minimal changes good, I am not encouraging you to rewrite and would most favour a simple tweak to the local providr that made sure it wrote its tools dir properly [22:34] rather than the upload-tools malarky we do now [22:34] thumper, oh, god, yes we should [22:34] fwereade: however I'm not sure what the best way is [22:35] thumper, I'm quite sure we can harmonise it with all the simplestreams stuff [22:35] I hope so [22:35] fwereade: when you say "not much" - is there a simple explanation of why we store the tools url and version in state? [22:36] wallyworld, the version we need for status [22:36] wallyworld, series is duplicated, a machine should already know its own series [22:36] why the url? [22:36] wallyworld, and for that matter arch should always be in hardware characteristics too [22:37] do we ever use the url to fetch tools? [22:37] if not, i can drop the need for imsisting on checksum [22:37] wallyworld, I was asking rogpeppe -- I hope I am not mischaracterising him to say that it's there just in case it turns out to be useful one day [22:38] wallyworld, SetAgentTools is, as far as I'm aware, purely a record of what the agent reports itself to be running [22:38] well [22:38] wallyworld, url and checksum and size are not, I think, exposed anywhere [22:38] not sure i agree with recording all that extra info just to report a version [22:39] wallyworld, all that detail in (once) state.Tools would have been great if we'd ever stored an environment's available tools in state [22:39] fwereade: would you object if i zero out url and checksum in set agent tools [22:40] if we have a url and not the checksum, that is not something we should encourage [22:40] wallyworld, because then we could just grab the tools for a particular machine with a trvial query, get the url and size and checksum, and hand them straight over [22:40] wallyworld, well [22:40] or i could find out why checksum is missing [22:40] wallyworld, the url really just indicates "this is where we got them from" [22:41] ok, i'll see how it pans out. for the release, where we need something done, it may just be easier to drop the mandatory checksum requirement [22:41] and fix next week [22:41] wallyworld, indeed, if that's what it comes to then so be it [22:42] cause the other issue sounds more tricky [22:42] wallyworld, fwereade: I'll look at the port int issue [22:42] thumper, <3 [22:42] wallyworld: if you want to tackle the checksum thing [22:42] yes indeed [22:43] heh, interesting, [22:43] fwereade: i'm also part way through ripping out all legacy tools support - that will need to be landed after 1.15 when all clouds have had simplestreams tools uploaded by the release team [22:43] I can see from the rpc logging that the value is being sent through as an int [22:43] * thumper digs [22:44] what the actual fuck... [22:45] wallyworld, awesome news [22:45] thumper, that sounds less awesome [22:45] * thumper just digging [22:45] * wallyworld needs a coffee [22:56] thumper: how do i reproduce your issue? [22:56] wallyworld: all I did was bootstrap the local provider [22:56] ok [22:56] I did try to deploy some things [22:56] before I checked the logs [22:56] so not entirely sure [22:56] np thanks [22:56] but I feel just bootstrap is enough [22:57] I also feel that my problem may be shadowing yours [22:57] should be easy to find then hopefully [22:57] so you might not get yours fixed [22:57] until mine is [22:57] let's find out [23:07] hmm... [23:07] I think I know what it is, but it is weird [23:07] and not sure why it hasn't broken before this [23:07] if it is what I think it is [23:08] * fwereade wants to watch, but is going to bedinstead [23:08] gn all [23:10] fwereade: night [23:10] thumper: i found the spot where SetAgentTools was passing in incomplete tools [23:11] wallyworld: cool, I've found out where the validate is failing, but unsure as to why [23:11] but i'm not sure i habe the size and checksum info at that point to pass in also [23:11] it really is just passing in a version wrapped in a tools struct which seem silly [23:14] thumper: ah, actually i think when local provider starts up, the tools hack it uses might not be recording the checksum etc, so when that info is read back later, it is missing [23:14] * wallyworld is guessing [23:14] how to I get the type of something printed out? [23:14] %T [23:14] fmt.Println("%T", thing) [23:14] Printf [23:16] stabby!!!!!!!!!!!!!!!!!!1 [23:16] error used to be : storage-port: expected int, got 8040 [23:16] added type info [23:16] guess what? [23:16] storage-port: expected int, got float64(8040) [23:16] this is why it is failing [23:16] FFS [23:17] is it because json serialization only has float64? [23:17] how to we fix this in a non sucky way? [23:18] * thumper wonders how the api port is handled [23:18] * thumper digs [23:18] stabby stabby [23:18] the difference is: [23:18] schema.Int [23:18] vs [23:18] schema.ForceInt [23:18] guess which is which? [23:20] huh? [23:20] I change it now I get a panic [23:22] thumper: you need a custom json demarshaller i think [23:22] for the struct [23:22] no, found it [23:22] you wouldn't believe it if I told you [23:22] well, you might [23:22] schema.Int -> int64 [23:23] schema.ForceInt -> int [23:23] wtf [23:23] ok, that fixes it [23:23] \o/ [23:24] thumper: save me some time - can you point me to where the local provider does its tools hacky thing to find the tools to bundle [23:24] it does the default --upload-tools bit [23:24] what do you mean exactly? [23:25] for some reason, the tools struct passed to bootstrap is (i think) missing the checksum info [23:25] i need to find out how that is happening [23:27] just working backwards to find it [23:28] probably the possible tools created by the upload-tools stuff [23:28] at a guess [23:35] * thumper proposes a copule of branches [23:36] thumper: found it, fixed, testing [23:37] https://codereview.appspot.com/14005043/ is just logging tweaks [23:37] the local environ did not implement CustomToolsSource interface [23:37] so it did not find tools using simplestreams, and defaulted to legacy [23:37] which means no checksums [23:38] https://codereview.appspot.com/14006043 is the fix for the config [23:38] ah [23:38] * thumper goes to set commit messages in prep [23:39] * thumper waits for review [23:39] almost tiem for lunch [23:40] thumper: done with one comment [23:42] added a little context [23:43] wallyworld: the new test failed with the expected same error output to the log file [23:43] changed the schema, and all good \o/ [23:43] yay [23:43] thumper: i'll be proposing a fix soon, may you can look after lunch [23:44] ok [23:44] * thumper is heading into town to lunch with veebers [23:44] wallyworld: once you review the actual fix, you can approve it [23:44] I'm hoping you won't find any issue [23:45] * thumper -> lunch [23:45] ok