[00:20] Hi davecheney [00:22] sinzui: s'ok, wallyworld_ answered my question [00:45] ericsnow: ok, done with dinner, taking a look at 708 [00:50] waigani_: I'm being bitten by the missing envuser now too [00:50] waigani_: because I changed the 'add service' code to look for them :-| [00:50] waigani_: how far away are you? [00:50] thumper: just about done [00:50] coolio [00:51] thumper: I didn't see your messages but I implemented exactly what you suggested - even the param name :) [00:51] waigani_: I didn't say them on the PR, jus here [01:02] thumper: currently if you pass in creator: "eric" to MakeUser it does not create a local user for eric [01:02] waigani_: I think that is fine for now [01:02] thumper: this now fails when you try to specify eric as the creator of the environuser [01:03] waigani_: again, probably ok [01:03] waigani_: fall back to the environment owner [01:03] if not specified [01:04] thumper: you mean if doesn't exist as a local user? [01:04] because it is being specified in the params [01:04] I mean that if you pass a value in explicitly to the factory, it is exptected to work [01:05] if you haven't set it up right, then it is the tests fault [01:05] not the factory [01:05] thumper: right, so if you pass in "eric" as the creator you should have created "eric" as a local user? [01:05] yes, that is what I'm saying [01:06] got it, I'll update the test in that case [01:15] thumper: do we still need factory.MakeEnvUser ? [01:16] yes [01:16] waigani_: there will be cases where we want an envuser, but they are not local [01:16] all users are local users [01:17] right, of course [01:21] thumper: just fixing up all the call sites now, there are a few [01:27] waigani_: here is one for your TODO list: [01:27] func (s *Service) GetOwnerTag() string { [01:27] from state/service.og [01:27] s/og/go/ [01:28] please make it return a names.UserTag [01:29] coffee time [01:29] thumper: okay [01:30] thumper: should the user now have a func to get the envuser? [01:37] waigani_: I don't thinkso [01:37] thumper: https://github.com/juju/juju/pull/702 [01:42] * thumper looks [01:46] waigani_: one change and one question [01:47] thumper: api.Open does not return a NotFound err [01:47] I tried to satisfy and it failed [01:47] is that error from api.Open? [01:47] waigani_: that's fine, that is why I asked :) [01:47] right [01:48] although... [01:48] by the time it hits api.Open [01:48] it should be "permission denied" [01:48] and nothing else [01:49] * thumper adds another comment [01:49] thumper: why perm denied? I'm giving the user perms in the test [01:49] waigani_: no, in the general case [01:50] and you aren't giving the user perms, you are explicitly testing that they can't get in [01:50] the error result should be "permission denied" [01:52] right, because it's more info than you should share to say that the user is not found [01:54] right [02:03] oops, sorry wallyworld_, thought I had updated my blobstore when I added it to dependencies.tsv... [02:03] no worries [02:09] davecheney: I have rockne-02 up with the deb locally [02:10] davecheney: but I can't remember how to install a deb [02:10] anyone? [02:12] sudo dpkg -i package.deb [02:13] bcsaller: ta [02:15] thumper: how the mighty have fallen [02:15] davecheney: I don't claim to be mighty with dpkg [02:15] nor have ever [02:17] hmm... [02:17] juju bootstrap tells me port 37017 is in use [02:17] how can I get netstat to tell me if this is true [02:17] I did 'netstat -a' [02:17] but that didn't show the port in use [02:17] am I mssing something? [02:18] actually, I see it now [02:18] hmm... [02:18] how to find out the process? [02:22] davecheney: if I can bootstrap with the 1.18.4 deb specified with the local provider, and do status, is that verified fixed? [02:23] mwhudson: hey, around? [02:23] thumper: yes [02:24] thumper: yes, i think so [02:26] davecheney: cool [02:56] thumper: are you sure rockne doens't have 64k pages ? [02:56] if you hit it with the api-get upgrade hammer [02:56] it will be running 64k pages [02:56] davecheney: yes, looked [02:56] welp, shitter [02:57] davecheney: I just did upgrade, and not dist-upgrade [02:57] you want me to try that? [02:57] nope [02:57] uname -a [02:57] * thumper is sshing in again [02:57] getconf PAGESIZE [02:57] Linux rockne-02 3.13.0-18-generic #38-Ubuntu SMP Mon Mar 17 21:41:16 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux [02:58] ah... [02:58] wat [02:58] * thumper did that just before [02:58] but got a different result [02:58] ubuntu@rockne-02:~$ getconf PAGESIZE [02:58] 65536 [02:58] textbook defintion of instanity [02:58] it was 4096 when I looked just before [02:58] maybe that was your own host [02:59] perhaps [02:59] i reckon it's not an issue [02:59] you did the test rihgt [02:59] * thumper is bootstrappinga agin [03:00] where did it fail last time? [03:01] * thumper tags verification-done [03:01] once any juju process had been running for > 5 mins [03:01] juju ssh some unit [03:01] wait for 20 mins [03:01] no crash, all good [03:02] oh, it has to run for some time? [03:02] hmm [03:02] * thumper bootstraps it and waits [03:04] yup, the bug is when the scavenger runs, it will try to munmap(2) an area of memory that isn't a multiple of the page size [03:04] this shows up on agents [03:04] and using juju ssh as the juju ssh parent process just sits there quitely [03:14] thumper: PTAL https://github.com/juju/juju/pull/709 [03:16] thumper: double underpants, check out dmsg [03:16] make sure there are no oddball kernel messages there [03:16] that's the canonical check [03:17] davecheney: well, machine 0 has been up over 15 minutes [03:17] had 'watch juju status` running [03:18] nup[ that won't show it [03:18] juju status only runs for a few seconds [03:18] so either the jujud daemons crash [03:18] dmesg seems fine [03:19] look, it's ficed [03:19] it's been fixed for months [03:19] if you use the right compiler [03:19] * thumper has marked the bug as verified [03:19] job done [03:19] next [03:26] any thoughts on why I can run lxc containers from inside a docker container but that the local provider fails to dial the state server on bootstrap? [03:27] bcsaller: sounds like the networking is all fucked up [03:29] davecheney: I was able to lxc-create/start etc. I manually brought up the lxcbr0 in the container and that seemed to work in the raw lxc case. w/o the bridge boostrap was failing much sooner [03:29] so it still might be, but I'm not sure that it is [03:34] bcsaller: what addresses and networks do the various components have ? [03:35] davecheney: juju.state open.go:101 connection failed, will retry: dial tcp 127.0.0.1:37017: connection refused [03:35] is the failure I'm seeing x100 [03:35] so its not getting very far I think [03:36] I put lxcbr0 on 10.0.4.1 [03:37] and eth0 in the container is a 172. address [03:39] menn0: did you figure out why your test was passing when you didn't think it should? [03:40] davecheney: eh, looks like there still might be some issues with the lxc-container networking as well, so I'll keep debugging the setup === allenap_ is now known as allenap === psivaa_ is now known as psivaa [04:12] _thumper_: do we need to handle the error from ParseUserTag? s.doc.OwnerTag is guaranteed to be in the right format, right? === _thumper_ is now known as thumper [04:13] * thumper thinks of how to best handle this... [04:15] waigani_: as much as I find it a little frustrating, I think the only real approach is to return (names.UserTag, error) [04:15] and handle the error in the places where we need to [04:16] which is exactly one place [04:16] thumper: yep [04:16] we shouldn't ever get an error [04:16] but I'd rather return an error that may one day be real [04:16] than panic [04:16] yeah, for sure [04:30] waigani_: what line ? [04:31] davecheney: https://github.com/juju/juju/pull/713 [04:31] davecheney: state/service.go:628 [04:34] waigani_: is it too late to not call the document OwnerTag [04:34] 'cos it's not [04:35] davecheney: nop, what would you like it called? [04:35] anything, as long as it doesn't end with Tag [04:35] there are two reasons for this [04:35] 1. the data in there is not in tag string format [04:35] 2. william has decreed that tags shall not be stored in the database [04:36] GetOwner ? [04:36] thumper: ^? [04:36] sgtm [04:37] davecheney: unfortunately it is indeed a string version of a tag [04:37] davecheney: and I think that 2. is flexible if it refers to a generic entity [04:37] but in this case it certainly doesn't [04:37] it is only a user [04:38] ok, if it is a tag [04:38] so it is a little more complicated [04:38] then it should be aclled OwnerTag and it miust be passed through ParseUserTag [04:38] there was the suggestion to remove it all together [04:39] and clean it up [04:39] thumper: fair enough [04:39] i don't know the background [04:39] just eating what's in front of me [04:39] it was an early attempt to deal with permissions [04:39] * thumper nods [04:39] s/eating/digesting [04:42] waigani_: this is turning out to be much more of a PITA than I wanted [04:42] * thumper is considering the whole kill it approach [04:42] thumper: doing last round of testing [04:42] nuke it from orbit [04:42] it is the only way to be sure [04:43] thumper: you want me to drop the branch? [04:44] waigani_: I [04:44] ugh [04:44] I'm thinking we may be throwing good effort after bad [04:44] and we should perhaps just clean up the mess [04:44] ooooh [04:44] rather than pushing it into a nice pile in the corner [04:45] I'd like to clarify with fwereade [04:45] waigani_: however, removing it has more changes [04:45] as all the deploy helpers now take a service owner [04:45] that we would no longer need [04:45] thumper: I'm just about done with this, shall I finish it off and push it up for reference if nothing else? [04:46] waigani_: if you like, and we should get input from fwereade [04:46] waigani_: don't spend too much more on it though [04:46] understood [04:46] waigani_: instead look at auditing the user manager functions that we have [04:47] thumper: okay, where should I start with that? [04:47] waigani_: look at what functions are implemented, [04:47] compare CLI, api client, api server [04:47] and state [04:47] and look at strings vs. tag usage [04:47] ah right, go it [04:48] I know there isn't consistency, but I want to know how inconsistent we are === Guest9121 is now known as wallyworld [04:50] axw_: can you connect to cloud-images.ubuntu.com ? [04:50] wallyworld: yep [04:50] sigh, i can't :-( [04:55] thumper: sorry, just saw this... yes I figured out why that test was passing - the test setup was wrong so it was passing for the wrong reason [04:56] menn0: ok, in which case you should be good to go [04:56] menn0: cheers [05:05] axw_: can you run "juju metadata validate-images" for me to look up a precise image id on ec2, since i can't access cloud-images [05:05] seems there's a routing issue :0( [05:05] sure [05:06] axw_: ah, got connectivity again [05:06] okey dokey === urulama-afk is now known as urulama [05:21] axw_: it appears there's a problem with trunk - i bootstrap with default-series=precise and machine 0 comes up ok. i deploy a charm, and machine 1 can't start: "no matching tools available" [05:21] hrm [05:21] I'll take a look [05:21] wallyworld: which provider? [05:21] ok, ta [05:21] aws [05:21] and are you doing --upload-tools? [05:22] yep [05:22] hm, weird. ok [05:22] and also --upload-series=precise,trusty [05:22] that shouldn't do anything anymore [05:22] i'm running from a utopic client [05:22] thought so, just did it in case [05:23] you should get a deprecation warning for --upload-series... you did right? [05:23] yeah, i did [05:23] ok. I'll try and repro in a sec [05:36] wallyworld: what did you try to deploy? [05:36] ubuntu? [05:36] mysql [05:36] you didn't specify series? [05:37] no [05:37] k [05:37] "1": [05:37] agent-state-info: no matching tools available [05:37] instance-id: pending [05:37] series: precise [05:38] wallyworld: just worked for me... :( [05:38] wallyworld: can you check cloud-init-output.log on machine-0 for lines saying "Adding tools" [05:38] ok, i'll try again a bit later and try and reproduce [05:38] i may have destroyed, i'll check [05:39] wallyworld: oh I have an idea what it might be [05:39] ok [05:39] if you uploaded, then your uploaded tools will have series=utopic.. does our code know about utopic already? [05:39] actually, probably does... [05:40] should do, but i wanted precise tools [05:40] wallyworld: yeah, what happens is the CLI uploads the tools it can build, and the bootstrap machine explodes them into each of the series of hte same OS [05:40] by "the tools it can build" I mean the local series [05:41] hrm, actually it should be the series of the bootstrap machine not the local machine... will have to check it's doing the right thing [05:42] checking machine-0, the only tools entry in cloud-init-output is 3b20f9692616c75f4df7326aed49efcfe520cbdeddeb39b8e19a59696e2975f8 /var/lib/juju/tools/1.21-alpha1.1-precise-amd64/tools.tar.gz [05:42] wallyworld: nothing saying "Adding tools" [05:43] ? [05:43] not that i can see [05:43] ok... can you please cat /var/lib/juju/tools/1.21-alpha1.1-precise-amd64/downloaded-tools.txt [05:44] {"version":"1.21-alpha1.1-precise-amd64","url":"file:///tmp/juju-tools260863187/tools/releases/juju-1.21-alpha1.1-utopic-amd64.tgz","sha256":"3b20f9692616c75f4df7326aed49efcfe520cbdeddeb39b8e19a59696e2975f8","size":8198295} [05:45] ah look [05:45] utopic [05:45] right, that's a bug [05:45] thanks [05:45] yet machine 0 is precise [05:46] yeah, that URL is wrong and precise doesn't know about utopic, so it doesn't know it's Ubuntu [05:59] wallyworld: just live testing a fix now, do you want a patch while I write a unit test? [05:59] axw_: it's ok, i have been able to test what i needed [05:59] cool [05:59] mongo syslog is beng spammed :-( [06:00] i've reduced it, but it's still logging regularly about authenticating a user [06:01] hmm, actually that URL shouldn't make a difference, only the version should. hrrmmm. [06:01] I'll try faking my series [06:26] wallyworld: can you please review https://github.com/juju/utils/pull/28 [06:27] * axw_ checks OCR [06:27] asleeping [06:27] if you're too busy, I can wait [06:28] master is not happy with the apt retries though === kwmonroe_ is now known as kwmonroe [06:53] wallyworld: I can't reproduce the issue. I've forced my local series to utopic, still nothing. That URL doesn't matter, I was misremembering what it was used for [06:54] bootstrapped ec2 with default-series=precise, and deployed mysql with no issue [07:02] morning [07:02] dimitern, ping? [07:04] axw_: hmmmm, ok. i'll try again a bit later [07:05] tasdomas, hey [07:05] dimitern, you pinged me yesterday - was afk at that moment [07:05] wallyworld: CI doesn't look particularly happy either, though. [07:06] axw_: looks like the upgrade jobs at first glance [07:06] tasdomas, yes, it was about the port ranges work, we'll be inheriting from you :) [07:06] dimitern, right - I'm addressing fwereade's comments as we speak [07:07] tasdomas, can you give me a quick status update? [07:07] dimitern, fixing up the PR (https://github.com/juju/juju/pull/517) [07:08] dimitern, it's a large PR, fwereade requested that it be split up into smaller ones, unfortunately I won't be able to do that [07:08] tasdomas, right, so how much time do you need? [07:09] dimitern, to finish fixing the PR? [07:09] tasdomas, I can perhaps take over and finish it if you don't have the time? [07:10] tasdomas, I heard your team is focusing on other things now [07:10] dimitern, that would be great [07:11] dimitern, I'll finish what I am working on at the moment [07:11] dimitern, do you want to have a hangout to discuss the port ranges work? Or do you want a small write-up on what's been done and what still needs to be done? [07:11] tasdomas, ok, cool, I'll have a look to remember what's what and how to continue [07:12] tasdomas, what works better for you? [07:12] dimitern, ok, ping me if you have any questions [07:12] dimitern, it doesn't really make a difference for me [07:12] dimitern, whatever works best for you [07:13] tasdomas, ok, then I'd rather have the writeup summary, as I'm doing like 3 things now :) [07:14] dimitern, ok - you'll have it by lunch time (2-3 hours) [07:14] tasdomas, thanks! [07:14] dimitern, no, thank you [07:15] dimitern, also, I've updated the PR https://github.com/juju/juju/pull/667 - when you have a sec, could you take a look? [07:15] tasdomas, sure, looking [07:33] tasdomas, LGTM [07:33] dimitern, thanks - I'll update the error message before landing [07:34] tasdomas, sweet! [07:44] wallyworld: I have charms deploying without provider storage :) needs some polishing and more testing before I can propose anything [07:45] also upgrade steps required this time [08:01] morning [08:07] TheMue, morning [08:09] dimitern: regarding the last comment yesterday: yes, the suite is running twice, once for v0 and once for v1, during the first run the test for a function introduced with v1 is skipped [08:09] dimitern: this way it's easy to check if v1 doesn't break compatability to v0 [08:09] TheMue, yeah, I've seen this, but doesn't that seem awkward way of running the tests? [08:10] TheMue, how is that better than having 2 separate v0- and v1-only suites? [08:12] dimitern: it thought about it, but then you 1st need a base test you can embed into the real ones, and then 2nd you have one for v0 and one for v1 with almost the same content, in my case only one additional test. that's lots of redundant code [08:12] dimitern: because each new version has to ensure that it doesn't break existing functionality [08:17] TheMue, ok, that sounds good to me [08:18] dimitern: yeah, spent some time yesterday how to organize it best and to see, where the lowest dependencies exist [08:19] TheMue, cheers [08:20] jam: would also like to discuss it with you, mast of API versioning :D [08:23] s/mast/master/ [08:30] TheMue, well, new versions are surely there *because* we want to break existing functionality -- when things don't change, yes, you get a duplicated test; but when they do I think it will be very hard to adapt that style of test [08:30] TheMue, I understand where you're coming from [08:33] TheMue, might it make most sense to have per-method suites? so then you can run the same per-method suite against multiple versions, hopefully minimising duplication without falling into a situation where adding a new version involves adding a new layer of special-casing to an over-general full-facade suite? [08:33] fwereade: yes, I simply want to ensure that all functions of a former version work like before while those which are added or changed surely behave different [08:34] fwereade: could you please expand a bit? [08:34] fwereade: did you you take a look into my proposal? [08:34] TheMue, so, the concern is that having a single full suite with one special-case for one new method is defining the direction we'll take in the future [08:34] simply to synchronize better [08:34] TheMue, next method will be another special case [08:35] TheMue, and then next version there's a change in functionality for some method [08:35] TheMue, and whoever implements it will... add another special case [08:36] TheMue, and *very soon indeed* it will become straight-up impossible to understand what's happening in this single godlike test suite that actually tests slightly different things for all the api versions [08:36] hazmat: I'm assuming you succeeded in building tokumx, but I've been struggling a bit. Did you grab their source control branches? What version? And did you use cmake or scons, as it looks like they want to switch to cmake (mongo itself uses scons), but I keep running into errors trying to build 1.5.0 [08:36] TheMue, I haven't seen the code we're talking about, though, I'm just going by what you said above [08:38] fwereade: please take a look here: https://github.com/TheMue/juju/blob/capability-detection-for-networker/apiserver/machine/machiner_test.go [08:39] fwereade: and I would like to see an outline of a per-method suite. this term sadly doesn't tell me a lot. ;) [08:39] TheMue: a "Suite" object for each method, rather than one "Suite" for each Facade [08:39] TheMue, you have a suite that tests all the methods, but special-cases some of them [08:39] TheMue, I'm suggesting having lots of suites, defining our expectations of the behaviour of a single method each [08:40] TheMue, and registering explicitly only the tests we actually want to run [08:40] jam: ah, thanks [08:40] TheMue, rather than mixing the what-to-test in with the how-to-test [08:44] * TheMue tries to imagine how the code base will look like for a number of methods that are robust over time. [08:45] so a v0 test would be embedded into a v1 test and so on, and only when it breaks, e.g. at v7, a new implementation would be made? [08:46] my goal is a good compromise of test reusage and flexibility for changes over time. [08:46] TheMue, I'd rather avoid embedding anything at all anywhere really [08:47] TheMue, I'm imagining there'd be a TestGetMachines suite, which gets set up to run its tests against v1 of the API [08:47] so let's say we have 5 suites for a v0, I add a new method, now have e.g. 6 suites for v1, and then in v2 I add two more and change one ... [08:48] TheMue, and all the other suites test against both v1 and v0 [08:48] fwereade: no embedding, code duplication instead? [08:48] TheMue, where did I suggest we duplicate code? [08:48] fwereade: that's why I ask [08:49] TheMue, you write one suite, that is capable of testing that some method implementation acts as expected [08:49] fwereade: simply to get better aware of your thoughts ;) [08:49] TheMue, you then feed all the facade versions that you expect to have that behaviour into that suite [08:49] TheMue, so adding a new version is a matter of adding the new version to the suite for each method it still uses [08:50] TheMue, new method? new suite, targeting just that facade [08:50] fwereade: ok, that's what I'm doing (when I get your word right), but for the whole suite with more then one method to test [08:50] TheMue, yes [08:50] aaaaaaah [08:50] TheMue, I just want more granularity [08:51] fwereade: instead of using the skipping or evel switches based on the version number inside the tests [08:51] TheMue, I think (particularly for the bigger facades) full-facade suites wil become unmanageable really alarmingly fast [08:51] fwereade: sounds cool [08:53] TheMue, on a separate note, what does Machiner need GetMachines for? [08:55] TheMue, ah, whether something's on a manual provider? what do we use that for? [08:55] fwereade: IIRC for stuff that was on the Agent API but doesn't do anything for Unit agenst, and thus is a Machiner responsibility [08:55] fwereade: the whole branch is about the needance for a safe networker. and here we neede the information if a machine is provisioned manually. first approach has been to retrieve the information extra, as it isn't needed so often. [08:56] jam, TheMue: then that is *definitely* not a machiner responsibility -- the machiner doesn't start the networker [08:56] fwereade: but review and discussion feedback has been to not make an extra call, so I changed the way we retrieve a machine info on the client side of the API [08:58] TheMue, jam: this feels like it should be a job, as communicated by the agent api, rather than tacking it onto an unrelated purpose-specific facade [08:58] TheMue, jam: am I confused about something? [08:58] fwereade: so, previously there was an API on Agent that was giving you the Life of the entity you wanted, and a bunch of other Machine related stuff that didn't make sense for Unit agents. [08:58] fwereade: what is the task of the machiner API? [08:58] fwereade: GetEntity IIRC, looking [08:59] fwereade: naive, by taking the term "Machiner" I would expect machine related API calls, like retrieving information about a machine [08:59] TheMue, set the machine to dead once it's marked as dying, and shut down [08:59] TheMue, it also sends network addresses once on startup which is a bit yucky [09:00] TheMue, the facades are all worker-specific [09:00] TheMue, they should be exactly what's needed for a remote worker to fulfil its (ideally *single*) responsibility [09:02] fwereade: here's my problem from a maintenance perspective. wanting to do something related to machines it always pulls me to the term "Machine" or "Machiner", but never to something called "Agent" [09:02] fwereade: so today we have AGentGetEntitiesResult which has 1 field that is actually shared, and then 2 fields that aren't meaningful for Unit agents, we would have been adding a 3rd. It felt better to split that out for Machine-Agent specific responsibilities. [09:02] I see your point that Machiner is the worker, not the Machine-Agent api [09:03] but do we have a Facade for just machine agents (vs all agents in general), do we want one? Is it just better to pull it out of Agent.GetEntities and make it something Agent.GetMachineDetails sort of thing? [09:03] fwereade: ^^ [09:03] jam, TheMue: IMO the separate existence of unit agents is the anomaly -- making the agent api more machine-agenty doesn't seem to me to be a particularly major issue, because it echoes where we want to go anyway [09:03] jam, fwereade: so maybe there's a need for two facades: "Machiner"/"MachineWorker" and "Machine" [09:03] TheMue, I don't think so [09:03] TheMue, what's the worker that uses "Machine" [09:03] ? [09:04] fwereade: are only worker using the API? [09:04] TheMue, and agents; and external clients; but essentially, yes [09:04] TheMue, and an agent is almost a special case of a worker [09:05] TheMue, it's the "worker" that starts other workers [09:05] TheMue, and what we have hitherto done is (1) use the Jobs to figure out what to start [09:05] TheMue, or (2) pull hacky shite out of the agent config instead [09:06] TheMue, the latter is not good [09:06] fwereade: so there is currently a bunch of code in api/agent/state.go that claims to be talking about an "Entity" but has stuff like "Entity.Jobs()" which returns []params.MachineJobs [09:06] which doesn't fit very well on a generic "Entity" object. [09:06] jam, agreed, that's not nice [09:07] fwereade: I think the sentiment was lets pull it into something for Machine agents, and it got put over on Machiner. I think I'm in agreement that it shouldn't go there, but where *should* it go [09:07] fwereade: ok, maybe here's my mistake, as to me the API is for more than just the worker. it's an API. and if I wan't to talk about machines I need somewhere to talk to. [09:07] jam, IMO making the agent code more machine-agenty is a far lesser sin [09:07] TheMue: the Facades design is about 1 Facade per worker [09:07] so it isn't talking about Machines [09:08] it is more that *if* you're Worker needs to know about Machines then your corresponding Facade will have a Machines API call [09:08] TheMue: eg, we won't have "juju" the CLI client talking to the Machine facade. [09:08] TheMue, if there's functionality that two separate facades need, you implement it separately from both, and embed (or passthrough if there's a different method name) [09:09] TheMue, the individual facades control auth, and if their functionality is unique it's generally in there too [09:09] fwereade: ic, thanks [09:09] TheMue, shared implementations are in apiserver/common, and need a GetAuthFunc (supplied by the facade) to determine how they can be called [09:12] * TheMue is astonished how this turns. looks like an almost pushed PR needs larger changes again. already had an LGTM and the change of the test code only has been to check how to better test the API :D [09:13] fwereade: so in my case you would place that GetMachine() at the agent API? [09:21] TheMue, sorry, back: I'm wondering why we are not expressing an agent's responsibilities with *jobs* [09:21] TheMue, that's what they're for after all [09:22] TheMue, this feels like just another case of exposing inappropriate information to the agents [09:23] fwereade: ok, fine for me, but my use cae is: I need an information about a machine [09:23] TheMue, why is it ok for the agent to know what sort of provider it's running on? [09:23] fwereade: I need to know if a machine has been provisioned manually, because then always a safe networker is needed [09:24] fwereade: we don't talk about the provider, but the machine === rvba` is now known as rvba [09:24] fwereade: e.g. a manually provisioned machine on ec2 [09:24] TheMue, I thought you were asking about its provider type [09:24] fwereade: or openstack [09:24] TheMue, -> you know about providers in the machine agent [09:24] fwereade: sorry, bad expressed myself, no [09:24] TheMue, -> you are breaking layering [09:25] TheMue, surely the agent should know *nothing* about why or how it was provisioned [09:25] fwereade: *sigh* [09:26] TheMue, I'm sorry to architect-tantrum at you [09:26] TheMue, but [09:26] TheMue, we have jobs, which we're meant to use [09:26] fwereade: *rofl* no problem [09:26] TheMue, we have dirty hacks that get around jobs, that we kinda had to do because we "designed" the system without an api layer, and were hamstrung by compatibility [09:27] fwereade: so, dear architect, what's your idea for determining if a safe or "non-safe" networker has to be used? [09:27] TheMue, we introduce new jobs, and use those to determine what workers to run [09:28] TheMue, the bad-but-once-acceptable way to do it is the explicit checking based on provider type and/or machine id (that we have still not managed to excise from jujud) [09:29] TheMue, the right way to do it now is to get rid of *all* those special cases, and use the fact that we can now change the api meaningfully to express the set of responsibilities that a machine agent can have, or not have [09:29] fwereade: the idea has been to let the providers decide by implementing it as an environment capability [09:30] TheMue, sure, but that happens somewhere in the api server, and the machine agent shouldn't know or care [09:31] fwereade: otherwise if this is a kind of job decided by the API server, than for each new provider implementation the server side API possibly has to be changed to. do I understand you right? [09:31] davecheney: thanks for the review [09:32] fwereade: because this also is a breaking for me, the idea of clean provider interfaces so that provider implementations can be plugged in and exchanged [09:32] TheMue, the two reasonable approaches I can see are (1) new job, that the MA uses to start appropriate workers; or (2) putting it in the Networker facade, such that the client side knows whether to run "safely" or not [09:32] TheMue, would you expand a little on what you expect to change there? [09:32] wwitzel3: np [09:33] TheMue, isn't it still just a matter of the provider exposing whether you can safely mess with network interfaces on its machines? [09:33] TheMue, but we use that to figure out the machine jobs [09:33] TheMue, and we do that in a component that's allowed to know about providers [09:34] TheMue, then we express it to the agents in a form that's easy for the agents to consume [09:34] TheMue, which may or may not match the underlying internal data model [09:34] fwereade: hmm, maybe I lost you here [09:35] TheMue, would you explain what change to the provider interfaces you're worried about? [09:36] fwereade: nothing on the provider interfaces, only that the Agent API has to know about the existing providers and what they need to decide wether they need a safe or non-safe networker (I hate this term ;) ) [09:37] TheMue: so I think for what *we're* trying to accomplish, having a JobManageNetworks would be perfectly appropriate for deciding what kind of Networker we want to run [09:37] and whether that Job gets added can be based on whether the machine was manually provisioned. [09:37] TheMue, maybe it's the agent api, maybe it's done at the state level [09:37] TheMue: so what we care about is whether we should be managing /etc/network/interfaces [09:37] TheMue, all I care about at this point is that we not leak that information onto the agents themselves [09:37] jam: to decide it we need to now which provier, which machine (bootstrap or not) and if it is manually provisioned [09:38] TheMue: but that can be done at provisioning time, rather than when the agent is starting up [09:38] TheMue, but you *cannot know those things in the agent* if you care about coupling and layering and the consequences of ignoring those considerations [09:38] TheMue: so we remove all of the special case inside the code, and just have it told by the thing that actually knew that information originally. [09:39] * fwereade brb, don't stop talking [09:40] TheMue: at least, AIUI, I also think we should bring dimitern in on this conversation. [09:40] fwereade: if the API allows to retrieve the needed information (in a generic way, GetMachines is also used instead of the old way to retrieve information about a machine on the client side) we can provide all needed information [09:40] But the idea is that when you want to ask the question "should X run in Y circumstance" that question can still be asked, we just need to ask it earlier and record it as whether or not an agent will be assigned a Job [09:41] i'm here [09:41] * dimitern reads a lot of scrollback [09:41] dimitern: followed this interesting discussion? ;) [09:42] TheMue: so for example, ContainerType is also a bad API [09:42] instead, it should be a JobRunLXCProvisioner [09:42] TheMue, nope I'm afraid, I'm trying to write a manual procedure for making an addressable container in ec2 and maas [09:42] or something along those lines. [09:42] jam: there maybe already several ones, yes [09:43] jam: maybe my thoughts of an API, what I understand as an API, are a bit naive [09:43] TheMue: at least as I am "channelling my inner fwereade" the idea is that we can look at the questions we're asking, and figure out if they are appropriate or whether someone else should just be giving the answer. [09:43] TheMue: it isn't so much about API vs not API [09:43] but what questions should be asked and who is responsible for knowing the answer. [09:44] fwereade, TheMue, jam: my concerns align pretty well with " TheMue, and *very soon indeed* it will become straight-up impossible to understand what's happening..." [09:44] * jam has to go take the dog out before it gets messy, brb [09:44] jam: this discussion has been in the beginning, the whole change has been about adding an environment capability implemented by the providers to decide, which networker to use [09:46] dimitern, I've shared a doc with you [09:46] dimitern: we're not talking about testing anymore, more about responsibilities [09:47] dimitern, and pushed my latest changes to the port ranges PR [09:48] dimitern: what information are retrieved from where so that the thing currently implemented as environment capability can decide which networker to start [09:48] axw_: just got back from soccer; say message; niiiice [09:48] saw [09:49] dimitern: or if the networker can decide it internally by communicating with the Agent API which then decides based on provider, machine id, and manual provisioning which one to take [09:49] dimitern: so (1) passing information to client/worker and decide there or (2) passing information to according API and decide there? [09:50] TheMue, dimitern, jam: cmd/jujud/machine.go:507 [09:50] wallyworld: I'm rewinding a bit to improve things, but it shouldn't be too far off [09:50] ok [09:50] // TODO(axw) 2013-09-24 bug #1229507 [09:50] // Make another job to enable storage. [09:50] // There's nothing special about this. [09:50] Bug #1229507: create a machine job for machines/environments that provide local storage [09:51] * axw_ slinks into the shadows [09:51] TheMue, dimitern, jam: the other place we do it is in deciding whether to start the authentication worker [09:52] TheMue, dimitern, jam: I *think* those are the existing dependencies in jujud [09:52] axw_, you couldn't do it then, we still had to worry about sending jobs that agents didn't understand [09:52] ah yes [09:53] axw_, I think we're fine now, because we implement can a new Jobs method that can send more values [09:53] axw_, and be sure that nobody's going to call it without being prepared [09:53] tasdomas, thanks, I got it, will look a bit later [09:54] * dimitern is still catching up to the current discussion.. [09:55] dimitern, short version: the agents must not know about providers! (oh, and we shouldn't jam agent methods onto the machiner) [09:55] fwereade, I'm +100 for this [09:56] fwereade: inside the PR the agents DON'T know about the provider [09:56] fwereade: they simply delegate the decision to the current provider by using an environment capability [09:56] fwereade, I mean agents not knowing about providers, but capabilities implemented by providers and checked by the agent? [09:57] dimitern: yes, as we discussed, this is how it works inside the PR [09:57] dimitern: but I don't need to tell you, you know it :) [09:58] TheMue, you have added an IsManual field to the api [09:58] TheMue, that is *explicit* information about the provider type, exposed to the agent [09:58] fwereade: please, no, not the provider [09:58] TheMue, the agent now needs to care about what it means for something to be a manual provier [09:58] fwereade: it's about if it is provisioned manually, even in ec2, openstack, azure ... [09:59] fwereade, not exactly, IsManual is about a machine being manually provisioned or not [09:59] fwereade: it's not about the manual provider, definitely not [09:59] TheMue, dimitern: what provisions manual machines? [09:59] TheMue, well, it kinda is [09:59] a manual machine can technically be in a non-manual provider environment [09:59] fwereade, but it's the property of a machine, isn't it? [10:00] axw_, right, but that machine's provider is not, say, ec2 [10:00] it is at the moment, because we don't have per-machine providers [10:00] we have a per-environment one [10:00] axw_, dimitern, TheMue: we weren't able to explicitly tag machines with their provider, it's true [10:01] axw_, dimitern, TheMue: remind me, how do we prevent the provisioner trying to do things with those machines? [10:01] jam, i built it in a trusty cloud container [10:01] fwereade: it doesn't know about those instances, so it leaves them alone [10:01] fwereade, they are already provisioned perhaps? [10:01] i.e. have instance id [10:01] axw_, dimitern, TheMue: hmm, makes sense -- and when they die? [10:01] jam, i can give you my binary, i think i have instructions somewhere as well extracted from my bash history [10:02] fwereade: maybe again a leak of information on my side. what is the intention of Machine.IsManual in state? [10:02] jam, http://paste.ubuntu.com/8298506/ [10:02] jam, re build recipe [10:02] fwereade: I don't understand the question [10:03] fwereade, they are destroyed as usual, but how the provisioner doesn't reap them you mean? [10:03] there provisioner doesn't destroy those instances, because they're not things under its management [10:03] again because it doesn't know about the instance IDs [10:04] axw_, ah ok, we ask the provider for instances with X ids, we get back errpartialinstances and gnore the missing ones? [10:05] fwereade: I'm afraid my memory on specifics is a bit hazy [10:05] jam, and my binary (w ssl) is @ https://www.dropbox.com/s/dbcrgahxxyt8buv/tokumx-1.5.0-linux-x86_64.tgz?dl=0 [10:05] fwereade: but that sounds about right. [10:05] fwereade: TBH, I think a job is just as applicable to manually provisioned machines as it is to manual provider type [10:05] jam, i'd give it a go with the compile again using the build recipe (lxc container) [10:06] or clean env [10:06] I haven't looked at the PR in question though [10:07] hazmat: I thought I was using exactly the same thing, but I'm getting: http://paste.ubuntu.com/8298512/ [10:08] jam, sorry can't help more than that atm, in the middle of a sprint and pair programming [10:08] fwereade, so to summarize my point, having IsManual or the machiner facade is a *good* thing I think, it's not provider-specific; this allows us to define the capability across providers; my only contention was with the way this is tested wrt api versions [10:08] fwereade: the returned IsManual talks about juju/state/machine.go:270. will this function only return true when we use a manual provisioner? [10:10] hazmat: np [10:10] dimitern: yeah, here fwereade had a good idea for test that will go in next (when we solved the current topic). per-method suites running for the respective versions. so no huge suite and no skipping or branching inside [10:10] I'll try it again [10:10] dimitern: I like it [10:10] hazmat: thanks for the pointers [10:11] TheMue, ah, good - reading scrollback again to get context [10:13] dimitern: yes, it's a good approach, especially when API change more and more over time [10:14] dimitern: so I certainly think from a "what type of networker do we run" it fits better on a Jobs basis. [10:15] jam, thinking about it now, yes it indeed does work better as a job, but there's one caveat [10:16] jam, it's not just about the networker, that's why I keep forgetting to ask TheMue to rename the capability to "RequiresSafeNetworking" perhaps, to point out that it applies to all networking (incl. what we do in cloud-init on maas) [10:16] jam: so not the the-provider-knows-it-best-approch, but the a-central-instance-called-api-knows-it-best-approach? [10:17] jam: here I dislike the idea, that the logic on the server-side has to know about the provider [10:18] jam: in (1) we retrieve information from the server-side and let the provider (capabilty) decide, in (2) we send information to the server-side and make the decision there [10:19] jam: in both solutions information of one side are passed to the other side [10:19] Hmmm... it's looking increasingly like we're stuffed with ipv6 support without some help from mgo [10:19] TheMue, the api *definitely* knows what provider is used btw [10:19] jam: and as an old friend of bottom-up I like it more to pass information from the server to the client then vice-versa [10:20] dimitern: yes, but is this good? [10:20] voidspace, no luck finding a workaround for ipv6 format to pass as arg? [10:20] TheMue: my point is that at the time you do provisioning you determine whether it is safe to control networking on that machine, and then record that information as a Job [10:20] TheMue, good or bad, it's unavoidable [10:20] dimitern: mgo keeps the cluster addresses from the addresses we pass in [10:21] dimitern: hehe, maybe I'm just to old-schooled bottom-up [10:21] dimitern: and mgo requires ipv6 addresses in one format (the correct one) and mongo requires another format [10:21] dimitern: so neither works, and as far as I can tell so far I can't work around it at the level above [10:21] dimitern: still digging into exactly how mgo gets the cluster addresses [10:21] but it's not simple code [10:21] voidspace, so does it seem like a bug in mgo? [10:22] jam: hmm, decision on client, storage as Job, retrieval when needed? did I get you right? that sounds like a clean approach to me [10:22] dimitern: well probably a bug in Mongo that mgo needs to work around [10:22] dimitern: not *really* [10:22] dimitern: it's a bug in mongo that mgo makes it impossible to work around :-) [10:23] TheMue: well, it would still be mostly determined inside the API Server (I believe), as that's the thing where you are saying "add this manually provisioned machine to your list of machines to control" [10:23] dimitern: mgo requires the *right format* (because it just passes addresses through to net.Dial functions) [10:23] dimitern: but mongo can't work with them [10:23] I guess no-one is using mgo with ipv6 [10:23] voidspace: so we write our own "dial" functions, so we can patch them as needed, can't we? [10:23] voidspace: I haven't seen this particular bug that you're describing, is it in the traceback? [10:24] jam: mgo code calls net.DialTimeout - from the Go standard librarty [10:24] jam: I worked out why the ipv6 test fails sometimes [10:24] *library [10:24] jam: so a "different Dial function" [10:24] jam: can we patch the Go standard library? [10:24] jam: if calling Set causes a primary renegotiation (doesn't seem to happen every time) [10:24] jam: then mgo calls syncServers [10:25] jam: this uses net.DialTimeout to check it can reach servers [10:25] voidspace: line 114 of mongo/open.go [10:25] jam: it is determined indirectly by adding more information than today, but the decision itself is done on the client side based on this information [10:25] we define our "what do you use to call mongo" [10:25] jam: this is a call to net.DialTimeout inside mgo [10:25] and is different from the Dial functions we use [10:25] jam: mgo is using it to check that cluster members are up [10:25] voidspace: k, so arguably the mgo bug is (a) that it isn't using our dial function [10:26] because we're using TLS anyway, so Dial would really fail [10:26] I guess you could connect to the port [10:26] it doesn't fail [10:26] yeah, it's just a connect [10:26] but you couldn't talk to MongoD there [10:26] net.DialTimeout requires ipv6 addreses to be in the form [::1]:port [10:26] with square brackets [10:26] and if you don't have them the dial fails with "too many colons in address" [10:27] but mgo discards the actual error and just reports "no reachable servers" [10:27] jam: however due to the mongo bug we discovered a while ago, we can't start an ipv6 replicaset unless we use the address form *without* square brackets [10:28] voidspace: so thi sis line 399 of mgo.v2/cluster.go ? [10:28] "dial with UDP and a 10s timeout "? [10:28] jam: yep [10:28] jam: I added an extra log line there and you see the error... [10:29] voidspace: so I think the fix is that we tweak the "getKnownAddrs" code to handle the mongo ipv6 badness [10:29] so change cluster.getKnownAddrs [10:29] so that if it sees an address like "fe08::1:12345" [10:29] jam: or even just resolveAddr [10:29] it knows to call that "[fe08::1]:12345" [10:29] voidspace: I'd rather have real addresses in memory as much as possible [10:29] jam: we have to be careful that "fixed addresses" don't leak back to mongo [10:30] and only translate at the exact "talking to mongo" boundary [10:30] because mongo can't parse them in that format [10:30] voidspace: sure, but I'd still rather have a very clear "this is where we're translating for mongo" and it should live in mgo, and we should get rid of our hack-arounds in juju [10:31] jam: but it's largely serialised configs we're sending [10:31] voidspace: certainly you would agree that "mgo" should be where it knows the details of how Mongo works [10:31] jam: so "fixing at the boundary" means de-serialising and re-serialising [10:31] voidspace: all of the replicaset code was intended to live inside mgo [10:31] voidspace: natefinch implemented in Juju as a prototype to see it working [10:31] with the intent that it migrates into mgo proper [10:31] jam: we serialise replicaset configs and just call session.Run w [10:31] once the API seemed to be appropriate and working [10:32] voidspace: so for some amount we can have it in "replicaset" as that is "logically" mgo code [10:32] jam: yes, but mgo passes serialised data straight through [10:32] our replicaset code can strip out the square brackets from member addresses - and add them back in [10:32] voidspace: so if my last lines weren't clear [10:33] "it should be in mgo, but 'replicaset' can be treated as mgo code" [10:34] but that isn't sufficient because getKnownAddrs needs to change too [10:34] voidspace: this is one of those cases where if we had separated out the "struct for serialization" from the "struct in memory that you use to get stuff done" it would be clearer. [10:34] right, but then mgo would need specific code for every possible mongo command [10:34] voidspace: so userSeeds, dynaSeeds, servers.Slice should likely all already have real [dead:beef::1] addresses. [10:35] by "should", you mean "need fixing"? [10:35] voidspace: as in "logically should be done", and probably needs a patch, yes. [10:36] voidspace: at least, if I was doing the code, I would want our in-memory representation to hold "correct" values, and translate at the boundary [10:36] like you do for UTF8 / Unicode / byte strings / user encodings. [10:36] so long as there's no way for an "unfixed" address to leak [10:37] voidspace: it will, but you can treat that as a bug [10:37] and as mgo is a low level driver that basically allows us to send whatever to mgo, it's very hard to guarantee that [10:37] just like user encodings often leak all over the place [10:37] voidspace: so as you can always call session.Run sure, there are cases where the user has to do the work, but mgo should be the abstraction over 90% of that. [10:38] well, yes - and we can try and patch our code everywhere we find holes and fix all the bugs as we find them [10:38] or we have one function to do the fix and we speak mongo native addresses everywhere else [10:38] voidspace: I don't think we want to think in terms of Mongo bad-ipv6 addresses [10:38] as then *those* leak in our code [10:38] as they have already done here [10:38] and we know those are bad formats [10:38] I'd rather have good addresses leak [10:38] than bad ones [10:39] heh [10:39] well, I don't disagree [10:39] voidspace: hence why you try to make what you keep around "correct" as much as possible, because at best it exposes bugs in other people's stuff when you accidentally hand them the right thing. [10:39] just that a general fix *really* means a layer over session.Run and deserialising and checking all commands [10:39] voidspace: I don't think we have to fix session.Run [10:39] you don't really fix "exec.Command" [10:39] and we *still* need to fix mgo as well *anyway* [10:40] you have to layer over the top of it [10:40] fixing the replicaset functions to translate at the boundary is easy enough [10:40] so I'll go down this path [10:41] voidspace: so my grep for "\.Run(" only points to state.Presence [10:41] and replicalset [10:41] and *replicaset* is meant to be in mgo eventually, so it is allowed and must be made correct. [10:42] and state.Precence is doing something that mgo actually added support for [10:42] well, maybe it didn't expose it [10:42] There is a "cluster.isMaster" function [10:42] that calls ssesion.Run("ismaster") which is what we are duplicating in our code. [10:43] voidspace: so again, think of the "replicaset/" directory as though it should live in mgo.v2 [10:43] jam: sure, that's not the issue [10:43] and I think you can see the layering that I'm proposing. [10:43] jam: we *know* that fixing resolveAddr solves the immediate problem without risk of leaking "wrong" addresses [10:43] calling session.Run from user code means 'mgo' isn't doing its job [10:43] voidspace: maybe, but I think it is still the wrong fix [10:43] jam: the fix your suggesting is a lot more work *and* a much higher risk of introducing tricky bugs [10:44] voidspace: it means we maybe sort of sometimes think in terms of almost IPv6 addresses in memory. [10:44] that we could be playing whack-a-mole with in production for our customers [10:44] and requiring new releases of mgo to solve, so out of the teams hands for actually delivering a fix [10:45] voidspace: resolveAddr is a bugfix to mgo [10:45] I think that's an invalid argument [10:45] jam: we do one bugfix instead of n [10:45] voidspace: I don't tihnk we have N [10:45] where n is potentially unbounded :-) [10:45] we know that today nobody is using IPv6 with mgo and mongo [10:45] because it doesn't work [10:45] right [10:46] voidspace: I really think you're overstating it [10:46] having correct addresses in memory *is the fix* [10:46] jam, voidspace: hangout? [10:46] jam: I guess in terms of encoding, you're suggesting mixing encoded / decoded - with mojibake risk [10:46] I suggesting we stay decoded... [10:46] and while MGO does allow you to poke at the internals of the DB, it isn't *how user code is meant to look* [10:46] *encoded [10:46] dammit [10:46] getting my metaphors wrong [10:46] voidspace: TheMue: joining [10:47] jam: I may well be overstating it [10:47] jam: I'll talk to Gustavo about it [10:48] TheMue: dimitern: neither my mic nor camera are working [10:48] dimitern: I had to plug in my headphones, and I think the sound settings ar ewrong, brb [10:49] jam: each time you're frozen [10:49] TheMue: strange, as I can follow you guys just fine [10:51] TheMue: dimitern: voidspace: k, I'll type to respond to you guys [10:51] but I can follow you without problem [10:51] dimitern: so what's up with you today [10:52] dimitern: feel free to run the meeting since people can't follow me well [10:52] my webcam works fine [10:52] "cheese" uses it no problem [10:52] it's chrome [10:52] *grrr* [10:52] voidspace: hehe, I almost never use chrome anymore, but the fox [10:57] dimitern: do you agree that the Networker worker shouldn't be deciding what mode to be run in, but it should be a Job ? === mup_ is now known as mup [10:58] jam, the networker does not decide on its own, it's started in either safe mode or not [10:58] TheMue: not that grouping for Facades [10:59] dimitern: sure but it makes sense (to me) that the Agent doesn't decide its tasks, but is given them [10:59] and the logic of whether that task should be run is determined elsewhere. [10:59] and encoded as "Jobs" [10:59] jam, but using a job works for me, except that little quirk about disabling cloud-init scripts for maas [11:00] dimitern: so there is a bit of duplicating logic, but only because we want to get rid of the cloud-init step anyway [11:00] sorry, rejoining [11:00] dimitern: we should still do bridging in the Networker, IMO [11:01] dimitern: *today* what we have been doing in cloud-init should be done in the Networker [11:01] dimitern: irrespective of the new MaaS api [11:02] dimitern: we can do the same logic we have today [11:02] which is "always bridge eth0" [11:02] but grow into better logic [11:06] voidspace: so they only go in via "replicaSet" [11:07] voidspace: so the issue is that server.Addr still has the bad ipv6 address [11:08] voidspace: so from what I can see server.Addr is the one that we pass to newServer [11:09] so cluster.server() is the other place that is setting it [11:09] voidspace: and that is being called by spawnSync [11:09] which got the result of resolveAddr [11:09] and got that addr [11:10] ultimately from an IsMaster call [11:10] which is again ReplicaSet related, and we should be able to patch it at that level [11:11] jam: which newServer? [11:11] voidspace: so I think we can patch line 140 of cluster.go to know it needs to translate back [11:11] mgo/server.go newServer [11:11] Add a check there [11:11] that we don't have an invalid IPv6 address [11:11] ah, we have a newServer too [11:11] which doesn't take an address [11:12] I've done a pull on my mgo so I can look at the latest version [11:12] so our lines aren't matching up [11:12] let me go back [11:12] * jam goes to pick up my son [11:25] voidspace: so looking at 'master' [11:25] mgo.v2/server.go newServer [11:25] is where Addr seems to be getting set [11:25] (I didn't find another spot) [11:25] that seems to be called from cluster.go 394 "server()" and the addr is passed in [11:26] voidspace: and that is only being called by line cluster.go 457 [11:26] that addr comes from the call to spawnSync [11:26] which gets it from knownAddrs or from a hosts list [11:27] hosts is from the result of syncServer, which gets it from a results object [11:27] which is the result of calling ismaster [11:27] getKnownAddrs doesn't talk to mongo, but just pulls together all of the objects it already has in memory [11:27] so I'm reasonably comfortable [11:28] saying the patch could be: [11:28] a) add a trap in server.go newServer that doesn't allow Addr to be an invalid IPv6 address (can probably use net.ParseAddr for that) [11:28] b) fix cluster.go line 136 isMaster call to call, fill the result object, and then fix the result object to have valid addresses [11:29] c) fix our replicaset/ package to do similar things [11:29] c-i) we duplicate IsMaster, so we need to duplicate the fix [11:30] c-ii) CurrentConfig probably needs fixing [11:30] c-iii) Not sure about CurrentStatus, but probably [11:31] c-iv) And Initiate and applyRelSetConfig would need fixing [11:31] though probably that is a helper that takes a Config [11:32] mungeIpv6Addresses(*Config) [11:51] voidspace: my machine just locked up. It is working well enough to lock the screen, but all text entry fields are not working. [11:51] voidspace: [11:51] so I don't know if you got my earlier message and whether it made sense [11:54] fwereade, ping? [11:54] mattyw, pong [12:09] fwereade: so if we add a JobManageNetworking, that requires an API bump, doesn't it? [12:14] late good morning [12:15] jam1, yeah, I think it does [12:15] jam1, we don't really want to confuse old clients [12:15] jam1, even if they would probably handle it with a minor logged whine [12:15] mmmmm, wine [12:15] :) [12:16] fwereade: I think they would actually just casually discard it because the checks I see have an empty "default:" section. [12:17] but yes [12:17] I'm fine saying that it must be a new API when the set of values can change [12:17] jam1, ah, I thought I remembered them logging an "unknown job" -- but indeed, I think we agree anyway ;) [12:17] jam1: i ran an ensure-availability test with my mongo login changes - the new state servers appeared to correctly start and juju status shows everything is ok. all-machines log looks ok too [12:17] wallyworld: are we sure that clusterAdmin is respected with 2.4 ? [12:18] because I'm sure machine-0 is the one that is setting up the replicaset [12:18] jam1: this is on a trusty state server [12:18] which is mongo 2.4.9 i believe [12:19] the changes are only a band aide anyway :-( [12:19] i can't see a way to turn it off [12:21] wallyworld: so at this point it seems like we'd have to dig into the mongo code and figure out why it is emitting the warning, and I'm guessing it is a bug in mongo. [12:21] jam1: yeah, i did link 2 very similar bugs in the juju-core bug report [12:21] mongo bugs that is [12:21] any they are marked as targetted at 2.7 [12:21] and [12:22] so i can't see any fix coming for 2.4 [12:22] wallyworld: I agree that mongo won't fix it, I'm guessing it isn't something we can fix ourselves, unless we can do some post-config on syslog [12:29] jam1: hey [12:29] voidspace: heya [12:29] jam1: sorry, missed your messagess [12:30] jam1: pretty sure I saw all your messages [12:30] voidspace: k. does it sound reasonable ? [12:30] jam1: yep [12:30] CurrentConfig and CurrentStatus definitely need fixing, plus Add and Set [12:31] I think Add/Set end up using the same apply helper [12:31] or maybe just fixing applyRelSetConfig (which I've renamed applyReplSetConfig because it annoyed me) would do Add and Set automatically [12:31] right [12:31] they take a config which has Members and it's Members that needs fixing [12:32] voidspace: so I'd rather not mutate Members, but instead use an internal munged Members to pass on [12:32] jam1: yep [12:32] voidspace: though that depends on whether you get a Members or a *Members [12:32] jam1: although I think we create the config [12:33] voidspace: then it can just be the config thing that we mutate [12:33] jam1: it's internal [12:33] which I think was my "mungeIPv6Addresses(*config)" suggestion [12:33] jam1: the replicaset changes are easy enough [12:33] it's the mgo ones that are more funky, but you've done a lot of the work tracing it for me [12:34] voidspace: certainly you have to confirm with gustavo for the mgo ones, but I think they're straightforward and limited in scope [12:34] cool [12:35] and stick well to the "translate at the point that is known to give/need bad information" [12:35] so long as that doesn't proliferate too far [12:36] well, it is all the stuff that talks about the replicaset config, I think [13:05] ericsnow: ping me when you are back please [13:09] has anyone here used the juju publish command? [13:14] rogpeppe: I didn't even realize it already existed === jheroux_away is now known as jheroux [13:39] perrito666: I'm here [13:42] perrito666: let me guess, you have another PR you want me to "accidentally" merge [13:42] ericsnow: mm, so you are the go to guy for those things :p [13:42] * perrito666 makes a note [13:43] nah, I wanted to make sure that with what is merged I can already work on restore integration to your code [13:44] perrito666: yep, the only missing parts are the high-level abstraction and the API server facade [13:44] perrito666: neither should have any relationship with the restore implementation [13:45] ericsnow: did you pr the API server facade [13:45] ? [13:45] perrito666: it depends on 708, which is up for review right now [13:46] it has been lgtmd, hasnt it? [13:49] perrito666: needs sign-off from wwitzel3's review mentor (or a full reviewer) [14:04] natefinch: standup? [14:08] perrito666: oops, coming [14:15] natefinch: is there a standard trick for a "right split" on strings, given there's no strings.SplitRight function? [14:15] natefinch: other than revers, split, reverse again [14:17] voidspace: use strings.LastIndex? [14:18] natefinch: ah cool, that will do nicely [14:18] thanks [14:20] natefinch: and is there a function to split a string at an index point? [14:22] voidspace: foo, bar := baz[:x], baz[x:] [14:22] natefinch: thanks [14:22] nice and easy [14:22] as it was easy, I assumed Go didn't support it... === urulama is now known as urulama-afk [14:32] natefinch: could you change the juju org github OAuth app URL to "https://reviews.vapour.ws/oauth/"? [14:33] ericsnow: sure [14:34] ericsnow: done, though that was only a case-change from OAuth to oath [14:34] oauth that is [14:35] natefinch: ah, cool [14:36] natefinch: that URL still won't work until I get SSL working, but I can wait to switch "apps" until then [14:36] natefinch: right now it's using the app I registered on my own github account, which obviously is only a short-term solution [14:38] natefinch: ping, I'd like to ask a couple of questions if you don't mind [14:39] natefinch: I need to ensure I'm working on a copy of a struct and I don't know this area of Go well enough to know if I already am or not [14:39] natefinch: (because I want a mutated copy of the struct but don't want the caller to see the change) [14:39] I haven't actually asked the question yet. I don't expect you to know just form that... [14:41] natefinch: http://pastebin.ubuntu.com/8300392/ [14:42] natefinch: just constructing my own version for play.golang.org to find out... [14:43] voidspace: the first rule of Go is that everything is passed by value [14:43] right [14:43] except slices [14:43] and therefore maybe iterating over a slice [14:43] and the *call* is constructing a slice too [14:43] voidspace: nope, they're passed by value, it's just that the value is a pointer to an array [14:43] voidspace: sorry, brb [14:43] pass by value where the value is a pointer is what python does [14:43] which never copies [14:44] so that doesn't elucidate... === Ursinha is now known as Ursinha-afk [14:49] natefinch: trying it with play.golang.org shows me it's a copy [14:49] natefinch: I *assume* it's the call and not the iteration that copies (?) [14:50] although I can test that as well [14:50] nope, the iteration copies too [14:50] unless you have a slice of pointers I guess [14:53] voidspace: back [14:54] natefinch: so the iteration definitely returns a copy [14:54] natefinch: and so does the call [14:54] everything is always a copy unless you're dereferencing a pointer. The trick is that slice[0] is dereferencing a pointer [14:54] right, but iterating over the slice isn't [14:54] voidspace: correct [14:54] but slice[0] = foo [14:54] is that creating a pointer [14:54] I guess it must be [14:55] slice[0] is dereferencing the pointer to the backing slice and setting its value to foo [14:55] natefinch: that's clear to me now, thanks [14:55] voidspace: cool [14:55] natefinch: I needed to be sure I had a copy because I want to mutate the value [14:56] natefinch: in replicasets we now have "good ipv6 addresses" and "bad ipv6 addresses" [14:56] ahrh [14:56] interesting [14:56] natefinch: we always want to use good addresses, but mongo only works with bad ones [14:56] natefinch: the bug causing the ipv6 replicaset test to be unreliable is due to the fact that we *have* to use the format "::1:1234" for mongo [14:57] natefinch: but mgo calls net.Dial(addr) when it does syncServers [14:57] natefinch: and for net.Dial(addr) you *must* use the form [::1]:1234 [14:57] voidspace: well that's a kick in the pants [14:57] natefinch: so the test would pass if we didn't cause a syncServers and would fail if we did [14:57] natefinch: which seems to be random :-) [14:57] natefinch: it needs a fix in mgo [14:58] natefinch: but we're going to ensure that in replicaset (i.e. our side of the code) we only use and see the "good format" [14:58] i.e. *with* square brackets [14:58] voidspace: can we not use a struct with renderers for the different formats? [14:58] perrito666: the address is in the serialised bson [14:58] perrito666: and mgo stores it's own concept of server addresses [14:58] perrito666: so no === Ursinha-afk is now known as Ursinha [15:07] * perrito666 tries lite ide [15:08] perrito666: it's ok. It makes using gdb less painful, but it's still not great [15:08] natefinch: to be honest I usually only use the code navigation features on ides [15:11] perrito666: ahh, the only reason I tried lite ide was the gdb integration. As an editor it's kinda meh [15:15] * TheMue came bake to the good old vim after trying Sublime text for some time [15:16] * katco coughs. https://www.youtube.com/watch?v=DubEaS0lMqE [15:17] TheMue: I always go back to vim [15:17] but every now and then I need to take a stroll out of my comfort zone to remind myself why I use vim [15:18] I technically have atom installed... I started it up once..... but haven't really played with it [15:19] perrito666: hehe, good argument [15:19] katco: trust me, RMS dressed as a saint is the opposite of good marketing for your editor [15:20] perrito666: tongue and cheek :p i don't try to market my editor haha [15:21] katco: I want to compare a value to a set of possible values [15:21] katco: is there anything more elegant than [15:21] (entry.State == PrimaryState || entry.State == SecondaryState || entry.State == ArbiterState) [15:23] voidspace: switch entry.State { case PrimaryState,SecondaryState,ArbiterState: myfunc()} [15:24] katco: cool, thanks [15:25] voidspace: any time :) [15:31] mm, why no editor offers a package navigation instead of file navigation [15:32] perrito666: vim with tagbar allows a kind of, at least inside the current file as scope. here you can navigate over types, fields, functions etc [15:33] TheMue: yup, so far I got, but what I meant is the way to navigate in the packages of a project [15:35] perrito666: yes, I know, but sadly here I don't have a better answer yet [15:35] TheMue: I am trying to mod ninja-ide to support go, I presume that I will eventually get there and will be able to navigate packages [15:35] * TheMue still wants his old Smalltalk platforms back *sniff* [15:36] perrito666: write a good vim plugin, I'll use it [15:36] TheMue: I prefer to cut my fingers, vim plugin lang is awful [15:37] perrito666: I've got at least my own little plugin giving me the most important commands at my fingertips [15:37] perrito666: vimscript isn't nice, yes, but it works. but afaik you can use python too [15:37] perrito666: or lua? [15:42] fwereade, ping [15:44] alexisb, heyhey [15:44] alexisb, sorry about yesterday, public holiday [15:46] fwereade, yep, I saw that post my pong [15:46] ping [15:46] I would like to meet today if possible [15:47] alexisb, do you have a couple of minutes now? otherwise it will need to be later [15:47] are you free post the actions call? [15:48] alexisb, I'm catching up with bodie now because I need to be away at 6 [15:48] fwereade, what time are you available later? [15:50] alexisb, to be safe, let's say 3 hours from now on the hour? [15:50] alexisb, hope it'll be quicker [15:50] alexisb, but probably better to have an actual time [15:50] fwereade, ack, 3 hours is fine [15:50] I am not in a hurry but would like to catch you this week before you are out [15:54] alexisb, yeah, it's been a while, I meant to come to our 1:1 yesterday but then was out and completely forgot [15:55] fwereade, no worries at all [15:55] I sent an invite and I am flexible if that doesnt work [16:03] natefinch: I totally spaced our 1-on-1 [16:03] natefinch: you have time later? [16:03] ericsnow: heh it's ok, I was busy any way. later is fine [16:03] natefinch: when is good for you? [16:05] ericsnow: pretyy much any time except for the next hour [16:05] natefinch: let's go in 2 hours then [16:06] ericsnow: cool [16:23] fwereade, jam: i see that ian has changed the blobstore to use sha384, which is great. i wonder what you think about using sha256 instead (a minor change now) so that the hashes match the current hashes used for local charm caching. this would make migration considerably more straightforward. [16:24] wallyworld: ^ [16:56] woo, back standing [16:57] that was a crappy two weeks of sitting :( [16:58] wwitzel3: sweet you unpacked the legs finally? [16:58] :p [16:59] :) [16:59] they arrived after standup [17:24] natefinch: wow, it really sucks as an editor, when hitting ctrl+s if there is nothing to edit it will just write s [17:24] s/edit/save [17:25] perrito666: wow, I hadn't noticed that [17:26] they are doing a Qt pattern for key handling which should not be used in this case lol [17:30] natefinch: can you start a strategy multiple times? [17:31] voidspace: I don't know [17:32] natefinch: heh, me neither [17:32] natefinch: guess I'm about to find out [17:32] natefinch: well, it either worked or succeeded on first attempt so didn't need to check for HasNext [18:06] ericsnow: [18:06] natefinch: coming [18:24] right EOD [18:24] natefinch: https://github.com/juju/juju/pull/708 [18:51] cmars: could you take a look at https://github.com/juju/juju/pull/708? [18:55] ericsnow: I'm currently reviewing, btw, but certainly welcome more eyes. [18:55] natefinch: sorry, I thought you were running an errand [19:00] ericsnow: doesn [19:01] ericsnow: I was just going to the freezer, not the store :) I can see the confusion, though. [19:01] natefinch: :) [19:01] ericsnow, back from lunch, will take a look soon [19:01] uhh ice cream [19:01] cmars: thanks [19:01] mattyw, fwereade restarting chrome for a hangout [19:01] perrito666: well, my wife *is* pregnant :) [19:02] cmars, ack [19:02] natefinch: so you got all the possible combos? [19:02] perrito666: heh [19:02] perrito666: nah, just chocolate. [19:10] rick_h_: I sent an email to the juju-dev list asking for use cases around Juju Actions [19:11] jcw4: awesome [19:11] rick_h_: per our discussion last week I'm particularly interested in the GUI perspective... thanks :) [19:12] jcw4: will do, we've got a couple of charms in progress we'd use actions for [19:12] rick_h_: perfect! [19:17] * cmars is looking at ericsnow's backup PR [19:18] ericsnow, i'm not familiar with the backup story in juju, but i'll try to pick it up from context in the code. anything else that might be helpful (bugs, docs, etc)? [19:18] cmars: https://github.com/juju/juju/blob/master/doc/backup_and_restore.txt [19:18] sweet, thanks [19:19] cmars: basically that PR is the barrier between the backups implementation and the rest of juju [19:22] * perrito666 said he would buy one of those new phones and suddenly there is a horde of floss advocates comparing him to every possible traitor in history [19:22] its a good thing I said nothing about the watch thing [19:40] jcw4: so this doc you linked seems more about the actions api vs examples of 'actions a charm would implement' [19:40] jcw4: which are you kind of looking for? [19:40] ericsnow: you got lgtmd, meeerge [19:41] :p I want to see the next pr [19:41] perrito666: there's a CI blocker [19:41] ... [19:41] life [19:52] perrito666: in the meantime, you can already review the next patch at https://github.com/ericsnowcurrently/juju/pull/4 [19:54] cmars: would you mind reviewing the next PR instead ^^^ [19:54] ericsnow, sure [19:54] cmars: thanks [19:56] ericsnow: wait, in none of those I see the actual backups command arent you missing one pr? [19:57] rick_h_: right now we're trying to nail down the API, but examples of actions a charm would implement are valuable too. [19:57] perrito666: that PR exposes the Create method on the new Backups facade [19:58] jcw4: ok, will make some space for our notes/such and you can pull it in as you need. [19:58] muchas gracias! [20:04] sinzui: how soon do you think we'll know on the re-testing for bug #1366802? [20:04] Bug #1366802: juju.-gui fails with a config-changed error when used under juju 1.21alpha [20:11] ericsnow, It is fixed, but there is a more catastrophic regression being reported now [20:12] hi [20:12] sinzui: lovely [20:14] "juju bootstrap" doesn't work because juju seems to compare non-localized-error-string against localized-error-string ... https://github.com/juju/juju/blob/master/environs/sshstorage/storage.go#L254 [20:15] so why not using the errno ? === ChanServ changed the topic of #juju-dev to: https://juju.ubuntu.com | On-call reviewer: see calendar | Open critical bugs: 1367431 [20:16] natefinch, can you get someone looking at bug 1367431? [20:16] Bug #1367431: Juju upgrade times out, never completes [20:22] pafounette: ug, that's some ugly code [20:23] natefinch, yeah :) and I can't "bootstrap" just because my hosts don't return english error messages [20:23] pafounette: that's definitely a bug we need to fix. My apologies for the problems it's causing. [20:24] natefinch, thanks :) [21:13] menn0: morning [21:13] menn0: not sure if it is related to the bug natefinch just passed on [21:13] menn0: but I consistently get a jujud upgrade test failure locally [21:13] menn0: could be related to my changes though [21:13] menn0: want me to check master? [21:13] thumper: unit test or running a manual upgrade? [21:13] unit test [21:14] interesting [21:14] but I'm not sure if I even have your branch merged in actually [21:14] * thumper checks [21:14] menn0: actually, I don't [21:14] so it won't be that [21:14] ok [21:15] do you have the failure details handy. I might still be able to figure out what's happening. [21:15] menn0: sure, will pastebin [21:15] thumper: wasn't there discussion somewhere about supporting sub-commands for juju sub-commands? [21:15] menn0: do you think that mi branch that was merged yesterday can be the culprit? [21:15] menn0: I haven't looked at the CI failures yet but given the timing it seems probable [21:16] menn0: http://paste.ubuntu.com/8303344/ [21:16] ericsnow: yes [21:16] ericsnow: I'm trying to get direction from sabdfl [21:16] ericsnow: we use it sometimes now [21:16] perrito666: wow, that just keeps on giving [21:17] ericsnow: but specs that have been put forward recently get sent back with "use top level commands" [21:17] thumper: I'm adding a new juju backups command that will have its own subcommands [21:17] menn0: do remember that a couple of steps where reverted there and you had a comment on how the order of some steps where reverted in order [21:17] ericsnow: fwereade and I (and some others) do prefer subcommands, so we're trying to get definitive feedback [21:17] thumper: I'd rather not have to roll my own support for that if I can avoid it :) [21:18] ericsnow: there are examples already in our code [21:18] thumper: which commands? [21:18] ericsnow: look at the user command [21:18] perrito666: yep [21:18] thumper: thanks [21:18] ericsnow: although it is disabled on the release branch [21:18] as we mess around with the api [21:19] thumper: no worries [21:19] perrito666: sorry, I misunderstood the first thing you said. I think it's more likely that it's my big upgrade sync branch, not yours. [21:19] perrito666: but I don't know much at this stage. [21:19] thumper: is that the only test that's failing/ [21:19] menn0: I am eod but count me in for additional help [21:19] perrito666: thanks [21:20] menn0: yeah [21:20] menn0: I'm looking at it now [21:20] perrito666: I think once I'm able to reproduce the problem locally I should be able to get it sorted pretty quickly. [21:21] thumper: that test is one of the older ones (from before my time) although the code it's running has certainly changed plenty recently [21:21] thumper: it's quite strange that it's just that test failing [21:21] * thumper nods [21:21] thumper: definitely try with master on your machine as a first check [21:23] menn0: ping me if you need any help with this bug [21:23] thumper: will do thanks [21:23] * thumper pulls master [21:27] cmars: FYI, I've added you as an admin on reviewboard [21:27] ericsnow, sweet, thanks [21:35] menn0: I've grabbed an updated master, and currently the tests are stuck on cmd/jujud [21:35] I'm guessing they may time out later... [21:36] menn0: but this could be handy for reproducibility [21:36] thumper: I'll try current master on my machine [21:37] menn0: I'm wondering if this is related to a change from axw where the tools are now in the environ storage [21:37] menn0: or gridfs or whereever it went [21:38] thumper: could be. curtis wrote on the ticket for the CI blocker that axw's merge for that seemed to be where things broke. [21:38] menn0: this one https://github.com/juju/juju/pull/700 [21:39] thumper: the jujud upgrade tests on master pass on my machine [21:39] thumper: trying all the cmd/jujud tests now [21:39] failed here [21:40] thumper: wonderful :( [21:40] 3 failed === jheroux is now known as jheroux_away [21:42] menn0: confirm your tip hash? [21:42] thumper: 203a10db796649043a1162df35d6cf96a14b4798 [21:42] which is pull #681 merge? [21:42] if so, we have the same version [21:43] * thumper reruns tests [21:43] thumper: that's the one [21:43] thumper: all cmd/jujud tests pass btw [21:43] seems like a race condition then [21:43] thumper: so there's something environment at play [21:43] I got three failures [21:43] perhaps [21:43] environmental I mean [21:43] either environmental or racy [21:44] menn0: run the tests five times [21:44] thumper: will do [21:44] thumper: i've dug into the CI failure logs a bit for the "local upgrade on trusty" job [21:45] thumper: machine-0 upgrades fine, machine-1 upgrade fine (but there's a rsyslog issue) and machine-2 doesn't upgrade because it can't download the tools [21:45] thumper: so it's looking more likely that it's the tools in gridfs change that causing the CI issues [21:46] hmm... [21:46] why can't machine-2 get the tools? [21:46] ha [21:46] not environmental [21:46] pass that time here [21:46] this gets repeated over and over: [21:46] well that sucks [21:46] 2014-09-09 19:26:38 INFO juju.worker.upgrader upgrader.go:167 fetching tools from "https://10.0.1.1:17070/environment/558e5fc8-f707-45d6-8066-0698e5ac2e4e/tools/1.21-alpha1.1-trusty-amd64" [21:46] 2014-09-09 19:26:38 INFO juju.utils http.go:66 hostname SSL verification disabled [21:46] 2014-09-09 19:26:41 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://10.0.1.1:17070/environment/558e5fc8-f707-45d6-8066-0698e5ac2e4e/tools/1.21-alpha1.1-trusty-amd64": bad HTTP response: 400 Bad Request [21:46] in the machine-2 logs [21:47] * menn0 goes to run those unit tests again [21:54] runs twice [21:54] then failed in one place [21:55] machine_test.go:701: [21:55] through machine_test.go:909: [21:57] thumper: I've just run all the cmd/jujud tests 5 times without failure [21:57] try again? [21:58] thumper: menn0: just getting up to speed, anything I can do? [21:58] wallyworld: we have 2 problems, possibly related [21:58] wallyworld: the gridfs tools patch is causing CI errors [21:58] thumper: intermittent? [21:58] wallyworld: but I also get race conditions in cmd/jujud tests [21:58] wallyworld: seems to pass on only one architecture [21:58] hmmmm, ok [21:59] wallyworld: almost all the upgrade related CI tests are failing [21:59] i'll start looking at CI, will likely make more progress once andrew comes online [21:59] wallyworld: I've been looking at the logs for the CI failures, particularly local provider upgrades on trusty [22:00] wallyworld: there's 3 machines in the env. machine-0 and machine-1 upgrade fine [22:00] wallyworld: but machine-2 can't download the tools [22:00] interesting [22:00] wallyworld: even though machine-1 is downloading from the same URL [22:00] wallyworld: at about the same time [22:00] :-( [22:00] wallyworld: bad HTTP response: 400 Bad Request [22:01] so not a 404 [22:01] or a 500 [22:01] nope [22:01] the server thinks the client is sending a bad request [22:02] and yet it will the the same request for machine 1 or 2 [22:02] wallyworld: indeed! [22:02] wallyworld: that's what's strange [22:02] awesome [22:03] wallyworld, thumper: I'm going to try and repro the CI failure locally [22:03] i'll start digging as well, just need a coffee first [22:03] wallyworld: thumper: and if that pans out, try ripping out axw's change [22:04] wallyworld, thumper: I thought it was going to be related to my big upgrade sync merge but it's really not looking like that now [22:04] menn0: you can leave it tome to look if you want to get back to other tungs [22:04] things [22:05] wallyworld: that might make sense [22:05] no use all of us being tied up [22:05] wallyworld: oh and another thing [22:05] wallyworld: a possible other problem I noticed in the CI failure logs [22:05] you sound like COlumbo [22:06] wallyworld: after machine-1 upgraded (successfully) the rsyslog worker was borked [22:06] 2014-09-09 19:17:32 INFO juju.worker runner.go:261 start "rsyslog" [22:06] 2014-09-09 19:17:32 ERROR juju.worker runner.go:219 exited "rsyslog": x509: cannot validate certificate for 10.0.1.1 because it doesn't contain any IP SANs [22:06] 2014-09-09 19:17:32 INFO juju.worker runner.go:253 restarting "rsyslog" in 3s [22:06] machine-0 was fine after upgrade [22:06] sounds unrelated [22:06] and machine-2 didn't manage to upgrade [22:06] i have no idea what an IP SAN is [22:06] wallyworld: yep I think it's unrelated but is yet another thing to sort out [22:06] yeah :-( [22:06] no doubt related to the recent work in this area [22:07] yup [22:07] I don't know what an IP SAN is either [22:07] cmars: I'm going to stand you up today [22:07] cmars: next week? [22:07] thumper, no prob [22:07] menn0: thanks for looking [22:07] I could guess what I'm a SAN IP is but that makes no sense in terms of the rsyslog worker :) [22:08] wallyworld, IP SAN = subjectAltName [22:08] you have to use a different x509 field to issue a cert for an IP addr [22:08] ok [22:09] maybe we should change the logs to say subjectAltName instead of SAN [22:09] hopefully the network guys know how to fix [22:09] menn0, i think that message comes from crypto/tls [22:09] thumper: do you want to try running the jujud unit tests with andrew's change removed? [22:09] or crypto/x509 [22:09] cmars: right [22:09] cmars: so not so easy to change [22:10] menn0: will do, otp just now [22:10] thumper: or indeed mine [22:10] thumper: kk [22:16] wallyworld, katco does safe-mode get converted to provisioner-harvest-mode during upgrades? [22:16] sinzui: yes [22:16] although the default is different [22:17] thank you wallyworld. so long as the value set transitions to the new scheme, I don't need to document madness [22:17] That is the only good news I have had today [22:17] :-( [22:18] CI s not happy with trunk [22:18] wallyworld: my frustration is around the intermittent failures I get on upgrade [22:18] wallyworld: but I believe they may be due to mongo not starting [22:18] yeah :-( [22:18] wallyworld: and may well under the covers be the standard mongo failures [22:18] hard to get logging for that. [22:18] well awesome [22:19] * thumper sees where extra logging could go [22:19] sinzui: i haven't checked - are you guys setting up a test run on jenkins using mongo 2.6 [22:19] not yet wallyworld [22:19] ok [22:27] I'm looking at this ticket https://bugs.launchpad.net/juju-core/+bug/1365623, is there any reason we can't just add a --force to juju run and skip the acuireHookLock step? Is there more to it than that? Or should that work? [22:27] Bug #1365623: juju run with option to bypass hook queue [22:31] wwitzel3: seems fine, as long as it only works at the machine level, not charm [22:43] bugger [22:48] * thumper drums fingers while waiting for the jujud tests [22:48] five good in a row [22:48] * thumper has messed with logging [22:48] the more I mess with the tests, the more inclined I am to write my juju-test plugin [22:50] thumper: i had a *very* brief look earlier, and it seems jujud represents the vast majority of our intermittent failures now [22:50] a bit early to tell for sure [22:51] i'm very tempted to remove the test retry on landing [22:51] I'm going to poke a bit longer [22:51] +1 on that [22:51] ok, i'll jfdi [22:53] gah [22:53] tests aren't failing now [22:55] thumper: sinzui: i have removed the --retry flag from the landing tests. let's see how that pans out [22:58] wallyworld, thank you, I think you are taking a courageous step [22:58] wallyworld: EOD, sent you an email w/ latest [22:59] wallyworld: any insight into the tools upgrading errors? [23:00] I did test upgrade, all worked... :/ [23:02] thumper: ok, thanks [23:03] sinzui: we can easily revert, but i hope the landing tests will pass much more often now [23:03] katco: thank you, have a good evening [23:03] axw_: not yet sadly [23:04] wallyworld: thanks, have a good day wallyworld and axw_ (and everyone just coming on) [23:11] cheers katco, good night [23:31] axw_: there's a difference between 1.20 and 1.21 - the tools fetching in 1.20 uses utils.GetHTTPClient(hostnameVerification), whereas 1.21 uses utils.GetNonValidatingHTTPClient() [23:31] maybe that could explain the 400 [23:32] when 1.20 is trying to fetch the new tools [23:32] just a guess, but i can't see anything else to go on [23:33] wallyworld: that's intentional; we can't validate the API server for HTTPS [23:34] I just tested (again) upgrading 1.20.7 to 1.21-alpha1 [23:34] testing on ec2 now [23:34] worked on local [23:34] axw_: but 1.20 is talking to the state server http endpoint to get the tools [23:35] using the validating client and https [23:35] wallyworld: ah yes, but the 1.21 API server always tells the client to disable verification [23:35] there's an API call that is used first to find the URL, and a flag to decide whether validation is done [23:35] ok, so we are sure hostnameVerification=false [23:36] pretty sure we'd see something other than 400 if validation failed anyway [23:36] fairly, I'll double check [23:36] yeah, i'm just clutching at straws a bit [23:37] wallyworld: yep, in apiserver/common/tools.go, there's a TODO to remove the flag in 1.22 [23:37] ok [23:39] axw_: the logs seems to show that the only tools fetch that succeeds is the one to get them from http://juju-dist.s3.amazonaws.com [23:39] waigani: I'm pretty sure the "blank space before comment" rule only applies when it follows another code block. [23:39] it seems all of the calls to the state server http fail [23:39] wallyworld: yeah... [23:40] axw_: oh really? [23:41] axw_: so directly after a func sig is okay? [23:41] i gotta agree with axw_ here, waigani [23:41] too much whiespace is horrible [23:41] so you can see where one logical set of operations begins and another ends [23:41] wallyworld, axw_: so what exactly is the rule, as I've been reviewed to add whitespace before comments [23:42] waigani: the only times I've been told that (before) is when there's a bunch of fields together in a struct, and no space between them/comments [23:42] axw_: that makes sense, I'll go with that for now [23:42] when declaring an interface, you need whitespace before each doc comment for the methods [23:42] cheers [23:42] also when folling a code block [23:43] following [23:43] davecheney: comment added, FWIW, but no vanguard to poke [23:43] wallyworld: I've added some chagnes to the jujud tests, and now I can't get any failures [23:43] good right? [23:43] wallyworld: so I guess that is good, but ever so slightly concerning [23:43] wallyworld: yeah... [23:43] I'll propose [23:44] thumper: what sort of changes? [23:44] after lunch... [23:44] ok [23:44] wallyworld: bah, upgrade on ec2 worked too :| [23:44] jujud gained a setup logging method that did the lumberjack stuff [23:44] that replaced the default logger [23:44] now we use the default logger in the tests to send things through c.Log [23:44] thumper: that's concerning that a logging change affects stuff like that [23:44] so I mocked it out for the tests [23:45] wallyworld: yes... that is why I said conerning [23:45] i see now that i have the info :-) [23:45] what I was trying to do was to capture the logging output [23:45] instead of having the tests write it to a file [23:45] heisenburg :-) [23:45] now I could no longer reproduce [23:45] yeah [23:45] I'll submit a patch after lunch [23:45] ok [23:45] a fuck it [23:45] I'll do it now [23:45] then it may be landed by lunch [23:46] axw_: do you want me to assign bug 1367431 to you? I've added some detail about what we know so far. [23:46] Bug #1367431: Juju upgrade times out, never completes [23:47] menn0: sure, thanks [23:49] axw_: it seems that the sendError(400) used by the tools http server is not also logging the error passed to it, so we are blind as to the root cause [23:49] maybe we need to add extra logging [23:49] yep [23:50] let's do that and lands as a fixes-blah [23:50] then we can see the errors [23:50] will get onto it [23:50] ok [23:50] wallyworld: https://github.com/juju/juju/pull/717 [23:50] wallyworld: lets do this... [23:50] looking [23:51] axw_: done [23:51] thanks [23:52] axw_: i think we should log the error server side in the sendError(), as well as when it is received client side [23:53] wallyworld: changing client won't help atm, as it's old code. I will update server [23:53] axw_: this may not be relevant but perrito666 merged some changes to API server login handling yesterday. that's about restricting API calls during restore but it may have inadvertently caused what you're seeing. [23:53] menn0: maybe, though I've pulled master and can't repro yet [23:53] wallyworld: if you are happy, please add the merge flags, I'm going to lunch [23:53] thumper: will do [23:54] ta muchly [23:54] thumper: will need to wait till landings unblocked though [23:58] wallyworld: actually there is a slightly useful error message that narrows it down a bit [23:58] "bad HTTP response" means the API server failed to find the tools locally, and failed to find them remotely [23:58] paste it? [23:59] axw_: i could see from that (400) where in the code it is being generated, but not exactly why [23:59] menn0: that was not supposed to be merged so you migt revert it without asking too [23:59] hence the need for extra logging