[00:10] Bug #1590045 changed: Uniter could not recover from failed juju run [01:09] thumper: menn0 fwereade https://github.com/juju/juju/pull/5578 === natefinch-afk is now known as natefinch [01:13] *decrement [01:15] "The coordination via stopped is not reliably observable, and hence not tested" [01:16] (and didn't work :) [01:16] fixed mis spelling [01:18] davecheney: well, it looks reasonable to me [01:18] davecheney: do all the tests now pass? [01:19] thumper: the cmd/jujud/agent test that was previously failing because the catacomb had not stopped the workers it was in charge of by the time the test is shut down [01:20] I'd really like to see a failing test in the catacomb package that this changes fixes. relying on a test far away to assert the correctness of the change is less than ideal. [01:21] natefinch: sure, https://github.com/juju/juju/pull/5564#issuecomment-224468739 [01:21] here is the failing test [01:22] you can see in the panic output that the firewalls and pinger are still running and were started from the catacomb [01:22] davecheney: I think what natefinch means is an explicit test on catacomb [01:22] there is an explicit test [01:22] small test [01:22] if the worker has not shut down, the catacomb tests will not pass [01:22] there is an explicit test for catacomb.Wait() [01:23] and did that one intermittently fail? [01:23] 'cos the catacomb tests before weren't testing shit [01:23] now the are [01:23] I fixed the code to match the test [01:23] * thumper sighs [01:23] but there wasn't a failing catacomb test [01:24] nope, see the comment in the PR [01:24] this wasn't tested [01:24] now itis [01:24] now it is [01:24] not explicitly just on catacomb without the agent bollocks [01:25] go func() { [01:25] defer catacomb.tomb.Done() [01:25] defer catacomb.wg.Wait() [01:25] catacomb.Kill(plan.Work()) [01:25] }() [01:26] ^ the catacomb is not dead until wg.Wait() drops to zero [01:26] wg.Wait cannot drop to zero until all the workers have exited [01:39] davecheney: http://reports.vapour.ws/releases/4039/job/run-unit-tests-trusty-amd64-go1-6/attempt/637 [01:39] pprof thingy [01:40] davecheney: I think that is related to what you added, yes? [01:40] I have NFI what is wrong though [01:40] just saw it fly past [01:43] thumper: yes that is the pprof facility we added a while back [01:43] i think axw had the last hack at fixing a related bug [01:43] I need to chat at some stage, perhaps next week, at how to hook into that :) [01:45] thumper: its on the wiki [01:46] what? docs? who reads these days :) [01:46] thumper: https://github.com/juju/juju/wiki/pprof-facility [01:46] ta [01:49] thumper: are we good ? [01:49] is it possible to hook into the listener to add additional output paths? [01:50] i.e. extra stats? [01:50] nope, sorry that's all the runtime exposes [01:50] hmm... ok [01:50] if you want to do something more than the bits we get for free from the runtiem [01:51] that's more than a non trivial amount of work [02:28] axw_: whenever you have time, this stores cloud region against model, and cloud name in controller doc http://reviews.vapour.ws/r/5023/ next one will use separate controllerconfig [02:38] thumper: we really should address those various TODO schema change items in the code base but we probs will run out of time :-( [02:42] * perrito666 runs the whole suite and fires netflix [02:53] yes :( [02:59] axw_: GCE region fix: http://reviews.vapour.ws/r/5024/ [03:02] yay, someone fixed my bug! :) sounds like it was worse than just lying about where we were bootstrapping [03:03] anastasiamac: so, what happens if you bootstrap without specifying a region? [03:03] anastasiamac: have you tested live? [03:03] yes [03:04] natefinch: u will have to specify region in ur clouds. i think u cannot by-pass it now anyway... [03:04] axw_: prior to change, I'd always end up on us-central, after change i've bootstrapped in us-east :D [03:05] axw_: (as expected) [03:05] why is having a default bad? I mean, sure, don't override what someone explicitly specifies... but do we require other clouds specify regions? [03:06] natefinch: we have defaults at a higher level now [03:06] natefinch: in clouds.yaml, the first region is the default unless you specify one [03:06] and we always pass that into the provider [03:06] natefinch: because, like in this case, at the provider level, it's easy to omit getting mandatory property from bootstrap config [03:07] (two places for a default => subtle bugs) [03:07] natefinch: m not planning to remove default from 1.25, only planning to add copy param logic :D [03:09] axw_: ok, sure, yeah, not having multiple defaults is fine.... but clouds.yaml isn't really a default, it's a configuration. This means the user always has to manually specify a region, right? [03:09] natefinch: no. the first region specified in clouds.yaml, for each region, is the default unless you specify one. you can set a default region yourself, otherwise it's the first in the list [03:10] natefinch: run "juju show-cloud aws", and the first region in there should be the default for aws [03:10] (unless you have set one with "juju set-default-region") [03:12] axw_: I'm confused, I thought clouds.yaml was what we called the thing you had to write out by hand to pass into add-cloud. Is that a generated file we create on disk? [03:13] natefinch: sorry, I was being a bit non-specific. there's ~/.local/shared/juju/public-clouds.yaml (public clouds), and ~/.local/shared/juju/clouds.yaml (personal clouds, ones you added with add-cloud) [03:14] natefinch: the public-clouds.yaml file won't be there by default, it's also built into the client [03:14] natefinch: but "juju update-clouds" will pull down a file in that spot if there's been updates, so a client can refer to new clouds/regions without upgrading [03:15] axw_: ahh, thanks, ok. that was the context I was missing :) [03:18] natefinch: btw that intemediate clouds.yaml to feed into add-cloud should be going away (or at least relegated), as add-cloud will eventually (soon?) be made interactive [03:19] axw_: thank goodness. the less yaml I have to write, the better :) [03:19] yep, the aim is to stop editing files by hand [03:38] * thumper is on kid duty [03:38] I'm going to be afk for a while, but back after being a taxi tonight [03:41] davecheney: oh, also, will miss standup tomorrow morning, on airport pick up duty === thumper is now known as thumper-afk [03:45] cd . [04:03] thumper-afk: right ok [04:03] no worries [04:20] natefinch: shutdown -h [04:21] davecheney: heh [04:22] davecheney: for your consideration: https://github.com/juju/juju/blob/master/worker/provisioner/provisioner_task.go#L409 [04:27] natefinch: you'll have to try harder than that if you're trying to shock me [04:28] the MustCompile is a nice touch [04:28] very devil may care [04:28] I just can't even. So much wrong in that one little line [04:51] axw_: i'm not sure we should include cloud region in migration data, since model could be stored to a different cloud/region [04:56] wallyworld: I think migration expects to be between controllers in the same region [04:56] wallyworld: the machines/agents/etc. remain where they are, and are redirected to the new controller [04:56] wallyworld: they're not destroyed and recreated [04:57] ok [05:06] anyone having problems with statesuite.TestPing? TearDownTest seems to hang forever [05:07] log is just full of [LOG] 0:55.041 DEBUG juju.mongo connection failed, will retry: dial tcp 127.0.0.1:39150: getsockopt: connection refused [05:07] why does touching anything cause it to berak [05:07] https://bugs.launchpad.net/juju-core/+bug/1590645 [05:07] Bug #1590645: worker/catacomb: test failure under stress testing [05:07] top tip: this was already broken [05:07] ^ that's master [05:08] Bug #1590645 opened: worker/catacomb: test failure under stress testing [05:09] nope, sorry, pebkac [05:09] gah, testping doesn't even DO anything [05:23] menn0: could you take a second look at https://github.com/juju/juju/pull/5578 [05:24] I found a bug in stress testing that I have now fixed [05:24] menn0: https://github.com/juju/juju/pull/5578/commits/36c3e7f8bd9435bf1cccd2480b4f921bcb6345d9 [05:29] davecheney: looking [05:38] davecheney: the core change looks good but we still disagree about test suites [05:38] Bug #1590645 changed: worker/catacomb: test failure under stress testing [05:43] menn0: fine i'll roll back that change [05:43] it's not worth having a fight about [05:43] davecheney: kk [05:43] menn0: done [05:43] removed that commit from the PR [05:43] it's pretty insane that our api client tests take many many minutes to run [05:44] davecheney: ship it already :) [05:45] menn0: with pleasure [05:46] menn0: as you say, it was obvious in retrospect (now we can see the solution) [05:46] we _have_ to wait for both goroutines to finish [05:47] davecheney: yeah totally [05:49] the previous stopped channel connected both goroutines in one direction, but didn't in the other [05:59] welp. Not getting these tests passing tonight. Sorry wallyworld. I keep running into random tests that time out after 10 minutes, which is sorta killing the development cycle here. I'm even using gt to avoid retesting code I know hasn't changed. [05:59] ok, can you push what you have? [06:00] wallyworld: sure [06:00] ty [06:03] wallyworld: made a WIP PR to make it easier to see the diffs: https://github.com/juju/juju/pull/5583 [06:03] ty [06:03] i'll try and look after soccer [06:03] heh: +547 −5,744 [06:03] \o/ [06:04] and most of that plus is really just a file rename or two [06:04] ok, bedtime. [06:08] * redir goes to bed too! === frankban|afk is now known as frankban [07:14] davecheney, would you explain http://reviews.vapour.ws/r/5022/diff/# to me please? I see that the original second goroutine can outlive the catacomb; but I can't see how Kill()ing an already-finished worker triggers session-copy panics [07:14] davecheney, ...oh, dammit, is this presence again? [07:18] Anyone know why when I ask Azure for 8 GB RAM & 50 GB disk, I get 13 GB RAM and 29 GB disk, but it claims to have mem=14336M root-disk=130048M ? [07:20] dimitern, do you know about the DocID and tests for same around the linklayerdevices code? [07:21] fwereade: what about it? [07:21] dimitern, why it's exported, and why it includes the model-uuid [07:22] fwereade: it shouldn't be exported [07:22] fwereade: sorry about that [07:22] dimitern, it happens ;) [07:22] fwereade: but why is it surprising that it includes model-uuid as prefix? [07:23] dimitern, fixed that already, mainly just checking something wasn't planning to build on it [07:23] dimitern, because state isn't meant to know that stuff -- the multi-model stuff does it for you [07:24] dimitern, or, it should -- I was wondering if there was something about it that meant it didn't quite fit [07:24] fwereade: well, I wasn't that comfortable with the multi-model stuff then I guess, wanted to test it excplictly [07:25] dimitern, just checking, though: you aren't using those DocIDs as anything other than _ids, right? not e.g. storing them in fields for subsequent lookup? [07:26] fwereade: I'm using mostly global keys [07:27] fwereade: and the DocID IIRC is only used for txn.Op{Id:} and FindId() [07:27] dimitern, (global keys without any model-uuid prefix, right?) [07:27] dimitern, cool, thanks for the orientation [07:28] fwereade: without, except for the parent/child refs [07:28] bugger [07:28] fwereade: let me have a look to remind myself.. [07:29] dimitern, thanks [07:30] Also, 9 minutes to bootstrap that instance in Azure - is that expected? [07:31] blahdeblah: 1.25 or 2.0? [07:32] fwereade: nope, so for the refs the docid is used literally as given, no assumptions on prefix [07:32] 1.25.5 [07:32] axw_: ^ [07:32] Looks like no matter what you ask for in a root disk, you get whatever Microsoft decides you need, which is 31.5 GB raw. [07:32] dimitern, can you point me at the code you're looking at? [07:32] fwereade: linklayerdevices_refs.go [07:33] blahdeblah: it's been a while since I looked at the old provider, will have to go spelunking [07:33] axw_: Is it something where we need to specify instance type instead of mem/disk constraints? [07:33] blahdeblah: re slowness: yes, sadly that's expected [07:33] How does this cloud still exist? :-\ [07:34] dimitern, ok, refs looks safe, it's explicit but doesn't need to be [07:34] blahdeblah: I *think* the root disk size is the same for all instance types, will need to check [07:34] dimitern, what about lldDoc.ParentName? [07:35] fwereade: that can be a global key [07:35] axw_: I suppose I should log a bug saying that there's no indication that the root-disk constraint is not honoured then... [07:35] dimitern, whoa, ParentName lets docids leak out too? [07:36] blahdeblah: I think it may actually be related to the images that Canonical publishes [07:36] fwereade: well, not quite the docid, just the gk [07:36] axw_: That affects the size of sda presented to the OS? Seems unlikely... [07:36] dimitern, still [07:36] blahdeblah: well the images have the size of the root disk in the name ... [07:37] But surely that would simply affect the size of the partition created on the disk, not the disk size itself... [07:37] blahdeblah: maybe, depends on whatever Hyper-V does. I don't know, I'll have a poke around [07:37] Thanks - appreciated. [07:37] fwereade: looking at the code I don't see a good reason to export ParentName() actually.. as there is a ParentDevice() which is more useful outside of state anyway [07:38] dimitern, excellent, I'll drop it if I can [07:38] axw_: I'll have a poke around for relevant bugs [07:38] dimitern, still sl. struggling to get my head around what changes I could/should make to get around the internal test failures [07:38] fwereade: please don't just drop it - it will still be needed inside the package for refs checks and some other internal logic, but unexporting it should be fine I think [07:39] dimitern, sorry, that's what I meant [07:39] dimitern, so, to step back for context [07:39] fwereade: ok [07:39] dimitern, I'm trying to extract a smaller, less stateful, type from State [07:39] fwereade: sounds challenging :) [07:40] dimitern, the clean line at the moment seems to be {database, model-uuid} [07:40] dimitern, and I've tacked on only a few methods -- getCollection, runTransaction, and the docID/localID translators [07:40] dimitern, this ofc means that the hacked-up state no longer produces the correct answers [07:41] dimitern, because the implementation detail of *how* we calculate docID has changed [07:41] dimitern, but I am deeply reluctant to just "fix" that *State [07:41] fwereade: I'm afraid I do have a bunch of internal tests for LLDs that verify the docID format :/ [07:42] dimitern, the biggest question actually [07:42] fwereade: I trust the multi-model code better now at least :) [07:42] dimitern, is: can I just drop those internal tests? is there any functionality that isn't covered by external ones? [07:42] dimitern, so many of them are working with an unconfigured state anyway... ;p [07:43] dimitern, first stab at multi-model you did have to care about model-uuid [07:43] fwereade: those tests that check the docID includes model uuid prefix? sure - I think those are unnecessary anyway [07:43] dimitern, really, all the internal tests [07:43] fwereade: but the assumptions on the globalKey format for LLDs is important [07:45] dimitern, why so? they strike me as the purest of implementation details [07:45] fwereade: I aimed for 100% coverage while writing the code, some bits of it are not possible to test externally, but the internal tests gave me confidence that code is exercised [07:45] dimitern, if it's not possible to test it externally, why does it matter? [07:46] dimitern, by definition, surely, that makes it an implementation detail [07:46] fwereade: well, re gk format - container LLDs are intentionally restricted to only having a parent device on the same machine as the container and only of type bridge [07:47] axw_: FWIW, seems like it might be region-dependent. If I bootstrap in debug mode, I get a bunch of error messages about Basic_A[1-4] and Standard_G[1-5] not being available in US East [07:49] dimitern, ok, but aren't those restrictions exercisable via the exported api? doesn't matter what strings we use, it's the restrictions on the live types we export that we need to verify [07:50] fwereade: sorry, I've looked at the internal tests again; ISTM most can be either moved to the non-internal suites or simply dropped [07:54] fwereade: those that could be moved include tests on simple getters or exported, related helper funcs (e.g. IsValidLinkLayerDeviceName) [07:56] dimitern, ok, I'll see what I can do, thanks [07:56] blahdeblah: "Azure supports VHD format, fixed disks." -- https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-about-disks-vhds/ [07:57] fwereade: I believe tests around parent gk format can be tested externally, it was just easier to test them with a 'fakeState' (and not bring up the whole stack with the real one) [07:57] blahdeblah: the old provider should not allow you to specify 50GB in the first place. the new one doesn't [07:59] axw_: When you say "old provider", you mean the one in 1.25.x ? [07:59] dimitern, as in, we can externally test that we can't create bad LLDs? [07:59] blahdeblah: yep [07:59] axw_: So if I need more than 31.5 GB root disk, what can I do? [07:59] blahdeblah: it does not exist in 2.0. Azure completely changed their APIs [07:59] dimitern, is there something specific about the gks that I'm missing? [08:00] blahdeblah: so with the old world, I think you could do it with some hoop jumping: take the existing VHD, resize it and upload to your private storage. then you'd need custom image metadata [08:00] blahdeblah: not sure why the images have such a small size in the first place tho really [08:01] blahdeblah: possibly because with "premium storage", you pay by the disk size, rather than usage [08:01] axw_: When you say "old world", you mean "The only production-ready version currently supported"? :-) [08:01] blahdeblah: yes. [08:01] fwereade: well, there's one thing... [08:01] axw_: So can we specify instance type or something like that to get around this? [08:01] Or are we just stuck with 31.5 GB? [08:02] blahdeblah: nope, it's down to the images I'm afraid. all instance types have the same max limit for OS disk sizes [08:02] fwereade: to test e.g. you can't add a parent device to a container device without using the parent's globalkey as LLDArgs.ParentName [08:02] axw_: OK - thanks for your time. [08:03] fwereade: you have to use the gk, which is then verified to exist on the same host machine as the container and it's a bridge [08:03] dimitern, don't we have some other way of identifying the parent? [08:04] fwereade: well, we only need its name and its host machine, which is conveniently part of the LLD's gk [08:05] fwereade: and in other cases (e.g. adding a new child to existing parent device on the same machine) we just use the plain parent device name in LLDArgs.ParentName [08:06] dimitern, sorry, when is a new child *not* added to an existing parent on the same machine? [08:06] fwereade: but hey! fortunately, that logic with LLDArgs.ParentName being a gk is only really used in one place usable externally [08:07] fwereade: when it's on a different machine (e.g. in a container, while the parent is on the host machine of that same container) [08:08] fwereade: SetContainerLinkLayerDevices is the only place currently where we use a gk-based ParentName for adding devices [08:12] fwereade: and if LLD.ParentName() is unexported, and ParentDevice() used instead (as it is IIRC), we won't leak gks outside of state [08:13] dimitern, I remain a bit suspicious of the still-extant ability to specify gk-ParentName from outside [08:16] dimitern, could we, e.g., have always-simple-ParentDeviceName, and ParentMachineID, in LLDArgs? [08:16] fwereade: that can be rectified, assuming SetLinkLayerDevices() rejects ParentName set to a gk, and SetContainerLLDs can bypass that check internally [08:17] fwereade: that's perhaps better - future-proof and more explicit [08:18] dimitern, cheers [08:19] fwereade: btw, now that you've had a chance to look at the approach I used for setting LLDs and LLDAddresses in bulk [08:19] fwereade: how do you like it? [08:20] dimitern, only just looking at that side of it now... [08:22] dimitern, just reading setDevicesAddresses... doesn't the providerID stuff have that index gotcha? I see some checking but it's not immediately clear that it's enough [08:22] fwereade: it was a noble attempt at first :) but it got complicated as parent/child refs were added.. and my inability to construct a single set of []txn.Op that can insert and verify new parents and new children of those in the same set of ops [08:23] fwereade: now providerIDsC is used for subnets, spaces, LLDs, and LLDAs, and the indexing issue is handled there [08:23] dimitern, yeah, the "you don't need to assert what you're doing in this txn" thing honestly only makes life harder [08:24] dimitern, ok... but the providerID memory-checks aren't guaranteed to run at all here, are they? and I don't think they're asserted at apply-time either [08:24] fwereade: yeah txn.DocMissing being the only possible option for inserts can really force you to re-think.. and I do understand why it's the only option [08:26] * dimitern takes a look [08:27] dimitern, no, wait, setDevicesAddressesFromDocsOps seems to cover that [08:27] fwereade: so the ProviderID on LLDAddrDoc is just for convenience, it's not enforcing integrity like it used to; the asserts on PrIDsC do [08:29] dimitern, sorry, I seem to be slow thing morning [08:30] dimitern, we have some logic around :592 that checks for dupe provider ids -- what does it do that sDAFDO doesn't? [08:31] fwereade: it validates whether ErrAborted occured due to the assert added in sDAFDO [08:31] dimitern, won't reinvoking sDAFDO catch those anyway? [08:33] Bug #1590671 opened: Azure provider does not honour root-disk constraint [08:33] fwereade: well, it looks like sDAFDO doesn't validate, just asserts docMissing [08:34] fwereade: sorry, I need to talk to some contractors that just arrived - back in ~15m [08:34] dimitern, np [08:35] jam: happy birthday [08:36] dimitern, should networkEntityGlobalKeyOp theoretically be doing that check? if it's just a FindId().Count() I'd not too worried, even if we do ask it a bunch of times... [08:41] fwereade: the problem is the goroutine does, w.Stop() and then returns, marking the waitgroup as done [08:41] Stop doens't actaully stop, it just asks to stop so the worker can still be in the process of shutting down when the mongo connection is torn down [08:59] thanks voidspace [09:02] davecheney, right; but that's why the wg.Done happened after the Wait in the first goroutine -- how does the wg complete with a running worker? the only race I see is the possibility of a late Kill in the second goroutine [09:04] davecheney, and I accept *that's* wrong, because it's too trusting of Kill not to do anything untoward [09:05] davecheney, but if I understand correctly, you're saying that workers live too long, not that late Kills of already-dead workers are the problem? [09:06] davecheney, and I don't see how that was happening, because we always Wait()ed for the worker before we call wg.Done [09:08] davecheney, (I am assuming you're talking about Kill, which doesn't wait, rather than Stop, which does wait but wasn't used in the original) [09:16] fwereade: I suppose so [09:16] fwereade: but since networkEntityGlobalKeyOp is used for subnets, and spaces as well, it should be done carefully [09:17] dimitern, ack [09:17] dimitern, (also, I'm pretty sure machineProxy is evil vs extracting ops-composers and using them in both places) [09:20] fwereade: yeah, machineProxy was added mostly for convenience, is it evil because it assumes the LLD and the machine are always on the same session? [09:21] dimitern, more that it's pretending to be a valid machine and it's really not [09:22] dimitern, it's trusting to luck that an empty machine doc won't trigger a bad-memory-state failure [09:23] fwereade: fair point [09:23] dimitern, the garbage data is *kinda* ok, in that the asserts *should* trap any problems, but you've still got an invalid *Machine lying around and it makes me nervous ;p [09:24] Bug #1590689 opened: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time [09:24] dimitern, anyway, I think I have to consider most of this a derail, I'm adding a bug [09:25] fwereade: FWIW it's never lying around - only used to call LLD(name) on it in 2 places [09:25] fwereade: but still [09:26] dimitern, yeah, it's not how it's used right that bothers me so much as how someone will one day use it wrong ;) [09:26] (considering all of that was designed and implemented in a week..) [09:27] fwereade: indeed; "don't be clever, be obvious" [09:27] :) [09:27] dimitern, I apologise for the length of this letter, I did not have time to make it shorter ;) [09:28] fwereade: which one? [09:29] dimitern, it's a quote from someone-or-other that seemed tangentially relevant [09:30] dimitern, "shorter" *sort of* maps to "more obvious", so long as it's the right sort of shortness [09:30] fwereade: ah :) I was thinking of "clean code" [09:31] dimitern, that is rather more directly relevant than pascal, indeed ;) [09:33] Bug #1590689 changed: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time [09:36] Someone needs to come up with a way of hedging Pascal's wager against Roko's Basilisk. [09:36] babbageclunk, ssh! [09:37] ;p [09:37] Sorry you guys. [09:42] Bug #1590689 opened: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time [09:42] Bug #1590699 opened: LinkLayerDeviceArgs exposes globalKeys outside state [09:46] hey dimitern, got a moment for more stupid questions? [09:46] babbageclunk: always :) [09:48] dimitern: So I've added my vlans to the eni on the maas controller and restarted. The routes *seem* ok, but I still can't ping any of the vlan addresses on the hosts. [09:48] babbageclunk: what have you tried pinging? [09:49] dimitern: 192.168.10.2 [09:49] babbageclunk: that's a node's address and you pinged from maas? [09:50] dimitern: yup [09:51] dimitern: on one node I can see that in ip route, and on the maas node I still get destination host unreachable trying to ping that address. [09:51] babbageclunk: do you have ip forwarding enabled on maas? [09:51] dimitern: Do I need to set the vlan up in the VMM virtual network config? [09:52] dimitern: Where do I check for ip forwarding? [09:52] babbageclunk: sudo sysctl -a | grep ip_forward [09:52] dimitern: oh, ok - nothing in the ui? [09:53] dimitern: net.ipv4.ip_forward = 1 [09:53] babbageclunk: the kvm node has 1 or more NICs, each of which connected to a libvirt bridge; the bridge is layer-2, so it will pass both tagged and untagged traffic [09:53] dimitern: ok] [09:54] babbageclunk: on the node page in maas though, you need to have a physical NIC on the untagged vlan and 1 or more vlan NICs [09:55] dimitern: for the rack controller? [09:55] dimitern: yes, I've got those. [09:55] babbageclunk: can you paste e/n/i from your rack controller? [09:56] babbageclunk: where are the nodes and rack ctrl located? [09:57] babbageclunk: if the kvm nodes are on the rack ctrl machine, you need bridges there as well; if the rack ctrl is itself a kvm and it's sitting on your machine, along with all the nodes [09:57] dimitern: Ooh, looking back at your eni example I think I see the problem [09:57] dimitern: (well, a problem) [09:57] dimitern: http://pastebin.ubuntu.com/17139880/ [09:58] babbageclunk: ...then you need to enable forwarding on your machine as well, and the bridges will be on your machine [09:58] dimitern: I've left off the /24 part of the address [09:58] dimitern: The rack controller and the nodes are all kvms on my machine/ [09:58] babbageclunk: yeah! :) it's either e.g. /24 or netmask is needed [09:59] babbageclunk: omitting both means /32 IIRC [09:59] dimitern: ok, fixing and bouncing [10:00] dimitern: oops, meeting [10:00] dimitern: thanks again! I owe you about an infinitude of beers by this poing. [10:00] gah, point [10:01] babbageclunk: heh, keep 'em coming :D [10:03] anastasiamac: meeting? [10:12] anastasiamac: can you try again pushing? [10:15] dimitern: fatal: unable to access 'https://github.com/juju/juju/': The requested URL returned error: 403 [10:15] dimitern: ;( [10:16] anastasiamac: so you shouldn't be able to push to juju/juju - so why is it trying? [10:16] dimitern: ofco - i have not created a branch \o/ [10:17] thank you!! like i said - one of these days :D [10:17] anastasiamac: :) [10:57] dimitern, that didn't help. :( [10:58] babbageclunk: adding the /24 ? [10:58] dimitern: yup [10:59] dimitern: What were you saying about bridging above? If the controller and the nodes are all sibling kvms do I need bridges in the controller eni? [11:00] dimitern: There are bridges in the node ENIs, but I don't know whether they're needed or working. [11:00] babbageclunk: juju created those [11:00] dimitern: Ah, right. [11:01] babbageclunk: but yeah, the libvirt bridges where both maas and the nodes connect to are on your machine I guess then [11:01] babbageclunk: how are those bridges (networks) configured? [11:02] babbageclunk: in vmm [11:02] babbageclunk: hmm also - is there anything in /e/n/interfaces.d/ on the rack machine? [11:04] dimitern: my host eni is basically empty - presumably handled by something else? [11:04] dimitern: no, nothing in the rack controller interfaces.d [11:04] babbageclunk: yeah, usually network manager handles that [11:07] babbageclunk: re rack's e/n/i.d/ - ok; please paste the output of `virsh net-list` and `virsh net-dumpxml ` for each of those? [11:09] (or only those relevant to that maas rack) [11:11] dimitern: http://pastebin.ubuntu.com/17141198/ [11:13] babbageclunk: ok, so your maas2 network on ens3 on the rack, but what's 10.10.0.0/24 then? [11:13] s/network on/network is on/ [11:14] babbageclunk: that looks like the issue - eth1.10 on the rack [11:14] babbageclunk: is this the pxe subnet for the nodes? [11:18] dimitern: That's a holdover from a previous space experiment (I think it was for reproducing a bug maybe?) - should I just get rid of it? [11:18] dimitern: Don't understand your last question. [11:18] dimitern: Is what the pxe subnet? [11:18] Bug #1590732 opened: unnecessary internal state tests for networking [11:19] babbageclunk: the pxe subnet is the one the nodes boot from, it cannot be on a tagged vlan; paste the output of `maas interfaces read `, for the node you where unable to ping? [11:20] babbageclunk: it the node interfaces table in the ui the pxe interface is marked with a selected radio button [11:20] s/it/in/ [11:20] dimitern: hang on, I rebooted the controller after the eni change. [11:27] dimitern: http://pastebin.ubuntu.com/17141467/ [11:27] dimitern: pxe's always on the physical interface. [11:28] babbageclunk: ok, thanks, I think I see a few things that might be the cause [11:28] dimitern: Or do you mean there can't be any tagged vlans hanging off the pxe interface? [11:30] babbageclunk: I think the gateway_ip is wrong for the 192.168.150.0/24 subnet is wrong [11:32] dimitern: Really? Wouldn't that have stopped them from getting to the internet? [11:32] dimitern: The 150 subnet was set up automatically at MAAS install time. [11:33] babbageclunk: I believe the maas rack's ip should be the gateway for the nodes (i.e. gateway ip on the 192.168.150.0/24 subnet in maas), as the .1 address is the libvirt bridge [11:33] babbageclunk: the rack itself needs to use the .1 ip on ens3 as gateway [11:35] babbageclunk: i.e. try updating the 192.168.150.0/24 subnet to use 192.168.150.2 as gateway ip, and then deploy a node directly from maas (no juju) and see how it comes up and whether you can ping its IP from maas and vs [11:35] vv. [11:35] dimitern: ok. (vv?) [11:37] babbageclunk: sorry; vice verse (?) [11:37] dimitern: Ah, right. [11:37] I was sure it's written as versa but erc syntax check underlines it.. [11:38] dimitern: I can't change the subnet while things are using it, I'll have to kill my juju nodes. [11:38] dimitern: Oh, maybe that's just a limitation in the ui? [11:39] babbageclunk: ah, I suppose it's like that for safety [11:39] Do you think I should blow away my maas install and start again entirely clean? [11:40] babbageclunk: no need to go that far I think - just release the nodes, update the subnet's gw and try deploying I guess should fix it, hopefully [11:43] dimitern: ok, doing that now. [11:46] babbageclunk: I need to go out btw.. but should be back in 1h or so [11:47] dimitern: ok - I'm going to go for a run soon too. Thanks. [11:47] babbageclunk: np, good luck :) and I'll ping you when I'm back [11:52] dimitern: :) [12:13] dimitern: you're out, but yay! I deployed a machine and can ping it from the controller on the vlan IP and vice versa. [12:13] dimitern: rerunning my bootstrap now and going for a run. === akhavr1 is now known as akhavr [13:47] jam: you around? [14:01] anyone wanna review this, which deprecates some fields that wanted to go a long time ago in charm.Meta? https://github.com/juju/charm/pull/212 [14:10] rogpeppe1: LGTM with a couple doc tweaks [14:10] natefinch: thanks [14:22] babbageclunk: nice! [14:55] Bug #1590821 opened: juju deploy output is misleading [15:10] dimitern: Hmm - now juju run is hanging on me. [15:12] dimitern: huh - correction, juju run --unit hangs. juju run --machine is working fine. [15:17] babbageclunk: try with --debug ? [15:20] dimitern: http://pastebin.ubuntu.com/17145175/ [15:21] dimitern: Ooh, I saw something in the juju run docs about show-action-status [15:21] babbageclunk: try `juju run --unit haproxy/0 -- 'echo hi' [15:22] dimitern: that shows the unit-level jobs I've been running as pending, while the machine ones are completed. [15:22] babbageclunk: the extra args are confusing 'juju run' I think [15:22] dimitern: no difference [15:23] babbageclunk: ok, so if the unit agent hasn't set the status to started and it's still pending, something didn't work [15:23] babbageclunk: try looking for issues in the unit-haproxy-0.log ? [15:23] dimitern: I did have some work at the start! I can see them completed in the show-action-status output. [15:23] dimitern: Ok, checking the log [15:24] babbageclunk: since juju run is actually handled by the uniter, it needs to run in order to do anything [15:25] dimitern: At the end of the log I see a lot of "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped [15:26] babbageclunk: hmm [15:26] babbageclunk: this sounds bad, but it might be orthogonal.. any other errors? [15:28] dimitern: no, nothing that sounds exciting. Blocks of http://pastebin.ubuntu.com/17145415/ [15:29] dimitern: periodically [15:29] babbageclunk: ok, how about the output of juju show-machine? [15:31] dimitern: http://pastebin.ubuntu.com/17145473/ [15:31] babbageclunk: hmm [15:32] babbageclunk: well, the only thing left to check is `juju status --format=yaml` I guess [15:32] dimitern: I'll look in the controller logs too. [15:33] babbageclunk: yeah, it might help if e.g. the spaces discovery wasn't completed by the time the unit started to get deployed.. [15:34] dimitern: nothing interesting in juju status --format=yaml [15:34] babbageclunk: wait a sec... why machine 0's address is 192.168.10.2 ? I'd have expected to see 192.168.150.x ? [15:35] babbageclunk: wasn't the .10. subnet on a tagged vlan 10 ? [15:36] babbageclunk: or maybe it still is, but just the .10. address sorted before the .150. one and was picked as "preferred private address".. [15:36] dimitern: yeah, it is on the vlan [15:36] dimitern: I had to run sshuttle for juju ssh to work. [15:37] dimitern: .10 is the internal vlan [15:38] babbageclunk: do you mind a quick HO + screen share - it will be quicker to diagnose [15:38] ok [15:38] dimitern: juju-sapphire? [15:38] babbageclunk: joining yesterday's standup [15:38] yeah [15:39] babbageclunk: you appear frozen.. [15:40] dimitern: yeah, my whole machine hung [15:40] ooh :/ [15:42] babbageclunk: same thing.. you might be having the same issues as I had before downgrading the nvidia driver from proposed to stable [15:43] dimitern: yay, back again! [15:43] Ok, maybe not screen sharing - what about tmate? [15:44] I haven't used it [16:12] dooferlad: ping [16:19] frobware: two minutes [16:20] dooferlad: can you jump into sapphire HO when ready - thx [16:36] dooferlad: (reverse-i-search)`check': git checkout f0b4d55bd98e5d1a9089399dc7ecee2c75ecc6a8 add-juju-bridge.py [17:02] ahh state tests.... my old nemesis [17:06] heh, I had time for lunch with dessert and possibly a coffee and the suite is still running [17:09] man, my tests do not like to run in parallel... apiserver and state tests both barf when I run all tests, but run fine if I run them by themselves. === frankban is now known as frankban|afk [17:53] I wish we had a "please test my branch on CI because I don't trust the tests running on my own machine" button [17:56] natefinch: I use the other machine [18:00] hey cmars - were you able to access that windows system from sinzui for bug 1581157? [18:00] Bug #1581157: github.com/juju/juju/cmd/jujud test timeout on windows [18:00] cherylj, haven't tried it yet [18:06] cherylj, ok, i can rdp into it [18:06] cherylj, how would i reproduce the hang there? not familiar with the CI setup [18:07] cmars: what I've done in the past is looked at what the GOPATH is [18:07] to see where the src might be [18:07] is GOPATH set? [18:07] cherylj, it is, but there's no $env:GOPATH/src [18:08] cmars: okay, then you'll need to scp a tar of the src [18:08] I've used the ones generated by CI, one sec [18:08] cherylj, hmm.. i could just use go get to grab the source [18:09] cmars, it's easier to get the tarball :) [18:09] in my experience anyway [18:09] http://data.vapour.ws/juju-ci/products/version-4043/build-revision/build-4043/juju-core_2.0-beta9.tar.gz [18:09] get that then scp it to the windows machine [18:10] I wish juju status just had a -v to alias --format=yaml [18:10] cmars: I had to use a path like this in the scp: $ scp file_windows.go Administrator@:/cygdrive/c/Users/Administrator [18:12] ha: https://bugs.launchpad.net/juju-core/+bug/1575310 [18:13] Bug #1575310: Add "juju status --verbose". [18:15] Anyone have an inordinate amount of free time? I need a review: http://reviews.vapour.ws/r/5027/ Files changed: 89 +547 -5744 [18:17] ericsnow, ^^^ [18:18] natefinch: looking [18:18] it's all pretty mechanical, honestly [18:28] cherylj, ok. i don't know how to untar on windows, but go get worked fine. i've got a loop of `go test ./cmd/jujud/agent` running in a loop, 30 times, on that machine. it's tee'ing output to a log file, will check it after lunch [18:29] it's running against latest master [18:29] cmars: yeah, you have to use that '7z' utility that sinzui mentioned in the email [18:34] Bug #1590909 opened: github.com/juju/juju/state: undefined: cloudSettingsC [18:45] hmm.... I don't think juju deploy mysql --to lxd is working [18:45] ericsnow: is lxd as a container type supposed to work on trusty machnies? [18:46] natefinch: I'd expect it to (given LXD is set up correctly) [18:46] natefinch: and Go is 1.3+ [18:47] ericsnow: I mean, like, deploy mysql --to lxd in AWS, ... should spin up a new machine and deploy a lxd container to it [18:47] and deploy mysql to that container [18:47] natefinch: not sure [18:47] natefinch: it should work, but only if the container is also trusty [18:47] erm, maybe [18:48] ericsnow: so far, looks like no... juju add-machine lxd works [18:48] that's how it was with lxc anyway [18:48] n/m me [18:48] heh [18:48] lemme try something that uses xeniel [18:48] xenial [18:50] ...that's a no. [18:50] it never creates the base machine [18:50] damn [18:50] brb [19:02] sigh nil vs non-nil mismatch; obtained ; expected [19:06] natefinch: fix-it-then-ship-it === ericsnow is now known as ericsnow-afk === redir is now known as redir_afk [19:19] sinzui: do we test deploying to containers in CI? [19:20] mgz: ^ [19:20] natefinch: all the time [19:21] natefinch: lxd network still fails [19:21] sinzui: ahh... umm.. shouldn't that be like a blocking bug or something? [19:22] natefinch: we bring it up several times a week. We are told it isn't as hot as the other bugs...but I am sure when lxc is removed. it will be hot [19:22] sinzui: lol, yeah... I'm hesitant to remove lxc if we have no replacement (other than kvm) [19:23] natefinch: kvm isn't a replacement because public clouds don't support it [19:24] sinzui: ahh, well doubly so then [19:29] natefinch: We have bundles that deploy to lxd. the workloads work. lxd mostly works. its networking is broken though. juju cannot ssh into it as it can with lxc. [19:29] sinzui: ahh, ok... my current branch doesn't work, so that's interesting. I'll retry with master to make sure I know what it expect [19:30] sinzui: by doesn't work, I mean that if you deploy, no base machine is ever created, so the fcontainer is never created, so the service is never deployed. But it sounds like that's probably my own fault on my branch [19:32] natefinch: specific things are still an issue with lxd, but the common stuff works. you may either have borked something in your branch or have hit a specific sequence of steps that don't [19:32] most of lxd tests just throw a bundle at juju, only one or two use add-machine/--to [19:32] mgz: since I changed 89 files around container stuff, it's probably my fault. I'll double check [19:54] hmmm ERROR failed to bootstrap model: cannot start bootstrap instance: cannot run instances: Request limit exceeded. (RequestLimitExceeded) [19:55] wonder if my previous aws deployment was retrying the machine deployment too much [19:57] from previous controller: machine-0: 2016-06-09 18:05:00 WARNING juju.apiserver.provisioner provisioninginfo.go:526 encountered cannot read index data, cannot read data for source "default cloud images" at URL https://streams.canonical.com/juju/images/releases/streams/v1/index.sjson: openpgp: signatur [19:57] e made by unknown entity while getting published images metadata from default cloud images [19:58] unit mysql/0 cannot get assigned machine: unit "mysql/0" is not assigned to a machine [19:59] that is uh, not the most useful error stack [20:00] reconfiguring logging from "=DEBUG" to "=WARNING;unit=DEBUG" ... is this our default logging level now? [20:00] bbl [20:00] that seems... odd === thumper is now known as thumper-afk [20:29] Bug #1590947 opened: TestCertificateUpdateWorkerUpdatesCertificate failures on windows [20:34] natefinch: afaik, it's always been at that logging level [20:34] cherylj: weird... maybe I've had a custom logging level set in my environments.yaml for so long that I didn't realize. It seems like a crazy log level [20:35] yes [20:35] I agree [20:35] cherylj: I mean... not showing info drops a lot of useful context on the floor... and unit=debug? What? I'll file a bug. [20:37] does unit even work? would't it need to be juju.unit? [20:41] mgz, sinzui: FWIW, juju deploy ubuntu --to lxd does not create base machines using master, at least for me (just tried on GCE since AWS was mad at me) [20:41] * natefinch files another bug [20:42] natefinch: mad at all of us. none of us could launch an instance in us-east-1 for the last two hours [20:44] sinzui: ahh, ok, I thought it was something my code had triggered accidentally [20:50] Bug #1590958 opened: Juju's default logging level is bizarre [20:57] btw, looks like add-machine lxd works, and I can then deploy --to 0/lxd/0 ... it's just the straight deploy foo --to lxd that isn't working [20:58] natefinch: yeah, that sounds possible, we may just not do that in functional tests [20:59] mgz: yep [20:59] gotta run, will bbl === natefinch is now known as natefinch-afk [21:11] Bug #1590960 opened: juju deploy --to lxd does not create base machine [21:11] Bug #1590961 opened: Need consistent resolvable name for units === ericsnow-afk is now known as ericsnow [21:58] katco: looks like you have a plan for capturing CLI interactions? [21:58] wallyworld: CLI interactions? no, API requests [21:58] ah sorry, yeah [21:58] that's what i meant [21:59] wallyworld: it looks like the RPC stuff already has a concept of listening to what goes on, but unfortunately it's constrained to a single type [21:59] wallyworld: so i'm expanding on that a little [21:59] katco: could you outline your idea and email the tech board just to ensure they are in the loop [22:00] wallyworld: ...seriously? i don't think this is a radical change... [22:00] ok. i was just a little cautious messing with the rpc stuff but i guess it's ok [22:01] wallyworld: i can email them. i don't think it's a breaking change. it's probably easier to just point them at a diff though === thumper-afk is now known as thumper [22:02] katco: so the plan is to use export the RequestNotifier and use that? [22:03] wallyworld: no, the plan is to support multiple observers, one of which will remain the RequestNotifier [22:03] wallyworld: the other one will be the audit observer [22:04] ok, and you need to export RequestNotifier so you can manage it as an observer [22:04] wallyworld: well it's in an internal package, so it's scope isn't any different [22:04] wallyworld: but i wanted to encapsulate all the observers so they're not cluttering apiserver [22:05] ok, thanks for clarifying, just trying to get up to speed [22:06] Bug #1590909 changed: github.com/juju/juju/state: undefined: cloudSettingsC [22:06] wallyworld: np, lmk if you have any more qs [22:07] will do ty [22:07] gave your pr +1 [22:09] wallyworld: ta === redir_afk is now known as redir [22:26] rick_h_: you around? [22:28] wallyworld, rick_h_is out today [22:28] ah, ok [22:28] wallyworld, send him mail [22:28] ywp, will do [22:28] he has been responding occasionally [22:29] wallyworld: for status I am using ModelName() and ControllerName() to get the local names, but how can I get cloud? [22:29] thumper: get ControllerDetails() [22:29] that should have cloud in it IIRC [22:29] k [22:30] yep, and it has region [22:30] gah [22:30] JujuConnSuite is a PITA [22:30] where are the tools it can use defined? [22:30] jeez, you just worked that out [22:30] I need a defined version number [22:30] a known, static version number [22:30] thumper: there's UploadFakeTools helpers maybe [22:31] i think that's what stuff uses [22:31] ugh... [22:31] too much hastle [22:31] it's ugly [22:31] just let me choose a known version [22:31] do you recall where they are? [22:31] patch current.Version [22:31] i can look [22:34] great, my ISP cannot properly route me to argentinian ubuntu mirrors but can do to us ones [23:33] yay the upgrade to xenial sort of worked [23:34] "sort of" :) [23:34] I had to finish it the old way [23:35] \o/ there is wisdom in old ways :)