[00:10] Bug #1590045 changed: Uniter could not recover from failed juju run

[01:09] thumper: menn0 fwereade https://github.com/juju/juju/pull/5578 === natefinch-afk is now known as natefinch [01:13] *decrement [01:15] "The coordination via stopped is not reliably observable, and hence not tested" [01:16] (and didn't work :) [01:16] fixed mis spelling [01:18] davecheney: well, it looks reasonable to me [01:18] davecheney: do all the tests now pass? [01:19] thumper: the cmd/jujud/agent test that was previously failing because the catacomb had not stopped the workers it was in charge of by the time the test is shut down [01:20] I'd really like to see a failing test in the catacomb package that this changes fixes. relying on a test far away to assert the correctness of the change is less than ideal. [01:21] natefinch: sure, https://github.com/juju/juju/pull/5564#issuecomment-224468739 [01:21] here is the failing test [01:22] you can see in the panic output that the firewalls and pinger are still running and were started from the catacomb [01:22] davecheney: I think what natefinch means is an explicit test on catacomb [01:22] there is an explicit test [01:22] small test [01:22] if the worker has not shut down, the catacomb tests will not pass [01:22] there is an explicit test for catacomb.Wait() [01:23] and did that one intermittently fail? [01:23] 'cos the catacomb tests before weren't testing shit [01:23] now the are [01:23] I fixed the code to match the test [01:23] * thumper sighs [01:23] but there wasn't a failing catacomb test [01:24] nope, see the comment in the PR [01:24] this wasn't tested [01:24] now itis [01:24] now it is [01:24] not explicitly just on catacomb without the agent bollocks [01:25] go func() { [01:25] defer catacomb.tomb.Done() [01:25] defer catacomb.wg.Wait() [01:25] catacomb.Kill(plan.Work()) [01:25] }() [01:26] ^ the catacomb is not dead until wg.Wait() drops to zero [01:26] wg.Wait cannot drop to zero until all the workers have exited [01:39] davecheney: http://reports.vapour.ws/releases/4039/job/run-unit-tests-trusty-amd64-go1-6/attempt/637 [01:39] pprof thingy [01:40] davecheney: I think that is related to what you added, yes? [01:40] I have NFI what is wrong though [01:40] just saw it fly past [01:43] thumper: yes that is the pprof facility we added a while back [01:43] i think axw had the last hack at fixing a related bug [01:43] I need to chat at some stage, perhaps next week, at how to hook into that :) [01:45] thumper: its on the wiki [01:46] what? docs? who reads these days :) [01:46] thumper: https://github.com/juju/juju/wiki/pprof-facility [01:46] ta [01:49] thumper: are we good ? [01:49] is it possible to hook into the listener to add additional output paths? [01:50] i.e. extra stats? [01:50] nope, sorry that's all the runtime exposes [01:50] hmm... ok [01:50] if you want to do something more than the bits we get for free from the runtiem [01:51] that's more than a non trivial amount of work [02:28] axw_: whenever you have time, this stores cloud region against model, and cloud name in controller doc http://reviews.vapour.ws/r/5023/ next one will use separate controllerconfig [02:38] thumper: we really should address those various TODO schema change items in the code base but we probs will run out of time :-( [02:42] * perrito666 runs the whole suite and fires netflix [02:53] yes :( [02:59] axw_: GCE region fix: http://reviews.vapour.ws/r/5024/ [03:02] yay, someone fixed my bug! :) sounds like it was worse than just lying about where we were bootstrapping [03:03] anastasiamac: so, what happens if you bootstrap without specifying a region? [03:03] anastasiamac: have you tested live? [03:03] yes [03:04] natefinch: u will have to specify region in ur clouds. i think u cannot by-pass it now anyway... [03:04] axw_: prior to change, I'd always end up on us-central, after change i've bootstrapped in us-east :D [03:05] axw_: (as expected) [03:05] why is having a default bad? I mean, sure, don't override what someone explicitly specifies... but do we require other clouds specify regions? [03:06] natefinch: we have defaults at a higher level now [03:06] natefinch: in clouds.yaml, the first region is the default unless you specify one [03:06] and we always pass that into the provider [03:06] natefinch: because, like in this case, at the provider level, it's easy to omit getting mandatory property from bootstrap config [03:07] (two places for a default => subtle bugs) [03:07] natefinch: m not planning to remove default from 1.25, only planning to add copy param logic :D [03:09] axw_: ok, sure, yeah, not having multiple defaults is fine.... but clouds.yaml isn't really a default, it's a configuration. This means the user always has to manually specify a region, right? [03:09] natefinch: no. the first region specified in clouds.yaml, for each region, is the default unless you specify one. you can set a default region yourself, otherwise it's the first in the list [03:10] natefinch: run "juju show-cloud aws", and the first region in there should be the default for aws [03:10] (unless you have set one with "juju set-default-region") [03:12] axw_: I'm confused, I thought clouds.yaml was what we called the thing you had to write out by hand to pass into add-cloud. Is that a generated file we create on disk? [03:13] natefinch: sorry, I was being a bit non-specific. there's ~/.local/shared/juju/public-clouds.yaml (public clouds), and ~/.local/shared/juju/clouds.yaml (personal clouds, ones you added with add-cloud) [03:14] natefinch: the public-clouds.yaml file won't be there by default, it's also built into the client [03:14] natefinch: but "juju update-clouds" will pull down a file in that spot if there's been updates, so a client can refer to new clouds/regions without upgrading [03:15] axw_: ahh, thanks, ok. that was the context I was missing :) [03:18] natefinch: btw that intemediate clouds.yaml to feed into add-cloud should be going away (or at least relegated), as add-cloud will eventually (soon?) be made interactive [03:19] axw_: thank goodness. the less yaml I have to write, the better :) [03:19] yep, the aim is to stop editing files by hand [03:38] * thumper is on kid duty [03:38] I'm going to be afk for a while, but back after being a taxi tonight [03:41] davecheney: oh, also, will miss standup tomorrow morning, on airport pick up duty === thumper is now known as thumper-afk [03:45] cd . [04:03] thumper-afk: right ok [04:03] no worries [04:20] natefinch: shutdown -h [04:21] davecheney: heh [04:22] davecheney: for your consideration: https://github.com/juju/juju/blob/master/worker/provisioner/provisioner_task.go#L409 [04:27] natefinch: you'll have to try harder than that if you're trying to shock me [04:28] the MustCompile is a nice touch [04:28] very devil may care [04:28] I just can't even. So much wrong in that one little line [04:51] axw_: i'm not sure we should include cloud region in migration data, since model could be stored to a different cloud/region [04:56] wallyworld: I think migration expects to be between controllers in the same region [04:56] wallyworld: the machines/agents/etc. remain where they are, and are redirected to the new controller [04:56] wallyworld: they're not destroyed and recreated [04:57] ok [05:06] anyone having problems with statesuite.TestPing? TearDownTest seems to hang forever [05:07] log is just full of [LOG] 0:55.041 DEBUG juju.mongo connection failed, will retry: dial tcp 127.0.0.1:39150: getsockopt: connection refused [05:07] why does touching anything cause it to berak [05:07] https://bugs.launchpad.net/juju-core/+bug/1590645 [05:07] Bug #1590645: worker/catacomb: test failure under stress testing [05:07] top tip: this was already broken [05:07] ^ that's master [05:08] Bug #1590645 opened: worker/catacomb: test failure under stress testing [05:09] nope, sorry, pebkac [05:09] gah, testping doesn't even DO anything [05:23] menn0: could you take a second look at https://github.com/juju/juju/pull/5578 [05:24] I found a bug in stress testing that I have now fixed [05:24] menn0: https://github.com/juju/juju/pull/5578/commits/36c3e7f8bd9435bf1cccd2480b4f921bcb6345d9 [05:29] davecheney: looking [05:38] davecheney: the core change looks good but we still disagree about test suites [05:38] Bug #1590645 changed: worker/catacomb: test failure under stress testing [05:43] menn0: fine i'll roll back that change [05:43] it's not worth having a fight about [05:43] davecheney: kk [05:43] menn0: done [05:43] removed that commit from the PR [05:43] it's pretty insane that our api client tests take many many minutes to run [05:44] davecheney: ship it already :) [05:45] menn0: with pleasure [05:46] menn0: as you say, it was obvious in retrospect (now we can see the solution) [05:46] we _have_ to wait for both goroutines to finish [05:47] davecheney: yeah totally [05:49] the previous stopped channel connected both goroutines in one direction, but didn't in the other [05:59] welp. Not getting these tests passing tonight. Sorry wallyworld. I keep running into random tests that time out after 10 minutes, which is sorta killing the development cycle here. I'm even using gt to avoid retesting code I know hasn't changed. [05:59] ok, can you push what you have? [06:00] wallyworld: sure [06:00] ty [06:03] wallyworld: made a WIP PR to make it easier to see the diffs: https://github.com/juju/juju/pull/5583 [06:03] ty [06:03] i'll try and look after soccer [06:03] heh: +547 −5,744 [06:03] \o/ [06:04] and most of that plus is really just a file rename or two [06:04] ok, bedtime. [06:08] * redir goes to bed too! === frankban|afk is now known as frankban [07:14] davecheney, would you explain http://reviews.vapour.ws/r/5022/diff/# to me please? I see that the original second goroutine can outlive the catacomb; but I can't see how Kill()ing an already-finished worker triggers session-copy panics [07:14] davecheney, ...oh, dammit, is this presence again? [07:18] Anyone know why when I ask Azure for 8 GB RAM & 50 GB disk, I get 13 GB RAM and 29 GB disk, but it claims to have mem=14336M root-disk=130048M ? [07:20] dimitern, do you know about the DocID and tests for same around the linklayerdevices code? [07:21] fwereade: what about it? [07:21] dimitern, why it's exported, and why it includes the model-uuid [07:22] fwereade: it shouldn't be exported [07:22] fwereade: sorry about that [07:22] dimitern, it happens ;) [07:22] fwereade: but why is it surprising that it includes model-uuid as prefix? [07:23] dimitern, fixed that already, mainly just checking something wasn't planning to build on it [07:23] dimitern, because state isn't meant to know that stuff -- the multi-model stuff does it for you [07:24] dimitern, or, it should -- I was wondering if there was something about it that meant it didn't quite fit [07:24] fwereade: well, I wasn't that comfortable with the multi-model stuff then I guess, wanted to test it excplictly [07:25] dimitern, just checking, though: you aren't using those DocIDs as anything other than _ids, right? not e.g. storing them in fields for subsequent lookup? [07:26] fwereade: I'm using mostly global keys [07:27] fwereade: and the DocID IIRC is only used for txn.Op{Id:} and FindId() [07:27] dimitern, (global keys without any model-uuid prefix, right?) [07:27] dimitern, cool, thanks for the orientation [07:28] fwereade: without, except for the parent/child refs [07:28] bugger [07:28] fwereade: let me have a look to remind myself.. [07:29] dimitern, thanks [07:30] Also, 9 minutes to bootstrap that instance in Azure - is that expected? [07:31] blahdeblah: 1.25 or 2.0? [07:32] fwereade: nope, so for the refs the docid is used literally as given, no assumptions on prefix [07:32] 1.25.5 [07:32] axw_: ^ [07:32] Looks like no matter what you ask for in a root disk, you get whatever Microsoft decides you need, which is 31.5 GB raw. [07:32] dimitern, can you point me at the code you're looking at? [07:32] fwereade: linklayerdevices_refs.go [07:33] blahdeblah: it's been a while since I looked at the old provider, will have to go spelunking [07:33] axw_: Is it something where we need to specify instance type instead of mem/disk constraints? [07:33] blahdeblah: re slowness: yes, sadly that's expected [07:33] How does this cloud still exist? :-\ [07:34] dimitern, ok, refs looks safe, it's explicit but doesn't need to be [07:34] blahdeblah: I *think* the root disk size is the same for all instance types, will need to check [07:34] dimitern, what about lldDoc.ParentName? [07:35] fwereade: that can be a global key [07:35] axw_: I suppose I should log a bug saying that there's no indication that the root-disk constraint is not honoured then... [07:35] dimitern, whoa, ParentName lets docids leak out too? [07:36] blahdeblah: I think it may actually be related to the images that Canonical publishes [07:36] fwereade: well, not quite the docid, just the gk [07:36] axw_: That affects the size of sda presented to the OS? Seems unlikely... [07:36] dimitern, still [07:36] blahdeblah: well the images have the size of the root disk in the name ... [07:37] But surely that would simply affect the size of the partition created on the disk, not the disk size itself... [07:37] blahdeblah: maybe, depends on whatever Hyper-V does. I don't know, I'll have a poke around [07:37] Thanks - appreciated. [07:37] fwereade: looking at the code I don't see a good reason to export ParentName() actually.. as there is a ParentDevice() which is more useful outside of state anyway [07:38] dimitern, excellent, I'll drop it if I can [07:38] axw_: I'll have a poke around for relevant bugs [07:38] dimitern, still sl. struggling to get my head around what changes I could/should make to get around the internal test failures [07:38] fwereade: please don't just drop it - it will still be needed inside the package for refs checks and some other internal logic, but unexporting it should be fine I think [07:39] dimitern, sorry, that's what I meant [07:39] dimitern, so, to step back for context [07:39] fwereade: ok [07:39] dimitern, I'm trying to extract a smaller, less stateful, type from State [07:39] fwereade: sounds challenging :) [07:40] dimitern, the clean line at the moment seems to be {database, model-uuid} [07:40] dimitern, and I've tacked on only a few methods -- getCollection, runTransaction, and the docID/localID translators [07:40] dimitern, this ofc means that the hacked-up state no longer produces the correct answers [07:41] dimitern, because the implementation detail of *how* we calculate docID has changed [07:41] dimitern, but I am deeply reluctant to just "fix" that *State [07:41] fwereade: I'm afraid I do have a bunch of internal tests for LLDs that verify the docID format :/ [07:42] dimitern, the biggest question actually [07:42] fwereade: I trust the multi-model code better now at least :) [07:42] dimitern, is: can I just drop those internal tests? is there any functionality that isn't covered by external ones? [07:42] dimitern, so many of them are working with an unconfigured state anyway... ;p [07:43] dimitern, first stab at multi-model you did have to care about model-uuid [07:43] fwereade: those tests that check the docID includes model uuid prefix? sure - I think those are unnecessary anyway [07:43] dimitern, really, all the internal tests [07:43] fwereade: but the assumptions on the globalKey format for LLDs is important [07:45] dimitern, why so? they strike me as the purest of implementation details [07:45] fwereade: I aimed for 100% coverage while writing the code, some bits of it are not possible to test externally, but the internal tests gave me confidence that code is exercised [07:45] dimitern, if it's not possible to test it externally, why does it matter? [07:46] dimitern, by definition, surely, that makes it an implementation detail [07:46] fwereade: well, re gk format - container LLDs are intentionally restricted to only having a parent device on the same machine as the container and only of type bridge [07:47] axw_: FWIW, seems like it might be region-dependent. If I bootstrap in debug mode, I get a bunch of error messages about Basic_A[1-4] and Standard_G[1-5] not being available in US East [07:49] dimitern, ok, but aren't those restrictions exercisable via the exported api? doesn't matter what strings we use, it's the restrictions on the live types we export that we need to verify [07:50] fwereade: sorry, I've looked at the internal tests again; ISTM most can be either moved to the non-internal suites or simply dropped [07:54] fwereade: those that could be moved include tests on simple getters or exported, related helper funcs (e.g. IsValidLinkLayerDeviceName) [07:56] dimitern, ok, I'll see what I can do, thanks [07:56] blahdeblah: "Azure supports VHD format, fixed disks." -- https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-about-disks-vhds/ [07:57] fwereade: I believe tests around parent gk format can be tested externally, it was just easier to test them with a 'fakeState' (and not bring up the whole stack with the real one) [07:57] blahdeblah: the old provider should not allow you to specify 50GB in the first place. the new one doesn't [07:59] axw_: When you say "old provider", you mean the one in 1.25.x ? [07:59] dimitern, as in, we can externally test that we can't create bad LLDs? [07:59] blahdeblah: yep [07:59] axw_: So if I need more than 31.5 GB root disk, what can I do? [07:59] blahdeblah: it does not exist in 2.0. Azure completely changed their APIs [07:59] dimitern, is there something specific about the gks that I'm missing? [08:00] blahdeblah: so with the old world, I think you could do it with some hoop jumping: take the existing VHD, resize it and upload to your private storage. then you'd need custom image metadata [08:00] blahdeblah: not sure why the images have such a small size in the first place tho really [08:01] blahdeblah: possibly because with "premium storage", you pay by the disk size, rather than usage [08:01] axw_: When you say "old world", you mean "The only production-ready version currently supported"? :-) [08:01] blahdeblah: yes. [08:01] fwereade: well, there's one thing... [08:01] axw_: So can we specify instance type or something like that to get around this? [08:01] Or are we just stuck with 31.5 GB? [08:02] blahdeblah: nope, it's down to the images I'm afraid. all instance types have the same max limit for OS disk sizes [08:02] fwereade: to test e.g. you can't add a parent device to a container device without using the parent's globalkey as LLDArgs.ParentName [08:02] axw_: OK - thanks for your time. [08:03] fwereade: you have to use the gk, which is then verified to exist on the same host machine as the container and it's a bridge [08:03] dimitern, don't we have some other way of identifying the parent? [08:04] fwereade: well, we only need its name and its host machine, which is conveniently part of the LLD's gk [08:05] fwereade: and in other cases (e.g. adding a new child to existing parent device on the same machine) we just use the plain parent device name in LLDArgs.ParentName [08:06] dimitern, sorry, when is a new child *not* added to an existing parent on the same machine? [08:06] fwereade: but hey! fortunately, that logic with LLDArgs.ParentName being a gk is only really used in one place usable externally [08:07] fwereade: when it's on a different machine (e.g. in a container, while the parent is on the host machine of that same container) [08:08] fwereade: SetContainerLinkLayerDevices is the only place currently where we use a gk-based ParentName for adding devices [08:12] fwereade: and if LLD.ParentName() is unexported, and ParentDevice() used instead (as it is IIRC), we won't leak gks outside of state [08:13] dimitern, I remain a bit suspicious of the still-extant ability to specify gk-ParentName from outside [08:16] dimitern, could we, e.g., have always-simple-ParentDeviceName, and ParentMachineID, in LLDArgs? [08:16] fwereade: that can be rectified, assuming SetLinkLayerDevices() rejects ParentName set to a gk, and SetContainerLLDs can bypass that check internally [08:17] fwereade: that's perhaps better - future-proof and more explicit [08:18] dimitern, cheers [08:19] fwereade: btw, now that you've had a chance to look at the approach I used for setting LLDs and LLDAddresses in bulk [08:19] fwereade: how do you like it? [08:20] dimitern, only just looking at that side of it now... [08:22] dimitern, just reading setDevicesAddresses... doesn't the providerID stuff have that index gotcha? I see some checking but it's not immediately clear that it's enough [08:22] fwereade: it was a noble attempt at first :) but it got complicated as parent/child refs were added.. and my inability to construct a single set of []txn.Op that can insert and verify new parents and new children of those in the same set of ops [08:23] fwereade: now providerIDsC is used for subnets, spaces, LLDs, and LLDAs, and the indexing issue is handled there [08:23] dimitern, yeah, the "you don't need to assert what you're doing in this txn" thing honestly only makes life harder [08:24] dimitern, ok... but the providerID memory-checks aren't guaranteed to run at all here, are they? and I don't think they're asserted at apply-time either [08:24] fwereade: yeah txn.DocMissing being the only possible option for inserts can really force you to re-think.. and I do understand why it's the only option [08:26] * dimitern takes a look [08:27] dimitern, no, wait, setDevicesAddressesFromDocsOps seems to cover that [08:27] fwereade: so the ProviderID on LLDAddrDoc is just for convenience, it's not enforcing integrity like it used to; the asserts on PrIDsC do [08:29] dimitern, sorry, I seem to be slow thing morning [08:30] dimitern, we have some logic around :592 that checks for dupe provider ids -- what does it do that sDAFDO doesn't? [08:31] fwereade: it validates whether ErrAborted occured due to the assert added in sDAFDO [08:31] dimitern, won't reinvoking sDAFDO catch those anyway? [08:33] Bug #1590671 opened: Azure provider does not honour root-disk constraint [08:33] fwereade: well, it looks like sDAFDO doesn't validate, just asserts docMissing [08:34] fwereade: sorry, I need to talk to some contractors that just arrived - back in ~15m [08:34] dimitern, np [08:35] jam: happy birthday [08:36] dimitern, should networkEntityGlobalKeyOp theoretically be doing that check? if it's just a FindId().Count() I'd not too worried, even if we do ask it a bunch of times... [08:41] fwereade: the problem is the goroutine does, w.Stop() and then returns, marking the waitgroup as done [08:41] Stop doens't actaully stop, it just asks to stop so the worker can still be in the process of shutting down when the mongo connection is torn down [08:59] thanks voidspace [09:02] davecheney, right; but that's why the wg.Done happened after the Wait in the first goroutine -- how does the wg complete with a running worker? the only race I see is the possibility of a late Kill in the second goroutine [09:04] davecheney, and I accept *that's* wrong, because it's too trusting of Kill not to do anything untoward [09:05] davecheney, but if I understand correctly, you're saying that workers live too long, not that late Kills of already-dead workers are the problem? [09:06] davecheney, and I don't see how that was happening, because we always Wait()ed for the worker before we call wg.Done [09:08] davecheney, (I am assuming you're talking about Kill, which doesn't wait, rather than Stop, which does wait but wasn't used in the original) [09:16] fwereade: I suppose so [09:16] fwereade: but since networkEntityGlobalKeyOp is used for subnets, and spaces as well, it should be done carefully [09:17] dimitern, ack [09:17] dimitern, (also, I'm pretty sure machineProxy is evil vs extracting ops-composers and using them in both places) [09:20] fwereade: yeah, machineProxy was added mostly for convenience, is it evil because it assumes the LLD and the machine are always on the same session? [09:21] dimitern, more that it's pretending to be a valid machine and it's really not [09:22] dimitern, it's trusting to luck that an empty machine doc won't trigger a bad-memory-state failure [09:23] fwereade: fair point [09:23] dimitern, the garbage data is *kinda* ok, in that the asserts *should* trap any problems, but you've still got an invalid *Machine lying around and it makes me nervous ;p [09:24] Bug #1590689 opened: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time

[09:24] dimitern, anyway, I think I have to consider most of this a derail, I'm adding a bug [09:25] fwereade: FWIW it's never lying around - only used to call LLD(name) on it in 2 places [09:25] fwereade: but still [09:26] dimitern, yeah, it's not how it's used right that bothers me so much as how someone will one day use it wrong ;) [09:26] (considering all of that was designed and implemented in a week..) [09:27] fwereade: indeed; "don't be clever, be obvious" [09:27] :) [09:27] dimitern, I apologise for the length of this letter, I did not have time to make it shorter ;) [09:28] fwereade: which one? [09:29] dimitern, it's a quote from someone-or-other that seemed tangentially relevant [09:30] dimitern, "shorter" *sort of* maps to "more obvious", so long as it's the right sort of shortness [09:30] fwereade: ah :) I was thinking of "clean code" [09:31] dimitern, that is rather more directly relevant than pascal, indeed ;) [09:33] Bug #1590689 changed: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time

[09:36] Someone needs to come up with a way of hedging Pascal's wager against Roko's Basilisk. [09:36] babbageclunk, ssh! [09:37] ;p [09:37] Sorry you guys. [09:42] Bug #1590689 opened: MAAS 1.9.3 + Juju 1.25.5 - on the Juju controller node eth0 and juju-br0 interfaces have the same IP address at the same time

[09:42] Bug #1590699 opened: LinkLayerDeviceArgs exposes globalKeys outside state

[09:46] hey dimitern, got a moment for more stupid questions? [09:46] babbageclunk: always :) [09:48] dimitern: So I've added my vlans to the eni on the maas controller and restarted. The routes *seem* ok, but I still can't ping any of the vlan addresses on the hosts. [09:48] babbageclunk: what have you tried pinging? [09:49] dimitern: 192.168.10.2 [09:49] babbageclunk: that's a node's address and you pinged from maas? [09:50] dimitern: yup [09:51] dimitern: on one node I can see that in ip route, and on the maas node I still get destination host unreachable trying to ping that address. [09:51] babbageclunk: do you have ip forwarding enabled on maas? [09:51] dimitern: Do I need to set the vlan up in the VMM virtual network config? [09:52] dimitern: Where do I check for ip forwarding? [09:52] babbageclunk: sudo sysctl -a | grep ip_forward [09:52] dimitern: oh, ok - nothing in the ui? [09:53] dimitern: net.ipv4.ip_forward = 1 [09:53] babbageclunk: the kvm node has 1 or more NICs, each of which connected to a libvirt bridge; the bridge is layer-2, so it will pass both tagged and untagged traffic [09:53] dimitern: ok] [09:54] babbageclunk: on the node page in maas though, you need to have a physical NIC on the untagged vlan and 1 or more vlan NICs [09:55] dimitern: for the rack controller? [09:55] dimitern: yes, I've got those. [09:55] babbageclunk: can you paste e/n/i from your rack controller? [09:56] babbageclunk: where are the nodes and rack ctrl located? [09:57] babbageclunk: if the kvm nodes are on the rack ctrl machine, you need bridges there as well; if the rack ctrl is itself a kvm and it's sitting on your machine, along with all the nodes [09:57] dimitern: Ooh, looking back at your eni example I think I see the problem [09:57] dimitern: (well, a problem) [09:57] dimitern: http://pastebin.ubuntu.com/17139880/ [09:58] babbageclunk: ...then you need to enable forwarding on your machine as well, and the bridges will be on your machine [09:58] dimitern: I've left off the /24 part of the address [09:58] dimitern: The rack controller and the nodes are all kvms on my machine/ [09:58] babbageclunk: yeah! :) it's either e.g. /24 or netmask is needed [09:59] babbageclunk: omitting both means /32 IIRC [09:59] dimitern: ok, fixing and bouncing [10:00] dimitern: oops, meeting [10:00] dimitern: thanks again! I owe you about an infinitude of beers by this poing. [10:00] gah, point [10:01] babbageclunk: heh, keep 'em coming :D [10:03] anastasiamac: meeting? [10:12] anastasiamac: can you try again pushing? [10:15] dimitern: fatal: unable to access 'https://github.com/juju/juju/': The requested URL returned error: 403 [10:15] dimitern: ;( [10:16] anastasiamac: so you shouldn't be able to push to juju/juju - so why is it trying? [10:16] dimitern: ofco - i have not created a branch \o/ [10:17] thank you!! like i said - one of these days :D [10:17] anastasiamac: :) [10:57] dimitern, that didn't help. :( [10:58] babbageclunk: adding the /24 ? [10:58] dimitern: yup [10:59] dimitern: What were you saying about bridging above? If the controller and the nodes are all sibling kvms do I need bridges in the controller eni? [11:00] dimitern: There are bridges in the node ENIs, but I don't know whether they're needed or working. [11:00] babbageclunk: juju created those [11:00] dimitern: Ah, right. [11:01] babbageclunk: but yeah, the libvirt bridges where both maas and the nodes connect to are on your machine I guess then [11:01] babbageclunk: how are those bridges (networks) configured? [11:02]