[00:00] thumper: i've been staring at this one all afternoon, https://bugs.launchpad.net/juju-core/+bug/1475056 [00:00] Bug #1475056: worker/uniter/relation: HookQueueSuite.TestAliveHookQueue failure [00:00] it happens super reliably for me [00:00] davecheney: with you shortly, writing big email [00:00] thumper: that's ok [00:00] no action needed [00:01] just letting you know 'cos I missed standup [00:06] davecheney: ok [00:08] i'm a bit worried [00:08] i cannot see anything in the logic that the test actually guarentees [00:08] ie, it's adding then removing a relatino [00:08] and hoping that happens fast enough that no events are generated [00:08] this is at best, a conincidence [00:12] haha [00:12] that's terrible [00:12] davecheney: uff, that relies on the uniter being busy in a different path on the loop :| [00:13] or not yet in it [00:27] Bug #1475056 opened: worker/uniter/relation: HookQueueSuite.TestAliveHookQueue failure [00:32] menn0: is there a chance bug 1469077 is caused by the mgo/txn issue you are working on? [00:32] Bug #1469077: Leadership claims, document larger than capped size

[00:33] it has been raised again as an issue [00:33] for a 1.24 deployment [00:33] wallyworld: it's possible, not sure of the likelihood [00:33] wallyworld: do you know which collection has the out of control txn-queue fields? [00:34] not yet [00:34] wallyworld: also note that I'm not working on that one yet [00:34] wallyworld: i'm currently dealing with bug 1474195 [00:34] Bug #1474195: juju 1.24 memory leakage

[00:34] wallyworld: that's going well [00:34] yay [00:34] cannot resume transactions: document is larger than capped size 1326012 > 1048576 [00:34] is the only error i can see so far [00:35] doesn't say what collection [00:35] wallyworld: yeah, you need to look at the DB to see where the problem is [00:35] ok [00:35] happens writing the lease token, so could be leadership related [00:36] * perrito666 yells at bot [00:39] thumper: which makes me wonder [00:39] should I just delete the test ? [00:40] there cannot be code relying on this behavior [00:40] 'cos [00:40] well [00:40] the behaviour only exists in tests [00:40] in real life [00:40] there is no way this timing could exist [00:41] wallyworld: didn't we fix that problem already ... when we saw this before it was lease/leadership related too [00:42] davecheney: what is it testing exactly? [00:43] menn0: a fix was made to error if any concurrent change was made to leadership document. not sure though how the previous implementation or the current one would impact txn queue [00:43] s/leadership/lease [00:43] ok [00:44] I have no idea what's going on then [00:44] by error, i mean return with error rather than trying again [00:44] exit txn loop early [00:44] if anything, that should have helped the situation [00:44] so the bug was marked as incomplete [00:45] but was recently reported as the issue still occurs :-( [00:46] test 0: Nothing happens if a unit departs before its joined is run [00:49] wallyworld: I think we need to point jam and fwereade and this one [00:49] at this one [00:49] yeah [00:49] i'll ping them later [01:05] wallyworld: would you PTAL at http://reviews.vapour.ws/r/2154/ ? [01:05] sure [01:09] axw: looks ok, just a quibble [01:09] wallyworld: ta [01:16] wallyworld, axw: is it important for there to be an assertion that the env is alive around createStorageOps? [01:16] yes [01:16] wallyworld, axw: I ask b/c it gets called as part of unit creation, and we're trying to avoid that assertion when units are created [01:16] because storage costs $$ [01:16] wallyworld: yes, for persistent storage anyway. we don't want to destroy an environment while there's persistent storage around [01:16] well some [01:16] err [01:16] menn0: :) [01:16] yes, just persistent [01:17] axw: blonde moment - how could line 28 in this pastebin result in a nil pointer given that "ch" is used just above [01:17] http://pastebin.ubuntu.com/11885503/ [01:17] * axw looking [01:18] wallyworld: so yes if there's persistent storage involved? [01:18] menn0: yeah [01:18] wallyworld: ch.URL() dereferences the charmDoc.URL field [01:18] wallyworld: well that sucks b/c we can't fully remove this bottleneck then [01:18] wallyworld: so if it's nil... [01:18] menn0: we want to avoid provisioning machines / volumes etc that could cost the user [01:18] wallyworld: yeah I understand [01:18] wallyworld: what actually provisions the storage/ [01:19] wallyworld: maybe we can block it there [01:19] axw: sure, so why isn't the line number where the charm doc is then? [01:19] inside ULR() [01:19] wallyworld: show me the panic? [01:19] menn0: there's a storage provisioner [01:20] similar to machine provisioner [01:20] axw: http://data.vapour.ws/juju-ci/products/version-2882/aws-upgrade-trusty-amd64/build-2233/machine-0.log.gz [01:21] wallyworld: I feel like I'm missing something, that panic points to the MigrateCharmStorage function [01:21] axw: yeah, it's in 1.22 [01:22] and not the state code [01:22] i have to move it [01:22] had [01:22] I see [01:22] because we needed to use the raw collection [01:23] wallyworld: not entirely sure, possibly inlining? [01:23] yeah could be, weird though [01:24] here's the new code https://github.com/juju/juju/blob/1.22/state/upgrades.go#L964 [01:24] i'll do some digging [01:24] yeah I found it, thanks [01:24] menn0: did you find it? [01:25] wallyworld: yep I found the storage provisioner [01:25] wallyworld: it'll be a bit of work to add watching of env life in there [01:26] menn0: it calls into state methods - may be able to modify one of those [01:26] wallyworld: I'll go for adding the assertion only for perisistent storage [01:27] axw: ^^^ so if there's a EBS volume involved, that is bound to the machine, the above approach will be ok i think? [01:28] maybe it sould assert the storage binding instead [01:29] or i mean do the assert if binding = env [01:29] but wait, this is 1.24 [01:29] so will be different [01:29] yes I think that'll work. machines will prevent env death, so machine-bound storage will be fine [01:30] so do that for 1.25 [01:30] menn0: why would you add env life watching? [01:30] axw: it's automatically added everywhere by the multi-env txn layer [01:30] axw: but that's created a massive perf bottleneck [01:30] so that's being ripped out [01:30] in favour of selectively adding it in a few key places [01:31] storage is one of those places [01:31] menn0: understood, by why does that mean adding a watcher? [01:31] b/c we don't want someone to be able to add storage to an env just has it's dying [01:32] menn0: the way things work atm with storage, we use cleanups to tirgger death of storage when the bound-to entity dies [01:32] menn0: so you destroy a machine with storage, then a cleanup is queued that destroys the attached storage [01:33] axw: but if storage is added as the env is dying and that txn takes a while to run the cleanup could miss it [01:33] axw: that cleanup is only in master though from memory [01:33] axw: but I guess the machine or unit will be dead so the txn will probably still fail [01:34] menn0: the env can't die while there's still machines right? [01:34] hrm [01:34] * axw ponders [01:34] we need a 1.24 solution too [01:35] clear [01:35] oops [01:37] menn0: if storage is added, then its life will be set to Dying by the cleanup regardless of whether it's been provisioned [01:37] axw: yes it can [01:37] axw: the first thing that happens is the env is set to Dying and then machines and everything else get killed off [01:38] but you're saying that the txn that adds the storage may happen after the cleanup... [01:38] ah wonderful a test that only breaks when run non isolated.... [01:38] axw: there's a slim chance that it could [01:39] axw: right now that's not possible because we have an automatically added env life assertion on almost all txns [01:40] axw: but that's going away [01:40] axw: seems like adding an extra check in the storage provisioners before it does anything might be sensible? [01:40] provisioner [01:40] menn0: sorry I mistyped before: the env can't be *removed* until there's no machines? i.e. it can go to Dying, but can't be Removed until the dependents are gone? [01:41] hrmph still doesn't really help [01:41] axw: yes that's right [01:41] menn0: we're going to have this problem with the machine provisioner too right? [01:42] no because the machine addition ops now include an explicit env life assertion [01:42] (but only for top level machines, not containers) [01:43] menn0: so why can't we do that in storage? they're no more plentiful than machine addition ops [01:43] we don't want that for units though because units are often added in huge bulk (this is where users are seeing the current bottleneck) [01:43] axw: b/c storage ops get added as part of unit addition [01:43] menn0: I think we could do it for machine storage (volumes, filesystems), but not storage instances [01:43] my unfamiliarity with storage is probably not helping here :) [01:43] so do machines, except if you're using --to [01:44] axw: b/c I'm slow can you please summarise :) [01:45] menn0: if a charm requires storage, then adding a unit will add a "storage instance". that will cause the creation of either a volume or filesystem when the unit is assigned to a machine [01:46] menn0: a volume can be e.g. a loop device, or an EBS volume [01:46] ok [01:47] menn0: actually we never create storage without an accompanying machine, so if the machine is prevented due to env being Dying, then we're fine [01:47] menn0: the storage provisioner won't create a volume or filesystem until the due-to-be-attached machine is provisioned [01:47] axw: ok that sounds promising then [01:48] axw: I think you were hinting at this before, but what about when a unit is added to an already provisioned machine [01:48] menn0: so I think we can drop the env life checks in storage [01:48] ah yeah [01:49] :| [01:49] axw: I guess the machine will be dying or about to die if the env is going down [01:49] axw: and that should clean up the storage? [01:50] menn0: it will... but only if the storage is bound to the machine. there's a concept of lifecycle binding, where storage is bound to either a unit/service, a machine, or the environment [01:50] axw: also, won't the storage provisioner itself die if the env goes to dying [01:50] menn0: currently we're fine because we always bind to either the unit, service or machine [01:51] menn0: there was an intention of binding storage to env initially if marked persistent though [01:52] menn0: I hope the worker would continue to run until the env is removed, not just Dying [01:52] * menn0 checks [01:53] menn0: otherwise the provisioner won't clean up any remaining things [01:55] axw: so it looks we're ok because there isn't a storage provisioner per env [01:55] axw: it's not run under the envWorkerManager [01:56] menn0: ok, cool [01:56] axw: the worker is up until the machine agent dies [01:56] the storage provisioner worker I mean [01:57] * axw nods [01:57] axw: ok so it looks like we don't need env life assertions for the state stuff in storage then [01:57] menn0: so... I think we're ok unless/until we allow storage to be created that is bound to an env [01:58] menn0: currently not the case, so we're fine atm [01:58] axw: we can do the assert only for the case where storage is bound to the env [01:58] yep, that should be fine [01:58] axw: which will be a fairly low frequency event I imagine so not a performance issue [01:59] yes I think so [01:59] axw: thanks for your help [02:00] menn0: nps, thank you for fixing. sounds messy :) [02:00] axw: it is [02:00] axw: found out why charm url is nil - serialisation changed between 1.20 and 1.22. which also means charm migration is broken in general and we didn't notice because migration function was never called [02:01] wallyworld: :( [02:01] fixing now :-) [02:12] menn0: based on axw's points above, we should at least get together to talk about environment destruction [02:12] menn0, axw: because I feel that we hav some bad interactions [02:12] and I'd like to check [02:47] thumper: sure. [02:47] thumper: now? [02:47] not just now, Rachel is arriving home shortly and I'll be stopping for coffee [02:47] but perhaps in 30-40 minutes? [02:48] thumper: sure just let me know [02:48] thumper: with axw too/ [03:16] axw: have you got some time? [03:36] waigani: how goes environment destroy? [03:37] thumper: merging cli command to jes-cli branch now. [03:37] thumper: and writing Will an email to review environ.Destroy branch [03:37] kk [03:37] coolio [03:39] thumper: Will usually starts around 8, so I'll check in with him this evening and hopefully finish off / land tonight. [03:40] cool [03:40] wallyworld: any idea if master is capable of being blessed at the moment? or are there known failures? [03:41] thumper: not sure, i'd have to look at build logs [03:41] i don't know of any failures [03:41] there was a windows issue at some stage [03:42] has that all been fixed now? [03:42] there's an open critical bug on 1.25 [03:42] #1468815 [03:42] Bug #1468815: Upgrade fails moving syslog config files "invalid argument"

[03:46] * thumper sighs [03:46] why has it not been forward ported? [04:03] thumper, wallyworld: I have a likely fix to bug 1474195 ready... although I need to talk env destruction with thumper [04:04] Bug #1474195: juju 1.24 memory leakage

[04:04] great [04:04] I'm waiting for axw before we talk destruction [04:04] menn0: I can look at the fix if you like [04:05] thumper: pushing now [04:11] thumper: https://github.com/juju/juju/pull/2801 [04:11] * thumper looks [04:16] menn0: for the machine insertion [04:16] menn0: does that method also do the containers [04:16] ? [04:16] or is there a different one to add containers [04:17] as I thought we were going to skip the alive assertion for containers [04:17] a different one does containers [04:17] kk [04:17] see the docstring at the top of the method I added the assert to [04:18] wallyworld, thumper: any tips of debugging an lxc container that is stuck in "pending"? [04:18] I can't ssh to it [04:18] and lxc-console gives me nothing [04:18] menn0: the logs are available locally [04:18] menn0: look here: /var/lib/juju/containers/... [04:18] /var/lib/lxc/blah/root [04:19] then look at cloud init logs [04:19] and also where wallyworld said [04:19] wallyworld: thanks, i'll look there [04:19] the cloud init logs are in the /var/lib/juju/containers dir [04:19] menn0: shipit [04:20] thumper: what about your concerns? [04:20] this branch doesn't touch the concerns I have [04:20] ok great [04:20] any bad thing we are doing, we are already doing [04:21] which is why I think we need to talk to axw about environment destruction of hosted environments [04:21] because we are going "bullet to the head" on all the machines, then removing all the docs [04:21] what impact is this going to have for any attached storage [04:21] thumper: I want to do some manual performance comparisons and if it looks like things are faster then I'll merge [04:24] thumper, wallyworld: this appears to be why that container didn't start: http://paste.ubuntu.com/11886010/ [04:24] any clues? [04:24] I'm guessing this line: WARN lxc_start - start.c:signal_handler:307 - invalid pid for SIGCHLD [04:24] NFI why though [04:25] * menn0 is googling [04:29] menn0: yeah, NFI sorry [04:30] this looks like the bug (a race) but it was fixed in lxc 1.0.0-alpha2: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526 [04:30] Bug #1168526: race condition causing lxc to not detect container init process exit [04:37] menn0: what version of lxc do you have? [04:38] thumper: 1.1.2 (stock vivid) [04:38] (as far as I know) [04:38] so not fixed released then... [04:38] thumper: ? [04:38] menn0: try #lxcontainers [04:39] thumper: I will [04:39] menn0: because it is happening to you...\ [04:39] out of 10 containers 1 failed [04:39] but this happened earlier today and yesterday as well [04:39] yeah, but we create a lot of containers [04:48] wallyworld: master curse seems to be : bad record MAC, mongo not coming up, and intermittent failure collecting metrics in the uniter suite [04:48] sigh [04:49] those would all be intermittent right [04:49] yup [04:50] i'll look at the logs when i can [05:16] * thumper heading off until meeting later tongith [06:43] axw: could you look at http://reviews.vapour.ws/r/2181/ when you get a chance? it looks larger than it is because i reverted the move done previosly [06:51] wallyworld: ok [06:52] ty [06:55] wallyworld: LGTM [06:56] ty [07:13] Bug #1475163 opened: when the uniter fails to run an operation due to an error, the agent state is not set to "failed"

[07:22] axw: and one more sorry http://reviews.vapour.ws/r/2184/ [07:42] wallyworld: reviewed [08:02] wallyworld: machine provisioning and hook errors are a bit different: they're coming from the IaaS provider and the hook execution respectively. Maybe I misunderstood, but it sounded like these errors might include, say, errors talking to the API server [08:03] axw: yeah, could be those. i think your idea not to include is good [08:03] fixing patching will be more work, but i have soccer now so will do later [08:04] wallyworld: ok. I have to go out soon anyway, so will check later [09:03] fwereade: dimitern: standup ? [09:11] Bug #1475212 opened: Environment destroy can miss manual machines and persistent volumes [10:01] fwereade: so I'm supposed to be in a call now, but he's not arrived yet. So on the concept of Token being resuable... [10:01] dooferlad, TheMue, fwereade, jam, sorry guys for missing standup - I had to renew my car insurance in the morning, but it took more time than expected :/ [10:02] dimitern: no worries [10:02] dimitern: jam just said my standard response, so ^^ [10:02] jam, listening [10:03] I've discovered yesterday after wasting almost a full day, that when running go test with both -race and -cover (or -coverprofile=) *itself* leads to races! [10:04] dooferlad: hehe, maybe the number of calls will get negative when passing a black hole. but you're right, a bool flag would be enough [10:04] dimitern: well, that sucks [10:04] supposedly fixed in go 1.3+, can be worked around by adding also -covermode=atomic (which is the default behavior in 1.3+) [10:04] TheMue: I was more thinking about uint [10:05] TheMue: but yes, types matter and sometimes we live with inappropriate choices [10:05] Morning [10:05] i'll send this to juju-dev as well, just in case I can save somebody else the same experience [10:06] s/the same/from the same/ even [10:06] fwereade: he showed, sorry. I did want to overview of how I felt tokens should work. [10:07] jam, just braindump whenever you get the chance :) [10:37] TheMue, hey [10:38] dimitern: heya [10:38] TheMue, didn't we discuss using bulk client-side api calls for the addresser? [10:38] TheMue, like RemoveIPAddresses taking params.Entites and returning params.ErrorResults, error, rather than forcing the worker to remove them one by one? [10:39] dimitern: have to take look in my notes [10:40] dimitern: we talked about where the "work" has to be done when I suggested that e'thing could be done via one call on server-side [10:41] TheMue, I don't insist on doing it now (just the addresser using api instead of state is already a big improvement, esp. around the entity watcher), but it seems to me it will be slightly better [10:41] Bug #1455628 changed: TestPingTimeout fails

[10:41] Bug #1456726 opened: UniterCollectMetrics fails

[10:41] dimitern: when I asked why we need an API usable only for the worker and providing its calls [10:42] TheMue, that's what *all* our apis are doing anyway :) [10:42] dimitern: tyvm for being adventurous and running tests with 2 flags not one :D [10:43] TheMue, however I see your point - we should (re)use better defined api interfaces across multiple workers/etc. [10:43] dimitern: and I oriented at your instancepoller, which is acting on one machine each too [10:43] anastasiamac, I'm even using -check.v :D [10:43] dimitern: \o/ [10:43] dimitern: that's why I implemented the IPAddress(Proxy) as type [10:44] dimitern: but n.p., I simply can change it, one tine missed gofmt dislikes my try to merge, hehe [10:44] tiny [10:45] TheMue, yes, as it was easiest to do - gradual improvement, over using state directly, but from design perspective we can do better for such workers that makes more sense to batch multiple ops in a single api call [10:45] and I thought I ran my pre-commit check *grmfplx* [10:46] TheMue: your git client doesn't auto-run the pre-commit hook? [10:46] TheMue, so I suggest you go ahead and still land this (if you can perhaps add a TODO somewhere in the code we can improve the behavior by using bulk calls) [10:47] dooferlad: different environment here, as you know. script integration didn't work, so I integrated it into my jdt (juju development tool) [10:47] dimitern: ok, will do so [10:48] TheMue, cheers [10:48] TheMue: Clearly you need to switch clients :p [10:49] dooferlad: it's not the client, it's more complex. will show you when having our next meeting. [10:51] dimitern: is the logic behind addSubnetsCache just to speed things up? Isn't state fast enough and the canonical source of information? [10:53] dooferlad, the main reason for its existence is to improve the case when multiple subnets are added in the same API call [10:55] dooferlad, so I guess it might be actually moot if we don't allow users to add multiple subnets with the CLI (unless we add an "import these subnets definitions as a batch" thing, which was discussed at some point) [10:56] dimitern: if we have the ability at some point to dump the output of juju status to a file, then load that back, then yes we will benefit. === psivaa is now known as psivaa-afk [10:57] dooferlad, ewww.. yeah, I got your point :) but we'll have state deltas before that happens most likely [10:59] dimitern: I mostly don't like caches because if somebody does something unexpected to what they are caching you can have "fun" finding bugs. In this case though, I was looking at it in terms of what I needed to do for space create. [10:59] (just imagined having to parse a moving target like the status yaml output) [10:59] dimitern: which seems to be, not caching. [11:00] dooferlad, for space create I don't think you need to do it the same way [11:00] dimitern: +1 [11:00] dooferlad, I've realize addSubnetsCache now looks totally over-engineered to me :/ [11:01] dimitern: well, I am sure it was fun engineering, so I am not worrying! [11:02] dooferlad, you bet :) [11:05] fwereade, jam leads call [11:20] TheMue, not tried lfe yet, but it's on my list of things to try [11:21] mattyw: it has a nice approach for lisplers, but it never will get a larger community *sigh* [11:22] dooferlad: btw, just found why my pre-commit failed. only one missing line [11:22]