[00:32] veebers: ping [00:44] hml: pong o/ how's things? [00:44] veebers: good - getting cold [00:45] veebers: is there a god way to force a model migration failure? - i’m working on your bug for show-model [00:46] hml: babbageclunk would have a good idea, I think attempting to migrate before units are idle would do it (i.e. do the commands really quick) [00:46] veebers: at least with the tests, i’m not sure any error status will be more helpful. :-/ e.g. "machine sanity check failed, 2 errors found" [00:47] hml: it's starting to get warmer here, although we've had really nice warm days then it descends into cold days [00:47] veebers: of course, he’s the one who knew how to fix this . [00:47] veebers: the fun days - hot, cold, hot [00:47] aye ^_^ [00:49] hml: I love that Christmas (and thus Christmas break) is over the middle of summer here :-) [00:49] veebers: that’s just plain weird [00:49] ^_^ [00:49] veebers: that said, the family heads south for christmas [00:50] hml: hah, flying in arrow formation I hope :-) [00:50] veebers: ha [01:22] Hmm, with lxd, now when publishing an image from an existing container it takes ages and create tmp files > 100GB. I think something is wrong :-\ [01:22] hml: sorry, was grabbing lunch - unfortunately the units not being ready won't fail in the right way now. [01:31] babbageclunk: i’m thinking this is what you had in mind? https://github.com/hmlanigan/juju/commit/06db34cb637594cbe6e80db4ecb1f22778b2988f [01:32] babbageclunk: is there a good way to force a model migration abort? [01:32] not having any luck at it [01:36] hml: not that I know of - it generally indicates a failure of prechecking. You could simulate it with a wrench? [01:36] babbageclunk: wrench? [01:37] hml: we've got a mechanism for throwing errors if a wrench file is present... [01:37] hml: for testing hard-to-reach scenarios... [01:37] hml: hang on, finding it [01:37] babbageclunk: haven’t run into that yet. [01:38] babbageclunk: the wrench part :-) [01:38] hml: oh, there's one in the migrationmaster already... [01:39] hml: see the call to wrench.IsActive? [01:39] babbageclunk: yes [01:41] basically you could throw an error from transferModel if wrench.IsActive("migrationmaster", "die-in-export") [01:42] babbageclunk: okay - and wrench active would be a line in /var/lib/juju/wrench/machine-agent? [01:44] I think it would be wrench/migrationmaster - add die-in-export to that file [01:44] babbageclunk: ah, ty [01:46] hml: no worries [01:46] babbageclunk: the change was what you had in mind? [01:46] hml: yup [02:00] * thumper sighs [02:01] we have found two new buts that we really need to fix for 2.3 [02:01] heh [02:01] buts [02:01] gugs [02:01] bugs [02:01] those things [02:01] babbageclunk: got 10 minutes? [02:01] babbageclunk: I'd like to talk a few things through [02:05] * thumper steps away [02:15] thumper: sorry, looking again! [02:15] thumper: grab me when you come back [02:21] babbageclunk: veebers: at least the last error won’t be overwritten by the abort msg with model migration. we’ll see how useful the errors are. :-) [02:23] wallyworld: looking at your PR shortly, I've just put up https://github.com/juju/juju/pull/8087 if you have time to look; moves state pool to worker/state [02:24] axw: will do, just getting a quick bite after finishing interview [02:24] sure, no great rush [02:25] hml: excellent :-) It should be very helpful when migrations fail (especially in tests so we can categorise them) [02:30] axw: so we would still try a specific zone, but doesn't your patch mean we'll at least fall back to another one? [02:31] I suppose if the issue is that pods will give you one when they otherwise would want to fallback to explicit hardware [02:31] but apparently the underlying bug was actually that MAAS would ignore a 'tag' constraint which is a better way to target real hardware [02:32] thumper: Did you get a note from the meeting about checking whether juju create-backup excludes previous backups? [02:32] thumper: Also, has any thought been given to leaving the backups in the filesystem instead of mongodb? [02:36] wallyworld: babbageclunk: i have a quick review if you’re around: https://github.com/juju/juju/pull/8088 [02:36] yup [02:36] ok [02:41] hml: the migration message won't actually sat that the migration is aborted though will it? it will show some arbitary error text but the user won't know what the final outcome is. maybe it's a recoverable, temporary error, who knows [02:41] wallyworld: i left as info, since it was calling setInfoStatus… [02:42] you mean the log message? [02:42] wallyworld: yes [02:42] what written to logs is a warning though since something failed [02:42] the setInfoStatus() is just an api name [02:43] jam: in the cases where it would *fail* because of going to the zone, then sure. for this bug, I was only thinking about the case where we can get a machine in either zone, but we really want the one in the default zone [02:43] is there a way to track the last migration error in the worker and when aborting, preprent the error with "aborted: " or something [02:43] wallyworld: in this case, goes to info [02:43] wallyworld: i’ll have to look [02:43] setStatusInfo(0 goes to the model status right? not the logs [02:43] it goes to both [02:44] ok, but the final logged message should be a warning - sys admins grep for warnings and knowing something has failed is important [02:45] okay, can be changed [02:46] there doesn’t appear to be a last failure. [02:47] would have to change for every abort, instead of in abort [02:47] and some places not easily known - [02:50] i haven't looked at code - can we record last failure as an attribute on the migration op. there's ony one place where the final abort status is set? or no? [02:51] there is only one place abort status is set - [02:52] that's good - we just need to check for last error and prepend that with "migration aborted: " or whatever [02:52] that way it is 100% clear that the thing has stopped and why [02:52] where is the last error? [02:53] yo need to record it! [02:53] create a varioable to hold it [02:54] have a heper method somehwere used to update status with anerror, and overwrite a lastError each time. or something [02:54] sorry i’m being dense here. [02:54] there’s one i can add on to [02:54] could be me oversimplifying [02:55] i only have a little brain [02:55] i was thinking the worker in this case was at a highlevel [02:55] but it’s in mibrationmater [02:58] yeah, i think migration master entity should be able to track errors encountered doing the job [03:07] babbageclunk: you loving simple streams? :-D [03:09] axw: just started looking - why was srv.run() put in a go routine? maybe it will be clear when i read more of the PR? [03:10] just curious if it was due to a drive by or part of the PR [03:10] wallyworld: apiserver.Server is a worker, I'm just making it conform to our usual patterns [03:10] ok, ta [03:10] wallyworld: it's not a big deal, just tidying up [03:10] no worries, sgtm [03:13] axw: and that means you were able to pull expireLocalLoginInteractions() out of a separate go rountine [03:15] wallyworld: changes done [03:15] wallyworld: I am not [03:15] hml: awesome tyvm, looking [03:16] hml: much nicer, thank you! [03:17] babbageclunk: did we need a HO? [03:17] wallyworld: probably couldn't hurt [03:17] righto [03:20] * hml headed to the airport to pick up my dad [03:20] hml: see you later [03:25] wallyworld: yeah, the idea was to stop using the waitgroup for two different things (tracking sub-workers, and outstanding connections/requests) [03:30] thumper: did you come back? [04:03] axw: sorry about delay, had meeting, lgtm [04:15] wallyworld: no worries. I've left a bunch of comments on your PR. I'd like it to be split up a bit (see comments), and worker responsibilities separated [04:15] ok looking [04:17] wallyworld: re renaming StateTracker to StatePoolTracker or whatever, I'd really rather not. I'm considering StatePool just an implementation detail, as the entry point to "state". I want to replace it with a different type which manages all the state workers, as we've talked about recently [04:17] ok [04:24] wallyworld: if we find the same binary version in multiple streams it doesn't matter, does it? They'll be the same. [04:24] they are today but not guaranteed [04:25] alsthough they will always be the same in practice [04:25] We won't find 2.3-rc1 in both released and proposed. [04:25] correct [04:25] rc1 should only be in proposed [04:25] sorry, i think i misunderstood the first time [04:25] No, that was a slightly different question. [04:47] wallyworld: https://github.com/juju/juju/pull/8089 [04:47] ok [04:47] * thumper heads out for more kid duty [04:47] bbl [05:06] wallyworld: I'm not sure I agree with you on the select bits [05:06] thumper: why have 2 select blocks which creates bloat when we just need one? [05:06] wallyworld: because it is explicit and clear [05:07] you aren't allowed to send to a nil channel [05:07] not even allowed to try [05:07] sending to a nil channel as a no-op is fairly standard [05:07] no [05:07] oh wait [05:07] receiving [05:07] sorry [05:07] pulling off a nil channel [05:07] yeah, ignor eme [05:07] ok [05:07] I'll look at the Timeout bit [05:07] ok [05:08] I think in order to keep the patch small, we shouldn't rename Timeout in this branch [05:08] in case we need to back port it [05:09] I'm ok in principle with renaming Timeout [05:09] although Timeout is fairly self explanitory [05:11] except that there's 2 and it's a bit ambiguous, but the doc helps and agree about minimising change [06:02] axw: why do you want separate NewIAASModel() and NewCAASModel() when the code for both is 95% the same? there's just a couple of different checks which should disappear at some point eg the IAAS restriction on new model being in the same cloud as the controller we have talked about removing [06:03] given model type is eassentially just a model attribute, i'm not sure what such a far reaching change buys us [06:04] there > 30 usages of NewModel() [06:08] wallyworld: because it keeps the callers simpler, not having to pass in irrelevant things like storage providers and constraints. so each time you add something IAAS specific you don't have to worry about CAAS code, and vice versa [06:08] wallyworld: all the existing ones are for IAAS models, right? [06:08] we already support adding a caas model [06:09] most of the test cases are iaas [06:09] this NewModel() code with caas type has been there for a while [06:09] i think [06:10] i can change, will just add some noise to the pr [06:10] i'm starting a new pr [06:10] for the below the water line stuff [06:10] wallyworld: leave that one then [06:10] it can be refactored later [06:11] axw: i can easily do in a followup [06:12] wallyworld: my biggest beef is with responsibility of making/maintaining connections, and head was getting too full to dive into the state innardy bits [06:12] the rest is small stuff [06:12] yup fair point [06:12] except ModelActive [06:12] i've changed but still dislike having to make 2 separate mgo queries [06:13] you don't have a model object at the call site - just a uuid [06:13] and the ony thing that calls ModelActive() is there [06:13] wallyworld: but you can change that, by using a state pool -no? [06:13] no, this is before the agent is started [06:13] when it is figuring out what manifolds to use [06:14] wallyworld: it's in the model worker manager? that's not before any agents have started... [06:15] it's run outside of the machine manifold, which is part of the problem [06:15] it needn't be though [06:15] ok, i'll look closer [06:15] wallyworld: I can move that into the machine manifold if you like, while you're doing other things? [06:16] chasing down a test failure atm, which I'll need to fix first [06:16] ok, that would be grand. i can rebase on top of that. i've got about 3 branches on the go [06:16] got the skeleton operator agent going [06:17] with a bit of common stuff extracted from uniter [06:17] small steps [06:22] wallyworld: I'm going to merge develop into the feature branch, OK? [06:22] \o/ [06:22] there's some agent fixes I don't want to work around [06:23] yup [06:23] i wish git had pipelines like bzr [06:23] would make working on several dependent branches so much easier [07:02] axw: you still working on the state pool tracker pr or can that land now the feature branch has been updated from develop? [07:03] wallyworld: I'm trying to figure out why a test failed [07:03] righto [07:03] wallyworld: I've been unable to reproduce [07:03] joy [07:04] might just land and see if it pops up again, might be blue moon type of thing === frankban|afk is now known as frankban [08:57] wallyworld: can you take a look at another WIP and let me know whether you're happier with that approach? [08:58] wallyworld: took a bit longer than expected, here's my PR to move modelworkermanager to the dependency engine: https://github.com/juju/juju/pull/8093 [08:58] wallyworld: oops https://github.com/juju/juju/pull/8092 [08:58] pipped to the post! [09:01] babbageclunk: looking [09:08] axw: I don't know, you got the link there first [09:08] :) [09:09] babbageclunk: looks pretty good [09:10] wallyworld: cool cool, just testing [09:10] babbageclunk: i'm not sure there's more to do to handle searching across datasources? as the current behaviour might be sufficient? [09:11] axw: ty, i'll look after dinner [09:11] wallyworld: maybe? It'll still stop on the first datasource that has an index though [09:11] yeah [09:12] if datasource A only has develp and B has released [09:12] it needs to consider all [09:12] right [09:12] i think the branch as is can land though [09:12] and another done to do the other bit [09:12] I need to fix unit tests [09:13] ok, will do that [09:13] jeez, it's late for you [09:13] had to break to feed and bathe kids [09:14] see how you go, i might have to pick it up tomorrow [09:14] i need to go afk for a short while for dinner [09:14] ok [09:41] axw: awesome, ty. i will have some conflicts but I'll deal :-) [09:41] wallyworld: cool [09:42] axw: i'll have 2 PRs tomorrow. one for the state/infra stuff, one for the worker and manifold stuff. and then I'll weave in the facade stuff. and after that I'll propose the operator skeleton. so I guess that's 4 :-) [09:43] wallyworld: sounds good. I'll try and move the remaining machine workers to manifolds in the morning, and we can figure out what CAAS things I can do in 1:! [09:43] 1:1 even [09:44] yay, more caas stuff [09:47] git question because I'm feeling stupid - if I want to see all the diff-stats for the accumulated commits in my branch, how do I do it? [09:48] I've tried git diff --stat upstream/develop..my-branch but it includes changes on develop that I haven't rebased into my branch yet. [09:49] bah, just rebased so that I didn't need to worry about it. === salmankhan1 is now known as salmankhan [10:25] balloons: looks like jenkins has died in the ass again - https://github.com/juju/juju/pull/8093 [13:18] wallyworld: ping? unlikely [13:25] babbageclunk: go to bed [13:25] :) [13:25] babbageclunk: but if there is something I can help with [13:25] axw: balloons said it was out of disk the other day [13:26] jam: I'm thiiiis close to stopping! Thanks, I think I worked it out. [13:32] jam: it's not even 3am, leave him alone! [13:45] wpk :) [13:48] is mergebot down? [13:50] balloons: is there a problem with mergebot? This PR isn't getting built: https://github.com/juju/juju/pull/8092 [13:50] * babbageclunk goes to bed [13:58] I can look. Mean [13:58] Every morning this week [13:59] I need to share the doc with you all on how to troubleshoot [14:02] Axw, so perhaps we need to clean and reboot the vsphere daily to keep things going? [14:10] looks like cloud instance died [14:18] restored, jobs running again [16:14] wpk, does this look better? https://github.com/juju/juju/pull/8086 === frankban is now known as frankban|afk [18:18] axino: ping [18:53] https://bugs.launchpad.net/juju/+bug/1732764 [18:53] Bug #1732764: series + spaces + artful = fail [19:22] balloons: is that doc for troubleshooting the merge bot around? :-) github-check-merge-juju-pipline thinks it has a config error. [19:23] bdx, xenial works and artful doesn't? [19:26] hml, needs updated and it was shared with you at one point :-) https://docs.google.com/document/d/1TN4SG8QXNbXpFn_9QTpPgI7XxD85uzdlttGB0QRlm2I/edit# [19:27] balloons: i have to remember that far back? ;-) [19:28] ballons: yeah [19:28] bdx, ty. We've actually been testing this over the last couple weeks [19:28] the only combo that breaks it is "series + spaces + artful + beta3" [19:28] ballons np [19:29] hml, ohh the old job was restoed [19:29] just needs to be removed from jenkins again [19:29] Im making an official redis snap right now, so it wont matter to me anymore - I was chasing the 4.x in artful lol [19:30] good to get these things fixed anyway though [19:30] bdx, ack. And indeed [19:32] hml, anyways notice the actual bot ran fine on your pr [19:33] balloons: i did, wasn’t sure what the pipeline thing was [21:58] wallyworld: i looked at the bug for a panic in the storageprovisioner, it’s easy to see the cause, but i’m wondering why the pointer is nil at that point. who’s been in that area? [21:58] wallyworld: just to make sure there isn’t a bad root case [21:58] or cause even [22:08] wallyworld, babbageclunk, please test sigfile in the edge snap [22:10] and let me know; since we're going to ship that in rc1 if it's good [22:16] or if you don't say anything :p [23:31] balloons: I guess we could try, but it didn't take long for it to start misbehaving after I restarted it last time