[00:02] mgz: forgot to ask you about the failure in https://github.com/go-goose/goose/pull/33. how does one update deps for that package's CI? it needs github.com/juju/loggo [00:45] is there any plans for 1.25.9? 1.25.8 seems suuuuper memory hungry [00:59] bradm: hey, yeah, I'm looking into a leak in 1.25.8 now [00:59] bradm: do you have any data points for me? [01:01] thumper: how does 12G res jujud on node 0 sound? [01:01] it sounds entirely unreasonable [01:01] how big a model? [01:02] 23G virtual and "0.012t" res, which is an interesting way to put it [01:02] thumper: 10 physical machines, a bunch of containers [01:02] how many units? [01:03] its a HA openstack deployment, so lots [01:03] 40? [01:03] 60? [01:03] including subordinates, about 290 units [01:03] 100? [01:03] huh [01:03] ok [01:04] on only 10 physical machines? [01:04] wow [01:04] yup [01:04] fairly standard deployment [01:04] that includes landscape client, nrpe, ksplice, things like that [01:05] hmm... [01:05] looks like nearly 50 lxcs on there [01:06] so ~5 units per machine [01:06] about that [01:06] I suppose that isn't terrible [01:06] what version did you upgrade from? [01:06] fresh 1.25.8 install [01:06] and do you have any indication of memory it used before? [01:06] ah [01:06] ok [01:06] * thumper taps fingers [01:07] we were hitting the tomb dying error last week, and ended up having to go with fresh 1.25.8 [01:07] tomb dying error? [01:07] what is that? [01:07] is it related to this? https://bugs.launchpad.net/juju-core/+bug/1645729 [01:07] Bug #1645729: environment unstable after 1.25.8 upgrade [01:08] ok, I think I'm going to have to work out how to read the go heap profile dumps [01:08] https://bugs.launchpad.net/juju-core/1.25/+bug/1613992 <- that one [01:08] Bug #1613992: 1.25.6 "ERROR juju.worker.uniter.filter filter.go:137 tomb: dying" [01:09] bradm: in the 1.25 agents, there is a point where we can get the agent to dump us a heap profile [01:09] maybe this could point to the leak [01:10] axw: do you have some familiarity in reading the go heap profiles? [01:12] thumper: well, anything we can do to help out, let me know. [01:12] bradm: will do [01:12] we'd just handed the stack over to the customer last week, but they're only doing testing now [01:17] thumper: interestingly the other state servers aren't leaking as much, something like 13G virt, 10G res on one, 11G virt, 8G res on the other [01:17] HA right? [01:17] yeah [01:17] multi-model? [01:17] probably not [01:18] was still behind a feature flag [01:18] nope, just a simple openstack deploy [01:57] menn0, thumper: review plz? https://github.com/juju/juju/pull/6633 [02:08] menn0: So I was thinking that tracking the latest log time seen every 2 minutes of log messages would probably be an ok balance between DB activity and getting annoyed by double-ups. Sound alright? [02:08] bugger... [02:08] menn0: Works out to 864 extra writes over 3 days worth of logs. [02:08] crawled through all the code changes from 1.25.6 to 1.25.8 [02:08] nothing obvious [02:08] thumper: stink [02:09] babbageclunk: that seems ok, especially given that in most cases it won't be interrupted [02:10] thumper: how sure are we that the problem actually started since 1.25.6? [02:10] menn0: I'm not entirely [02:10] could well be before [02:10] menn0: oops, that was 5 mins - 2 mins is 2160 writes. [02:11] menn0: I have logs of 1.25.6 and prior where controller was running for weeks or months without restarting [02:11] but 1.28 OOMs very quickly [02:11] so I was using that as a basis [02:11] one unit has ~ 37 watchers [02:12] with 60 units [02:12] that is ~2100 watchers [02:12] each server side watcher has more than one goroutine [02:17] thumper: well the fact that you see 1.25.6 lasting for long periods is a pretty strong indicator [02:17] thumper: it might be something that isn't obvious from the commit logs [02:18] 1.25.6 up for 50 days [02:18] thumper: can you reproduce the issue yourself by spinning up a reasonably sized model? [02:18] ~4-12 hours up time since upgrade [02:18] you need many machines and many units [02:19] I wonder if there were charm deployments that used newish features that were updated at a similar time [02:19] code that may have been in the older version but not touched [02:31] babbageclunk: ship it [02:32] thumper: dunno about code diff btw 1.25.6 and 1.25.8 but just looking at the bugs that went in, including 1.25.7, it's plausible... [02:32] I've done a git diff between the tag juju-1.25.6 and tip of 1.25 branch [02:32] only ~3k lines [02:32] and nothing obvious [02:33] thumper: was anything changed in dependent libraries? diff versions? [02:33] only three [02:33] juju/utils goamz and one other [02:33] juju/charms [02:33] riiiight [02:33] * thumper goes to look at them [02:35] thumper: i *think* we also still patch our own mgo at release time... would b great to know if 1.25.8 was patched as well.. [02:35] i'd like to know the magic involved... [02:35] there was no change between 1.25.6 and 1.25.8 in that [02:35] k [02:36] thumper: alexisb has been trying migrations and is getting lots of precheck failures regarding machines not being running [02:36] thumper: and this is with your fix [02:36] is she sure? [02:36] thumper, alexisb: unless --build-agent didn't work? [02:37] alexisb: to be really sure you're running the code you think you are: tear down the controllers, go install ./..., rebootstrap, try migrate again [02:37] only change in utils is different TLS cyphers [02:38] thumper: :( [02:38] I don't use --build-agent [02:38] thumper: you should, b/c otherwise when a release comes out and you haven't rebased/merged your work you end up bootstrapping with the released version [02:39] I'm careful to watch whether it uploads or not [02:42] anyone know openstack? I seem to be getting different json back from it than goose expects [02:42] menn0: Ta! [02:46] nevermind, figured it out [02:46] natefinch: \o/ [02:47] thumper: did u see axw's last comment on https://bugs.launchpad.net/bugs/1587644 [02:47] Bug #1587644: jujud and mongo cpu/ram usage spike [02:48] thumper: there is another bug in mgo that could b potentially biting us on both 1.25.x and 2.x, spiking cpu, etc... [02:50] axw: can I grab you for 10 minutes before the tech board? [02:51] thumper: axw is on school rotation [02:51] he wasn't going to tech board [02:51] he has to go to school? [02:51] :) [02:51] we all have to at some stage [02:51] menn0: can you read the go heap profile? [02:52] thumper: no sorry, never tried [02:54] menn0: r u still planning to discuss the topic m interested in at tech board? [02:54] menn0: nm. i see minutes [02:54] anastasiamac: recovering from mgo/txn corruption? [02:55] yes [02:55] menn0: k \o/ i might join later on in the meeting then! thnx [02:56] anastasiamac: cool. do you want me to let you know when the topic comes up? [02:57] ok I will try tearing down [02:57] thumper, ^^^ [02:57] if that doesnt work I will open a bug as it is not urgent [02:57] what logs do you guys need if I need to open a bug? [03:00] menn0: sure :) if u keen... [03:00] u r* [03:00] alexisb: the controller machine-0 logs at DEBUG level should do it [03:00] k [03:08] bradm: I don't suppose I could get you to grab me a heap profile could you? [03:10] thumper: we can certainly make it happen, just fighting some stuff elsewhere - how do I do it? [03:10] bradm: let me find you the instructions [03:12] bradm: this is mostly accurate for the 1.25 code https://github.com/juju/juju/wiki/pprof-facility [03:12] see the heading of heap profile [03:12] I'm also interested in the goroutines [03:25] thumper: so which bits do you need? its a 56M file [03:50] bradm: unfortunately the whole thing [03:50] either private filestore or support files [03:50] gzipped probably a little smaller [03:51] yeah, definitely going to gzip [03:51] the goroutines is only 59k or something [03:52] babbageclunk: if you are adding start time to debug-log, can you add end time too? [03:52] babbageclunk: that is something I have wanted for quite some time [03:53] been meaning to get around to it [03:53] thumper: not adding anything to the command at the moment. Also it's a bit more work - start time was already in LogTailer, but end time isn't. [03:54] thumper: also, I've already done it! Maybe I'll cycle back and put end time in once I've done the rest of the restartable logtransfer stuff. [03:55] thumper: I've picked up the task to get you the info you asked jacekn for in lp:1645729; I'm just pulling down his debug logs now - let me know if there's anything else you want other than unit counts. [04:06] blahdeblah: I think we're good for now, but my EOD [04:06] will continue tomorrow [04:06] OK - will update the ticket in a sec [04:09] cheers [04:11] lol @ openstack provider rejecting a 200 ok response with valid json [04:12] because it requires a 300 Multiple Choices [04:12] really? 300? geez people [04:15] the best is the error message: "request (http://127.0.0.1:40020/) returned unexpected status: 200 error info: [04:42] jam: i've linked the PR in the bug :) [06:18] wallyworld anastasiamac: I've added 2 new commits to https://github.com/juju/juju/pull/6623. main thing is adding a functional test for the statemetrics worker in the agent [06:19] wallyworld anastasiamac: would appreciate your eyes on that bit in particular, in case you have an idea of how I can make it more of a unit test [06:20] ok [06:22] axw: the existing tests for other workers do nothing more than patch the worker.New and check that the worker is started, rather than anything functional [06:22] wallyworld: yeah, that feels pretty dirty to me [06:22] it does [06:22] trying to avoid patching [06:24] axw: there's maybe not a lot else you can do - i'd almost be inclided to more the test to featuretests [06:25] wallyworld: I'll have a look at doing that. there is an instrospection suite there already, could piggy back on that [06:25] could do yeah, since it really is testing the moving parts al lworking together [06:29] axw: i'll look a bit later but m happy to delegate if wallyworld is happy \o/ [06:30] anastasiamac: one set of eyes is probably enough, thanks [06:33] axw: :D it's also the quality of that set that gives me comfort :) === frankban|afk is now known as frankban [08:34] ermahgerd, out of ec2 instances again [08:43] axw: sorry, missed you last night. goose deps are hard coded in the merge job still, we probably need to make a dependencies.tsv at some point [08:43] for now, updated and tried merge again [08:43] mgz: okey dokey. thank you [08:43] mgz: are you able to delete an instance or two so http://juju-ci.vapour.ws:8080/job/github-merge-juju/9739/ can be retried too? [08:43] anda we're out of instances? I did manual cleanup on monday... [08:44] mgz: seems so :( [08:44] having a look [08:46] okay, terminating about 50 in us-east-1 [08:47] mgz: :o [08:47] mgz: thanks [08:47] mostly ha-recovery, a couple of other things [09:04] axw: goos change merged [09:04] mgz: yup, thanks. juju one to use it has now arrived :) [09:22] frobware: ping [09:23] macgreagoir: hi [09:23] HO? [09:54] mgz: poke [10:07] jam: heya [10:19] can I get a stamp on https://github.com/juju/juju/pull/6602 please? [10:19] did the cherrypick of changes required for the utils bump, so should be good to go now [10:23] anybody running MAAS 2.1.1 and seeing DHCP occasionally failing? Answers some of my wtf moments today. === iatrou_ is now known as iatrou [10:28] frobware: CI is still in 2.1.0 [10:28] mgz: ack [12:05] mgz, http://paste.ubuntu.com/23557786/ [12:05] thanks! [12:35] mgz: standup? === freyes__ is now known as freyes [13:39] * frobware lunches [14:12] of course, the bug must be in the most complex patch :p [14:29] perrito666: can I bug you to be a second pair of eyes on a small good review? [14:30] sure [14:30] https://github.com/go-goose/goose/pull/37 [14:30] s/good/goose/ [14:31] good as well hopefully. [14:32] mgz: goose is not good, but that's not your fault ;) [14:32] mgz: lgtm [14:33] perrito666: thanks! === cmars` is now known as cmars [14:53] bbl, errand === icey is now known as Guest28168 [16:19] so... we're supposed to branch off staging and then PR onto develop, right? [16:21] natefinch: ideally, but that's not realistic at present [16:21] well, even ideally it doesn't actually work [16:22] like, if there's any kind of conflict, you just have to rebase onto develop and fix the merge conflict [16:23] so, might as well just branch off develop anyway [16:24] mgz, fwiw I've put up pull requests for juju 1.25 and 2.0 to update deps. The Jenkins job has failed due to the http://paste.ubuntu.com/23557786/ compat bugs [16:25] gnuoy: right, we're going to need to bundle those changes into the juju code along with the dep update [16:25] mgz I'm happy to update my pull requests [16:25] but I'd generally start with the (off develop) dep bump for 2.1 [16:26] either way around is fine [16:26] I've not sent email yet about compat breakage, but just fixing for 1.25 seems okay [16:28] mgz: that will never work, git does not work the way whoever wrote that thinks it works [16:29] perrito666: which bit in particular? [16:30] the branch from develop, merge to staging? [16:30] branches must return to their source [16:30] yes [16:30] it technically can work, but requires a bunch of discipline [16:30] mgz: not really, there is no amount of discipline that can make a branch diverged enough merge cleanly [16:31] perrito666: the point is diversion should really be only a day or twos worth of commits [16:31] and if you get a bad set you roll the lot back [16:31] you could make it a bit better by forcing anyone to squash their commits and even then the conflicts you solve are useless if all of the commits dont do it to staging [16:32] I bet there is no actual practical reason for that (there actually is no gain in the process as it is suggested) [16:34] if we all squashed our commits (that would give you roughly 2 commits per feature) you can remove the offending commit only without altering much the rest [16:45] I thought we had agreed to squash commits? if we also had the bot do a squash & merge, it could be exactly one commit per feature. [16:50] that would be glorious, things like git bisect would work properly for instance === icey_ is now known as icey [17:33] brb reboot === frankban is now known as frankban|afk [18:04] rick_h: bonding - I wonder if an up-front limitation is that we have is... if you're using bonds then you need to B-A-T (bridge-ahead-of-time) via MAAS. Otherwise all I can guarantee is that at somepoint the machine will wedge with ifdown/up [18:04] macgreagoir: ^^ [18:05] rick_h: I just spent all the afternoon watching it fail in subtle ways. macgreagoir is my witness. :) [18:05] rick_h: generally, bridging vlans, aliases, and non-bonded interfaces seems OK [18:06] jam: ^^ [18:06] frobware: +1 [18:07] rick_h: it seems I could spent the rest of my days trying to make this work. it seems fundamentally racy. [18:13] lol, of course, I added checks to ensure that endpoints represent real clouds, and now all my unit tests fail because - tada - they weren't adding real clouds. [18:14] (where all == 4, but still) [18:15] (and where unit == full stack, obv) [18:44] frobware: full support of that being maas driven [18:45] rick_h: I need to take a step back and ensure what we have in 2.0.2 actually works on the node I'm using. Having said that, the new stuff we did see working today but the ratio of good:bad is like 1:50. [18:45] rick_h: and when you get it wrong systemd graciously spends 5 minutes trying to bring up the interfaces (which fails) before you get to a login prompt. Grrr. [18:46] rick_h: ifupdown is not happy in the modern world. [18:47] rick_h: and I haven't tested at all on trusty. different kernel, different ... fun. [18:48] * frobware heads to the pub. [19:02] oh, external tests, you are the worst [19:56] need some assistance figuring out what has failed: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9743/ [19:57] i see some things which might be an issue in lxd-err.log? but that's about it? [19:57] trusty-out.log is impossible to scan now [20:08] katco: console output says lxd failed [20:08] katco: the output in lxd-err.log is pretty hard to read, but I see "error: controller merge-juju-lxd not found" [20:09] oh wait, that's the just in case cleanup, it should fail, that's ok [20:09] natefinch: i am a bit stumped [20:10] katco: I guess the exception at the end there... seems like printing out the stack trace is extraneous [20:10] Command '('juju', '--debug', 'bootstrap', '--constraints', 'mem=2G', 'lxd/localhost', 'merge-juju-lxd', '--config', '/tmp/tmpsoEIYE.yaml', '--default-model', 'merge-juju-lxd', '--agent-version', '2.1-beta2', '--bootstrap-series', 'xenial')' returned non-zero exit status 1 [20:11] ahh here we go: [20:11] 19:33:59 ERROR cmd supercommand.go:458 failed to bootstrap model: cannot start bootstrap instance: unable to get LXD image for ubuntu-xenial: Error adding alias ubuntu-xenial: already exists [20:11] natefinch: sounds spurious? [20:11] balloons: ^^ ? [20:12] sounds like we don't have code in the test to handle this codepath where the image already exists. [20:13] or rather, I guess that's a Juju message [20:13] natefinch: not sure why this commit is triggering this though [20:13] katco: no clue [20:15] sinzui: balloons: mgz: any idea if the CI environment is to blame here? [20:15] katco: that is lxd [20:16] katco: I have seen it from time to time over the year [20:16] sinzui: should i just requeue? [20:16] katco: yes [20:16] sinzui: ta [20:17] ty sinzui [20:17] and that's annoying :-( [20:35] blahdeblah: you around? [20:36] rick_h, alexisb: I think that SSL issue might be the openssl version (yay dependencies) [20:36] rick_h, alexisb: http://stackoverflow.com/questions/38489767/ssl-error-on-python-request [20:37] natefinch: rgr [20:45] thumper: I shouldn't be, but... [20:46] blahdeblah: if you shouldn't be, then don't be [20:46] thumper: Well, now that I'm here, what's up? :-) [20:46] blahdeblah: mup tells me that it is very early for you, is that right? [20:47] nah - not a big deal [20:47] been up for about 4 hrs already :-\ [20:47] blahdeblah: yesterday bradm got a heapprofile from a misbehaving apiserver process for me, but hasn't passed on details... [20:47] wat? [20:47] seriously? [20:48] Long story [20:48] blahdeblah: was wondering if I could get a heapprofile from the apiserver (hopefully not too soon after restarting) [20:48] to see if we can work out where the leak is [20:48] details of getting the heap profile from 1.25 are documented here https://github.com/juju/juju/wiki/pprof-facility [20:48] The one bradm was working on was an OpenStack, IIRC [20:49] Different from the env I gathered data for yesterday [20:49] yeah, a different environment, but showing similar problems [20:54] thumper: I can gather that from our environment later today. [20:54] blahdeblah: thanks [21:03] rick_h, ping [21:21] it is just me, or is calling strings.TrimSpace on someone's password a bad idea? [21:21] natefinch: seems like a problem [21:21] natefinch: I mean, it doesn't seem like a *good* idea. [21:22] reminds me of a website, I forget which, that just truncated your password if it was too long [21:24] nice [21:38] wallyworld: are you on yet? [21:38] somewhat [21:39] wallyworld: can you explain what this comment means? https://github.com/juju/juju/blob/staging/cmd/juju/cloud/addcredential.go#L330 [21:40] for now, we don't support allowing the user to type in a multi-line attribute - they only have the option of specifying a filepath to a file which contains the attribute [21:41] the concrete case for that is the GCE credential info from memory [21:43] wallyworld: but what does that have to do with the line below it? [21:44] wallyworld: also, it looks like if that if statement is false, then we do validation against value which hasn't been set? [21:44] give me a minute to read the code [21:46] that comment block looks like it's a general statement about what's supported for credential attr entry in the code block below in the entire loop, rather that specifically the line of code just below [21:47] so the location of the comment is a bit crap [21:47] oh ok, that makes a lot more sense :) [21:47] sorry [21:47] I'm in that code, so I can add a blank line to make it more obvious [21:47] ty [21:48] or move it outside the loop or something [21:48] yeah [21:48] sad when you comment code and then need to explain the comment [21:48] heh [21:49] man, I really don't like the code font they've started using on golang.org. [21:49] heh... I use it in my editor [21:51] It did take a little getting used to, but I stopped noticing it after the first half hour [21:56] bbl === natefinch is now known as natefinch-afk [22:10] menn0, thumper, anyone else: review please? https://github.com/juju/juju/pull/6641 [22:19] babbageclunk: looking [22:19] redir: thanks! [23:19] axw: ping [23:21] perrito666, I have him occupied atm [23:21] * perrito666 imagines axw cutting the lawn on alexisb house [23:22] perrito666, that takes a tractor and several days [23:24] thumper: Fix for migration of charms with ~user component: https://github.com/juju/juju/pull/6642 [23:24] * thumper looks while being in a call [23:39] menn0, thumper: can one of you look at https://github.com/juju/juju/pull/6641 [23:40] menn0, thumper: redir likes it but it could do with some migrationy eyes too [23:41] babbageclunk: will look after standup [23:50] perrito666: cutting lawn?? (I barely cut my own, it's a mess) [23:50] axw: I pay someone to do it because I dont own a big enough machine :( [23:51] such things can be purchased ;) but here service-to-hardware ratio is probably higher [23:53] yep, over 500USD for the machine and under 15 for the cut