/srv/irclogs.ubuntu.com/2016/11/30/#juju-dev.txt

axwmgz: forgot to ask you about the failure in https://github.com/go-goose/goose/pull/33. how does one update deps for that package's CI? it needs github.com/juju/loggo00:02
bradmis there any plans for 1.25.9?  1.25.8 seems suuuuper memory hungry00:45
thumperbradm: hey, yeah, I'm looking into a leak in 1.25.8 now00:59
thumperbradm: do you have any data points for me?00:59
bradmthumper: how does 12G res jujud on node 0 sound?01:01
thumperit sounds entirely unreasonable01:01
thumperhow big a model?01:01
bradm23G virtual and "0.012t" res, which is an interesting way to put it01:02
bradmthumper: 10 physical machines, a bunch of containers01:02
thumperhow many units?01:02
bradmits a HA openstack deployment, so lots01:03
thumper40?01:03
thumper60?01:03
bradmincluding subordinates, about 290 units01:03
thumper100?01:03
thumperhuh01:03
thumperok01:03
thumperon only 10 physical machines?01:04
thumperwow01:04
bradmyup01:04
bradmfairly standard deployment01:04
bradmthat includes landscape client, nrpe, ksplice, things like that01:04
thumperhmm...01:05
bradmlooks like nearly 50 lxcs on there01:05
thumperso ~5 units per machine01:06
bradmabout that01:06
thumperI suppose that isn't terrible01:06
thumperwhat version did you upgrade from?01:06
bradmfresh 1.25.8 install01:06
thumperand do you have any indication of memory it used before?01:06
thumperah01:06
thumperok01:06
* thumper taps fingers01:06
bradmwe were hitting the tomb dying error last week, and ended up having to go with fresh 1.25.801:07
thumpertomb dying error?01:07
thumperwhat is that?01:07
thumperis it related to this? https://bugs.launchpad.net/juju-core/+bug/164572901:07
mupBug #1645729: environment unstable after 1.25.8 upgrade <juju-core:Triaged> <https://launchpad.net/bugs/1645729>01:07
thumperok, I think I'm going to have to work out how to read the go heap profile dumps01:08
bradmhttps://bugs.launchpad.net/juju-core/1.25/+bug/1613992 <- that one01:08
mupBug #1613992: 1.25.6 "ERROR juju.worker.uniter.filter filter.go:137 tomb: dying" <canonical-is> <cdo-qa-blocker> <landscape> <juju-core:Fix Committed> <juju-core 1.25:Fix Committed> <https://launchpad.net/bugs/1613992>01:08
thumperbradm: in the 1.25 agents, there is a point where we can get the agent to dump us a heap profile01:09
thumpermaybe this could point to the leak01:09
thumperaxw: do you have some familiarity in reading the go heap profiles?01:10
bradmthumper: well, anything we can do to help out, let me know.01:12
thumperbradm: will do01:12
bradmwe'd just handed the stack over to the customer last week, but they're only doing testing now01:12
bradmthumper: interestingly the other state servers aren't leaking as much, something like 13G virt, 10G res on one, 11G virt, 8G res on the other01:17
thumperHA right?01:17
bradmyeah01:17
thumpermulti-model?01:17
thumperprobably not01:17
thumperwas still behind a feature flag01:18
bradmnope, just a simple openstack deploy01:18
babbageclunkmenn0, thumper: review plz? https://github.com/juju/juju/pull/663301:57
babbageclunkmenn0: So I was thinking that tracking the latest log time seen every 2 minutes of log messages would probably be an ok balance between DB activity and getting annoyed by double-ups. Sound alright?02:08
thumperbugger...02:08
babbageclunkmenn0: Works out to 864 extra writes over 3 days worth of logs.02:08
thumpercrawled through all the code changes from 1.25.6 to 1.25.802:08
thumpernothing obvious02:08
babbageclunkthumper: stink02:08
menn0babbageclunk: that seems ok, especially given that in most cases it won't be interrupted02:09
menn0thumper: how sure are we that the problem actually started since 1.25.6?02:10
thumpermenn0: I'm not entirely02:10
thumpercould well be before02:10
babbageclunkmenn0: oops, that was 5 mins - 2 mins is 2160 writes.02:10
thumpermenn0: I have logs of 1.25.6 and prior where controller was running for weeks or months without restarting02:11
thumperbut 1.28 OOMs very quickly02:11
thumperso I was using that as a basis02:11
thumperone unit has ~ 37 watchers02:11
thumperwith 60 units02:12
thumperthat is ~2100 watchers02:12
thumpereach server side watcher has more than one goroutine02:12
menn0thumper: well the fact that you see 1.25.6 lasting for long periods is a pretty strong indicator02:17
menn0thumper: it might be something that isn't obvious from the commit logs02:17
thumper1.25.6 up for 50 days02:18
menn0thumper: can you reproduce the issue yourself by spinning up a reasonably sized model?02:18
thumper~4-12 hours up time since upgrade02:18
thumperyou need many machines and many units02:18
thumperI wonder if there were charm deployments that used newish features that were updated at a similar time02:19
thumpercode that may have been in the older version but not touched02:19
menn0babbageclunk: ship it02:31
anastasiamacthumper: dunno about code diff btw 1.25.6 and 1.25.8 but just looking at the bugs that went in, including 1.25.7, it's plausible...02:32
thumperI've done a git diff between the tag juju-1.25.6 and tip of 1.25 branch02:32
thumperonly ~3k lines02:32
thumperand nothing obvious02:32
anastasiamacthumper: was anything changed in dependent libraries? diff versions?02:33
thumperonly three02:33
thumperjuju/utils goamz and one other02:33
thumperjuju/charms02:33
anastasiamacriiiight02:33
* thumper goes to look at them02:33
anastasiamacthumper: i *think* we also still patch our own mgo at release time... would b great to know if 1.25.8 was patched as well..02:35
anastasiamaci'd like to know the magic involved...02:35
thumperthere was no change between 1.25.6 and 1.25.8 in that02:35
anastasiamack02:35
menn0thumper: alexisb has been trying migrations and is getting lots of precheck failures regarding machines not being running02:36
menn0thumper: and this is with your fix02:36
thumperis she sure?02:36
menn0thumper, alexisb: unless --build-agent didn't work?02:36
menn0alexisb: to be really sure you're running the code you think you are: tear down the controllers, go install ./..., rebootstrap, try migrate again02:37
thumperonly change in utils is different TLS cyphers02:37
anastasiamacthumper: :(02:38
thumperI don't use --build-agent02:38
menn0thumper: you should, b/c otherwise when a release comes out and you haven't rebased/merged your work you end up bootstrapping with the released version02:38
thumperI'm careful to watch whether it uploads or not02:39
natefinchanyone know openstack?  I seem to be getting different json back from it than goose expects02:42
babbageclunkmenn0: Ta!02:42
natefinchnevermind, figured it out02:46
anastasiamacnatefinch: \o/02:46
anastasiamacthumper: did u see axw's last comment on https://bugs.launchpad.net/bugs/158764402:47
mupBug #1587644: jujud and mongo cpu/ram usage spike <canonical-bootstack> <canonical-is> <eda> <performance> <juju:Triaged> <juju-core:Triaged> <juju-core 1.25:In Progress by axwalk> <https://launchpad.net/bugs/1587644>02:47
anastasiamacthumper: there is another bug in mgo that could b potentially biting us on both 1.25.x and 2.x, spiking cpu, etc...02:48
thumperaxw: can I grab you for 10 minutes before the tech board?02:50
anastasiamacthumper: axw is on school rotation02:51
anastasiamache wasn't going to tech board02:51
thumperhe has to go to school?02:51
anastasiamac:)02:51
anastasiamacwe all have to at some stage02:51
thumpermenn0: can you read the go heap profile?02:51
menn0thumper: no sorry, never tried02:52
anastasiamacmenn0: r u still planning to discuss the topic m interested in at tech board?02:54
anastasiamacmenn0: nm. i see minutes02:54
menn0anastasiamac: recovering from mgo/txn corruption?02:54
menn0yes02:55
anastasiamacmenn0: k \o/ i might join later on in the meeting then! thnx02:55
menn0anastasiamac: cool. do you want me to let you know when the topic comes up?02:56
alexisbok I will try tearing down02:57
alexisbthumper, ^^^02:57
alexisbif that doesnt work I will open a bug as it is not urgent02:57
alexisbwhat logs do you guys need if I need to open a bug?02:57
anastasiamacmenn0: sure :) if u keen...03:00
anastasiamacu r*03:00
menn0alexisb: the controller machine-0 logs at DEBUG level should do it03:00
alexisbk03:00
thumperbradm: I don't suppose I could get you to grab me a heap profile could you?03:08
bradmthumper: we can certainly make it happen, just fighting some stuff elsewhere - how do I do it?03:10
thumperbradm: let me find you the instructions03:10
thumperbradm: this is mostly accurate for the 1.25 code https://github.com/juju/juju/wiki/pprof-facility03:12
thumpersee the heading of heap profile03:12
thumperI'm also interested in the goroutines03:12
bradmthumper: so which bits do you need?  its a 56M file03:25
thumperbradm: unfortunately the whole thing03:50
thumpereither private filestore or support files03:50
thumpergzipped probably a little smaller03:50
bradmyeah, definitely going to gzip03:51
bradmthe goroutines is only 59k or something03:51
thumperbabbageclunk: if you are adding start time to debug-log, can you add end time too?03:52
thumperbabbageclunk: that is something I have wanted for quite some time03:52
thumperbeen meaning to get around to it03:53
babbageclunkthumper: not adding anything to the command at the moment. Also it's a bit more work - start time was already in LogTailer, but end time isn't.03:53
babbageclunkthumper: also, I've already done it! Maybe I'll cycle back and put end time in once I've done the rest of the restartable logtransfer stuff.03:54
blahdeblahthumper: I've picked up the task to get you the info you asked jacekn for in lp:1645729; I'm just pulling down his debug logs now - let me know if there's anything else you want other than unit counts.03:55
thumperblahdeblah: I think we're good for now, but my EOD04:06
thumperwill continue tomorrow04:06
blahdeblahOK - will update the ticket in a sec04:06
thumpercheers04:09
natefinchlol @ openstack provider rejecting a 200 ok response with valid json04:11
natefinchbecause it requires a 300 Multiple Choices04:12
natefinchreally? 300? geez people04:12
natefinchthe best is the error message: "request (http://127.0.0.1:40020/) returned unexpected status: 200 error info: <valid json>04:15
anastasiamacjam: i've linked the PR in the bug :)04:42
axwwallyworld anastasiamac: I've added 2 new commits to https://github.com/juju/juju/pull/6623. main thing is adding a functional test for the statemetrics worker in the agent06:18
axwwallyworld anastasiamac: would appreciate your eyes on that bit in particular, in case you have an idea of how I can make it more of a unit test06:19
wallyworldok06:20
wallyworldaxw: the existing tests for other workers do nothing more than patch the worker.New and check that the worker is started, rather than anything functional06:22
axwwallyworld: yeah, that feels pretty dirty to me06:22
wallyworldit does06:22
axwtrying to avoid patching06:22
wallyworldaxw: there's maybe not a lot else you can do - i'd almost be inclided to more the test to featuretests06:24
axwwallyworld: I'll have a look at doing that. there is an instrospection suite there already, could piggy back on that06:25
wallyworldcould do yeah, since it really is testing the moving parts al lworking together06:25
anastasiamacaxw: i'll look a bit later but m happy to delegate if wallyworld is happy \o/06:29
axwanastasiamac: one set of eyes is probably enough, thanks06:30
anastasiamacaxw: :D it's also the quality of that set that gives me comfort :)06:33
=== frankban|afk is now known as frankban
axwermahgerd, out of ec2 instances again08:34
mgzaxw: sorry, missed you last night. goose deps are hard coded in the merge job still, we probably need to make a dependencies.tsv at some point08:43
mgzfor now, updated and tried merge again08:43
axwmgz: okey dokey. thank you08:43
axwmgz: are you able to delete an instance or two so http://juju-ci.vapour.ws:8080/job/github-merge-juju/9739/ can be retried too?08:43
mgzanda we're out of instances? I did manual cleanup on monday...08:43
axwmgz: seems so :(08:44
mgzhaving a look08:44
mgzokay, terminating about 50 in us-east-108:46
axwmgz: :o08:47
axwmgz: thanks08:47
mgzmostly ha-recovery, a couple of other things08:47
mgzaxw: goos change merged09:04
axwmgz: yup, thanks. juju one to use it has now arrived :)09:04
macgreagoirfrobware: ping09:22
frobwaremacgreagoir: hi09:23
macgreagoirHO?09:23
jammgz: poke09:54
mgzjam: heya10:07
mgzcan I get a stamp on https://github.com/juju/juju/pull/6602 please?10:19
mgzdid the cherrypick of changes required for the utils bump, so should be good to go now10:19
frobwareanybody running MAAS 2.1.1 and seeing DHCP occasionally failing? Answers some of my wtf moments today.10:23
=== iatrou_ is now known as iatrou
mgzfrobware: CI is still in 2.1.010:28
frobwaremgz: ack10:28
gnuoymgz, http://paste.ubuntu.com/23557786/12:05
mgzthanks!12:05
jammgz: standup?12:35
=== freyes__ is now known as freyes
* frobware lunches13:39
perrito666of course, the bug must be in the most complex patch :p14:12
mgzperrito666: can I bug you to be a second pair of eyes on a small good review?14:29
perrito666sure14:30
mgzhttps://github.com/go-goose/goose/pull/3714:30
mgzs/good/goose/14:30
mgzgood as well hopefully.14:31
natefinchmgz: goose is not good, but that's not your fault ;)14:32
perrito666mgz: lgtm14:32
mgzperrito666: thanks!14:33
=== cmars` is now known as cmars
perrito666bbl, errand14:53
=== icey is now known as Guest28168
natefinchso... we're supposed to branch off staging and then PR onto develop, right?16:19
mgznatefinch: ideally, but that's not realistic at present16:21
natefinchwell, even ideally it doesn't actually work16:21
natefinchlike, if there's any kind of conflict, you just have to rebase onto develop and fix the merge conflict16:22
natefinchso, might as well just branch off develop anyway16:23
gnuoymgz, fwiw I've put up pull requests for juju 1.25 and 2.0 to update deps. The Jenkins job has failed due to the  http://paste.ubuntu.com/23557786/ compat bugs16:24
mgzgnuoy: right, we're going to need to bundle those changes into the juju code along with the dep update16:25
gnuoymgz I'm happy to update my pull requests16:25
mgzbut I'd generally start with the (off develop) dep bump for 2.116:25
mgzeither way around is fine16:26
mgzI've not sent email yet about compat breakage, but just fixing for 1.25 seems okay16:26
perrito666mgz: that will never work, git does not work the way whoever wrote that thinks it works16:28
mgzperrito666: which bit in particular?16:29
mgzthe branch from develop, merge to staging?16:30
perrito666branches must return to their source16:30
perrito666yes16:30
mgzit technically can work, but requires a bunch of discipline16:30
perrito666mgz: not really, there is no amount of discipline that can make a branch diverged enough merge cleanly16:30
mgzperrito666: the point is diversion should really be only a day or twos worth of commits16:31
mgzand if you get a bad set you roll the lot back16:31
perrito666you could make it a bit better by forcing anyone to squash their commits and even then the conflicts you solve are useless if all of the commits dont do it to staging16:31
perrito666I bet there is no actual practical reason for that (there actually is no gain in the process as it is suggested)16:32
perrito666if we all squashed our commits (that would give you roughly 2 commits per feature) you can remove the offending commit only without altering much the rest16:34
natefinchI thought we had agreed to squash commits?  if we also had the bot do a squash & merge, it could be exactly one commit per feature.16:45
perrito666that would be glorious, things like git bisect would work properly for instance16:50
=== icey_ is now known as icey
redirbrb reboot17:33
=== frankban is now known as frankban|afk
frobwarerick_h: bonding - I wonder if an up-front limitation is that we have is... if you're using bonds then you need to B-A-T (bridge-ahead-of-time) via MAAS. Otherwise all I can guarantee is that at somepoint the machine will wedge with ifdown/up18:04
frobwaremacgreagoir: ^^18:04
frobwarerick_h: I just spent all the afternoon watching it fail in subtle ways. macgreagoir is my witness. :)18:05
frobwarerick_h: generally, bridging vlans, aliases, and non-bonded interfaces seems OK18:05
frobwarejam: ^^18:06
rick_hfrobware: +118:06
frobwarerick_h: it seems I could spent the rest of my days trying to make this work. it seems fundamentally racy.18:07
natefinchlol, of course, I added checks to ensure that endpoints represent real clouds, and now all my unit tests fail because - tada - they weren't adding real clouds.18:13
natefinch(where all == 4, but still)18:14
natefinch(and where unit == full stack, obv)18:15
rick_hfrobware: full support of that being maas driven18:44
frobwarerick_h: I need to take a step back and ensure what we have in 2.0.2 actually works on the node I'm using. Having said that, the new stuff we did see working today but the ratio of good:bad is like 1:50.18:45
frobwarerick_h: and when you get it wrong systemd graciously spends 5 minutes trying to bring up the interfaces (which fails) before you get to a login prompt. Grrr.18:45
frobwarerick_h: ifupdown is not happy in the modern world.18:46
frobwarerick_h: and I haven't tested at all on trusty. different kernel, different ... fun.18:47
* frobware heads to the pub.18:48
natefinchoh, external tests, you are the worst19:02
katconeed some assistance figuring out what has failed: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9743/19:56
katcoi see some things which might be an issue in lxd-err.log? but that's about it?19:57
katcotrusty-out.log is impossible to scan now19:57
natefinchkatco: console output says lxd failed20:08
natefinchkatco: the output in lxd-err.log is pretty hard to read, but I see "error: controller merge-juju-lxd not found"20:08
natefinchoh wait, that's the just in case cleanup, it should fail, that's ok20:09
katconatefinch: i am a bit stumped20:09
natefinchkatco: I guess the exception at the end there... seems like printing out the stack trace is extraneous20:10
natefinchCommand '('juju', '--debug', 'bootstrap', '--constraints', 'mem=2G', 'lxd/localhost', 'merge-juju-lxd', '--config', '/tmp/tmpsoEIYE.yaml', '--default-model', 'merge-juju-lxd', '--agent-version', '2.1-beta2', '--bootstrap-series', 'xenial')' returned non-zero exit status 120:10
natefinchahh here  we go:20:11
natefinch19:33:59 ERROR cmd supercommand.go:458 failed to bootstrap model: cannot start bootstrap instance: unable to get LXD image for ubuntu-xenial: Error adding alias ubuntu-xenial: already exists20:11
katconatefinch: sounds spurious?20:11
katcoballoons: ^^ ?20:11
natefinchsounds like we don't have code in the test to handle this codepath where the image already exists.20:12
natefinchor rather, I guess that's a Juju message20:13
katconatefinch: not sure why this commit is triggering this though20:13
natefinchkatco: no clue20:13
katcosinzui: balloons: mgz: any idea if the CI environment is to blame here?20:15
sinzuikatco: that is lxd20:15
sinzuikatco: I have seen it from time to time over the year20:16
katcosinzui: should i just requeue?20:16
sinzuikatco: yes20:16
katcosinzui: ta20:16
balloonsty sinzui20:17
balloonsand that's annoying :-(20:17
thumperblahdeblah: you around?20:35
natefinchrick_h, alexisb: I think that SSL issue might be the openssl version (yay dependencies)20:36
natefinchrick_h, alexisb: http://stackoverflow.com/questions/38489767/ssl-error-on-python-request20:36
rick_hnatefinch: rgr20:37
blahdeblahthumper: I shouldn't be, but...20:45
thumperblahdeblah: if you shouldn't be, then don't be20:46
blahdeblahthumper: Well, now that I'm here, what's up? :-)20:46
thumperblahdeblah: mup tells me that it is very early for you, is that right?20:46
blahdeblahnah - not a big deal20:47
blahdeblahbeen up for about 4 hrs already :-\20:47
thumperblahdeblah: yesterday bradm got a heapprofile from a misbehaving apiserver process for me, but hasn't passed on details...20:47
thumperwat?20:47
thumperseriously?20:47
blahdeblahLong story20:48
thumperblahdeblah: was wondering if I could get a heapprofile from the apiserver (hopefully not too soon after restarting)20:48
thumperto see if we can work out where the leak is20:48
thumperdetails of getting the heap profile from 1.25 are documented here https://github.com/juju/juju/wiki/pprof-facility20:48
blahdeblahThe one bradm was working on was an OpenStack, IIRC20:48
blahdeblahDifferent from the env I gathered data for yesterday20:49
thumperyeah, a different environment, but showing similar problems20:49
blahdeblahthumper: I can gather that from our environment later today.20:54
thumperblahdeblah: thanks20:54
alexisbrick_h, ping21:03
natefinchit is just me, or is calling strings.TrimSpace on someone's password a bad idea?21:21
menn0natefinch: seems like a problem21:21
babbageclunknatefinch: I mean, it doesn't seem like a *good* idea.21:21
natefinchreminds me of a website, I forget which, that just truncated your password if it was too long21:22
babbageclunknice21:24
natefinchwallyworld: are you on yet?21:38
wallyworldsomewhat21:38
natefinchwallyworld: can you explain what this comment means? https://github.com/juju/juju/blob/staging/cmd/juju/cloud/addcredential.go#L33021:39
wallyworldfor now, we don't support allowing the user to type in a multi-line attribute - they only have the option of specifying a filepath to a file which contains the attribute21:40
wallyworldthe concrete case for that is the GCE credential info from memory21:41
natefinchwallyworld: but what does that have to do with the line below it?21:43
natefinchwallyworld: also, it looks like if that if statement is false, then we do validation against value which hasn't been set?21:44
wallyworldgive me a minute to read the code21:44
wallyworldthat comment block looks like it's a general statement about what's supported for credential attr entry in the code block below in the entire loop, rather that specifically the line of code just below21:46
wallyworldso the location of the comment is a bit crap21:47
natefinchoh ok, that makes a lot more sense :)21:47
wallyworldsorry21:47
natefinchI'm in that code, so I can add a blank line to make it more obvious21:47
wallyworldty21:47
wallyworldor move it outside the loop or something21:48
natefinchyeah21:48
wallyworldsad when you comment code and then need to explain the comment21:48
natefinchheh21:48
babbageclunkman, I really don't like the code font they've started using on golang.org.21:49
natefinchheh... I use it in my editor21:49
natefinchIt did take a little getting used to, but I stopped noticing it after the first half hour21:51
natefinchbbl21:56
=== natefinch is now known as natefinch-afk
babbageclunkmenn0, thumper, anyone else: review please? https://github.com/juju/juju/pull/664122:10
redirbabbageclunk: looking22:19
babbageclunkredir: thanks!22:19
perrito666axw: ping23:19
alexisbperrito666, I have him occupied atm23:21
* perrito666 imagines axw cutting the lawn on alexisb house23:21
alexisbperrito666, that takes a tractor and several days23:22
menn0thumper: Fix for migration of charms with ~user component: https://github.com/juju/juju/pull/664223:24
* thumper looks while being in a call23:24
babbageclunkmenn0, thumper: can one of you look at https://github.com/juju/juju/pull/664123:39
babbageclunkmenn0, thumper: redir likes it but it could do with some migrationy eyes too23:40
menn0babbageclunk: will look after standup23:41
axwperrito666: cutting lawn?? (I barely cut my own, it's a mess)23:50
perrito666axw: I pay someone to do it because I dont own a big enough machine :(23:50
axwsuch things can be purchased ;)   but here service-to-hardware ratio is probably higher23:51
perrito666yep, over 500USD for the machine and under 15 for the cut23:53

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!