/srv/irclogs.ubuntu.com/2017/05/31/#juju-dev.txt

blahdeblahthumper: You around?  Looking for some advice on how to tackle a problem.00:19
blahdeblahI've got a production 1.25.10 environment in a rather bad way, showing the traditional symptoms of 1587644, plus the broken status updates of 1666396, but with no missing txn-revnos.00:19
blahdeblahdebug-log is full of https://pastebin.canonical.com/189580/00:19
blahdeblahrestarting jujud-machine-0, juju-db, and rsyslog has had no effect; it's just constant leadership-tracker spam per above00:19
blahdeblah^ Or anyone else, for that matter :-)00:21
babbageclunkblahdeblah: can you see any message about why the leadership-tracker isn't running?00:22
babbageclunkOr has that fallen off the bottom of the log?00:22
blahdeblahLet me run the debug-log for a while and grep out the noise00:22
blahdeblahI got one of these: machine-0: 2017-05-31 00:22:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("5912fa035290d208a719a8cc")00:23
blahdeblahIs that a smoking gun for an mgopurge?00:23
babbageclunkblahdeblah: I think so (although I'm not an expert).00:25
blahdeblahSeems like it from https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge00:25
blahdeblahI'll try that00:25
thumperblahdeblah: yeah00:30
thumperblahdeblah: oh, seems like you've sorted it00:31
blahdeblahthumper: so mgopurge is the right step?00:31
thumperblahdeblah: yes00:31
blahdeblahcool - thanks, babbageclunk & thumper00:32
blahdeblahHmmm. This is a rather persistently missing transaction.  Same one keeps repeating in the debug log.00:40
blahdeblahmgopurge output: https://pastebin.canonical.com/189582/00:41
blahdeblahAnything look awry there? ^00:41
blahdeblahthumper: ^ When you have a sec01:04
blahdeblah^ Full mgopurge (a.o.t. just pruning per https://github.com/juju/juju/wiki/MgoPurgeTool#pruning) did the trick01:27
thumpercool01:35
babbageclunkwallyworld_ or thumper: take a look at https://github.com/juju/juju/pull/7418 please?01:54
wallyworld_righto01:54
babbageclunkta01:55
wallyworld_babbageclunk: a small test quibble02:03
babbageclunkwallyworld_: cool, thanks02:06
axwwallyworld_: I thought you were talking about http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 (obtained 3, expected 4) in the standup ... but now I see a card for TestAgentConnectionsShutDownWhenAPIServerDies. I've fixed the former, did you want me to look at the latter?02:06
babbageclunkwallyworld_: oh yeah, good point02:07
wallyworld_axw: that card is from a day or 2 ago;  i think the test run is in the details of the card. but yeah, any race failure is good to fix02:07
axwokey dokey02:08
=== natefrinch is now known as natefinch
natefinchcan't even spell my own damn name02:09
mupBug #1494661 changed: Rotated logs are not compressed <canonical-bootstack> <uosci> <juju:In Progress by jsing> <juju-core:Won't Fix> <juju-core 1.25:Won't Fix> <https://launchpad.net/bugs/1494661>02:15
natefinch@thumper you around?02:15
meetingologynatefinch: Error: "thumper" is not a valid command.02:15
thumpernatefinch: yep02:15
thumperdamn bots02:15
natefinch@thumper has this code been reviewed by someone on Juju?02:15
meetingologynatefinch: Error: "thumper" is not a valid command.02:15
thumpernatefinch: you need to stop using @02:16
natefinchright sorry, slack02:16
natefinchcreeping in my brain02:16
natefinchheh02:16
thumperI *could*, or perhaps axw might be better suited02:16
thumperslack, pah02:16
natefinchhaha02:16
thumpernatefinch: if you like, I could quickly test as well...02:16
natefinchmostly wanted to know if I was rubber stamping or if I need to actually be really thorough on this02:17
thumpernatefinch: is it verification you wanted?02:17
natefinchsounds like the latter02:17
thumperI did look through it when it was initially proposed02:17
thumperand it looked fine to me02:17
natefinch*nod*02:17
thumperbut I didn't feel like I should comment on the PR02:17
natefinchit;'s open source , you can do whatever the hell you like ;)02:21
natefinchwhat I'd really like is if most of the new logic were in separate functions that could be unit tested02:22
natefinchpart of that is my own fault for not doing the same when I was writing it02:22
* babbageclunk goes for a run02:36
thumperaxw: is there a bug for the status history pruner failures?02:58
axwthumper: not that I know of02:58
* axw searches02:58
axwthumper: no not that I can see. why do you ask?02:59
thumperaxw: where did you see that it periodically fails?02:59
thumperjust thinking there should be a bug02:59
axwthumper: http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/281103:00
thumperah ha03:00
thumperok03:00
natefinchthumper:  I'll approve with the mode/chown code added.03:56
axwwallyworld_: coming to tech board?04:01
wallyworld_axw: i have to rush off to the football, i was looking for a second set of eyes. live testing seems to work, but unit test fails - the err is nil rather than not found. the txn ops to remove the constraints are getting queued inthe slice, so it's just something dumb i've missed. https://github.com/juju/juju/pull/742107:04
blahdeblahwallyworld_: Wave for the camera - I'll be watching for you. :-)07:06
wallyworld_axw: also, i just noticed deploying the second time results in a storage attached hook error, but i can't dig in now, i'll have to look later tonight when i get back07:06
wallyworld_will do :-) hope we win07:06
wallyworld_axw: maybe send an email if you see something and i'll pick up from there. ttyl07:07
axwwallyworld_: ok np. enjoy07:08
rogpeppejam: hiya07:41
wpkrogpeppe: https://github.com/go-retry/retry/pull/108:01
jamhi rogpeppe, currently digging into some hairy code, I'm guessing you'd like to chat about DNS Cache stuff?08:02
rogpeppejam: yeah08:02
rogpeppewpk: reviewed08:04
wpkrogpeppe: updated08:18
rogpeppewpk: thanks!08:18
rogpeppewpk: i'm not sure you've pushed your changes08:19
wpkI didn't, sec08:20
wpkready08:20
jamrogpeppe: so, DNSCache stuff. a few thoughts. To start with, realize that my understanding of the MAAS issue is not code that I worked on, but stuff that I heard around the area so my memory may be innaccurate.08:21
jamrogpeppe: to start with, one major caveat is that code was much more about caching DNS *misses* than about hits08:21
jamwhich is not what you're focused on08:21
jamI believe the problem worked out as08:21
jam1) We only ever connected to a single 'address' when doing things like 'juju ssh'08:21
rogpeppejam: yeah, that's my issue too - i'm not entirely sure what the issues were, and i think we've probably lost all the code reviews from then :-\08:21
jam2) We thought 'hey, hostnames, that should be better', so we started preferring hostnames to IP addresses08:21
jam3) Then MAAS started giving us hostnames that we can't resolve because they aren't in most users laptops08:22
jamand we didn't want to continually try to lookup hostnames that weren't resolvable, and we *definitely* didn't want to use them as the preferred address for 'ssh' if we were only going totry 108:22
jamwe now do attempt multiple targets08:25
jamI'm not sure why we would want to internally add yet another DNS cache08:25
jam(IIRC Linux defaults to using a local DNS cache anyway)08:25
jamif we're doing something internally where we're connecting repeatedly and DNS lookups are a significant problem, I'm not opposed to them08:25
jambut as always 'cache invalidation' is one of the core problems in programming08:26
jamso avoiding cache when you don't actually need it is often a good plan08:26
blahdeblahjam: +100008:26
jamand balloons would be the person to talk to about the archived review site, I'm fairly sure the content was not deleted, just the site taken down08:26
jamas maintaining those machines (keeping security updates, etc) has a nonzero cost to us08:27
rogpeppejam: yeah, i'm aware of that, but it really is a useful resource08:27
rogpeppejam: hmm, so the ssh issue is one i hadn't thought about08:30
jamrogpeppe: your DNSCache isn't caching negative results, so it doesn't really touch that problem, but AIUI that is why we had the "unvalidated addresses"08:31
jamunresolved08:31
rogpeppejam: so my problem with the unresolved addresses is that is makes it sound like the other ones are resolved, but they're not. the two fields sit in uneasy tension - their responsibilities are unclear08:32
rogpeppejam: the direction i'm trying to head is that one field has the addresses as returned by the controller, and that other fields provide meta-info that records stuff related to the addresses (e.g. their resolved IP address or whether we could resolve the address)08:34
rogpeppejam: the meta fields don't impact on correctness and can always be deleted without a problem (except potentially some extra connection time)08:35
rogpeppejam: so... you think that there's not really a problem with slow DNS lookups?08:35
rogpeppejam: if so, why did the original code bother to record the resolved IP addresses at all? it could just have moved addresses that resolve OK to the front of the list.08:36
jamrogpeppe: I think it was a case of "when we can find IP addresses prefer them, because for everything that isn't JAAS they're actually more 'real'" and available everywhere08:37
jamregardless of my personal configuration, etc.08:38
jamI think JAAS throws a wrench into that08:38
jam that comes after that code was landed08:38
rogpeppejam: but IP addresses can change08:38
rogpeppejam: and i still don't really understand. "when we can find IP addresses"... that's the responsibility of DNS, right? why are we doing it ourselves?08:39
rogpeppejam: that is, why is it a good idea to store the resolved addresses in controllers.yaml?08:40
jamso for things like shared (old) environments.yaml files, what DNS servers you could see was often disjoint from what IP addresses you could see08:40
jamso if I had one machine that *was* configured to see MAAS, putting the IP addresses in there meant that I could share it with another machine that *couldn't* see MAAS's DNS08:41
jambut could route to MAAS08:41
rogpeppejam: ha, so the MAAS DNS addresses were only resolvable locally, but the resolved IP addresses worked globally?08:42
jamrogpeppe: so MAAS runs its own DNS server that tracks all of the machines that it is managing08:43
jamyou can certainly have a *route* to the MAAS network08:43
rogpeppejam: if that's the case, that's a reasonable argument for maintaining a DNS cache08:43
jamwithout changing your local DNS to point to MAAS's bind08:43
jam(its not bind, but whatever it is)08:43
wpk(it is bind ;) )08:44
rogpeppejam: so if that's the case, how is the code much more about caching DNS misses?08:44
jamwpk: I thought it was dnsmasq or something like that08:45
jammaybe I'm thinking the DHCP one changed, not DNS08:45
jamI know they rewrote one of the backends08:45
jamwell, switched backends08:45
wpkjam: btw, when you're done with this you could take a look at https://github.com/juju/juju/pull/7383 ?08:47
jamwpk: looks like I started, but just didn't finish, will refresh08:49
axwjam: sorry, didn't see you had reviewed ian's branch already... I don't understand why his change would fix anything. maybe you can answer my questions?08:50
jamaxw: I'm going off the comments sections around where he had touched, but did not try to completely validate the logic myself. It sounded like one of those cases where a TXN can't chain its actions08:53
jam(op2 doesn't see the result of op1, IIRC)08:53
jamat least in terms of all-assertions are triggered before all ops08:53
jam'are checked'08:53
jamit sounded like if there were multiple ways that we might decref the reference counters during teardown, it wouldn't always go to 0.08:54
jamthough there is a "$inc" , -1, (or if there is a $dec) those operations shouldn't be trying to check the value and setting it to one less than it currently is08:54
jamaxw: what *I* got out of it, was that if you always did the finalization, then you actually end up with 2 finalization calls sometimes and the second would fail08:55
jamso instead he changed it to be "always call it at the end, but avoid calling it early"08:55
jamaxw: at least, that was my understanding and why it 'seemed like it would be ok', but I'll admit to not really digging deep into everything.08:56
axwjam: ok. I'm not 100% sure, but I didn't think it would do that because there's no asserts on the ops08:56
jamaxw: so double finalize sounds like it could fail08:57
jamcause the doc you're removing doesn't exist08:57
=== marlinc_ is now known as marlinc
axwjam: Remove will succeed even if the doc doesn't exist, unless you assert txn.DocExists (just tested by duplicating the ops). pretty sure the issue is that "isFinal" is not triggering, but I don't know why09:06
axwI started extracting a "transaction builder" for a limited set of State, but the yak hair was growing faster than I could cut it09:08
rogpeppeaxw: lol09:09
rogpeppewpk: i just merged your retry PR, thanks!09:12
wpkrogpeppe: great, I'll update juju PR with just new dependencies09:14
wpkrogpeppe: ok, juju PR updated09:22
rogpeppewpk: thanks09:23
rogpeppejam: (sorry, was busy trying to debug juju-run issue...)10:21
rogpeppejam: so, what's the upshot of our discussion?10:21
rogpeppejam: if we still care about copying controllers.yaml files and retaining previously resolved IP addresses, then ISTM that we'll still need some sort of DNS cache10:23
rogpeppejam: but i'm not quite sure whether we need to record DNS failures too10:24
rogpeppejam: currently i can't quite see that it's necessary.10:24
=== MmikeM is now known as Mmike
axw_rogpeppe: just saw a test failure in CI for TestWithUnresolvableAddrAfterCacheFallback (http://juju-ci.vapour.ws:8080/job/github-merge-juju/11036/artifact/artifacts/xenial.log/*view*/)10:32
axw_I'm logging off shortly, will look tomorrow if you don't get to it10:32
rogpeppeaxw_: thanks for the heads up10:33
rogpeppeaxw_: i'll take a look10:33
axw_cheers10:33
jamrogpeppe: well the current way to share controllers is things like 'register' and we're looking to have some other way to share with yourself11:21
jamcause we don't *want* to copy controllers.yaml around manually11:21
rogpeppejam: ok, so perhaps we can lose all the DNS caching stuff. all we really need to do is put the dialed host name at the start of the address list11:25
jamrogpeppe: the only other thing to sanity check is things like 'git blame' to see what commit messages say about thinsg.11:26
rogpeppejam: my current approach would mean that if there's a controller with a host name that resolves to several IP addresses and one of them is down, that the second time it would always try that IP address first11:26
rogpeppejam: unfortunately our commit messages are often pretty crap11:26
rogpeppejam: i really miss having the review history11:26
jamrogpeppe: so with git blame and a small amount of walking, you can find the rev that actually merged the code, which gives you at least the review message11:27
rogpeppejam: --ancestry-path is very useful for that11:27
jamrogpeppe: why would the IP that didn't resolve get chosen first the next time?11:34
jamI also thought we always sort and then move the one we successfully connected to, to the front11:34
rogpeppejam: the IP that *did* resolve would be chosen first next time, sorry11:34
rogpeppejam: we do currently. my plan was to remove the unresolved-api-endpoints field and add a dns-cache field mapping host names to ip addresses11:35
rogpeppejam: when you successfully dial an address, you move that hostname to the start of api-endpoints and the dialed ip address to the start of the dns-cache entry11:36
wpkrogpeppe: could you check https://github.com/juju/juju/pull/7417 ?12:36
rogpeppewpk: reviewed13:29
natefinchbackup compression has landed in lumberjack FYI.16:09
natefinchnot sure who is on during US work hours anymore16:09
natefinchhi rick_h marcoceppi rogpeppe alexisb16:11
rogpeppenatefinch: yo!16:11
alexisbheya natefinch16:11
natefinchhowdy :)16:12
natefinchrogpeppe: how's things in juju land?16:13
rick_hHowdy natefinch!16:13
rogpeppenatefinch: scrumptious as always :)16:13
* natefinch waves at everyone16:13
natefinchhaha16:13
rick_hnatefinch: how's the weather up in the Northeast treating ya?16:14
natefinchrick_h: pretty good.  mild most days, barely need heat or A/C.16:15
rick_hnatefinch: awesome, great time if the year16:16
natefinchthumper wanted backup compression done by today, so it's in.  Updating to master of gopkg.in/natefinch/lumberjack.v2 will bring it in.  Also tagged it as v2.1 for anyone who might be using something that cares about semantic versioning.16:17
rick_hnatefinch: that's awesome ty much!16:19
marcoceppio/ natefinch17:49
natefinchhi marcoceppi17:50
thumperveebers: so... why does the assess_log_rotation acceptance test require a JUJU_HOME/environments.yaml?22:10
veebersthumper: due to how the tests currently setup the environment to bootstrap, we have a source for credentials and settings etc. which are named (hence 'env' argument). ci-tests take that and prepare a JUJU_DATA (known as JUJU_HOME for historic reasons for the test arg)22:12
thumperI'm not sure what I need to pass it to get it running locally22:13
veebersthumper: if you have cloud-city you need: JUJU_HOME=<path to cloud city> ./<script name> <env name> where env name is parallel-lxd22:25
thumperveebers: ok it is running now...22:33
veebersthumper: cool22:35
thumperbabbageclunk: https://bugs.launchpad.net/bugs/169455922:47
mupBug #1694559: Log forwarding + debug log level = infinite messages <juju:New> <https://launchpad.net/bugs/1694559>22:47
thumperbabbageclunk: is there any way to easily enforce a larger batch size?22:48
thumperlarger minimum that is22:48
babbageclunkthumper: you'd need to change the structure of the code a bit - at the moment it just sends batches as it's handed them.22:49
babbageclunkBut I don't think it'd be especially hard.22:49
thumperwallyworld, babbageclunk: is this bug still accurate? https://bugs.launchpad.net/juju/+bug/164690723:29
mupBug #1646907: gce open-port does not create firewall rules <gce-provider> <network> <open-port> <juju:Triaged> <https://launchpad.net/bugs/1646907>23:29
babbageclunkthumper: don't know, would need to try it out sorry23:30
thumperbabbageclunk: that's ok, I thought it might have been covered by work you did there23:30
thumperwith the firewaller23:30
wallyworldthumper: don't *think* so. there was a lot of cleanup and improvement to that code that i did in the past couple of months, and thebug was from dec23:30
thumperif you don't know, I'll just drop priority and we can address later23:31
wallyworld+123:31
thumperwallyworld: it seems to me that if https://bugs.launchpad.net/juju/+bug/1613823 was still a problem, we'd see many more CI failures for gce23:37
mupBug #1613823: Google Compute Engine IP is ephemeral by default <gce-provider> <juju:Triaged> <https://launchpad.net/bugs/1613823>23:37
thumperthoughts?23:37
anastasiamacthumper: i think u'd see it to be a problem on a longer-running juju... how many CI tests are long-running?23:37
thumperanastasiamac: but this is talkinga bout controller dialing from a client23:38
wallyworldthumper: the IP does change on reboot of a machine, but i didn't think it changed arbitarily during use23:38
thumperso if the controller reboots... nothing can talk to it?23:39
thumperthat seems terrible23:39
wallyworldyeah, i think that may be the case23:39
wallyworldi haven't tested fully myself23:39
wallyworldbut it does seem an issue23:40
wallyworldwe should look at for 2.323:40

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!