[00:19] <blahdeblah> thumper: You around?  Looking for some advice on how to tackle a problem.
[00:19] <blahdeblah> I've got a production 1.25.10 environment in a rather bad way, showing the traditional symptoms of 1587644, plus the broken status updates of 1666396, but with no missing txn-revnos.
[00:19] <blahdeblah> debug-log is full of https://pastebin.canonical.com/189580/
[00:19] <blahdeblah> restarting jujud-machine-0, juju-db, and rsyslog has had no effect; it's just constant leadership-tracker spam per above
[00:21] <blahdeblah> ^ Or anyone else, for that matter :-)
[00:22] <babbageclunk> blahdeblah: can you see any message about why the leadership-tracker isn't running?
[00:22] <babbageclunk> Or has that fallen off the bottom of the log?
[00:22] <blahdeblah> Let me run the debug-log for a while and grep out the noise
[00:23] <blahdeblah> I got one of these: machine-0: 2017-05-31 00:22:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("5912fa035290d208a719a8cc")
[00:23] <blahdeblah> Is that a smoking gun for an mgopurge?
[00:25] <babbageclunk> blahdeblah: I think so (although I'm not an expert).
[00:25] <blahdeblah> Seems like it from https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge
[00:25] <blahdeblah> I'll try that
[00:30] <thumper> blahdeblah: yeah
[00:31] <thumper> blahdeblah: oh, seems like you've sorted it
[00:31] <blahdeblah> thumper: so mgopurge is the right step?
[00:31] <thumper> blahdeblah: yes
[00:32] <blahdeblah> cool - thanks, babbageclunk & thumper
[00:40] <blahdeblah> Hmmm. This is a rather persistently missing transaction.  Same one keeps repeating in the debug log.
[00:41] <blahdeblah> mgopurge output: https://pastebin.canonical.com/189582/
[00:41] <blahdeblah> Anything look awry there? ^
[01:04] <blahdeblah> thumper: ^ When you have a sec
[01:27] <blahdeblah> ^ Full mgopurge (a.o.t. just pruning per https://github.com/juju/juju/wiki/MgoPurgeTool#pruning) did the trick
[01:35] <thumper> cool
[01:54] <babbageclunk> wallyworld_ or thumper: take a look at https://github.com/juju/juju/pull/7418 please?
[01:54] <wallyworld_> righto
[01:55] <babbageclunk> ta
[02:03] <wallyworld_> babbageclunk: a small test quibble
[02:06] <babbageclunk> wallyworld_: cool, thanks
[02:06] <axw> wallyworld_: I thought you were talking about http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 (obtained 3, expected 4) in the standup ... but now I see a card for TestAgentConnectionsShutDownWhenAPIServerDies. I've fixed the former, did you want me to look at the latter?
[02:07] <babbageclunk> wallyworld_: oh yeah, good point
[02:07] <wallyworld_> axw: that card is from a day or 2 ago;  i think the test run is in the details of the card. but yeah, any race failure is good to fix
[02:08] <axw> okey dokey
[02:09] <natefinch> can't even spell my own damn name
[02:15] <mup> Bug #1494661 changed: Rotated logs are not compressed <canonical-bootstack> <uosci> <juju:In Progress by jsing> <juju-core:Won't Fix> <juju-core 1.25:Won't Fix> <https://launchpad.net/bugs/1494661>
[02:15] <natefinch> @thumper you around?
[02:15] <meetingology> natefinch: Error: "thumper" is not a valid command.
[02:15] <thumper> natefinch: yep
[02:15] <thumper> damn bots
[02:15] <natefinch> @thumper has this code been reviewed by someone on Juju?
[02:15] <meetingology> natefinch: Error: "thumper" is not a valid command.
[02:16] <thumper> natefinch: you need to stop using @
[02:16] <natefinch> right sorry, slack
[02:16] <natefinch> creeping in my brain
[02:16] <natefinch> heh
[02:16] <thumper> I *could*, or perhaps axw might be better suited
[02:16] <thumper> slack, pah
[02:16] <natefinch> haha
[02:16] <thumper> natefinch: if you like, I could quickly test as well...
[02:17] <natefinch> mostly wanted to know if I was rubber stamping or if I need to actually be really thorough on this
[02:17] <thumper> natefinch: is it verification you wanted?
[02:17] <natefinch> sounds like the latter
[02:17] <thumper> I did look through it when it was initially proposed
[02:17] <thumper> and it looked fine to me
[02:17] <natefinch> *nod*
[02:17] <thumper> but I didn't feel like I should comment on the PR
[02:21] <natefinch> it;'s open source , you can do whatever the hell you like ;)
[02:22] <natefinch> what I'd really like is if most of the new logic were in separate functions that could be unit tested
[02:22] <natefinch> part of that is my own fault for not doing the same when I was writing it
[02:36]  * babbageclunk goes for a run
[02:58] <thumper> axw: is there a bug for the status history pruner failures?
[02:58] <axw> thumper: not that I know of
[02:58]  * axw searches
[02:59] <axw> thumper: no not that I can see. why do you ask?
[02:59] <thumper> axw: where did you see that it periodically fails?
[02:59] <thumper> just thinking there should be a bug
[03:00] <axw> thumper: http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811
[03:00] <thumper> ah ha
[03:00] <thumper> ok
[03:56] <natefinch> thumper:  I'll approve with the mode/chown code added.
[04:01] <axw> wallyworld_: coming to tech board?
[07:04] <wallyworld_> axw: i have to rush off to the football, i was looking for a second set of eyes. live testing seems to work, but unit test fails - the err is nil rather than not found. the txn ops to remove the constraints are getting queued inthe slice, so it's just something dumb i've missed. https://github.com/juju/juju/pull/7421
[07:06] <blahdeblah> wallyworld_: Wave for the camera - I'll be watching for you. :-)
[07:06] <wallyworld_> axw: also, i just noticed deploying the second time results in a storage attached hook error, but i can't dig in now, i'll have to look later tonight when i get back
[07:06] <wallyworld_> will do :-) hope we win
[07:07] <wallyworld_> axw: maybe send an email if you see something and i'll pick up from there. ttyl
[07:08] <axw> wallyworld_: ok np. enjoy
[07:41] <rogpeppe> jam: hiya
[08:01] <wpk> rogpeppe: https://github.com/go-retry/retry/pull/1
[08:02] <jam> hi rogpeppe, currently digging into some hairy code, I'm guessing you'd like to chat about DNS Cache stuff?
[08:02] <rogpeppe> jam: yeah
[08:04] <rogpeppe> wpk: reviewed
[08:18] <wpk> rogpeppe: updated
[08:18] <rogpeppe> wpk: thanks!
[08:19] <rogpeppe> wpk: i'm not sure you've pushed your changes
[08:20] <wpk> I didn't, sec
[08:20] <wpk> ready
[08:21] <jam> rogpeppe: so, DNSCache stuff. a few thoughts. To start with, realize that my understanding of the MAAS issue is not code that I worked on, but stuff that I heard around the area so my memory may be innaccurate.
[08:21] <jam> rogpeppe: to start with, one major caveat is that code was much more about caching DNS *misses* than about hits
[08:21] <jam> which is not what you're focused on
[08:21] <jam> I believe the problem worked out as
[08:21] <jam> 1) We only ever connected to a single 'address' when doing things like 'juju ssh'
[08:21] <rogpeppe> jam: yeah, that's my issue too - i'm not entirely sure what the issues were, and i think we've probably lost all the code reviews from then :-\
[08:21] <jam> 2) We thought 'hey, hostnames, that should be better', so we started preferring hostnames to IP addresses
[08:22] <jam> 3) Then MAAS started giving us hostnames that we can't resolve because they aren't in most users laptops
[08:22] <jam> and we didn't want to continually try to lookup hostnames that weren't resolvable, and we *definitely* didn't want to use them as the preferred address for 'ssh' if we were only going totry 1
[08:25] <jam> we now do attempt multiple targets
[08:25] <jam> I'm not sure why we would want to internally add yet another DNS cache
[08:25] <jam> (IIRC Linux defaults to using a local DNS cache anyway)
[08:25] <jam> if we're doing something internally where we're connecting repeatedly and DNS lookups are a significant problem, I'm not opposed to them
[08:26] <jam> but as always 'cache invalidation' is one of the core problems in programming
[08:26] <jam> so avoiding cache when you don't actually need it is often a good plan
[08:26] <blahdeblah> jam: +1000
[08:26] <jam> and balloons would be the person to talk to about the archived review site, I'm fairly sure the content was not deleted, just the site taken down
[08:27] <jam> as maintaining those machines (keeping security updates, etc) has a nonzero cost to us
[08:27] <rogpeppe> jam: yeah, i'm aware of that, but it really is a useful resource
[08:30] <rogpeppe> jam: hmm, so the ssh issue is one i hadn't thought about
[08:31] <jam> rogpeppe: your DNSCache isn't caching negative results, so it doesn't really touch that problem, but AIUI that is why we had the "unvalidated addresses"
[08:31] <jam> unresolved
[08:32] <rogpeppe> jam: so my problem with the unresolved addresses is that is makes it sound like the other ones are resolved, but they're not. the two fields sit in uneasy tension - their responsibilities are unclear
[08:34] <rogpeppe> jam: the direction i'm trying to head is that one field has the addresses as returned by the controller, and that other fields provide meta-info that records stuff related to the addresses (e.g. their resolved IP address or whether we could resolve the address)
[08:35] <rogpeppe> jam: the meta fields don't impact on correctness and can always be deleted without a problem (except potentially some extra connection time)
[08:35] <rogpeppe> jam: so... you think that there's not really a problem with slow DNS lookups?
[08:36] <rogpeppe> jam: if so, why did the original code bother to record the resolved IP addresses at all? it could just have moved addresses that resolve OK to the front of the list.
[08:37] <jam> rogpeppe: I think it was a case of "when we can find IP addresses prefer them, because for everything that isn't JAAS they're actually more 'real'" and available everywhere
[08:38] <jam> regardless of my personal configuration, etc.
[08:38] <jam> I think JAAS throws a wrench into that
[08:38] <jam>  that comes after that code was landed
[08:38] <rogpeppe> jam: but IP addresses can change
[08:39] <rogpeppe> jam: and i still don't really understand. "when we can find IP addresses"... that's the responsibility of DNS, right? why are we doing it ourselves?
[08:40] <rogpeppe> jam: that is, why is it a good idea to store the resolved addresses in controllers.yaml?
[08:40] <jam> so for things like shared (old) environments.yaml files, what DNS servers you could see was often disjoint from what IP addresses you could see
[08:41] <jam> so if I had one machine that *was* configured to see MAAS, putting the IP addresses in there meant that I could share it with another machine that *couldn't* see MAAS's DNS
[08:41] <jam> but could route to MAAS
[08:42] <rogpeppe> jam: ha, so the MAAS DNS addresses were only resolvable locally, but the resolved IP addresses worked globally?
[08:43] <jam> rogpeppe: so MAAS runs its own DNS server that tracks all of the machines that it is managing
[08:43] <jam> you can certainly have a *route* to the MAAS network
[08:43] <rogpeppe> jam: if that's the case, that's a reasonable argument for maintaining a DNS cache
[08:43] <jam> without changing your local DNS to point to MAAS's bind
[08:43] <jam> (its not bind, but whatever it is)
[08:44] <wpk> (it is bind ;) )
[08:44] <rogpeppe> jam: so if that's the case, how is the code much more about caching DNS misses?
[08:45] <jam> wpk: I thought it was dnsmasq or something like that
[08:45] <jam> maybe I'm thinking the DHCP one changed, not DNS
[08:45] <jam> I know they rewrote one of the backends
[08:45] <jam> well, switched backends
[08:47] <wpk> jam: btw, when you're done with this you could take a look at https://github.com/juju/juju/pull/7383 ?
[08:49] <jam> wpk: looks like I started, but just didn't finish, will refresh
[08:50] <axw> jam: sorry, didn't see you had reviewed ian's branch already... I don't understand why his change would fix anything. maybe you can answer my questions?
[08:53] <jam> axw: I'm going off the comments sections around where he had touched, but did not try to completely validate the logic myself. It sounded like one of those cases where a TXN can't chain its actions
[08:53] <jam> (op2 doesn't see the result of op1, IIRC)
[08:53] <jam> at least in terms of all-assertions are triggered before all ops
[08:53] <jam> 'are checked'
[08:54] <jam> it sounded like if there were multiple ways that we might decref the reference counters during teardown, it wouldn't always go to 0.
[08:54] <jam> though there is a "$inc" , -1, (or if there is a $dec) those operations shouldn't be trying to check the value and setting it to one less than it currently is
[08:55] <jam> axw: what *I* got out of it, was that if you always did the finalization, then you actually end up with 2 finalization calls sometimes and the second would fail
[08:55] <jam> so instead he changed it to be "always call it at the end, but avoid calling it early"
[08:56] <jam> axw: at least, that was my understanding and why it 'seemed like it would be ok', but I'll admit to not really digging deep into everything.
[08:56] <axw> jam: ok. I'm not 100% sure, but I didn't think it would do that because there's no asserts on the ops
[08:57] <jam> axw: so double finalize sounds like it could fail
[08:57] <jam> cause the doc you're removing doesn't exist
[09:06] <axw> jam: Remove will succeed even if the doc doesn't exist, unless you assert txn.DocExists (just tested by duplicating the ops). pretty sure the issue is that "isFinal" is not triggering, but I don't know why
[09:08] <axw> I started extracting a "transaction builder" for a limited set of State, but the yak hair was growing faster than I could cut it
[09:09] <rogpeppe> axw: lol
[09:12] <rogpeppe> wpk: i just merged your retry PR, thanks!
[09:14] <wpk> rogpeppe: great, I'll update juju PR with just new dependencies
[09:22] <wpk> rogpeppe: ok, juju PR updated
[09:23] <rogpeppe> wpk: thanks
[10:21] <rogpeppe> jam: (sorry, was busy trying to debug juju-run issue...)
[10:21] <rogpeppe> jam: so, what's the upshot of our discussion?
[10:23] <rogpeppe> jam: if we still care about copying controllers.yaml files and retaining previously resolved IP addresses, then ISTM that we'll still need some sort of DNS cache
[10:24] <rogpeppe> jam: but i'm not quite sure whether we need to record DNS failures too
[10:24] <rogpeppe> jam: currently i can't quite see that it's necessary.
[10:32] <axw_> rogpeppe: just saw a test failure in CI for TestWithUnresolvableAddrAfterCacheFallback (http://juju-ci.vapour.ws:8080/job/github-merge-juju/11036/artifact/artifacts/xenial.log/*view*/)
[10:32] <axw_> I'm logging off shortly, will look tomorrow if you don't get to it
[10:33] <rogpeppe> axw_: thanks for the heads up
[10:33] <rogpeppe> axw_: i'll take a look
[10:33] <axw_> cheers
[11:21] <jam> rogpeppe: well the current way to share controllers is things like 'register' and we're looking to have some other way to share with yourself
[11:21] <jam> cause we don't *want* to copy controllers.yaml around manually
[11:25] <rogpeppe> jam: ok, so perhaps we can lose all the DNS caching stuff. all we really need to do is put the dialed host name at the start of the address list
[11:26] <jam> rogpeppe: the only other thing to sanity check is things like 'git blame' to see what commit messages say about thinsg.
[11:26] <rogpeppe> jam: my current approach would mean that if there's a controller with a host name that resolves to several IP addresses and one of them is down, that the second time it would always try that IP address first
[11:26] <rogpeppe> jam: unfortunately our commit messages are often pretty crap
[11:26] <rogpeppe> jam: i really miss having the review history
[11:27] <jam> rogpeppe: so with git blame and a small amount of walking, you can find the rev that actually merged the code, which gives you at least the review message
[11:27] <rogpeppe> jam: --ancestry-path is very useful for that
[11:34] <jam> rogpeppe: why would the IP that didn't resolve get chosen first the next time?
[11:34] <jam> I also thought we always sort and then move the one we successfully connected to, to the front
[11:34] <rogpeppe> jam: the IP that *did* resolve would be chosen first next time, sorry
[11:35] <rogpeppe> jam: we do currently. my plan was to remove the unresolved-api-endpoints field and add a dns-cache field mapping host names to ip addresses
[11:36] <rogpeppe> jam: when you successfully dial an address, you move that hostname to the start of api-endpoints and the dialed ip address to the start of the dns-cache entry
[12:36] <wpk> rogpeppe: could you check https://github.com/juju/juju/pull/7417 ?
[13:29] <rogpeppe> wpk: reviewed
[16:09] <natefinch> backup compression has landed in lumberjack FYI.
[16:09] <natefinch> not sure who is on during US work hours anymore
[16:11] <natefinch> hi rick_h marcoceppi rogpeppe alexisb
[16:11] <rogpeppe> natefinch: yo!
[16:11] <alexisb> heya natefinch
[16:12] <natefinch> howdy :)
[16:13] <natefinch> rogpeppe: how's things in juju land?
[16:13] <rick_h> Howdy natefinch!
[16:13] <rogpeppe> natefinch: scrumptious as always :)
[16:13]  * natefinch waves at everyone
[16:13] <natefinch> haha
[16:14] <rick_h> natefinch: how's the weather up in the Northeast treating ya?
[16:15] <natefinch> rick_h: pretty good.  mild most days, barely need heat or A/C.
[16:16] <rick_h> natefinch: awesome, great time if the year
[16:17] <natefinch> thumper wanted backup compression done by today, so it's in.  Updating to master of gopkg.in/natefinch/lumberjack.v2 will bring it in.  Also tagged it as v2.1 for anyone who might be using something that cares about semantic versioning.
[16:19] <rick_h> natefinch: that's awesome ty much!
[17:49] <marcoceppi> o/ natefinch
[17:50] <natefinch> hi marcoceppi
[22:10] <thumper> veebers: so... why does the assess_log_rotation acceptance test require a JUJU_HOME/environments.yaml?
[22:12] <veebers> thumper: due to how the tests currently setup the environment to bootstrap, we have a source for credentials and settings etc. which are named (hence 'env' argument). ci-tests take that and prepare a JUJU_DATA (known as JUJU_HOME for historic reasons for the test arg)
[22:13] <thumper> I'm not sure what I need to pass it to get it running locally
[22:25] <veebers> thumper: if you have cloud-city you need: JUJU_HOME=<path to cloud city> ./<script name> <env name> where env name is parallel-lxd
[22:33] <thumper> veebers: ok it is running now...
[22:35] <veebers> thumper: cool
[22:47] <thumper> babbageclunk: https://bugs.launchpad.net/bugs/1694559
[22:47] <mup> Bug #1694559: Log forwarding + debug log level = infinite messages <juju:New> <https://launchpad.net/bugs/1694559>
[22:48] <thumper> babbageclunk: is there any way to easily enforce a larger batch size?
[22:48] <thumper> larger minimum that is
[22:49] <babbageclunk> thumper: you'd need to change the structure of the code a bit - at the moment it just sends batches as it's handed them.
[22:49] <babbageclunk> But I don't think it'd be especially hard.
[23:29] <thumper> wallyworld, babbageclunk: is this bug still accurate? https://bugs.launchpad.net/juju/+bug/1646907
[23:29] <mup> Bug #1646907: gce open-port does not create firewall rules <gce-provider> <network> <open-port> <juju:Triaged> <https://launchpad.net/bugs/1646907>
[23:30] <babbageclunk> thumper: don't know, would need to try it out sorry
[23:30] <thumper> babbageclunk: that's ok, I thought it might have been covered by work you did there
[23:30] <thumper> with the firewaller
[23:30] <wallyworld> thumper: don't *think* so. there was a lot of cleanup and improvement to that code that i did in the past couple of months, and thebug was from dec
[23:31] <thumper> if you don't know, I'll just drop priority and we can address later
[23:31] <wallyworld> +1
[23:37] <thumper> wallyworld: it seems to me that if https://bugs.launchpad.net/juju/+bug/1613823 was still a problem, we'd see many more CI failures for gce
[23:37] <mup> Bug #1613823: Google Compute Engine IP is ephemeral by default <gce-provider> <juju:Triaged> <https://launchpad.net/bugs/1613823>
[23:37] <thumper> thoughts?
[23:37] <anastasiamac> thumper: i think u'd see it to be a problem on a longer-running juju... how many CI tests are long-running?
[23:38] <thumper> anastasiamac: but this is talkinga bout controller dialing from a client
[23:38] <wallyworld> thumper: the IP does change on reboot of a machine, but i didn't think it changed arbitarily during use
[23:39] <thumper> so if the controller reboots... nothing can talk to it?
[23:39] <thumper> that seems terrible
[23:39] <wallyworld> yeah, i think that may be the case
[23:39] <wallyworld> i haven't tested fully myself
[23:40] <wallyworld> but it does seem an issue
[23:40] <wallyworld> we should look at for 2.3