[00:19] thumper: You around? Looking for some advice on how to tackle a problem. [00:19] I've got a production 1.25.10 environment in a rather bad way, showing the traditional symptoms of 1587644, plus the broken status updates of 1666396, but with no missing txn-revnos. [00:19] debug-log is full of https://pastebin.canonical.com/189580/ [00:19] restarting jujud-machine-0, juju-db, and rsyslog has had no effect; it's just constant leadership-tracker spam per above [00:21] ^ Or anyone else, for that matter :-) [00:22] blahdeblah: can you see any message about why the leadership-tracker isn't running? [00:22] Or has that fallen off the bottom of the log? [00:22] Let me run the debug-log for a while and grep out the noise [00:23] I got one of these: machine-0: 2017-05-31 00:22:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("5912fa035290d208a719a8cc") [00:23] Is that a smoking gun for an mgopurge? [00:25] blahdeblah: I think so (although I'm not an expert). [00:25] Seems like it from https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge [00:25] I'll try that [00:30] blahdeblah: yeah [00:31] blahdeblah: oh, seems like you've sorted it [00:31] thumper: so mgopurge is the right step? [00:31] blahdeblah: yes [00:32] cool - thanks, babbageclunk & thumper [00:40] Hmmm. This is a rather persistently missing transaction. Same one keeps repeating in the debug log. [00:41] mgopurge output: https://pastebin.canonical.com/189582/ [00:41] Anything look awry there? ^ [01:04] thumper: ^ When you have a sec [01:27] ^ Full mgopurge (a.o.t. just pruning per https://github.com/juju/juju/wiki/MgoPurgeTool#pruning) did the trick [01:35] cool [01:54] wallyworld_ or thumper: take a look at https://github.com/juju/juju/pull/7418 please? [01:54] righto [01:55] ta [02:03] babbageclunk: a small test quibble [02:06] wallyworld_: cool, thanks [02:06] wallyworld_: I thought you were talking about http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 (obtained 3, expected 4) in the standup ... but now I see a card for TestAgentConnectionsShutDownWhenAPIServerDies. I've fixed the former, did you want me to look at the latter? [02:07] wallyworld_: oh yeah, good point [02:07] axw: that card is from a day or 2 ago; i think the test run is in the details of the card. but yeah, any race failure is good to fix [02:08] okey dokey === natefrinch is now known as natefinch [02:09] can't even spell my own damn name [02:15] Bug #1494661 changed: Rotated logs are not compressed [02:15] @thumper you around? [02:15] natefinch: Error: "thumper" is not a valid command. [02:15] natefinch: yep [02:15] damn bots [02:15] @thumper has this code been reviewed by someone on Juju? [02:15] natefinch: Error: "thumper" is not a valid command. [02:16] natefinch: you need to stop using @ [02:16] right sorry, slack [02:16] creeping in my brain [02:16] heh [02:16] I *could*, or perhaps axw might be better suited [02:16] slack, pah [02:16] haha [02:16] natefinch: if you like, I could quickly test as well... [02:17] mostly wanted to know if I was rubber stamping or if I need to actually be really thorough on this [02:17] natefinch: is it verification you wanted? [02:17] sounds like the latter [02:17] I did look through it when it was initially proposed [02:17] and it looked fine to me [02:17] *nod* [02:17] but I didn't feel like I should comment on the PR [02:21] it;'s open source , you can do whatever the hell you like ;) [02:22] what I'd really like is if most of the new logic were in separate functions that could be unit tested [02:22] part of that is my own fault for not doing the same when I was writing it [02:36] * babbageclunk goes for a run [02:58] axw: is there a bug for the status history pruner failures? [02:58] thumper: not that I know of [02:58] * axw searches [02:59] thumper: no not that I can see. why do you ask? [02:59] axw: where did you see that it periodically fails? [02:59] just thinking there should be a bug [03:00] thumper: http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 [03:00] ah ha [03:00] ok [03:56] thumper: I'll approve with the mode/chown code added. [04:01] wallyworld_: coming to tech board? [07:04] axw: i have to rush off to the football, i was looking for a second set of eyes. live testing seems to work, but unit test fails - the err is nil rather than not found. the txn ops to remove the constraints are getting queued inthe slice, so it's just something dumb i've missed. https://github.com/juju/juju/pull/7421 [07:06] wallyworld_: Wave for the camera - I'll be watching for you. :-) [07:06] axw: also, i just noticed deploying the second time results in a storage attached hook error, but i can't dig in now, i'll have to look later tonight when i get back [07:06] will do :-) hope we win [07:07] axw: maybe send an email if you see something and i'll pick up from there. ttyl [07:08] wallyworld_: ok np. enjoy [07:41] jam: hiya [08:01] rogpeppe: https://github.com/go-retry/retry/pull/1 [08:02] hi rogpeppe, currently digging into some hairy code, I'm guessing you'd like to chat about DNS Cache stuff? [08:02] jam: yeah [08:04] wpk: reviewed [08:18] rogpeppe: updated [08:18] wpk: thanks! [08:19] wpk: i'm not sure you've pushed your changes [08:20] I didn't, sec [08:20] ready [08:21] rogpeppe: so, DNSCache stuff. a few thoughts. To start with, realize that my understanding of the MAAS issue is not code that I worked on, but stuff that I heard around the area so my memory may be innaccurate. [08:21] rogpeppe: to start with, one major caveat is that code was much more about caching DNS *misses* than about hits [08:21] which is not what you're focused on [08:21] I believe the problem worked out as [08:21] 1) We only ever connected to a single 'address' when doing things like 'juju ssh' [08:21] jam: yeah, that's my issue too - i'm not entirely sure what the issues were, and i think we've probably lost all the code reviews from then :-\ [08:21] 2) We thought 'hey, hostnames, that should be better', so we started preferring hostnames to IP addresses [08:22] 3) Then MAAS started giving us hostnames that we can't resolve because they aren't in most users laptops [08:22] and we didn't want to continually try to lookup hostnames that weren't resolvable, and we *definitely* didn't want to use them as the preferred address for 'ssh' if we were only going totry 1 [08:25] we now do attempt multiple targets [08:25] I'm not sure why we would want to internally add yet another DNS cache [08:25] (IIRC Linux defaults to using a local DNS cache anyway) [08:25] if we're doing something internally where we're connecting repeatedly and DNS lookups are a significant problem, I'm not opposed to them [08:26] but as always 'cache invalidation' is one of the core problems in programming [08:26] so avoiding cache when you don't actually need it is often a good plan [08:26] jam: +1000 [08:26] and balloons would be the person to talk to about the archived review site, I'm fairly sure the content was not deleted, just the site taken down [08:27] as maintaining those machines (keeping security updates, etc) has a nonzero cost to us [08:27] jam: yeah, i'm aware of that, but it really is a useful resource [08:30] jam: hmm, so the ssh issue is one i hadn't thought about [08:31] rogpeppe: your DNSCache isn't caching negative results, so it doesn't really touch that problem, but AIUI that is why we had the "unvalidated addresses" [08:31] unresolved [08:32] jam: so my problem with the unresolved addresses is that is makes it sound like the other ones are resolved, but they're not. the two fields sit in uneasy tension - their responsibilities are unclear [08:34] jam: the direction i'm trying to head is that one field has the addresses as returned by the controller, and that other fields provide meta-info that records stuff related to the addresses (e.g. their resolved IP address or whether we could resolve the address) [08:35] jam: the meta fields don't impact on correctness and can always be deleted without a problem (except potentially some extra connection time) [08:35] jam: so... you think that there's not really a problem with slow DNS lookups? [08:36] jam: if so, why did the original code bother to record the resolved IP addresses at all? it could just have moved addresses that resolve OK to the front of the list. [08:37] rogpeppe: I think it was a case of "when we can find IP addresses prefer them, because for everything that isn't JAAS they're actually more 'real'" and available everywhere [08:38] regardless of my personal configuration, etc. [08:38] I think JAAS throws a wrench into that [08:38] that comes after that code was landed [08:38] jam: but IP addresses can change [08:39] jam: and i still don't really understand. "when we can find IP addresses"... that's the responsibility of DNS, right? why are we doing it ourselves? [08:40] jam: that is, why is it a good idea to store the resolved addresses in controllers.yaml? [08:40] so for things like shared (old) environments.yaml files, what DNS servers you could see was often disjoint from what IP addresses you could see [08:41] so if I had one machine that *was* configured to see MAAS, putting the IP addresses in there meant that I could share it with another machine that *couldn't* see MAAS's DNS [08:41] but could route to MAAS [08:42] jam: ha, so the MAAS DNS addresses were only resolvable locally, but the resolved IP addresses worked globally? [08:43] rogpeppe: so MAAS runs its own DNS server that tracks all of the machines that it is managing [08:43] you can certainly have a *route* to the MAAS network [08:43] jam: if that's the case, that's a reasonable argument for maintaining a DNS cache [08:43] without changing your local DNS to point to MAAS's bind [08:43] (its not bind, but whatever it is) [08:44] (it is bind ;) ) [08:44] jam: so if that's the case, how is the code much more about caching DNS misses? [08:45] wpk: I thought it was dnsmasq or something like that [08:45] maybe I'm thinking the DHCP one changed, not DNS [08:45] I know they rewrote one of the backends [08:45] well, switched backends [08:47] jam: btw, when you're done with this you could take a look at https://github.com/juju/juju/pull/7383 ? [08:49] wpk: looks like I started, but just didn't finish, will refresh [08:50] jam: sorry, didn't see you had reviewed ian's branch already... I don't understand why his change would fix anything. maybe you can answer my questions? [08:53] axw: I'm going off the comments sections around where he had touched, but did not try to completely validate the logic myself. It sounded like one of those cases where a TXN can't chain its actions [08:53] (op2 doesn't see the result of op1, IIRC) [08:53] at least in terms of all-assertions are triggered before all ops [08:53] 'are checked' [08:54] it sounded like if there were multiple ways that we might decref the reference counters during teardown, it wouldn't always go to 0. [08:54] though there is a "$inc" , -1, (or if there is a $dec) those operations shouldn't be trying to check the value and setting it to one less than it currently is [08:55] axw: what *I* got out of it, was that if you always did the finalization, then you actually end up with 2 finalization calls sometimes and the second would fail [08:55] so instead he changed it to be "always call it at the end, but avoid calling it early" [08:56] axw: at least, that was my understanding and why it 'seemed like it would be ok', but I'll admit to not really digging deep into everything. [08:56] jam: ok. I'm not 100% sure, but I didn't think it would do that because there's no asserts on the ops [08:57] axw: so double finalize sounds like it could fail [08:57] cause the doc you're removing doesn't exist === marlinc_ is now known as marlinc [09:06] jam: Remove will succeed even if the doc doesn't exist, unless you assert txn.DocExists (just tested by duplicating the ops). pretty sure the issue is that "isFinal" is not triggering, but I don't know why [09:08] I started extracting a "transaction builder" for a limited set of State, but the yak hair was growing faster than I could cut it [09:09] axw: lol [09:12] wpk: i just merged your retry PR, thanks! [09:14] rogpeppe: great, I'll update juju PR with just new dependencies [09:22] rogpeppe: ok, juju PR updated [09:23] wpk: thanks [10:21] jam: (sorry, was busy trying to debug juju-run issue...) [10:21] jam: so, what's the upshot of our discussion? [10:23] jam: if we still care about copying controllers.yaml files and retaining previously resolved IP addresses, then ISTM that we'll still need some sort of DNS cache [10:24] jam: but i'm not quite sure whether we need to record DNS failures too [10:24] jam: currently i can't quite see that it's necessary. === MmikeM is now known as Mmike [10:32] rogpeppe: just saw a test failure in CI for TestWithUnresolvableAddrAfterCacheFallback (http://juju-ci.vapour.ws:8080/job/github-merge-juju/11036/artifact/artifacts/xenial.log/*view*/) [10:32] I'm logging off shortly, will look tomorrow if you don't get to it [10:33] axw_: thanks for the heads up [10:33] axw_: i'll take a look [10:33] cheers [11:21] rogpeppe: well the current way to share controllers is things like 'register' and we're looking to have some other way to share with yourself [11:21] cause we don't *want* to copy controllers.yaml around manually [11:25] jam: ok, so perhaps we can lose all the DNS caching stuff. all we really need to do is put the dialed host name at the start of the address list [11:26] rogpeppe: the only other thing to sanity check is things like 'git blame' to see what commit messages say about thinsg. [11:26] jam: my current approach would mean that if there's a controller with a host name that resolves to several IP addresses and one of them is down, that the second time it would always try that IP address first [11:26] jam: unfortunately our commit messages are often pretty crap [11:26] jam: i really miss having the review history [11:27] rogpeppe: so with git blame and a small amount of walking, you can find the rev that actually merged the code, which gives you at least the review message [11:27] jam: --ancestry-path is very useful for that [11:34] rogpeppe: why would the IP that didn't resolve get chosen first the next time? [11:34] I also thought we always sort and then move the one we successfully connected to, to the front [11:34] jam: the IP that *did* resolve would be chosen first next time, sorry [11:35] jam: we do currently. my plan was to remove the unresolved-api-endpoints field and add a dns-cache field mapping host names to ip addresses [11:36] jam: when you successfully dial an address, you move that hostname to the start of api-endpoints and the dialed ip address to the start of the dns-cache entry [12:36] rogpeppe: could you check https://github.com/juju/juju/pull/7417 ? [13:29] wpk: reviewed [16:09] backup compression has landed in lumberjack FYI. [16:09] not sure who is on during US work hours anymore [16:11] hi rick_h marcoceppi rogpeppe alexisb [16:11] natefinch: yo! [16:11] heya natefinch [16:12] howdy :) [16:13] rogpeppe: how's things in juju land? [16:13] Howdy natefinch! [16:13] natefinch: scrumptious as always :) [16:13] * natefinch waves at everyone [16:13] haha [16:14] natefinch: how's the weather up in the Northeast treating ya? [16:15] rick_h: pretty good. mild most days, barely need heat or A/C. [16:16] natefinch: awesome, great time if the year [16:17] thumper wanted backup compression done by today, so it's in. Updating to master of gopkg.in/natefinch/lumberjack.v2 will bring it in. Also tagged it as v2.1 for anyone who might be using something that cares about semantic versioning. [16:19] natefinch: that's awesome ty much! [17:49] o/ natefinch [17:50] hi marcoceppi [22:10] veebers: so... why does the assess_log_rotation acceptance test require a JUJU_HOME/environments.yaml? [22:12] thumper: due to how the tests currently setup the environment to bootstrap, we have a source for credentials and settings etc. which are named (hence 'env' argument). ci-tests take that and prepare a JUJU_DATA (known as JUJU_HOME for historic reasons for the test arg) [22:13] I'm not sure what I need to pass it to get it running locally [22:25] thumper: if you have cloud-city you need: JUJU_HOME= ./