blahdeblah | thumper: You around? Looking for some advice on how to tackle a problem. | 00:19 |
---|---|---|
blahdeblah | I've got a production 1.25.10 environment in a rather bad way, showing the traditional symptoms of 1587644, plus the broken status updates of 1666396, but with no missing txn-revnos. | 00:19 |
blahdeblah | debug-log is full of https://pastebin.canonical.com/189580/ | 00:19 |
blahdeblah | restarting jujud-machine-0, juju-db, and rsyslog has had no effect; it's just constant leadership-tracker spam per above | 00:19 |
blahdeblah | ^ Or anyone else, for that matter :-) | 00:21 |
babbageclunk | blahdeblah: can you see any message about why the leadership-tracker isn't running? | 00:22 |
babbageclunk | Or has that fallen off the bottom of the log? | 00:22 |
blahdeblah | Let me run the debug-log for a while and grep out the noise | 00:22 |
blahdeblah | I got one of these: machine-0: 2017-05-31 00:22:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("5912fa035290d208a719a8cc") | 00:23 |
blahdeblah | Is that a smoking gun for an mgopurge? | 00:23 |
babbageclunk | blahdeblah: I think so (although I'm not an expert). | 00:25 |
blahdeblah | Seems like it from https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge | 00:25 |
blahdeblah | I'll try that | 00:25 |
thumper | blahdeblah: yeah | 00:30 |
thumper | blahdeblah: oh, seems like you've sorted it | 00:31 |
blahdeblah | thumper: so mgopurge is the right step? | 00:31 |
thumper | blahdeblah: yes | 00:31 |
blahdeblah | cool - thanks, babbageclunk & thumper | 00:32 |
blahdeblah | Hmmm. This is a rather persistently missing transaction. Same one keeps repeating in the debug log. | 00:40 |
blahdeblah | mgopurge output: https://pastebin.canonical.com/189582/ | 00:41 |
blahdeblah | Anything look awry there? ^ | 00:41 |
blahdeblah | thumper: ^ When you have a sec | 01:04 |
blahdeblah | ^ Full mgopurge (a.o.t. just pruning per https://github.com/juju/juju/wiki/MgoPurgeTool#pruning) did the trick | 01:27 |
thumper | cool | 01:35 |
babbageclunk | wallyworld_ or thumper: take a look at https://github.com/juju/juju/pull/7418 please? | 01:54 |
wallyworld_ | righto | 01:54 |
babbageclunk | ta | 01:55 |
wallyworld_ | babbageclunk: a small test quibble | 02:03 |
babbageclunk | wallyworld_: cool, thanks | 02:06 |
axw | wallyworld_: I thought you were talking about http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 (obtained 3, expected 4) in the standup ... but now I see a card for TestAgentConnectionsShutDownWhenAPIServerDies. I've fixed the former, did you want me to look at the latter? | 02:06 |
babbageclunk | wallyworld_: oh yeah, good point | 02:07 |
wallyworld_ | axw: that card is from a day or 2 ago; i think the test run is in the details of the card. but yeah, any race failure is good to fix | 02:07 |
axw | okey dokey | 02:08 |
=== natefrinch is now known as natefinch | ||
natefinch | can't even spell my own damn name | 02:09 |
mup | Bug #1494661 changed: Rotated logs are not compressed <canonical-bootstack> <uosci> <juju:In Progress by jsing> <juju-core:Won't Fix> <juju-core 1.25:Won't Fix> <https://launchpad.net/bugs/1494661> | 02:15 |
natefinch | @thumper you around? | 02:15 |
meetingology | natefinch: Error: "thumper" is not a valid command. | 02:15 |
thumper | natefinch: yep | 02:15 |
thumper | damn bots | 02:15 |
natefinch | @thumper has this code been reviewed by someone on Juju? | 02:15 |
meetingology | natefinch: Error: "thumper" is not a valid command. | 02:15 |
thumper | natefinch: you need to stop using @ | 02:16 |
natefinch | right sorry, slack | 02:16 |
natefinch | creeping in my brain | 02:16 |
natefinch | heh | 02:16 |
thumper | I *could*, or perhaps axw might be better suited | 02:16 |
thumper | slack, pah | 02:16 |
natefinch | haha | 02:16 |
thumper | natefinch: if you like, I could quickly test as well... | 02:16 |
natefinch | mostly wanted to know if I was rubber stamping or if I need to actually be really thorough on this | 02:17 |
thumper | natefinch: is it verification you wanted? | 02:17 |
natefinch | sounds like the latter | 02:17 |
thumper | I did look through it when it was initially proposed | 02:17 |
thumper | and it looked fine to me | 02:17 |
natefinch | *nod* | 02:17 |
thumper | but I didn't feel like I should comment on the PR | 02:17 |
natefinch | it;'s open source , you can do whatever the hell you like ;) | 02:21 |
natefinch | what I'd really like is if most of the new logic were in separate functions that could be unit tested | 02:22 |
natefinch | part of that is my own fault for not doing the same when I was writing it | 02:22 |
* babbageclunk goes for a run | 02:36 | |
thumper | axw: is there a bug for the status history pruner failures? | 02:58 |
axw | thumper: not that I know of | 02:58 |
* axw searches | 02:58 | |
axw | thumper: no not that I can see. why do you ask? | 02:59 |
thumper | axw: where did you see that it periodically fails? | 02:59 |
thumper | just thinking there should be a bug | 02:59 |
axw | thumper: http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 | 03:00 |
thumper | ah ha | 03:00 |
thumper | ok | 03:00 |
natefinch | thumper: I'll approve with the mode/chown code added. | 03:56 |
axw | wallyworld_: coming to tech board? | 04:01 |
wallyworld_ | axw: i have to rush off to the football, i was looking for a second set of eyes. live testing seems to work, but unit test fails - the err is nil rather than not found. the txn ops to remove the constraints are getting queued inthe slice, so it's just something dumb i've missed. https://github.com/juju/juju/pull/7421 | 07:04 |
blahdeblah | wallyworld_: Wave for the camera - I'll be watching for you. :-) | 07:06 |
wallyworld_ | axw: also, i just noticed deploying the second time results in a storage attached hook error, but i can't dig in now, i'll have to look later tonight when i get back | 07:06 |
wallyworld_ | will do :-) hope we win | 07:06 |
wallyworld_ | axw: maybe send an email if you see something and i'll pick up from there. ttyl | 07:07 |
axw | wallyworld_: ok np. enjoy | 07:08 |
rogpeppe | jam: hiya | 07:41 |
wpk | rogpeppe: https://github.com/go-retry/retry/pull/1 | 08:01 |
jam | hi rogpeppe, currently digging into some hairy code, I'm guessing you'd like to chat about DNS Cache stuff? | 08:02 |
rogpeppe | jam: yeah | 08:02 |
rogpeppe | wpk: reviewed | 08:04 |
wpk | rogpeppe: updated | 08:18 |
rogpeppe | wpk: thanks! | 08:18 |
rogpeppe | wpk: i'm not sure you've pushed your changes | 08:19 |
wpk | I didn't, sec | 08:20 |
wpk | ready | 08:20 |
jam | rogpeppe: so, DNSCache stuff. a few thoughts. To start with, realize that my understanding of the MAAS issue is not code that I worked on, but stuff that I heard around the area so my memory may be innaccurate. | 08:21 |
jam | rogpeppe: to start with, one major caveat is that code was much more about caching DNS *misses* than about hits | 08:21 |
jam | which is not what you're focused on | 08:21 |
jam | I believe the problem worked out as | 08:21 |
jam | 1) We only ever connected to a single 'address' when doing things like 'juju ssh' | 08:21 |
rogpeppe | jam: yeah, that's my issue too - i'm not entirely sure what the issues were, and i think we've probably lost all the code reviews from then :-\ | 08:21 |
jam | 2) We thought 'hey, hostnames, that should be better', so we started preferring hostnames to IP addresses | 08:21 |
jam | 3) Then MAAS started giving us hostnames that we can't resolve because they aren't in most users laptops | 08:22 |
jam | and we didn't want to continually try to lookup hostnames that weren't resolvable, and we *definitely* didn't want to use them as the preferred address for 'ssh' if we were only going totry 1 | 08:22 |
jam | we now do attempt multiple targets | 08:25 |
jam | I'm not sure why we would want to internally add yet another DNS cache | 08:25 |
jam | (IIRC Linux defaults to using a local DNS cache anyway) | 08:25 |
jam | if we're doing something internally where we're connecting repeatedly and DNS lookups are a significant problem, I'm not opposed to them | 08:25 |
jam | but as always 'cache invalidation' is one of the core problems in programming | 08:26 |
jam | so avoiding cache when you don't actually need it is often a good plan | 08:26 |
blahdeblah | jam: +1000 | 08:26 |
jam | and balloons would be the person to talk to about the archived review site, I'm fairly sure the content was not deleted, just the site taken down | 08:26 |
jam | as maintaining those machines (keeping security updates, etc) has a nonzero cost to us | 08:27 |
rogpeppe | jam: yeah, i'm aware of that, but it really is a useful resource | 08:27 |
rogpeppe | jam: hmm, so the ssh issue is one i hadn't thought about | 08:30 |
jam | rogpeppe: your DNSCache isn't caching negative results, so it doesn't really touch that problem, but AIUI that is why we had the "unvalidated addresses" | 08:31 |
jam | unresolved | 08:31 |
rogpeppe | jam: so my problem with the unresolved addresses is that is makes it sound like the other ones are resolved, but they're not. the two fields sit in uneasy tension - their responsibilities are unclear | 08:32 |
rogpeppe | jam: the direction i'm trying to head is that one field has the addresses as returned by the controller, and that other fields provide meta-info that records stuff related to the addresses (e.g. their resolved IP address or whether we could resolve the address) | 08:34 |
rogpeppe | jam: the meta fields don't impact on correctness and can always be deleted without a problem (except potentially some extra connection time) | 08:35 |
rogpeppe | jam: so... you think that there's not really a problem with slow DNS lookups? | 08:35 |
rogpeppe | jam: if so, why did the original code bother to record the resolved IP addresses at all? it could just have moved addresses that resolve OK to the front of the list. | 08:36 |
jam | rogpeppe: I think it was a case of "when we can find IP addresses prefer them, because for everything that isn't JAAS they're actually more 'real'" and available everywhere | 08:37 |
jam | regardless of my personal configuration, etc. | 08:38 |
jam | I think JAAS throws a wrench into that | 08:38 |
jam | that comes after that code was landed | 08:38 |
rogpeppe | jam: but IP addresses can change | 08:38 |
rogpeppe | jam: and i still don't really understand. "when we can find IP addresses"... that's the responsibility of DNS, right? why are we doing it ourselves? | 08:39 |
rogpeppe | jam: that is, why is it a good idea to store the resolved addresses in controllers.yaml? | 08:40 |
jam | so for things like shared (old) environments.yaml files, what DNS servers you could see was often disjoint from what IP addresses you could see | 08:40 |
jam | so if I had one machine that *was* configured to see MAAS, putting the IP addresses in there meant that I could share it with another machine that *couldn't* see MAAS's DNS | 08:41 |
jam | but could route to MAAS | 08:41 |
rogpeppe | jam: ha, so the MAAS DNS addresses were only resolvable locally, but the resolved IP addresses worked globally? | 08:42 |
jam | rogpeppe: so MAAS runs its own DNS server that tracks all of the machines that it is managing | 08:43 |
jam | you can certainly have a *route* to the MAAS network | 08:43 |
rogpeppe | jam: if that's the case, that's a reasonable argument for maintaining a DNS cache | 08:43 |
jam | without changing your local DNS to point to MAAS's bind | 08:43 |
jam | (its not bind, but whatever it is) | 08:43 |
wpk | (it is bind ;) ) | 08:44 |
rogpeppe | jam: so if that's the case, how is the code much more about caching DNS misses? | 08:44 |
jam | wpk: I thought it was dnsmasq or something like that | 08:45 |
jam | maybe I'm thinking the DHCP one changed, not DNS | 08:45 |
jam | I know they rewrote one of the backends | 08:45 |
jam | well, switched backends | 08:45 |
wpk | jam: btw, when you're done with this you could take a look at https://github.com/juju/juju/pull/7383 ? | 08:47 |
jam | wpk: looks like I started, but just didn't finish, will refresh | 08:49 |
axw | jam: sorry, didn't see you had reviewed ian's branch already... I don't understand why his change would fix anything. maybe you can answer my questions? | 08:50 |
jam | axw: I'm going off the comments sections around where he had touched, but did not try to completely validate the logic myself. It sounded like one of those cases where a TXN can't chain its actions | 08:53 |
jam | (op2 doesn't see the result of op1, IIRC) | 08:53 |
jam | at least in terms of all-assertions are triggered before all ops | 08:53 |
jam | 'are checked' | 08:53 |
jam | it sounded like if there were multiple ways that we might decref the reference counters during teardown, it wouldn't always go to 0. | 08:54 |
jam | though there is a "$inc" , -1, (or if there is a $dec) those operations shouldn't be trying to check the value and setting it to one less than it currently is | 08:54 |
jam | axw: what *I* got out of it, was that if you always did the finalization, then you actually end up with 2 finalization calls sometimes and the second would fail | 08:55 |
jam | so instead he changed it to be "always call it at the end, but avoid calling it early" | 08:55 |
jam | axw: at least, that was my understanding and why it 'seemed like it would be ok', but I'll admit to not really digging deep into everything. | 08:56 |
axw | jam: ok. I'm not 100% sure, but I didn't think it would do that because there's no asserts on the ops | 08:56 |
jam | axw: so double finalize sounds like it could fail | 08:57 |
jam | cause the doc you're removing doesn't exist | 08:57 |
=== marlinc_ is now known as marlinc | ||
axw | jam: Remove will succeed even if the doc doesn't exist, unless you assert txn.DocExists (just tested by duplicating the ops). pretty sure the issue is that "isFinal" is not triggering, but I don't know why | 09:06 |
axw | I started extracting a "transaction builder" for a limited set of State, but the yak hair was growing faster than I could cut it | 09:08 |
rogpeppe | axw: lol | 09:09 |
rogpeppe | wpk: i just merged your retry PR, thanks! | 09:12 |
wpk | rogpeppe: great, I'll update juju PR with just new dependencies | 09:14 |
wpk | rogpeppe: ok, juju PR updated | 09:22 |
rogpeppe | wpk: thanks | 09:23 |
rogpeppe | jam: (sorry, was busy trying to debug juju-run issue...) | 10:21 |
rogpeppe | jam: so, what's the upshot of our discussion? | 10:21 |
rogpeppe | jam: if we still care about copying controllers.yaml files and retaining previously resolved IP addresses, then ISTM that we'll still need some sort of DNS cache | 10:23 |
rogpeppe | jam: but i'm not quite sure whether we need to record DNS failures too | 10:24 |
rogpeppe | jam: currently i can't quite see that it's necessary. | 10:24 |
=== MmikeM is now known as Mmike | ||
axw_ | rogpeppe: just saw a test failure in CI for TestWithUnresolvableAddrAfterCacheFallback (http://juju-ci.vapour.ws:8080/job/github-merge-juju/11036/artifact/artifacts/xenial.log/*view*/) | 10:32 |
axw_ | I'm logging off shortly, will look tomorrow if you don't get to it | 10:32 |
rogpeppe | axw_: thanks for the heads up | 10:33 |
rogpeppe | axw_: i'll take a look | 10:33 |
axw_ | cheers | 10:33 |
jam | rogpeppe: well the current way to share controllers is things like 'register' and we're looking to have some other way to share with yourself | 11:21 |
jam | cause we don't *want* to copy controllers.yaml around manually | 11:21 |
rogpeppe | jam: ok, so perhaps we can lose all the DNS caching stuff. all we really need to do is put the dialed host name at the start of the address list | 11:25 |
jam | rogpeppe: the only other thing to sanity check is things like 'git blame' to see what commit messages say about thinsg. | 11:26 |
rogpeppe | jam: my current approach would mean that if there's a controller with a host name that resolves to several IP addresses and one of them is down, that the second time it would always try that IP address first | 11:26 |
rogpeppe | jam: unfortunately our commit messages are often pretty crap | 11:26 |
rogpeppe | jam: i really miss having the review history | 11:26 |
jam | rogpeppe: so with git blame and a small amount of walking, you can find the rev that actually merged the code, which gives you at least the review message | 11:27 |
rogpeppe | jam: --ancestry-path is very useful for that | 11:27 |
jam | rogpeppe: why would the IP that didn't resolve get chosen first the next time? | 11:34 |
jam | I also thought we always sort and then move the one we successfully connected to, to the front | 11:34 |
rogpeppe | jam: the IP that *did* resolve would be chosen first next time, sorry | 11:34 |
rogpeppe | jam: we do currently. my plan was to remove the unresolved-api-endpoints field and add a dns-cache field mapping host names to ip addresses | 11:35 |
rogpeppe | jam: when you successfully dial an address, you move that hostname to the start of api-endpoints and the dialed ip address to the start of the dns-cache entry | 11:36 |
wpk | rogpeppe: could you check https://github.com/juju/juju/pull/7417 ? | 12:36 |
rogpeppe | wpk: reviewed | 13:29 |
natefinch | backup compression has landed in lumberjack FYI. | 16:09 |
natefinch | not sure who is on during US work hours anymore | 16:09 |
natefinch | hi rick_h marcoceppi rogpeppe alexisb | 16:11 |
rogpeppe | natefinch: yo! | 16:11 |
alexisb | heya natefinch | 16:11 |
natefinch | howdy :) | 16:12 |
natefinch | rogpeppe: how's things in juju land? | 16:13 |
rick_h | Howdy natefinch! | 16:13 |
rogpeppe | natefinch: scrumptious as always :) | 16:13 |
* natefinch waves at everyone | 16:13 | |
natefinch | haha | 16:13 |
rick_h | natefinch: how's the weather up in the Northeast treating ya? | 16:14 |
natefinch | rick_h: pretty good. mild most days, barely need heat or A/C. | 16:15 |
rick_h | natefinch: awesome, great time if the year | 16:16 |
natefinch | thumper wanted backup compression done by today, so it's in. Updating to master of gopkg.in/natefinch/lumberjack.v2 will bring it in. Also tagged it as v2.1 for anyone who might be using something that cares about semantic versioning. | 16:17 |
rick_h | natefinch: that's awesome ty much! | 16:19 |
marcoceppi | o/ natefinch | 17:49 |
natefinch | hi marcoceppi | 17:50 |
thumper | veebers: so... why does the assess_log_rotation acceptance test require a JUJU_HOME/environments.yaml? | 22:10 |
veebers | thumper: due to how the tests currently setup the environment to bootstrap, we have a source for credentials and settings etc. which are named (hence 'env' argument). ci-tests take that and prepare a JUJU_DATA (known as JUJU_HOME for historic reasons for the test arg) | 22:12 |
thumper | I'm not sure what I need to pass it to get it running locally | 22:13 |
veebers | thumper: if you have cloud-city you need: JUJU_HOME=<path to cloud city> ./<script name> <env name> where env name is parallel-lxd | 22:25 |
thumper | veebers: ok it is running now... | 22:33 |
veebers | thumper: cool | 22:35 |
thumper | babbageclunk: https://bugs.launchpad.net/bugs/1694559 | 22:47 |
mup | Bug #1694559: Log forwarding + debug log level = infinite messages <juju:New> <https://launchpad.net/bugs/1694559> | 22:47 |
thumper | babbageclunk: is there any way to easily enforce a larger batch size? | 22:48 |
thumper | larger minimum that is | 22:48 |
babbageclunk | thumper: you'd need to change the structure of the code a bit - at the moment it just sends batches as it's handed them. | 22:49 |
babbageclunk | But I don't think it'd be especially hard. | 22:49 |
thumper | wallyworld, babbageclunk: is this bug still accurate? https://bugs.launchpad.net/juju/+bug/1646907 | 23:29 |
mup | Bug #1646907: gce open-port does not create firewall rules <gce-provider> <network> <open-port> <juju:Triaged> <https://launchpad.net/bugs/1646907> | 23:29 |
babbageclunk | thumper: don't know, would need to try it out sorry | 23:30 |
thumper | babbageclunk: that's ok, I thought it might have been covered by work you did there | 23:30 |
thumper | with the firewaller | 23:30 |
wallyworld | thumper: don't *think* so. there was a lot of cleanup and improvement to that code that i did in the past couple of months, and thebug was from dec | 23:30 |
thumper | if you don't know, I'll just drop priority and we can address later | 23:31 |
wallyworld | +1 | 23:31 |
thumper | wallyworld: it seems to me that if https://bugs.launchpad.net/juju/+bug/1613823 was still a problem, we'd see many more CI failures for gce | 23:37 |
mup | Bug #1613823: Google Compute Engine IP is ephemeral by default <gce-provider> <juju:Triaged> <https://launchpad.net/bugs/1613823> | 23:37 |
thumper | thoughts? | 23:37 |
anastasiamac | thumper: i think u'd see it to be a problem on a longer-running juju... how many CI tests are long-running? | 23:37 |
thumper | anastasiamac: but this is talkinga bout controller dialing from a client | 23:38 |
wallyworld | thumper: the IP does change on reboot of a machine, but i didn't think it changed arbitarily during use | 23:38 |
thumper | so if the controller reboots... nothing can talk to it? | 23:39 |
thumper | that seems terrible | 23:39 |
wallyworld | yeah, i think that may be the case | 23:39 |
wallyworld | i haven't tested fully myself | 23:39 |
wallyworld | but it does seem an issue | 23:40 |
wallyworld | we should look at for 2.3 | 23:40 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!