/srv/irclogs.ubuntu.com/2017/05/31/#juju-dev.txt

blahdeblah	thumper: You around? Looking for some advice on how to tackle a problem.	00:19
blahdeblah	I've got a production 1.25.10 environment in a rather bad way, showing the traditional symptoms of 1587644, plus the broken status updates of 1666396, but with no missing txn-revnos.	00:19
blahdeblah	debug-log is full of https://pastebin.canonical.com/189580/	00:19
blahdeblah	restarting jujud-machine-0, juju-db, and rsyslog has had no effect; it's just constant leadership-tracker spam per above	00:19
blahdeblah	^ Or anyone else, for that matter :-)	00:21
babbageclunk	blahdeblah: can you see any message about why the leadership-tracker isn't running?	00:22
babbageclunk	Or has that fallen off the bottom of the log?	00:22
blahdeblah	Let me run the debug-log for a while and grep out the noise	00:22
blahdeblah	I got one of these: machine-0: 2017-05-31 00:22:31 ERROR juju.worker.resumer resumer.go:69 cannot resume transactions: cannot find transaction ObjectIdHex("5912fa035290d208a719a8cc")	00:23
blahdeblah	Is that a smoking gun for an mgopurge?	00:23
babbageclunk	blahdeblah: I think so (although I'm not an expert).	00:25
blahdeblah	Seems like it from https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge	00:25
blahdeblah	I'll try that	00:25
thumper	blahdeblah: yeah	00:30
thumper	blahdeblah: oh, seems like you've sorted it	00:31
blahdeblah	thumper: so mgopurge is the right step?	00:31
thumper	blahdeblah: yes	00:31
blahdeblah	cool - thanks, babbageclunk & thumper	00:32
blahdeblah	Hmmm. This is a rather persistently missing transaction. Same one keeps repeating in the debug log.	00:40
blahdeblah	mgopurge output: https://pastebin.canonical.com/189582/	00:41
blahdeblah	Anything look awry there? ^	00:41
blahdeblah	thumper: ^ When you have a sec	01:04
blahdeblah	^ Full mgopurge (a.o.t. just pruning per https://github.com/juju/juju/wiki/MgoPurgeTool#pruning) did the trick	01:27
thumper	cool	01:35
babbageclunk	wallyworld_ or thumper: take a look at https://github.com/juju/juju/pull/7418 please?	01:54
wallyworld_	righto	01:54
babbageclunk	ta	01:55
wallyworld_	babbageclunk: a small test quibble	02:03
babbageclunk	wallyworld_: cool, thanks	02:06
axw	wallyworld_: I thought you were talking about http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811 (obtained 3, expected 4) in the standup ... but now I see a card for TestAgentConnectionsShutDownWhenAPIServerDies. I've fixed the former, did you want me to look at the latter?	02:06
babbageclunk	wallyworld_: oh yeah, good point	02:07
wallyworld_	axw: that card is from a day or 2 ago; i think the test run is in the details of the card. but yeah, any race failure is good to fix	02:07
axw	okey dokey	02:08
=== natefrinch is now known as natefinch
natefinch	can't even spell my own damn name	02:09
mup	Bug #1494661 changed: Rotated logs are not compressed <canonical-bootstack> <uosci> <juju:In Progress by jsing> <juju-core:Won't Fix> <juju-core 1.25:Won't Fix> <https://launchpad.net/bugs/1494661>	02:15
natefinch	@thumper you around?	02:15
meetingology	natefinch: Error: "thumper" is not a valid command.	02:15
thumper	natefinch: yep	02:15
thumper	damn bots	02:15
natefinch	@thumper has this code been reviewed by someone on Juju?	02:15
meetingology	natefinch: Error: "thumper" is not a valid command.	02:15
thumper	natefinch: you need to stop using @	02:16
natefinch	right sorry, slack	02:16
natefinch	creeping in my brain	02:16
natefinch	heh	02:16
thumper	I could, or perhaps axw might be better suited	02:16
thumper	slack, pah	02:16
natefinch	haha	02:16
thumper	natefinch: if you like, I could quickly test as well...	02:16
natefinch	mostly wanted to know if I was rubber stamping or if I need to actually be really thorough on this	02:17
thumper	natefinch: is it verification you wanted?	02:17
natefinch	sounds like the latter	02:17
thumper	I did look through it when it was initially proposed	02:17
thumper	and it looked fine to me	02:17
natefinch	nod	02:17
thumper	but I didn't feel like I should comment on the PR	02:17
natefinch	it;'s open source , you can do whatever the hell you like ;)	02:21
natefinch	what I'd really like is if most of the new logic were in separate functions that could be unit tested	02:22
natefinch	part of that is my own fault for not doing the same when I was writing it	02:22
* babbageclunk goes for a run		02:36
thumper	axw: is there a bug for the status history pruner failures?	02:58
axw	thumper: not that I know of	02:58
* axw searches		02:58
axw	thumper: no not that I can see. why do you ask?	02:59
thumper	axw: where did you see that it periodically fails?	02:59
thumper	just thinking there should be a bug	02:59
axw	thumper: http://qa.jujucharms.com/releases/5314/job/run-unit-tests-race/attempt/2811	03:00
thumper	ah ha	03:00
thumper	ok	03:00
natefinch	thumper: I'll approve with the mode/chown code added.	03:56
axw	wallyworld_: coming to tech board?	04:01
wallyworld_	axw: i have to rush off to the football, i was looking for a second set of eyes. live testing seems to work, but unit test fails - the err is nil rather than not found. the txn ops to remove the constraints are getting queued inthe slice, so it's just something dumb i've missed. https://github.com/juju/juju/pull/7421	07:04
blahdeblah	wallyworld_: Wave for the camera - I'll be watching for you. :-)	07:06
wallyworld_	axw: also, i just noticed deploying the second time results in a storage attached hook error, but i can't dig in now, i'll have to look later tonight when i get back	07:06
wallyworld_	will do :-) hope we win	07:06
wallyworld_	axw: maybe send an email if you see something and i'll pick up from there. ttyl	07:07
axw	wallyworld_: ok np. enjoy	07:08
rogpeppe	jam: hiya	07:41
wpk	rogpeppe: https://github.com/go-retry/retry/pull/1	08:01
jam	hi rogpeppe, currently digging into some hairy code, I'm guessing you'd like to chat about DNS Cache stuff?	08:02
rogpeppe	jam: yeah	08:02
rogpeppe	wpk: reviewed	08:04
wpk	rogpeppe: updated	08:18
rogpeppe	wpk: thanks!	08:18
rogpeppe	wpk: i'm not sure you've pushed your changes	08:19
wpk	I didn't, sec	08:20
wpk	ready	08:20
jam	rogpeppe: so, DNSCache stuff. a few thoughts. To start with, realize that my understanding of the MAAS issue is not code that I worked on, but stuff that I heard around the area so my memory may be innaccurate.	08:21
jam	rogpeppe: to start with, one major caveat is that code was much more about caching DNS misses than about hits	08:21
jam	which is not what you're focused on	08:21
jam	I believe the problem worked out as	08:21
jam	1) We only ever connected to a single 'address' when doing things like 'juju ssh'	08:21
rogpeppe	jam: yeah, that's my issue too - i'm not entirely sure what the issues were, and i think we've probably lost all the code reviews from then :-\	08:21
jam	2) We thought 'hey, hostnames, that should be better', so we started preferring hostnames to IP addresses	08:21
jam	3) Then MAAS started giving us hostnames that we can't resolve because they aren't in most users laptops	08:22
jam	and we didn't want to continually try to lookup hostnames that weren't resolvable, and we definitely didn't want to use them as the preferred address for 'ssh' if we were only going totry 1	08:22
jam	we now do attempt multiple targets	08:25
jam	I'm not sure why we would want to internally add yet another DNS cache	08:25
jam	(IIRC Linux defaults to using a local DNS cache anyway)	08:25
jam	if we're doing something internally where we're connecting repeatedly and DNS lookups are a significant problem, I'm not opposed to them	08:25
jam	but as always 'cache invalidation' is one of the core problems in programming	08:26
jam	so avoiding cache when you don't actually need it is often a good plan	08:26
blahdeblah	jam: +1000	08:26
jam	and balloons would be the person to talk to about the archived review site, I'm fairly sure the content was not deleted, just the site taken down	08:26
jam	as maintaining those machines (keeping security updates, etc) has a nonzero cost to us	08:27
rogpeppe	jam: yeah, i'm aware of that, but it really is a useful resource	08:27
rogpeppe	jam: hmm, so the ssh issue is one i hadn't thought about	08:30
jam	rogpeppe: your DNSCache isn't caching negative results, so it doesn't really touch that problem, but AIUI that is why we had the "unvalidated addresses"	08:31
jam	unresolved	08:31
rogpeppe	jam: so my problem with the unresolved addresses is that is makes it sound like the other ones are resolved, but they're not. the two fields sit in uneasy tension - their responsibilities are unclear	08:32
rogpeppe	jam: the direction i'm trying to head is that one field has the addresses as returned by the controller, and that other fields provide meta-info that records stuff related to the addresses (e.g. their resolved IP address or whether we could resolve the address)	08:34
rogpeppe	jam: the meta fields don't impact on correctness and can always be deleted without a problem (except potentially some extra connection time)	08:35
rogpeppe	jam: so... you think that there's not really a problem with slow DNS lookups?	08:35
rogpeppe	jam: if so, why did the original code bother to record the resolved IP addresses at all? it could just have moved addresses that resolve OK to the front of the list.	08:36
jam	rogpeppe: I think it was a case of "when we can find IP addresses prefer them, because for everything that isn't JAAS they're actually more 'real'" and available everywhere	08:37
jam	regardless of my personal configuration, etc.	08:38
jam	I think JAAS throws a wrench into that	08:38
jam	that comes after that code was landed	08:38
rogpeppe	jam: but IP addresses can change	08:38
rogpeppe	jam: and i still don't really understand. "when we can find IP addresses"... that's the responsibility of DNS, right? why are we doing it ourselves?	08:39
rogpeppe	jam: that is, why is it a good idea to store the resolved addresses in controllers.yaml?	08:40
jam	so for things like shared (old) environments.yaml files, what DNS servers you could see was often disjoint from what IP addresses you could see	08:40
jam	so if I had one machine that was configured to see MAAS, putting the IP addresses in there meant that I could share it with another machine that couldn't see MAAS's DNS	08:41
jam	but could route to MAAS	08:41
rogpeppe	jam: ha, so the MAAS DNS addresses were only resolvable locally, but the resolved IP addresses worked globally?	08:42
jam	rogpeppe: so MAAS runs its own DNS server that tracks all of the machines that it is managing	08:43
jam	you can certainly have a route to the MAAS network	08:43
rogpeppe	jam: if that's the case, that's a reasonable argument for maintaining a DNS cache	08:43
jam	without changing your local DNS to point to MAAS's bind	08:43
jam	(its not bind, but whatever it is)	08:43
wpk	(it is bind ;) )	08:44
rogpeppe	jam: so if that's the case, how is the code much more about caching DNS misses?	08:44
jam	wpk: I thought it was dnsmasq or something like that	08:45
jam	maybe I'm thinking the DHCP one changed, not DNS	08:45
jam	I know they rewrote one of the backends	08:45
jam	well, switched backends	08:45
wpk	jam: btw, when you're done with this you could take a look at https://github.com/juju/juju/pull/7383 ?	08:47
jam	wpk: looks like I started, but just didn't finish, will refresh	08:49
axw	jam: sorry, didn't see you had reviewed ian's branch already... I don't understand why his change would fix anything. maybe you can answer my questions?	08:50
jam	axw: I'm going off the comments sections around where he had touched, but did not try to completely validate the logic myself. It sounded like one of those cases where a TXN can't chain its actions	08:53
jam	(op2 doesn't see the result of op1, IIRC)	08:53
jam	at least in terms of all-assertions are triggered before all ops	08:53
jam	'are checked'	08:53
jam	it sounded like if there were multiple ways that we might decref the reference counters during teardown, it wouldn't always go to 0.	08:54
jam	though there is a "$inc" , -1, (or if there is a $dec) those operations shouldn't be trying to check the value and setting it to one less than it currently is	08:54
jam	axw: what I got out of it, was that if you always did the finalization, then you actually end up with 2 finalization calls sometimes and the second would fail	08:55
jam	so instead he changed it to be "always call it at the end, but avoid calling it early"	08:55
jam	axw: at least, that was my understanding and why it 'seemed like it would be ok', but I'll admit to not really digging deep into everything.	08:56
axw	jam: ok. I'm not 100% sure, but I didn't think it would do that because there's no asserts on the ops	08:56
jam	axw: so double finalize sounds like it could fail	08:57
jam	cause the doc you're removing doesn't exist	08:57
=== marlinc_ is now known as marlinc
axw	jam: Remove will succeed even if the doc doesn't exist, unless you assert txn.DocExists (just tested by duplicating the ops). pretty sure the issue is that "isFinal" is not triggering, but I don't know why	09:06
axw	I started extracting a "transaction builder" for a limited set of State, but the yak hair was growing faster than I could cut it	09:08
rogpeppe	axw: lol	09:09
rogpeppe	wpk: i just merged your retry PR, thanks!	09:12
wpk	rogpeppe: great, I'll update juju PR with just new dependencies	09:14
wpk	rogpeppe: ok, juju PR updated	09:22
rogpeppe	wpk: thanks	09:23
rogpeppe	jam: (sorry, was busy trying to debug juju-run issue...)	10:21
rogpeppe	jam: so, what's the upshot of our discussion?	10:21
rogpeppe	jam: if we still care about copying controllers.yaml files and retaining previously resolved IP addresses, then ISTM that we'll still need some sort of DNS cache	10:23
rogpeppe	jam: but i'm not quite sure whether we need to record DNS failures too	10:24
rogpeppe	jam: currently i can't quite see that it's necessary.	10:24
=== MmikeM is now known as Mmike
axw_	rogpeppe: just saw a test failure in CI for TestWithUnresolvableAddrAfterCacheFallback (http://juju-ci.vapour.ws:8080/job/github-merge-juju/11036/artifact/artifacts/xenial.log/view/)	10:32
axw_	I'm logging off shortly, will look tomorrow if you don't get to it	10:32
rogpeppe	axw_: thanks for the heads up	10:33
rogpeppe	axw_: i'll take a look	10:33
axw_	cheers	10:33
jam	rogpeppe: well the current way to share controllers is things like 'register' and we're looking to have some other way to share with yourself	11:21
jam	cause we don't want to copy controllers.yaml around manually	11:21
rogpeppe	jam: ok, so perhaps we can lose all the DNS caching stuff. all we really need to do is put the dialed host name at the start of the address list	11:25
jam	rogpeppe: the only other thing to sanity check is things like 'git blame' to see what commit messages say about thinsg.	11:26
rogpeppe	jam: my current approach would mean that if there's a controller with a host name that resolves to several IP addresses and one of them is down, that the second time it would always try that IP address first	11:26
rogpeppe	jam: unfortunately our commit messages are often pretty crap	11:26
rogpeppe	jam: i really miss having the review history	11:26
jam	rogpeppe: so with git blame and a small amount of walking, you can find the rev that actually merged the code, which gives you at least the review message	11:27
rogpeppe	jam: --ancestry-path is very useful for that	11:27
jam	rogpeppe: why would the IP that didn't resolve get chosen first the next time?	11:34
jam	I also thought we always sort and then move the one we successfully connected to, to the front	11:34
rogpeppe	jam: the IP that did resolve would be chosen first next time, sorry	11:34
rogpeppe	jam: we do currently. my plan was to remove the unresolved-api-endpoints field and add a dns-cache field mapping host names to ip addresses	11:35
rogpeppe	jam: when you successfully dial an address, you move that hostname to the start of api-endpoints and the dialed ip address to the start of the dns-cache entry	11:36
wpk	rogpeppe: could you check https://github.com/juju/juju/pull/7417 ?	12:36
rogpeppe	wpk: reviewed	13:29
natefinch	backup compression has landed in lumberjack FYI.	16:09
natefinch	not sure who is on during US work hours anymore	16:09
natefinch	hi rick_h marcoceppi rogpeppe alexisb	16:11
rogpeppe	natefinch: yo!	16:11
alexisb	heya natefinch	16:11
natefinch	howdy :)	16:12
natefinch	rogpeppe: how's things in juju land?	16:13
rick_h	Howdy natefinch!	16:13
rogpeppe	natefinch: scrumptious as always :)	16:13
* natefinch waves at everyone		16:13
natefinch	haha	16:13
rick_h	natefinch: how's the weather up in the Northeast treating ya?	16:14
natefinch	rick_h: pretty good. mild most days, barely need heat or A/C.	16:15
rick_h	natefinch: awesome, great time if the year	16:16
natefinch	thumper wanted backup compression done by today, so it's in. Updating to master of gopkg.in/natefinch/lumberjack.v2 will bring it in. Also tagged it as v2.1 for anyone who might be using something that cares about semantic versioning.	16:17
rick_h	natefinch: that's awesome ty much!	16:19
marcoceppi	o/ natefinch	17:49
natefinch	hi marcoceppi	17:50
thumper	veebers: so... why does the assess_log_rotation acceptance test require a JUJU_HOME/environments.yaml?	22:10
veebers	thumper: due to how the tests currently setup the environment to bootstrap, we have a source for credentials and settings etc. which are named (hence 'env' argument). ci-tests take that and prepare a JUJU_DATA (known as JUJU_HOME for historic reasons for the test arg)	22:12
thumper	I'm not sure what I need to pass it to get it running locally	22:13
veebers	thumper: if you have cloud-city you need: JUJU_HOME=<path to cloud city> ./<script name> <env name> where env name is parallel-lxd	22:25
thumper	veebers: ok it is running now...	22:33
veebers	thumper: cool	22:35
thumper	babbageclunk: https://bugs.launchpad.net/bugs/1694559	22:47
mup	Bug #1694559: Log forwarding + debug log level = infinite messages <juju:New> <https://launchpad.net/bugs/1694559>	22:47
thumper	babbageclunk: is there any way to easily enforce a larger batch size?	22:48
thumper	larger minimum that is	22:48
babbageclunk	thumper: you'd need to change the structure of the code a bit - at the moment it just sends batches as it's handed them.	22:49
babbageclunk	But I don't think it'd be especially hard.	22:49
thumper	wallyworld, babbageclunk: is this bug still accurate? https://bugs.launchpad.net/juju/+bug/1646907	23:29
mup	Bug #1646907: gce open-port does not create firewall rules <gce-provider> <network> <open-port> <juju:Triaged> <https://launchpad.net/bugs/1646907>	23:29
babbageclunk	thumper: don't know, would need to try it out sorry	23:30
thumper	babbageclunk: that's ok, I thought it might have been covered by work you did there	23:30
thumper	with the firewaller	23:30
wallyworld	thumper: don't think so. there was a lot of cleanup and improvement to that code that i did in the past couple of months, and thebug was from dec	23:30
thumper	if you don't know, I'll just drop priority and we can address later	23:31
wallyworld	+1	23:31
thumper	wallyworld: it seems to me that if https://bugs.launchpad.net/juju/+bug/1613823 was still a problem, we'd see many more CI failures for gce	23:37
mup	Bug #1613823: Google Compute Engine IP is ephemeral by default <gce-provider> <juju:Triaged> <https://launchpad.net/bugs/1613823>	23:37
thumper	thoughts?	23:37
anastasiamac	thumper: i think u'd see it to be a problem on a longer-running juju... how many CI tests are long-running?	23:37
thumper	anastasiamac: but this is talkinga bout controller dialing from a client	23:38
wallyworld	thumper: the IP does change on reboot of a machine, but i didn't think it changed arbitarily during use	23:38
thumper	so if the controller reboots... nothing can talk to it?	23:39
thumper	that seems terrible	23:39
wallyworld	yeah, i think that may be the case	23:39
wallyworld	i haven't tested fully myself	23:39
wallyworld	but it does seem an issue	23:40
wallyworld	we should look at for 2.3	23:40

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!