/srv/irclogs.ubuntu.com/2017/07/03/#juju-dev.txt

axw	thumper: hey, FYI I'm online today - taking tomorrow off instead	00:44
babbageclunk	axw: standup then? ;)	01:02
axw	babbageclunk: oh sorry! I didn't know you were on today	01:10
axw	babbageclunk: I'm here if you really do want to standup, but for me it's just: working on upgrades	01:11
babbageclunk	axw: :) externalreality and I had a brief chat, but no need - I'm chasing a bug with my sub-sub relation thing and he's working on getting details of run actions earlier to determine whether to take the machine lock.	01:13
axw	babbageclunk: okey dokey, thanks	01:13
thumper	menn0: is there a way to force a mongo election between peers?	01:31
menn0	thumper: you can be fiddling with the member priorities	01:32
thumper	https://bugs.launchpad.net/juju/+bug/1701275	01:32
mup	Bug #1701275: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect <cpe> <cpe-sa> <juju:Triaged> <https://launchpad.net/bugs/1701275>	01:32
thumper	looks like the agents disconnect during an election	01:32
thumper	and don't reconnect	01:32
menn0	thumper: we've seen that occasionally. it seems like a dep engine bug to me.	01:33
thumper	yeah...	01:33
thumper	we saw that on the customer site too	01:33
menn0	thumper: the dep engine stops trying after a while	01:33
thumper	where agents had disconnected	01:33
thumper	and didn't reconnect	01:33
menn0	thumper: perhaps due to manifold returning the wrong thing	01:33
thumper	not sure...	01:33
thumper	I may read through the dep engine code and see if anything leaps out	01:33
menn0	thumper: I have some old logs from jam for a system he had where it happened	01:34
menn0	on his machine	01:34
thumper	hmm...	01:34
menn0	thumper: do you have the tail end of the logs for an agent that got "stuck"	01:39
menn0	?	01:39
thumper	not off hand	01:40
thumper	they are all in warning+ mode	01:40
axw	thumper: 1:1?	02:01
menn0	babbageclunk: I'm seeing failures in TestSplitLogsCollection. Looks like the bit that looks for the "ns not found" error isn't working on my machine for some reason	03:47
menn0	babbageclunk: is this a known thing?	03:47
babbageclunk	menn0: not by me!	03:57
menn0	babbageclunk: the QueryError Code is coming back as 0 despite the message being the expected "ns found found"	03:58
babbageclunk	menn0: maybe a mongo version difference?	03:59
menn0	babbageclunk: must be, although I can see that the tests are using our officially packaged mongodb 3.2 package	03:59
menn0	babbageclunk: should we be worried that this is in a released version?	04:00
menn0	(I don't know if it is)	04:00
babbageclunk	It definitely is - it was done for 2.2.0, I think.	04:01
babbageclunk	menn0: is it failing consistently for you?	04:02
menn0	babbageclunk: yes	04:02
menn0	babbageclunk: I have a fix which uses the error code /or/ the message	04:02
menn0	babbageclunk: looks like this is in 2.2	04:03
babbageclunk	menn0: So is my machine using the wrong mongo version for tests?	04:05
menn0	babbageclunk: possibly. there's a number of fallbacks	04:06
menn0	babbageclunk: run the state tests and do: ps ax \| grep mongo	04:06
babbageclunk	It's using the juju-mongo32 mongod - that reports that it's 3.2.12	04:08
menn0	babbageclunk: hmmm same for me	04:09
menn0	how strange	04:09
babbageclunk	menn0: hmm	04:09
babbageclunk	git version ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 (just in case)	04:10
babbageclunk	menn0: so, that's weird. Different versions of mgo?	04:11
menn0	babbageclunk: i'm at f2b6f6c	04:11
babbageclunk	menn0: I'm working on a branch off 2.2 at the moment	04:11
babbageclunk	menn0: same git hash for mgo	04:12
babbageclunk	curiouser and curiouser	04:12
menn0	babbageclunk: hang on... I may have been running the tests remotely on a different machine	04:13
* menn0 checks		04:13
menn0	babbageclunk: i've got it	04:14
menn0	babbageclunk: it's my "workhorse" machine	04:14
babbageclunk	menn0: And what's that running?	04:15
menn0	babbageclunk: the official mongo 3.2 it would appear :/	04:16
menn0	babbageclunk: but that's where it's failing	04:16
menn0	i'll dig some more	04:16
babbageclunk	ok	04:16
menn0	babbageclunk: this machine has mongodb 3.2, 2.6 and 2.4 installed	04:19
babbageclunk	menn0: Which one's the test picking?	04:20
menn0	babbageclunk: still trying to figure that out	04:20
menn0	babbageclunk: it's 2.6.10	04:21
menn0	babbageclunk: which is odd given the test code is supposed to prefer the 3.2 installation	04:22
menn0	babbageclunk: the problem with this upgrade step will be that it could fail on trusty controllers	04:26
menn0	which still use 2.4 IIRC	04:26
babbageclunk	menn0: But that's only a problem if we were upgrading controllers to 2 from 1.25 in-place, isn't it?	04:27
menn0	babbageclunk: no	04:27
menn0	babbageclunk: if you bootstrap a 2.x controller on trusty I believe it still uses mongo 2.4	04:27
babbageclunk	menn0: Ah, right	04:28
babbageclunk	menn0: Wouldn't that be a pretty odd thing to do?	04:29
babbageclunk	menn0: I guess we support it though.	04:29
menn0	babbageclunk: yes, unless you wanted to run some trusty only charms on the controller	04:29
thumper	menn0: so ideas on forcing a leader election in mongo without restarting juju?	04:30
babbageclunk	menn0: It sounds like your fix will correct it anyway.	04:30
menn0	babbageclunk: yes it will but it's already in 2.2.1	04:30
menn0	babbageclunk: I guess it only fails the second time the upgrade step is attempted. is that right?	04:31
babbageclunk	menn0: oh, right - because the first time the logs collection is there?	04:32
thumper	axw: did you say that the metrics gathering now all uses the state pool?	04:32
thumper	axw: in 2.2?	04:32
axw	thumper: on the 2.2 branch, yes. the second fix I was talking about landed last week	04:32
thumper	axw: cool. I'll see if that fixes my issue here	04:32
menn0	thumper: connect using the mongo shell and give the current leader a priority of 0	04:32
menn0	thumper: rs.conf() show the current configuration	04:33
thumper	I'm thinking of writing a 'juju-fetch' plugin that just does 'git fetch'	04:34
menn0	thumper: this is helpful: https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/	04:34
thumper	menn0: thanks	04:34
menn0	babbageclunk: so the chances of hitting the problem are fairly slim but not 0. a re-run of the upgrade step could happen when any upgrade step fails and the upgrade is retried.	04:36
menn0	babbageclunk: i'll file a bug and get the fix in	04:37
menn0	babbageclunk: i've confirmed it works for 2.4	04:37
menn0	babbageclunk: I didn't even know you had done this work to split out log collections :)	04:38
babbageclunk	menn0: You mean if another upgrade step fails after this one succeeds, right? I did the state work, thumper did the upgrade step.	04:39
menn0	babbageclunk: I should have known :)	04:41
menn0	babbageclunk: just to be sure, I just confirmed that we still use 2.4 on trusty	04:45
babbageclunk	menn0: nice one.	04:47
menn0	babbageclunk or thumper: https://github.com/juju/juju/pull/7588	04:59
babbageclunk	menn0: looking	04:59
thumper	menn0: so... I got that logs deserialization error	05:00
thumper	went and looked in the DB	05:00
thumper	there is no entry with a missing TAG	05:00
thumper	so...	05:00
thumper	it looks like the oplog tailer is returning an empty value sometimes	05:01
thumper	when it shouldn't	05:01
menn0	thumper: were you looking for docs with an empty tag, or a missing tag?	05:01
thumper	ah... empty	05:01
thumper	how do I say missing?	05:01
thumper	nm	05:02
menn0	thumper: https://stackoverflow.com/questions/8567469/mongodb-find-a-document-by-non-existence-of-a-field	05:02
thumper	db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$exists:false}}).count()	05:03
thumper	0	05:03
thumper	neither empty nor missing	05:03
thumper	menn0: the error is being emitted every five minutes	05:06
thumper	seems very exact	05:06
menn0	thumper: it could also be that the field is the wrong type, so that it gets deserialised to "" because the db type doesn't match the struct type	05:06
menn0	thumper: where is the error being emitted?	05:07
menn0	thumper: when deal with broken txn metadata i've seen fields with nil values that are confusingly deserialised	05:07
thumper	menn0: db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$not: {$type: "string"}}}).count()	05:11
thumper	0	05:11
thumper	not not a string either	05:11
thumper	menn0: the error is emitted in the controller model, which happens to be the one I'm tailing	05:12
menn0	thumper: nice query	05:12
thumper	I recall the oplog query code has a reconnect on timeout	05:12
thumper	I'm wondering if that is emitting something bad by mistake	05:13
thumper	it goes back to look for values of the current time slice	05:13
thumper	but what if the logs are quiet	05:13
thumper	and there isn't one?	05:13
thumper	perhaps that's the use case	05:13
menn0	thumper: that seems likely	05:13
* menn0 looks at the code		05:13
* thumper goes back to finding his other bug		05:14
menn0	thumper: seems the default cursor timeout is 10mins	05:14
thumper	this is happening every 5 minutes on the nose	05:15
menn0	thumper: i'm seeing other sources which say 5 mins	05:16
menn0	digging	05:16
* thumper is getting a bunch of unit agents connected before he forces a mongo election		05:18
menn0	thumper: so it turns out the oplog tailer uses a timeout of 1s so that it can be interrupted	05:31
menn0	thumper: but there's also the concept of cursor invalidation and we might be running into that	05:32
thumper	hmm..	05:32
menn0	thumper: i've added logging and have left debug-log running	05:32
thumper	I'm curious to see if you see the same error message every 5m	05:32
menn0	what does the message look like?	05:32
menn0	thumper: ^^	05:34
thumper	machine-1: 17:29:11 WARNING juju.state deserialization failed (possible DB corruption), while parsing entity tag: "" is not a valid tag	05:49
thumper	I'm trying option two	05:51
thumper	stopping all api server agents	05:52
thumper	while leaving all unit agents running	05:52
thumper	will go away for a few hours	05:52
thumper	and bring them up again	05:52
thumper	that seems to mirror some of the failure cases	05:52
thumper	back later folks	05:52
=== frankban\|afk is now known as frankban
=== frankban is now known as frankban\|afk
babbageclunk	thumper: 1:1? (Sorry, was slow to see the reminder)	22:32
thumper	coming	22:33

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!