[00:44] <axw> thumper: hey, FYI I'm online today - taking tomorrow off instead
[01:02] <babbageclunk> axw: standup then? ;)
[01:10] <axw> babbageclunk: oh sorry! I didn't know you were on today
[01:11] <axw> babbageclunk: I'm here if you really do want to standup, but for me it's just: working on upgrades
[01:13] <babbageclunk> axw: :) externalreality and I had a brief chat, but no need - I'm chasing a bug with my sub-sub relation thing and he's working on getting details of run actions earlier to determine whether to take the machine lock.
[01:13] <axw> babbageclunk: okey dokey, thanks
[01:31] <thumper> menn0: is there a way to force a mongo election between peers?
[01:32] <menn0> thumper: you can be fiddling with the member priorities
[01:32] <thumper>  https://bugs.launchpad.net/juju/+bug/1701275
[01:32] <mup> Bug #1701275: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect <cpe> <cpe-sa> <juju:Triaged> <https://launchpad.net/bugs/1701275>
[01:32] <thumper> looks like the agents disconnect during an election
[01:32] <thumper> and don't reconnect
[01:33] <menn0> thumper: we've seen that occasionally. it seems like a dep engine bug to me.
[01:33] <thumper> yeah...
[01:33] <thumper> we saw that on the customer site too
[01:33] <menn0> thumper: the dep engine stops trying after a while
[01:33] <thumper> where agents had disconnected
[01:33] <thumper> and didn't reconnect
[01:33] <menn0> thumper: perhaps due to manifold returning the wrong thing
[01:33] <thumper> not sure...
[01:33] <thumper> I may read through the dep engine code and see if anything leaps out
[01:34] <menn0> thumper: I have some old logs from jam for a system he had where it happened
[01:34] <menn0> on his machine
[01:34] <thumper> hmm...
[01:39] <menn0> thumper: do you have the tail end of the logs for an agent that got "stuck"
[01:39] <menn0> ?
[01:40] <thumper> not off hand
[01:40] <thumper> they are all in warning+ mode
[02:01] <axw> thumper: 1:1?
[03:47] <menn0> babbageclunk: I'm seeing failures in TestSplitLogsCollection. Looks like the bit that looks for the "ns not found" error isn't working on my machine for some reason
[03:47] <menn0> babbageclunk: is this a known thing?
[03:57] <babbageclunk> menn0: not by me!
[03:58] <menn0> babbageclunk: the QueryError Code is coming back as 0 despite the message being the expected "ns found found"
[03:59] <babbageclunk> menn0: maybe a mongo version difference?
[03:59] <menn0> babbageclunk: must be, although I can see that the tests are using our officially packaged mongodb 3.2 package
[04:00] <menn0> babbageclunk: should we be worried that this is in a released version?
[04:00] <menn0> (I don't know if it is)
[04:01] <babbageclunk> It definitely is - it was done for 2.2.0, I think.
[04:02] <babbageclunk> menn0: is it failing consistently for you?
[04:02] <menn0> babbageclunk: yes
[04:02] <menn0> babbageclunk: I have a fix which uses the error code /or/ the message
[04:03] <menn0> babbageclunk: looks like this is in 2.2
[04:05] <babbageclunk> menn0: So is my machine using the wrong mongo version for tests?
[04:06] <menn0> babbageclunk: possibly. there's a number of fallbacks
[04:06] <menn0> babbageclunk: run the state tests and do: ps ax | grep mongo
[04:08] <babbageclunk> It's using the juju-mongo32 mongod - that reports that it's 3.2.12
[04:09] <menn0> babbageclunk: hmmm same for me
[04:09] <menn0> how strange
[04:09] <babbageclunk> menn0: hmm
[04:10] <babbageclunk> git version ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 (just in case)
[04:11] <babbageclunk> menn0: so, that's weird. Different versions of mgo?
[04:11] <menn0> babbageclunk: i'm at f2b6f6c
[04:11] <babbageclunk> menn0: I'm working on a branch off 2.2 at the moment
[04:12] <babbageclunk> menn0: same git hash for mgo
[04:12] <babbageclunk> curiouser and curiouser
[04:13] <menn0> babbageclunk: hang on... I may have been running the tests remotely on a different machine
[04:13]  * menn0 checks
[04:14] <menn0> babbageclunk: i've got it
[04:14] <menn0> babbageclunk: it's my "workhorse" machine
[04:15] <babbageclunk> menn0: And what's that running?
[04:16] <menn0> babbageclunk: the official mongo 3.2 it would appear :/
[04:16] <menn0> babbageclunk: but that's where it's failing
[04:16] <menn0> i'll dig some more
[04:16] <babbageclunk> ok
[04:19] <menn0> babbageclunk: this machine has mongodb 3.2, 2.6 and 2.4 installed
[04:20] <babbageclunk> menn0: Which one's the test picking?
[04:20] <menn0> babbageclunk: still trying to figure that out
[04:21] <menn0> babbageclunk: it's 2.6.10
[04:22] <menn0> babbageclunk: which is odd given the test code is supposed to prefer the 3.2 installation
[04:26] <menn0> babbageclunk: the problem with this upgrade step will be that it could fail on trusty controllers
[04:26] <menn0> which still use 2.4 IIRC
[04:27] <babbageclunk> menn0: But that's only a problem if we were upgrading controllers to 2 from 1.25 in-place, isn't it?
[04:27] <menn0> babbageclunk: no
[04:27] <menn0> babbageclunk: if you bootstrap a 2.x controller on trusty I believe it still uses mongo 2.4
[04:28] <babbageclunk> menn0: Ah, right
[04:29] <babbageclunk> menn0: Wouldn't that be a pretty odd thing to do?
[04:29] <babbageclunk> menn0: I guess we support it though.
[04:29] <menn0> babbageclunk: yes, unless you wanted to run some trusty only charms on the controller
[04:30] <thumper> menn0: so ideas on forcing a leader election in mongo without restarting juju?
[04:30] <babbageclunk> menn0: It sounds like your fix will correct it anyway.
[04:30] <menn0> babbageclunk: yes it will but it's already in 2.2.1
[04:31] <menn0> babbageclunk: I guess it only fails the second time the upgrade step is attempted. is that right?
[04:32] <babbageclunk> menn0: oh, right - because the first time the logs collection is there?
[04:32] <thumper> axw: did you say that the metrics gathering now all uses the state pool?
[04:32] <thumper> axw: in 2.2?
[04:32] <axw> thumper: on the 2.2 branch, yes. the second fix I was talking about landed last week
[04:32] <thumper> axw: cool. I'll see if that fixes my issue here
[04:32] <menn0> thumper: connect using the mongo shell and give the current leader a priority of 0
[04:33] <menn0> thumper: rs.conf() show the current configuration
[04:34] <thumper> I'm thinking of writing a 'juju-fetch' plugin that just does 'git fetch'
[04:34] <menn0> thumper: this is helpful: https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/
[04:34] <thumper> menn0: thanks
[04:36] <menn0> babbageclunk: so the chances of hitting the problem are fairly slim but not 0. a re-run of the upgrade step could happen when any upgrade step fails and the upgrade is retried.
[04:37] <menn0> babbageclunk: i'll file a bug and get the fix in
[04:37] <menn0> babbageclunk: i've confirmed it works for 2.4
[04:38] <menn0> babbageclunk: I didn't even know you had done this work to split out log collections :)
[04:39] <babbageclunk> menn0: You mean if another upgrade step fails after this one succeeds, right? I did the state work, thumper did the upgrade step.
[04:41] <menn0> babbageclunk: I should have known :)
[04:45] <menn0> babbageclunk: just to be sure, I just confirmed that we still use 2.4 on trusty
[04:47] <babbageclunk> menn0: nice one.
[04:59] <menn0> babbageclunk or thumper: https://github.com/juju/juju/pull/7588
[04:59] <babbageclunk> menn0: looking
[05:00] <thumper> menn0: so... I got that logs deserialization error
[05:00] <thumper> went and looked in the DB
[05:00] <thumper> there is no entry with a missing TAG
[05:00] <thumper> so...
[05:01] <thumper> it looks like the oplog tailer is returning an empty value sometimes
[05:01] <thumper> when it shouldn't
[05:01] <menn0> thumper: were you looking for docs with an empty tag, or a missing tag?
[05:01] <thumper> ah... empty
[05:01] <thumper> how do I say missing?
[05:02] <thumper> nm
[05:02] <menn0> thumper: https://stackoverflow.com/questions/8567469/mongodb-find-a-document-by-non-existence-of-a-field
[05:03] <thumper> db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$exists:false}}).count()
[05:03] <thumper> 0
[05:03] <thumper> neither empty nor missing
[05:06] <thumper> menn0: the error is being emitted every five minutes
[05:06] <thumper> seems very exact
[05:06] <menn0> thumper: it could also be that the field is the wrong type, so that it gets deserialised to "" because the db type doesn't match the struct type
[05:07] <menn0> thumper: where is the error being emitted?
[05:07] <menn0> thumper: when deal with broken txn metadata i've seen fields with nil values that are confusingly deserialised
[05:11] <thumper> menn0: db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$not: {$type: "string"}}}).count()
[05:11] <thumper> 0
[05:11] <thumper> not not a string either
[05:12] <thumper> menn0: the error is emitted in the controller model, which happens to be the one I'm tailing
[05:12] <menn0> thumper: nice query
[05:12] <thumper> I recall the oplog query code has a reconnect on timeout
[05:13] <thumper> I'm wondering if that is emitting something bad by mistake
[05:13] <thumper> it goes back to look for values of the current time slice
[05:13] <thumper> but what if the logs are quiet
[05:13] <thumper> and there isn't one?
[05:13] <thumper> perhaps that's the use case
[05:13] <menn0> thumper: that seems likely
[05:13]  * menn0 looks at the code
[05:14]  * thumper goes back to finding his other bug
[05:14] <menn0> thumper: seems the default cursor timeout is 10mins
[05:15] <thumper> this is happening every 5 minutes on the nose
[05:16] <menn0> thumper: i'm seeing other sources which say 5 mins
[05:16] <menn0> digging
[05:18]  * thumper is getting a bunch of unit agents connected before he forces a mongo election
[05:31] <menn0> thumper: so it turns out the oplog tailer uses a timeout of 1s so that it can be interrupted
[05:32] <menn0> thumper: but there's also the concept of cursor invalidation and we might be running into that
[05:32] <thumper> hmm..
[05:32] <menn0> thumper: i've added logging and have left debug-log running
[05:32] <thumper> I'm curious to see if you see the same error message every 5m
[05:32] <menn0> what does the message look like?
[05:34] <menn0> thumper: ^^
[05:49] <thumper> machine-1: 17:29:11 WARNING juju.state deserialization failed (possible DB corruption), while parsing entity tag: "" is not a valid tag
[05:51] <thumper> I'm trying option two
[05:52] <thumper> stopping all api server agents
[05:52] <thumper> while leaving all unit agents running
[05:52] <thumper> will go away for a few hours
[05:52] <thumper> and bring them up again
[05:52] <thumper> that seems to mirror some of the failure cases
[05:52] <thumper> back later folks
[22:32] <babbageclunk> thumper: 1:1? (Sorry, was slow to see the reminder)
[22:33] <thumper> coming