[00:44] thumper: hey, FYI I'm online today - taking tomorrow off instead [01:02] axw: standup then? ;) [01:10] babbageclunk: oh sorry! I didn't know you were on today [01:11] babbageclunk: I'm here if you really do want to standup, but for me it's just: working on upgrades [01:13] axw: :) externalreality and I had a brief chat, but no need - I'm chasing a bug with my sub-sub relation thing and he's working on getting details of run actions earlier to determine whether to take the machine lock. [01:13] babbageclunk: okey dokey, thanks [01:31] menn0: is there a way to force a mongo election between peers? [01:32] thumper: you can be fiddling with the member priorities [01:32] https://bugs.launchpad.net/juju/+bug/1701275 [01:32] Bug #1701275: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect

[01:32] looks like the agents disconnect during an election [01:32] and don't reconnect [01:33] thumper: we've seen that occasionally. it seems like a dep engine bug to me. [01:33] yeah... [01:33] we saw that on the customer site too [01:33] thumper: the dep engine stops trying after a while [01:33] where agents had disconnected [01:33] and didn't reconnect [01:33] thumper: perhaps due to manifold returning the wrong thing [01:33] not sure... [01:33] I may read through the dep engine code and see if anything leaps out [01:34] thumper: I have some old logs from jam for a system he had where it happened [01:34] on his machine [01:34] hmm... [01:39] thumper: do you have the tail end of the logs for an agent that got "stuck" [01:39] ? [01:40] not off hand [01:40] they are all in warning+ mode [02:01] thumper: 1:1? [03:47] babbageclunk: I'm seeing failures in TestSplitLogsCollection. Looks like the bit that looks for the "ns not found" error isn't working on my machine for some reason [03:47] babbageclunk: is this a known thing? [03:57] menn0: not by me! [03:58] babbageclunk: the QueryError Code is coming back as 0 despite the message being the expected "ns found found" [03:59] menn0: maybe a mongo version difference? [03:59] babbageclunk: must be, although I can see that the tests are using our officially packaged mongodb 3.2 package [04:00] babbageclunk: should we be worried that this is in a released version? [04:00] (I don't know if it is) [04:01] It definitely is - it was done for 2.2.0, I think. [04:02] menn0: is it failing consistently for you? [04:02] babbageclunk: yes [04:02] babbageclunk: I have a fix which uses the error code /or/ the message [04:03] babbageclunk: looks like this is in 2.2 [04:05] menn0: So is my machine using the wrong mongo version for tests? [04:06] babbageclunk: possibly. there's a number of fallbacks [04:06] babbageclunk: run the state tests and do: ps ax | grep mongo [04:08] It's using the juju-mongo32 mongod - that reports that it's 3.2.12 [04:09] babbageclunk: hmmm same for me [04:09] how strange [04:09] menn0: hmm [04:10] git version ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 (just in case) [04:11] menn0: so, that's weird. Different versions of mgo? [04:11] babbageclunk: i'm at f2b6f6c [04:11] menn0: I'm working on a branch off 2.2 at the moment [04:12] menn0: same git hash for mgo [04:12] curiouser and curiouser [04:13] babbageclunk: hang on... I may have been running the tests remotely on a different machine [04:13] * menn0 checks [04:14] babbageclunk: i've got it [04:14] babbageclunk: it's my "workhorse" machine [04:15] menn0: And what's that running? [04:16] babbageclunk: the official mongo 3.2 it would appear :/ [04:16] babbageclunk: but that's where it's failing [04:16] i'll dig some more [04:16] ok [04:19] babbageclunk: this machine has mongodb 3.2, 2.6 and 2.4 installed [04:20] menn0: Which one's the test picking? [04:20] babbageclunk: still trying to figure that out [04:21] babbageclunk: it's 2.6.10 [04:22] babbageclunk: which is odd given the test code is supposed to prefer the 3.2 installation [04:26] babbageclunk: the problem with this upgrade step will be that it could fail on trusty controllers [04:26] which still use 2.4 IIRC [04:27] menn0: But that's only a problem if we were upgrading controllers to 2 from 1.25 in-place, isn't it? [04:27] babbageclunk: no [04:27] babbageclunk: if you bootstrap a 2.x controller on trusty I believe it still uses mongo 2.4 [04:28] menn0: Ah, right [04:29] menn0: Wouldn't that be a pretty odd thing to do? [04:29] menn0: I guess we support it though. [04:29] babbageclunk: yes, unless you wanted to run some trusty only charms on the controller [04:30] menn0: so ideas on forcing a leader election in mongo without restarting juju? [04:30] menn0: It sounds like your fix will correct it anyway. [04:30] babbageclunk: yes it will but it's already in 2.2.1 [04:31] babbageclunk: I guess it only fails the second time the upgrade step is attempted. is that right? [04:32] menn0: oh, right - because the first time the logs collection is there? [04:32] axw: did you say that the metrics gathering now all uses the state pool? [04:32] axw: in 2.2? [04:32] thumper: on the 2.2 branch, yes. the second fix I was talking about landed last week [04:32] axw: cool. I'll see if that fixes my issue here [04:32] thumper: connect using the mongo shell and give the current leader a priority of 0 [04:33] thumper: rs.conf() show the current configuration [04:34] I'm thinking of writing a 'juju-fetch' plugin that just does 'git fetch' [04:34] thumper: this is helpful: https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/ [04:34] menn0: thanks [04:36] babbageclunk: so the chances of hitting the problem are fairly slim but not 0. a re-run of the upgrade step could happen when any upgrade step fails and the upgrade is retried. [04:37] babbageclunk: i'll file a bug and get the fix in [04:37] babbageclunk: i've confirmed it works for 2.4 [04:38] babbageclunk: I didn't even know you had done this work to split out log collections :) [04:39] menn0: You mean if another upgrade step fails after this one succeeds, right? I did the state work, thumper did the upgrade step. [04:41] babbageclunk: I should have known :) [04:45] babbageclunk: just to be sure, I just confirmed that we still use 2.4 on trusty [04:47] menn0: nice one. [04:59] babbageclunk or thumper: https://github.com/juju/juju/pull/7588 [04:59] menn0: looking [05:00] menn0: so... I got that logs deserialization error [05:00] went and looked in the DB [05:00] there is no entry with a missing TAG [05:00] so... [05:01] it looks like the oplog tailer is returning an empty value sometimes [05:01] when it shouldn't [05:01] thumper: were you looking for docs with an empty tag, or a missing tag? [05:01] ah... empty [05:01] how do I say missing? [05:02] nm [05:02] thumper: https://stackoverflow.com/questions/8567469/mongodb-find-a-document-by-non-existence-of-a-field [05:03] db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$exists:false}}).count() [05:03] 0 [05:03] neither empty nor missing [05:06] menn0: the error is being emitted every five minutes [05:06] seems very exact [05:06] thumper: it could also be that the field is the wrong type, so that it gets deserialised to "" because the db type doesn't match the struct type [05:07] thumper: where is the error being emitted? [05:07] thumper: when deal with broken txn metadata i've seen fields with nil values that are confusingly deserialised [05:11] menn0: db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$not: {$type: "string"}}}).count() [05:11] 0 [05:11] not not a string either [05:12] menn0: the error is emitted in the controller model, which happens to be the one I'm tailing [05:12] thumper: nice query [05:12] I recall the oplog query code has a reconnect on timeout [05:13] I'm wondering if that is emitting something bad by mistake [05:13] it goes back to look for values of the current time slice [05:13] but what if the logs are quiet [05:13] and there isn't one? [05:13] perhaps that's the use case [05:13] thumper: that seems likely [05:13] * menn0 looks at the code [05:14] * thumper goes back to finding his other bug [05:14] thumper: seems the default cursor timeout is 10mins [05:15] this is happening every 5 minutes on the nose [05:16] thumper: i'm seeing other sources which say 5 mins [05:16] digging [05:18] * thumper is getting a bunch of unit agents connected before he forces a mongo election [05:31] thumper: so it turns out the oplog tailer uses a timeout of 1s so that it can be interrupted [05:32] thumper: but there's also the concept of cursor invalidation and we might be running into that [05:32] hmm.. [05:32] thumper: i've added logging and have left debug-log running [05:32] I'm curious to see if you see the same error message every 5m [05:32] what does the message look like? [05:34] thumper: ^^ [05:49] machine-1: 17:29:11 WARNING juju.state deserialization failed (possible DB corruption), while parsing entity tag: "" is not a valid tag [05:51] I'm trying option two [05:52] stopping all api server agents [05:52] while leaving all unit agents running [05:52] will go away for a few hours [05:52] and bring them up again [05:52] that seems to mirror some of the failure cases [05:52] back later folks === frankban|afk is now known as frankban === frankban is now known as frankban|afk [22:32] thumper: 1:1? (Sorry, was slow to see the reminder) [22:33] coming