axw | thumper: hey, FYI I'm online today - taking tomorrow off instead | 00:44 |
---|---|---|
babbageclunk | axw: standup then? ;) | 01:02 |
axw | babbageclunk: oh sorry! I didn't know you were on today | 01:10 |
axw | babbageclunk: I'm here if you really do want to standup, but for me it's just: working on upgrades | 01:11 |
babbageclunk | axw: :) externalreality and I had a brief chat, but no need - I'm chasing a bug with my sub-sub relation thing and he's working on getting details of run actions earlier to determine whether to take the machine lock. | 01:13 |
axw | babbageclunk: okey dokey, thanks | 01:13 |
thumper | menn0: is there a way to force a mongo election between peers? | 01:31 |
menn0 | thumper: you can be fiddling with the member priorities | 01:32 |
thumper | https://bugs.launchpad.net/juju/+bug/1701275 | 01:32 |
mup | Bug #1701275: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect <cpe> <cpe-sa> <juju:Triaged> <https://launchpad.net/bugs/1701275> | 01:32 |
thumper | looks like the agents disconnect during an election | 01:32 |
thumper | and don't reconnect | 01:32 |
menn0 | thumper: we've seen that occasionally. it seems like a dep engine bug to me. | 01:33 |
thumper | yeah... | 01:33 |
thumper | we saw that on the customer site too | 01:33 |
menn0 | thumper: the dep engine stops trying after a while | 01:33 |
thumper | where agents had disconnected | 01:33 |
thumper | and didn't reconnect | 01:33 |
menn0 | thumper: perhaps due to manifold returning the wrong thing | 01:33 |
thumper | not sure... | 01:33 |
thumper | I may read through the dep engine code and see if anything leaps out | 01:33 |
menn0 | thumper: I have some old logs from jam for a system he had where it happened | 01:34 |
menn0 | on his machine | 01:34 |
thumper | hmm... | 01:34 |
menn0 | thumper: do you have the tail end of the logs for an agent that got "stuck" | 01:39 |
menn0 | ? | 01:39 |
thumper | not off hand | 01:40 |
thumper | they are all in warning+ mode | 01:40 |
axw | thumper: 1:1? | 02:01 |
menn0 | babbageclunk: I'm seeing failures in TestSplitLogsCollection. Looks like the bit that looks for the "ns not found" error isn't working on my machine for some reason | 03:47 |
menn0 | babbageclunk: is this a known thing? | 03:47 |
babbageclunk | menn0: not by me! | 03:57 |
menn0 | babbageclunk: the QueryError Code is coming back as 0 despite the message being the expected "ns found found" | 03:58 |
babbageclunk | menn0: maybe a mongo version difference? | 03:59 |
menn0 | babbageclunk: must be, although I can see that the tests are using our officially packaged mongodb 3.2 package | 03:59 |
menn0 | babbageclunk: should we be worried that this is in a released version? | 04:00 |
menn0 | (I don't know if it is) | 04:00 |
babbageclunk | It definitely is - it was done for 2.2.0, I think. | 04:01 |
babbageclunk | menn0: is it failing consistently for you? | 04:02 |
menn0 | babbageclunk: yes | 04:02 |
menn0 | babbageclunk: I have a fix which uses the error code /or/ the message | 04:02 |
menn0 | babbageclunk: looks like this is in 2.2 | 04:03 |
babbageclunk | menn0: So is my machine using the wrong mongo version for tests? | 04:05 |
menn0 | babbageclunk: possibly. there's a number of fallbacks | 04:06 |
menn0 | babbageclunk: run the state tests and do: ps ax | grep mongo | 04:06 |
babbageclunk | It's using the juju-mongo32 mongod - that reports that it's 3.2.12 | 04:08 |
menn0 | babbageclunk: hmmm same for me | 04:09 |
menn0 | how strange | 04:09 |
babbageclunk | menn0: hmm | 04:09 |
babbageclunk | git version ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 (just in case) | 04:10 |
babbageclunk | menn0: so, that's weird. Different versions of mgo? | 04:11 |
menn0 | babbageclunk: i'm at f2b6f6c | 04:11 |
babbageclunk | menn0: I'm working on a branch off 2.2 at the moment | 04:11 |
babbageclunk | menn0: same git hash for mgo | 04:12 |
babbageclunk | curiouser and curiouser | 04:12 |
menn0 | babbageclunk: hang on... I may have been running the tests remotely on a different machine | 04:13 |
* menn0 checks | 04:13 | |
menn0 | babbageclunk: i've got it | 04:14 |
menn0 | babbageclunk: it's my "workhorse" machine | 04:14 |
babbageclunk | menn0: And what's that running? | 04:15 |
menn0 | babbageclunk: the official mongo 3.2 it would appear :/ | 04:16 |
menn0 | babbageclunk: but that's where it's failing | 04:16 |
menn0 | i'll dig some more | 04:16 |
babbageclunk | ok | 04:16 |
menn0 | babbageclunk: this machine has mongodb 3.2, 2.6 and 2.4 installed | 04:19 |
babbageclunk | menn0: Which one's the test picking? | 04:20 |
menn0 | babbageclunk: still trying to figure that out | 04:20 |
menn0 | babbageclunk: it's 2.6.10 | 04:21 |
menn0 | babbageclunk: which is odd given the test code is supposed to prefer the 3.2 installation | 04:22 |
menn0 | babbageclunk: the problem with this upgrade step will be that it could fail on trusty controllers | 04:26 |
menn0 | which still use 2.4 IIRC | 04:26 |
babbageclunk | menn0: But that's only a problem if we were upgrading controllers to 2 from 1.25 in-place, isn't it? | 04:27 |
menn0 | babbageclunk: no | 04:27 |
menn0 | babbageclunk: if you bootstrap a 2.x controller on trusty I believe it still uses mongo 2.4 | 04:27 |
babbageclunk | menn0: Ah, right | 04:28 |
babbageclunk | menn0: Wouldn't that be a pretty odd thing to do? | 04:29 |
babbageclunk | menn0: I guess we support it though. | 04:29 |
menn0 | babbageclunk: yes, unless you wanted to run some trusty only charms on the controller | 04:29 |
thumper | menn0: so ideas on forcing a leader election in mongo without restarting juju? | 04:30 |
babbageclunk | menn0: It sounds like your fix will correct it anyway. | 04:30 |
menn0 | babbageclunk: yes it will but it's already in 2.2.1 | 04:30 |
menn0 | babbageclunk: I guess it only fails the second time the upgrade step is attempted. is that right? | 04:31 |
babbageclunk | menn0: oh, right - because the first time the logs collection is there? | 04:32 |
thumper | axw: did you say that the metrics gathering now all uses the state pool? | 04:32 |
thumper | axw: in 2.2? | 04:32 |
axw | thumper: on the 2.2 branch, yes. the second fix I was talking about landed last week | 04:32 |
thumper | axw: cool. I'll see if that fixes my issue here | 04:32 |
menn0 | thumper: connect using the mongo shell and give the current leader a priority of 0 | 04:32 |
menn0 | thumper: rs.conf() show the current configuration | 04:33 |
thumper | I'm thinking of writing a 'juju-fetch' plugin that just does 'git fetch' | 04:34 |
menn0 | thumper: this is helpful: https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/ | 04:34 |
thumper | menn0: thanks | 04:34 |
menn0 | babbageclunk: so the chances of hitting the problem are fairly slim but not 0. a re-run of the upgrade step could happen when any upgrade step fails and the upgrade is retried. | 04:36 |
menn0 | babbageclunk: i'll file a bug and get the fix in | 04:37 |
menn0 | babbageclunk: i've confirmed it works for 2.4 | 04:37 |
menn0 | babbageclunk: I didn't even know you had done this work to split out log collections :) | 04:38 |
babbageclunk | menn0: You mean if another upgrade step fails after this one succeeds, right? I did the state work, thumper did the upgrade step. | 04:39 |
menn0 | babbageclunk: I should have known :) | 04:41 |
menn0 | babbageclunk: just to be sure, I just confirmed that we still use 2.4 on trusty | 04:45 |
babbageclunk | menn0: nice one. | 04:47 |
menn0 | babbageclunk or thumper: https://github.com/juju/juju/pull/7588 | 04:59 |
babbageclunk | menn0: looking | 04:59 |
thumper | menn0: so... I got that logs deserialization error | 05:00 |
thumper | went and looked in the DB | 05:00 |
thumper | there is no entry with a missing TAG | 05:00 |
thumper | so... | 05:00 |
thumper | it looks like the oplog tailer is returning an empty value sometimes | 05:01 |
thumper | when it shouldn't | 05:01 |
menn0 | thumper: were you looking for docs with an empty tag, or a missing tag? | 05:01 |
thumper | ah... empty | 05:01 |
thumper | how do I say missing? | 05:01 |
thumper | nm | 05:02 |
menn0 | thumper: https://stackoverflow.com/questions/8567469/mongodb-find-a-document-by-non-existence-of-a-field | 05:02 |
thumper | db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$exists:false}}).count() | 05:03 |
thumper | 0 | 05:03 |
thumper | neither empty nor missing | 05:03 |
thumper | menn0: the error is being emitted every five minutes | 05:06 |
thumper | seems very exact | 05:06 |
menn0 | thumper: it could also be that the field is the wrong type, so that it gets deserialised to "" because the db type doesn't match the struct type | 05:06 |
menn0 | thumper: where is the error being emitted? | 05:07 |
menn0 | thumper: when deal with broken txn metadata i've seen fields with nil values that are confusingly deserialised | 05:07 |
thumper | menn0: db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$not: {$type: "string"}}}).count() | 05:11 |
thumper | 0 | 05:11 |
thumper | not not a string either | 05:11 |
thumper | menn0: the error is emitted in the controller model, which happens to be the one I'm tailing | 05:12 |
menn0 | thumper: nice query | 05:12 |
thumper | I recall the oplog query code has a reconnect on timeout | 05:12 |
thumper | I'm wondering if that is emitting something bad by mistake | 05:13 |
thumper | it goes back to look for values of the current time slice | 05:13 |
thumper | but what if the logs are quiet | 05:13 |
thumper | and there isn't one? | 05:13 |
thumper | perhaps that's the use case | 05:13 |
menn0 | thumper: that seems likely | 05:13 |
* menn0 looks at the code | 05:13 | |
* thumper goes back to finding his other bug | 05:14 | |
menn0 | thumper: seems the default cursor timeout is 10mins | 05:14 |
thumper | this is happening every 5 minutes on the nose | 05:15 |
menn0 | thumper: i'm seeing other sources which say 5 mins | 05:16 |
menn0 | digging | 05:16 |
* thumper is getting a bunch of unit agents connected before he forces a mongo election | 05:18 | |
menn0 | thumper: so it turns out the oplog tailer uses a timeout of 1s so that it can be interrupted | 05:31 |
menn0 | thumper: but there's also the concept of cursor invalidation and we might be running into that | 05:32 |
thumper | hmm.. | 05:32 |
menn0 | thumper: i've added logging and have left debug-log running | 05:32 |
thumper | I'm curious to see if you see the same error message every 5m | 05:32 |
menn0 | what does the message look like? | 05:32 |
menn0 | thumper: ^^ | 05:34 |
thumper | machine-1: 17:29:11 WARNING juju.state deserialization failed (possible DB corruption), while parsing entity tag: "" is not a valid tag | 05:49 |
thumper | I'm trying option two | 05:51 |
thumper | stopping all api server agents | 05:52 |
thumper | while leaving all unit agents running | 05:52 |
thumper | will go away for a few hours | 05:52 |
thumper | and bring them up again | 05:52 |
thumper | that seems to mirror some of the failure cases | 05:52 |
thumper | back later folks | 05:52 |
=== frankban|afk is now known as frankban | ||
=== frankban is now known as frankban|afk | ||
babbageclunk | thumper: 1:1? (Sorry, was slow to see the reminder) | 22:32 |
thumper | coming | 22:33 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!