/srv/irclogs.ubuntu.com/2017/07/03/#juju-dev.txt

axwthumper: hey, FYI I'm online today - taking tomorrow off instead00:44
babbageclunkaxw: standup then? ;)01:02
axwbabbageclunk: oh sorry! I didn't know you were on today01:10
axwbabbageclunk: I'm here if you really do want to standup, but for me it's just: working on upgrades01:11
babbageclunkaxw: :) externalreality and I had a brief chat, but no need - I'm chasing a bug with my sub-sub relation thing and he's working on getting details of run actions earlier to determine whether to take the machine lock.01:13
axwbabbageclunk: okey dokey, thanks01:13
thumpermenn0: is there a way to force a mongo election between peers?01:31
menn0thumper: you can be fiddling with the member priorities01:32
thumper https://bugs.launchpad.net/juju/+bug/170127501:32
mupBug #1701275: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect <cpe> <cpe-sa> <juju:Triaged> <https://launchpad.net/bugs/1701275>01:32
thumperlooks like the agents disconnect during an election01:32
thumperand don't reconnect01:32
menn0thumper: we've seen that occasionally. it seems like a dep engine bug to me.01:33
thumperyeah...01:33
thumperwe saw that on the customer site too01:33
menn0thumper: the dep engine stops trying after a while01:33
thumperwhere agents had disconnected01:33
thumperand didn't reconnect01:33
menn0thumper: perhaps due to manifold returning the wrong thing01:33
thumpernot sure...01:33
thumperI may read through the dep engine code and see if anything leaps out01:33
menn0thumper: I have some old logs from jam for a system he had where it happened01:34
menn0on his machine01:34
thumperhmm...01:34
menn0thumper: do you have the tail end of the logs for an agent that got "stuck"01:39
menn0?01:39
thumpernot off hand01:40
thumperthey are all in warning+ mode01:40
axwthumper: 1:1?02:01
menn0babbageclunk: I'm seeing failures in TestSplitLogsCollection. Looks like the bit that looks for the "ns not found" error isn't working on my machine for some reason03:47
menn0babbageclunk: is this a known thing?03:47
babbageclunkmenn0: not by me!03:57
menn0babbageclunk: the QueryError Code is coming back as 0 despite the message being the expected "ns found found"03:58
babbageclunkmenn0: maybe a mongo version difference?03:59
menn0babbageclunk: must be, although I can see that the tests are using our officially packaged mongodb 3.2 package03:59
menn0babbageclunk: should we be worried that this is in a released version?04:00
menn0(I don't know if it is)04:00
babbageclunkIt definitely is - it was done for 2.2.0, I think.04:01
babbageclunkmenn0: is it failing consistently for you?04:02
menn0babbageclunk: yes04:02
menn0babbageclunk: I have a fix which uses the error code /or/ the message04:02
menn0babbageclunk: looks like this is in 2.204:03
babbageclunkmenn0: So is my machine using the wrong mongo version for tests?04:05
menn0babbageclunk: possibly. there's a number of fallbacks04:06
menn0babbageclunk: run the state tests and do: ps ax | grep mongo04:06
babbageclunkIt's using the juju-mongo32 mongod - that reports that it's 3.2.1204:08
menn0babbageclunk: hmmm same for me04:09
menn0how strange04:09
babbageclunkmenn0: hmm04:09
babbageclunkgit version ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 (just in case)04:10
babbageclunkmenn0: so, that's weird. Different versions of mgo?04:11
menn0babbageclunk: i'm at f2b6f6c04:11
babbageclunkmenn0: I'm working on a branch off 2.2 at the moment04:11
babbageclunkmenn0: same git hash for mgo04:12
babbageclunkcuriouser and curiouser04:12
menn0babbageclunk: hang on... I may have been running the tests remotely on a different machine04:13
* menn0 checks04:13
menn0babbageclunk: i've got it04:14
menn0babbageclunk: it's my "workhorse" machine04:14
babbageclunkmenn0: And what's that running?04:15
menn0babbageclunk: the official mongo 3.2 it would appear :/04:16
menn0babbageclunk: but that's where it's failing04:16
menn0i'll dig some more04:16
babbageclunkok04:16
menn0babbageclunk: this machine has mongodb 3.2, 2.6 and 2.4 installed04:19
babbageclunkmenn0: Which one's the test picking?04:20
menn0babbageclunk: still trying to figure that out04:20
menn0babbageclunk: it's 2.6.1004:21
menn0babbageclunk: which is odd given the test code is supposed to prefer the 3.2 installation04:22
menn0babbageclunk: the problem with this upgrade step will be that it could fail on trusty controllers04:26
menn0which still use 2.4 IIRC04:26
babbageclunkmenn0: But that's only a problem if we were upgrading controllers to 2 from 1.25 in-place, isn't it?04:27
menn0babbageclunk: no04:27
menn0babbageclunk: if you bootstrap a 2.x controller on trusty I believe it still uses mongo 2.404:27
babbageclunkmenn0: Ah, right04:28
babbageclunkmenn0: Wouldn't that be a pretty odd thing to do?04:29
babbageclunkmenn0: I guess we support it though.04:29
menn0babbageclunk: yes, unless you wanted to run some trusty only charms on the controller04:29
thumpermenn0: so ideas on forcing a leader election in mongo without restarting juju?04:30
babbageclunkmenn0: It sounds like your fix will correct it anyway.04:30
menn0babbageclunk: yes it will but it's already in 2.2.104:30
menn0babbageclunk: I guess it only fails the second time the upgrade step is attempted. is that right?04:31
babbageclunkmenn0: oh, right - because the first time the logs collection is there?04:32
thumperaxw: did you say that the metrics gathering now all uses the state pool?04:32
thumperaxw: in 2.2?04:32
axwthumper: on the 2.2 branch, yes. the second fix I was talking about landed last week04:32
thumperaxw: cool. I'll see if that fixes my issue here04:32
menn0thumper: connect using the mongo shell and give the current leader a priority of 004:32
menn0thumper: rs.conf() show the current configuration04:33
thumperI'm thinking of writing a 'juju-fetch' plugin that just does 'git fetch'04:34
menn0thumper: this is helpful: https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/04:34
thumpermenn0: thanks04:34
menn0babbageclunk: so the chances of hitting the problem are fairly slim but not 0. a re-run of the upgrade step could happen when any upgrade step fails and the upgrade is retried.04:36
menn0babbageclunk: i'll file a bug and get the fix in04:37
menn0babbageclunk: i've confirmed it works for 2.404:37
menn0babbageclunk: I didn't even know you had done this work to split out log collections :)04:38
babbageclunkmenn0: You mean if another upgrade step fails after this one succeeds, right? I did the state work, thumper did the upgrade step.04:39
menn0babbageclunk: I should have known :)04:41
menn0babbageclunk: just to be sure, I just confirmed that we still use 2.4 on trusty04:45
babbageclunkmenn0: nice one.04:47
menn0babbageclunk or thumper: https://github.com/juju/juju/pull/758804:59
babbageclunkmenn0: looking04:59
thumpermenn0: so... I got that logs deserialization error05:00
thumperwent and looked in the DB05:00
thumperthere is no entry with a missing TAG05:00
thumperso...05:00
thumperit looks like the oplog tailer is returning an empty value sometimes05:01
thumperwhen it shouldn't05:01
menn0thumper: were you looking for docs with an empty tag, or a missing tag?05:01
thumperah... empty05:01
thumperhow do I say missing?05:01
thumpernm05:02
menn0thumper: https://stackoverflow.com/questions/8567469/mongodb-find-a-document-by-non-existence-of-a-field05:02
thumperdb.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$exists:false}}).count()05:03
thumper005:03
thumperneither empty nor missing05:03
thumpermenn0: the error is being emitted every five minutes05:06
thumperseems very exact05:06
menn0thumper: it could also be that the field is the wrong type, so that it gets deserialised to "" because the db type doesn't match the struct type05:06
menn0thumper: where is the error being emitted?05:07
menn0thumper: when deal with broken txn metadata i've seen fields with nil values that are confusingly deserialised05:07
thumpermenn0: db.logs["ac97e810-63f3-49b1-89dd-ea7f32e6e5d4"].find({n:{$not: {$type: "string"}}}).count()05:11
thumper005:11
thumpernot not a string either05:11
thumpermenn0: the error is emitted in the controller model, which happens to be the one I'm tailing05:12
menn0thumper: nice query05:12
thumperI recall the oplog query code has a reconnect on timeout05:12
thumperI'm wondering if that is emitting something bad by mistake05:13
thumperit goes back to look for values of the current time slice05:13
thumperbut what if the logs are quiet05:13
thumperand there isn't one?05:13
thumperperhaps that's the use case05:13
menn0thumper: that seems likely05:13
* menn0 looks at the code05:13
* thumper goes back to finding his other bug05:14
menn0thumper: seems the default cursor timeout is 10mins05:14
thumperthis is happening every 5 minutes on the nose05:15
menn0thumper: i'm seeing other sources which say 5 mins05:16
menn0digging05:16
* thumper is getting a bunch of unit agents connected before he forces a mongo election05:18
menn0thumper: so it turns out the oplog tailer uses a timeout of 1s so that it can be interrupted05:31
menn0thumper: but there's also the concept of cursor invalidation and we might be running into that05:32
thumperhmm..05:32
menn0thumper: i've added logging and have left debug-log running05:32
thumperI'm curious to see if you see the same error message every 5m05:32
menn0what does the message look like?05:32
menn0thumper: ^^05:34
thumpermachine-1: 17:29:11 WARNING juju.state deserialization failed (possible DB corruption), while parsing entity tag: "" is not a valid tag05:49
thumperI'm trying option two05:51
thumperstopping all api server agents05:52
thumperwhile leaving all unit agents running05:52
thumperwill go away for a few hours05:52
thumperand bring them up again05:52
thumperthat seems to mirror some of the failure cases05:52
thumperback later folks05:52
=== frankban|afk is now known as frankban
=== frankban is now known as frankban|afk
babbageclunkthumper: 1:1? (Sorry, was slow to see the reminder)22:32
thumpercoming22:33

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!