[08:49] <oscarf> Does anyone know how to troubleshoot this error: "cannot create relation state tracker: cannot remove persisted state, relation X has members"?
[08:51] <oscarf> The complete error is like this: "ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X-1": cannot create relation state tracker: cannot remove persisted state, relation X has members"
[08:57] <achilleasa> oscarf: which version of juju are you using?
[08:59] <oscarf> achilleasa: if I do "juju status" for the affected model it says 2.8.1
[09:03] <achilleasa> and the controller?
[09:04] <oscarf> achilleasa: also 2.8.1
[09:06] <oscarf> for the affected model, "Agent" says "failed" for the affected unit. It happened after I removed a unit it had a relation to.
[09:12] <achilleasa> oscarf: how did you remove the other unit? juju remove-unit?
[09:13] <oscarf> achilleasa: yes, remove-unit
[09:13] <achilleasa> without --force, right?
[09:13] <oscarf> achilleasa: yes, without --force
[09:15] <achilleasa> and when did you upgrade to 2.8.1?
[09:16] <oscarf> achilleasa: prior to doing remove-unit, I mean long time ago
[09:17] <achilleasa> and was that relation already established prior to the upgrade? I believe you might be experiencing a variant of https://bugs.launchpad.net/juju/+bug/1890828 (got fixed in 2.8.2)
[09:17] <mup> Bug #1890828: relation data lost during upgrade to juju 2.8.1  <upgrade-juju> <juju:Fix Released by hmlanigan> <https://launchpad.net/bugs/1890828>
[09:20] <achilleasa> we might be able to get the unit agent to start if we edit its local state on disk to remove the gone member ID
[09:21] <achilleasa> is that a production unit?
[09:21] <oscarf> achilleasa: it's not on a production model yet, just on a test model
[09:22] <oscarf> achilleasa: I think the agent is actually running, but I'm not sure
[09:22] <oscarf> when I run "ps aux | grep juju" I see: /var/lib/juju/tools/unit-X-1/jujud unit --data-dir /var/lib/juju --unit-name X/1 --debug
[09:24] <achilleasa> but it still shows up in failed state in juju status?
[09:25] <oscarf> yes
[09:25] <oscarf> when I do "systemcel status X/1" it also shows it's up and running
[09:25] <oscarf> systemctl
[09:25] <achilleasa> the uniter is one of the workers that the agent starts. It looks like it keeps starting and crashing.
[09:26] <achilleasa> Can you `juju ssh` into the unit's machine?
[09:26] <oscarf> yes, sure. I'm already inside the machine
[09:26] <achilleasa> we can try to patch the local state, restart the agent and then it should work (you should also upgrade to at least 2.8.2 after)
[09:26] <oscarf> that would be great
[09:26] <achilleasa> ok, give me 1 min to get a local controller running
[09:28] <oscarf> ok
[09:39] <oscarf> achilleasa: do you know how I can do the "patch the local state"?
[09:40] <achilleasa> so, first you need to systemcl stop the agent
[09:41] <achilleasa> then you need to find the relation state data in /var/lib/juju/ (since 2.8 this is maintained by the controller so I am bootstrapping an older controller to find the exact path for you)
[09:41] <oscarf> achilleasa: thanks, I really appreciate your help
[09:44] <oscarf> I have executed "systemctl stop jujud-unit-X-1.service"
[09:49] <achilleasa> ok, can you go to /var/lib/juju/agents/unit-X-1/state/relations?
[09:51] <oscarf> I'm here now (on the node running the "failed" unit): /var/lib/juju/agents/unit-X-1/state/ but it only has two directories: "bundles" and "deployer"
[09:52] <oscarf> there is no "relations", neither file or directory
[09:52] <achilleasa> oh I see. the state has been already migrated to the controller. one sec
[09:52] <oscarf> ok
[10:03] <achilleasa> oscarf: quick question, are there any more units in the other side of the relation?
[10:04] <achilleasa> so basically, is juju status showing the application for the other side without any units?
[10:05] <oscarf> achilleasa: yes, we do have other units with relations to the unit showing a "failed" state
[10:07] <oscarf> so basically, X/1 has relation to Y, and we have multiple Y/1, Y/2, Y/3, Y/4. I removed Y/1 and after that "X/1" ended up in a "failed" agent state
[10:10] <achilleasa> ok, looks like we are going to have to do a bit of DB surgery :-(
[10:10] <achilleasa> can you try running the script from https://discourse.juju.is/t/login-into-mongodb/309?
[10:10] <oscarf> achilleasa: that was my initial feeling for this problem actually
[10:11] <oscarf> sure, I will try it now
[10:11] <achilleasa> then you need to find the doc for the broken unit by running 'db.unitstates.find().pretty()'
[10:12] <achilleasa> or even better: 'db.unitstates.find({},{"relation-state":1}).pretty()'
[10:19] <oscarf> achilleasa: hmm, but it only shows components for the controller
[10:23] <achilleasa> oscarf: that's odd... so no entries for your model at all? (mongo paginates so you may have to type 'it' to see more)
[10:25] <achilleasa> or db.unitstates.find({"model-uuid":"$your-model"},{"relation-state":1}).pretty()
[10:26] <achilleasa> because the lack of the folders above from the unit's machine means that their contents have been migrated to the controller
[10:27] <oscarf> maybe I'm not on the right node, let me check
[10:35] <oscarf> working on to confirm the node
[10:42] <oscarf> achilleasa: sorry, this might take some time. I will try to communicate a bit with my coworker, but it's also lunch break here right now :)
[10:44] <achilleasa> oscarf: no worries. This is basically what I wanted to try https://paste.ubuntu.com/p/wrspV4y4bp/
[10:45] <achilleasa> I will be around for the next 6h so feel free to ping me when you get access to mongo
[10:46] <oscarf> achilleasa: thanks, I will try it as soon as I have figured out the nodes
[11:03] <stickupkid> achilleasa, https://github.com/juju/juju/pull/12138 CR
[14:11] <oscarf> achilleasa: I have identified the right controller and I'm inside the mongodb shell. can I ask, what does the number key represent inside the "relation-state" data structure?
[14:16] <achilleasa> oscarf: the keys in that map are relation IDs
[14:16] <hml> stickupkid_: i had to push a test fix to the refresh polish pr, only test code changed.
[14:17] <achilleasa> oscarf: make sure to copy the original value somewhere before making the change just in case you need to revert the change
[14:21] <oscarf> achilleasa: I can see that the unit I removed is still listed in the yaml code for the key. so I think that is the problem
[14:34] <achilleasa> oscarf: yes, I think if you remove that entry as I suggested in the pastebin and then restart the uniter that should do the trick
[14:34] <achilleasa> s/uniter/agent
[14:34] <oscarf> achilleasa: in the code that you provided, should I replace $set with something?
[14:34] <achilleasa> oscarf: the part after the colon should be the replacement value with the gone unit removed
[14:35] <achilleasa> (in the example I removed 'wordrpess/0:  1\n')
[14:36] <achilleasa> don't forget to replace the "_id" bit with the appropriate ID for the document you are editing
[14:36] <oscarf> achilleasa: okay, but I don't need to change "$set"?
[14:36] <achilleasa> ah, no. that's a mongo command
[14:38] <oscarf> achilleasa: aha okay
[14:41] <oscarf> achilleasa: what is the logic behind the relation-state.1?
[14:41] <oscarf> I mean, why the .1?
[14:42] <achilleasa> this is because relation-state is a nested document and the query uses the dot notation to target it for the update
[14:43] <achilleasa> more precisely, relation-state is a map of nested documents
[14:45] <oscarf> achilleasa: aha, I see
[14:46] <achilleasa> you will need to replace the '.1' with the key that you got when you displayed the original doc
[14:49] <oscarf> oh
[14:49] <oscarf> the relationship id?
[14:50] <achilleasa> when you do the find query you need to find the entry in relation-state and use its key in the following $set query (in my example the key was "1")
[14:51] <achilleasa> (you will also see it inside the yaml blob as the first entry 'id: X')
[14:51] <oscarf> ah, right
[14:54] <oscarf> I managed to update the right key
[14:55] <achilleasa> cool, try to restart the unit agent
[15:02] <oscarf> it still doesn't want to play :/ still saying the agent is "failed" in "juju status". but looking at the error message it complains about relation 26
[15:03] <achilleasa> same unit?
[15:03] <achilleasa> can you share the full error message?
[15:03] <oscarf> sure wait
[15:04] <oscarf> 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X": cannot create relation state tracker: cannot remove persisted state, relation 26 has members
[15:04] <oscarf> 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
[15:04] <achilleasa> ok, so this is a different relation than the one that failed before right?
[15:05] <oscarf> no, this is the same one all the time
[15:05] <oscarf> but the key I changed in mongodb had a different number, so may be it is not related then
[15:05] <achilleasa> can you double-check the rest of the entries in the relation-state map to see if the deleted unit shows up elsewhere?
[15:05] <oscarf> (but that key still listed the unit I removed as a value)
[15:07] <oscarf> and the relation-state map is shown using "db.unitstates.find", right?
[15:07] <achilleasa> yes, it's the first query in the pastebin link
[15:07] <oscarf> going over it now carefully
[15:08] <achilleasa> make sure to shutdown the agent before changing the db though
[15:19] <oscarf> I don't see the units I deleted, but I do see a relationship withg ID 26 (same as in the error message) that I'm not sure should be there
[15:21] <oscarf> maybe I should juse delete relation 26?
[15:24] <achilleasa> hml: are you around to lend in a hand with this problem? ^^^
[15:25] <oscarf> I think I just need some advice how to design a safe deletion query
[15:25] <hml> achilleasa: otp right now… will check back after
[15:26] <oscarf> I found the $unset operator..
[15:38] <oscarf> I think it worked
[15:38] <oscarf> I deleted relation 26 and restarted the agent.. no more "failed" state
[15:38] <achilleasa> so everything is green in the juju status?
[15:38] <oscarf> yes
[15:39] <achilleasa> awesome!
[15:39] <oscarf> thanks for your help
[15:40] <oscarf> this would have been harder without it
[15:55] <hml> achilleasa:  all good here yes?
[15:57] <achilleasa> hml: seems so :-) a variant of the race when migrating 2.7.x uniter state to the controller
[15:57] <achilleasa> there was a phantom unit in the member list for a relation
[15:57] <achilleasa> and the uniter refused to start
[15:58] <hml> achilleasa: huh, wonder how that happened… was something else giong on at the same time as the upgrade?
[16:00] <achilleasa> looks like it got triggered after an upgrade and remove-unit (there were still other units in the relation)
[16:00] <achilleasa> (that is on a 2.8.1)
[16:00] <hml> interesting
[23:51] <kelvinliu> tlm: https://github.com/juju/juju/pull/12141 got this pr to fix the primary SA issue, could u take a look? ty
[23:56] <kelvinliu> or wallyworld hpidcock anyone free, +1 plz ty
[23:56] <hpidcock> sure
[23:57] <wallyworld> looking
[23:58] <hpidcock> kelvinliu: LGTM
[23:58] <wallyworld> kelvinliu: so it was a regression from 2.7?