[08:49] Does anyone know how to troubleshoot this error: "cannot create relation state tracker: cannot remove persisted state, relation X has members"? [08:51] The complete error is like this: "ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X-1": cannot create relation state tracker: cannot remove persisted state, relation X has members" [08:57] oscarf: which version of juju are you using? [08:59] achilleasa: if I do "juju status" for the affected model it says 2.8.1 [09:03] and the controller? [09:04] achilleasa: also 2.8.1 [09:06] for the affected model, "Agent" says "failed" for the affected unit. It happened after I removed a unit it had a relation to. [09:12] oscarf: how did you remove the other unit? juju remove-unit? [09:13] achilleasa: yes, remove-unit [09:13] without --force, right? [09:13] achilleasa: yes, without --force [09:15] and when did you upgrade to 2.8.1? [09:16] achilleasa: prior to doing remove-unit, I mean long time ago [09:17] and was that relation already established prior to the upgrade? I believe you might be experiencing a variant of https://bugs.launchpad.net/juju/+bug/1890828 (got fixed in 2.8.2) [09:17] Bug #1890828: relation data lost during upgrade to juju 2.8.1 [09:20] we might be able to get the unit agent to start if we edit its local state on disk to remove the gone member ID [09:21] is that a production unit? [09:21] achilleasa: it's not on a production model yet, just on a test model [09:22] achilleasa: I think the agent is actually running, but I'm not sure [09:22] when I run "ps aux | grep juju" I see: /var/lib/juju/tools/unit-X-1/jujud unit --data-dir /var/lib/juju --unit-name X/1 --debug [09:24] but it still shows up in failed state in juju status? [09:25] yes [09:25] when I do "systemcel status X/1" it also shows it's up and running [09:25] systemctl [09:25] the uniter is one of the workers that the agent starts. It looks like it keeps starting and crashing. [09:26] Can you `juju ssh` into the unit's machine? [09:26] yes, sure. I'm already inside the machine [09:26] we can try to patch the local state, restart the agent and then it should work (you should also upgrade to at least 2.8.2 after) [09:26] that would be great [09:26] ok, give me 1 min to get a local controller running [09:28] ok [09:39] achilleasa: do you know how I can do the "patch the local state"? [09:40] so, first you need to systemcl stop the agent [09:41] then you need to find the relation state data in /var/lib/juju/ (since 2.8 this is maintained by the controller so I am bootstrapping an older controller to find the exact path for you) [09:41] achilleasa: thanks, I really appreciate your help [09:44] I have executed "systemctl stop jujud-unit-X-1.service" [09:49] ok, can you go to /var/lib/juju/agents/unit-X-1/state/relations? [09:51] I'm here now (on the node running the "failed" unit): /var/lib/juju/agents/unit-X-1/state/ but it only has two directories: "bundles" and "deployer" [09:52] there is no "relations", neither file or directory [09:52] oh I see. the state has been already migrated to the controller. one sec [09:52] ok [10:03] oscarf: quick question, are there any more units in the other side of the relation? [10:04] so basically, is juju status showing the application for the other side without any units? [10:05] achilleasa: yes, we do have other units with relations to the unit showing a "failed" state [10:07] so basically, X/1 has relation to Y, and we have multiple Y/1, Y/2, Y/3, Y/4. I removed Y/1 and after that "X/1" ended up in a "failed" agent state [10:10] ok, looks like we are going to have to do a bit of DB surgery :-( [10:10] can you try running the script from https://discourse.juju.is/t/login-into-mongodb/309? [10:10] achilleasa: that was my initial feeling for this problem actually [10:11] sure, I will try it now [10:11] then you need to find the doc for the broken unit by running 'db.unitstates.find().pretty()' [10:12] or even better: 'db.unitstates.find({},{"relation-state":1}).pretty()' [10:19] achilleasa: hmm, but it only shows components for the controller [10:23] oscarf: that's odd... so no entries for your model at all? (mongo paginates so you may have to type 'it' to see more) [10:25] or db.unitstates.find({"model-uuid":"$your-model"},{"relation-state":1}).pretty() [10:26] because the lack of the folders above from the unit's machine means that their contents have been migrated to the controller [10:27] maybe I'm not on the right node, let me check [10:35] working on to confirm the node [10:42] achilleasa: sorry, this might take some time. I will try to communicate a bit with my coworker, but it's also lunch break here right now :) [10:44] oscarf: no worries. This is basically what I wanted to try https://paste.ubuntu.com/p/wrspV4y4bp/ [10:45] I will be around for the next 6h so feel free to ping me when you get access to mongo [10:46] achilleasa: thanks, I will try it as soon as I have figured out the nodes [11:03] achilleasa, https://github.com/juju/juju/pull/12138 CR [14:11] achilleasa: I have identified the right controller and I'm inside the mongodb shell. can I ask, what does the number key represent inside the "relation-state" data structure? [14:16] oscarf: the keys in that map are relation IDs [14:16] stickupkid_: i had to push a test fix to the refresh polish pr, only test code changed. [14:17] oscarf: make sure to copy the original value somewhere before making the change just in case you need to revert the change [14:21] achilleasa: I can see that the unit I removed is still listed in the yaml code for the key. so I think that is the problem [14:34] oscarf: yes, I think if you remove that entry as I suggested in the pastebin and then restart the uniter that should do the trick [14:34] s/uniter/agent [14:34] achilleasa: in the code that you provided, should I replace $set with something? [14:34] oscarf: the part after the colon should be the replacement value with the gone unit removed [14:35] (in the example I removed 'wordrpess/0: 1\n') [14:36] don't forget to replace the "_id" bit with the appropriate ID for the document you are editing [14:36] achilleasa: okay, but I don't need to change "$set"? [14:36] ah, no. that's a mongo command [14:38] achilleasa: aha okay [14:41] achilleasa: what is the logic behind the relation-state.1? [14:41] I mean, why the .1? [14:42] this is because relation-state is a nested document and the query uses the dot notation to target it for the update [14:43] more precisely, relation-state is a map of nested documents [14:45] achilleasa: aha, I see [14:46] you will need to replace the '.1' with the key that you got when you displayed the original doc [14:49] oh [14:49] the relationship id? [14:50] when you do the find query you need to find the entry in relation-state and use its key in the following $set query (in my example the key was "1") [14:51] (you will also see it inside the yaml blob as the first entry 'id: X') [14:51] ah, right [14:54] I managed to update the right key [14:55] cool, try to restart the unit agent [15:02] it still doesn't want to play :/ still saying the agent is "failed" in "juju status". but looking at the error message it complains about relation 26 [15:03] same unit? [15:03] can you share the full error message? [15:03] sure wait [15:04] 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X": cannot create relation state tracker: cannot remove persisted state, relation 26 has members [15:04] 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent [15:04] ok, so this is a different relation than the one that failed before right? [15:05] no, this is the same one all the time [15:05] but the key I changed in mongodb had a different number, so may be it is not related then [15:05] can you double-check the rest of the entries in the relation-state map to see if the deleted unit shows up elsewhere? [15:05] (but that key still listed the unit I removed as a value) [15:07] and the relation-state map is shown using "db.unitstates.find", right? [15:07] yes, it's the first query in the pastebin link [15:07] going over it now carefully [15:08] make sure to shutdown the agent before changing the db though [15:19] I don't see the units I deleted, but I do see a relationship withg ID 26 (same as in the error message) that I'm not sure should be there [15:21] maybe I should juse delete relation 26? [15:24] hml: are you around to lend in a hand with this problem? ^^^ [15:25] I think I just need some advice how to design a safe deletion query [15:25] achilleasa: otp right now… will check back after [15:26] I found the $unset operator.. [15:38] I think it worked [15:38] I deleted relation 26 and restarted the agent.. no more "failed" state [15:38] so everything is green in the juju status? [15:38] yes [15:39] awesome! [15:39] thanks for your help [15:40] this would have been harder without it [15:55] achilleasa: all good here yes? [15:57] hml: seems so :-) a variant of the race when migrating 2.7.x uniter state to the controller [15:57] there was a phantom unit in the member list for a relation [15:57] and the uniter refused to start [15:58] achilleasa: huh, wonder how that happened… was something else giong on at the same time as the upgrade? [16:00] looks like it got triggered after an upgrade and remove-unit (there were still other units in the relation) [16:00] (that is on a 2.8.1) [16:00] interesting [23:51] tlm: https://github.com/juju/juju/pull/12141 got this pr to fix the primary SA issue, could u take a look? ty [23:56] or wallyworld hpidcock anyone free, +1 plz ty [23:56] sure [23:57] looking [23:58] kelvinliu: LGTM [23:58] kelvinliu: so it was a regression from 2.7?