oscarf | Does anyone know how to troubleshoot this error: "cannot create relation state tracker: cannot remove persisted state, relation X has members"? | 08:49 |
---|---|---|
oscarf | The complete error is like this: "ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X-1": cannot create relation state tracker: cannot remove persisted state, relation X has members" | 08:51 |
achilleasa | oscarf: which version of juju are you using? | 08:57 |
oscarf | achilleasa: if I do "juju status" for the affected model it says 2.8.1 | 08:59 |
achilleasa | and the controller? | 09:03 |
oscarf | achilleasa: also 2.8.1 | 09:04 |
oscarf | for the affected model, "Agent" says "failed" for the affected unit. It happened after I removed a unit it had a relation to. | 09:06 |
achilleasa | oscarf: how did you remove the other unit? juju remove-unit? | 09:12 |
oscarf | achilleasa: yes, remove-unit | 09:13 |
achilleasa | without --force, right? | 09:13 |
oscarf | achilleasa: yes, without --force | 09:13 |
achilleasa | and when did you upgrade to 2.8.1? | 09:15 |
oscarf | achilleasa: prior to doing remove-unit, I mean long time ago | 09:16 |
achilleasa | and was that relation already established prior to the upgrade? I believe you might be experiencing a variant of https://bugs.launchpad.net/juju/+bug/1890828 (got fixed in 2.8.2) | 09:17 |
mup | Bug #1890828: relation data lost during upgrade to juju 2.8.1 <upgrade-juju> <juju:Fix Released by hmlanigan> <https://launchpad.net/bugs/1890828> | 09:17 |
achilleasa | we might be able to get the unit agent to start if we edit its local state on disk to remove the gone member ID | 09:20 |
achilleasa | is that a production unit? | 09:21 |
oscarf | achilleasa: it's not on a production model yet, just on a test model | 09:21 |
oscarf | achilleasa: I think the agent is actually running, but I'm not sure | 09:22 |
oscarf | when I run "ps aux | grep juju" I see: /var/lib/juju/tools/unit-X-1/jujud unit --data-dir /var/lib/juju --unit-name X/1 --debug | 09:22 |
achilleasa | but it still shows up in failed state in juju status? | 09:24 |
oscarf | yes | 09:25 |
oscarf | when I do "systemcel status X/1" it also shows it's up and running | 09:25 |
oscarf | systemctl | 09:25 |
achilleasa | the uniter is one of the workers that the agent starts. It looks like it keeps starting and crashing. | 09:25 |
achilleasa | Can you `juju ssh` into the unit's machine? | 09:26 |
oscarf | yes, sure. I'm already inside the machine | 09:26 |
achilleasa | we can try to patch the local state, restart the agent and then it should work (you should also upgrade to at least 2.8.2 after) | 09:26 |
oscarf | that would be great | 09:26 |
achilleasa | ok, give me 1 min to get a local controller running | 09:26 |
oscarf | ok | 09:28 |
oscarf | achilleasa: do you know how I can do the "patch the local state"? | 09:39 |
achilleasa | so, first you need to systemcl stop the agent | 09:40 |
achilleasa | then you need to find the relation state data in /var/lib/juju/ (since 2.8 this is maintained by the controller so I am bootstrapping an older controller to find the exact path for you) | 09:41 |
oscarf | achilleasa: thanks, I really appreciate your help | 09:41 |
oscarf | I have executed "systemctl stop jujud-unit-X-1.service" | 09:44 |
achilleasa | ok, can you go to /var/lib/juju/agents/unit-X-1/state/relations? | 09:49 |
oscarf | I'm here now (on the node running the "failed" unit): /var/lib/juju/agents/unit-X-1/state/ but it only has two directories: "bundles" and "deployer" | 09:51 |
oscarf | there is no "relations", neither file or directory | 09:52 |
achilleasa | oh I see. the state has been already migrated to the controller. one sec | 09:52 |
oscarf | ok | 09:52 |
achilleasa | oscarf: quick question, are there any more units in the other side of the relation? | 10:03 |
achilleasa | so basically, is juju status showing the application for the other side without any units? | 10:04 |
oscarf | achilleasa: yes, we do have other units with relations to the unit showing a "failed" state | 10:05 |
oscarf | so basically, X/1 has relation to Y, and we have multiple Y/1, Y/2, Y/3, Y/4. I removed Y/1 and after that "X/1" ended up in a "failed" agent state | 10:07 |
achilleasa | ok, looks like we are going to have to do a bit of DB surgery :-( | 10:10 |
achilleasa | can you try running the script from https://discourse.juju.is/t/login-into-mongodb/309? | 10:10 |
oscarf | achilleasa: that was my initial feeling for this problem actually | 10:10 |
oscarf | sure, I will try it now | 10:11 |
achilleasa | then you need to find the doc for the broken unit by running 'db.unitstates.find().pretty()' | 10:11 |
achilleasa | or even better: 'db.unitstates.find({},{"relation-state":1}).pretty()' | 10:12 |
oscarf | achilleasa: hmm, but it only shows components for the controller | 10:19 |
achilleasa | oscarf: that's odd... so no entries for your model at all? (mongo paginates so you may have to type 'it' to see more) | 10:23 |
achilleasa | or db.unitstates.find({"model-uuid":"$your-model"},{"relation-state":1}).pretty() | 10:25 |
achilleasa | because the lack of the folders above from the unit's machine means that their contents have been migrated to the controller | 10:26 |
oscarf | maybe I'm not on the right node, let me check | 10:27 |
oscarf | working on to confirm the node | 10:35 |
oscarf | achilleasa: sorry, this might take some time. I will try to communicate a bit with my coworker, but it's also lunch break here right now :) | 10:42 |
achilleasa | oscarf: no worries. This is basically what I wanted to try https://paste.ubuntu.com/p/wrspV4y4bp/ | 10:44 |
achilleasa | I will be around for the next 6h so feel free to ping me when you get access to mongo | 10:45 |
oscarf | achilleasa: thanks, I will try it as soon as I have figured out the nodes | 10:46 |
stickupkid | achilleasa, https://github.com/juju/juju/pull/12138 CR | 11:03 |
oscarf | achilleasa: I have identified the right controller and I'm inside the mongodb shell. can I ask, what does the number key represent inside the "relation-state" data structure? | 14:11 |
achilleasa | oscarf: the keys in that map are relation IDs | 14:16 |
hml | stickupkid_: i had to push a test fix to the refresh polish pr, only test code changed. | 14:16 |
achilleasa | oscarf: make sure to copy the original value somewhere before making the change just in case you need to revert the change | 14:17 |
oscarf | achilleasa: I can see that the unit I removed is still listed in the yaml code for the key. so I think that is the problem | 14:21 |
achilleasa | oscarf: yes, I think if you remove that entry as I suggested in the pastebin and then restart the uniter that should do the trick | 14:34 |
achilleasa | s/uniter/agent | 14:34 |
oscarf | achilleasa: in the code that you provided, should I replace $set with something? | 14:34 |
achilleasa | oscarf: the part after the colon should be the replacement value with the gone unit removed | 14:34 |
achilleasa | (in the example I removed 'wordrpess/0: 1\n') | 14:35 |
achilleasa | don't forget to replace the "_id" bit with the appropriate ID for the document you are editing | 14:36 |
oscarf | achilleasa: okay, but I don't need to change "$set"? | 14:36 |
achilleasa | ah, no. that's a mongo command | 14:36 |
oscarf | achilleasa: aha okay | 14:38 |
oscarf | achilleasa: what is the logic behind the relation-state.1? | 14:41 |
oscarf | I mean, why the .1? | 14:41 |
achilleasa | this is because relation-state is a nested document and the query uses the dot notation to target it for the update | 14:42 |
achilleasa | more precisely, relation-state is a map of nested documents | 14:43 |
oscarf | achilleasa: aha, I see | 14:45 |
achilleasa | you will need to replace the '.1' with the key that you got when you displayed the original doc | 14:46 |
oscarf | oh | 14:49 |
oscarf | the relationship id? | 14:49 |
achilleasa | when you do the find query you need to find the entry in relation-state and use its key in the following $set query (in my example the key was "1") | 14:50 |
achilleasa | (you will also see it inside the yaml blob as the first entry 'id: X') | 14:51 |
oscarf | ah, right | 14:51 |
oscarf | I managed to update the right key | 14:54 |
achilleasa | cool, try to restart the unit agent | 14:55 |
oscarf | it still doesn't want to play :/ still saying the agent is "failed" in "juju status". but looking at the error message it complains about relation 26 | 15:02 |
achilleasa | same unit? | 15:03 |
achilleasa | can you share the full error message? | 15:03 |
oscarf | sure wait | 15:03 |
oscarf | 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X": cannot create relation state tracker: cannot remove persisted state, relation 26 has members | 15:04 |
oscarf | 2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent | 15:04 |
achilleasa | ok, so this is a different relation than the one that failed before right? | 15:04 |
oscarf | no, this is the same one all the time | 15:05 |
oscarf | but the key I changed in mongodb had a different number, so may be it is not related then | 15:05 |
achilleasa | can you double-check the rest of the entries in the relation-state map to see if the deleted unit shows up elsewhere? | 15:05 |
oscarf | (but that key still listed the unit I removed as a value) | 15:05 |
oscarf | and the relation-state map is shown using "db.unitstates.find", right? | 15:07 |
achilleasa | yes, it's the first query in the pastebin link | 15:07 |
oscarf | going over it now carefully | 15:07 |
achilleasa | make sure to shutdown the agent before changing the db though | 15:08 |
oscarf | I don't see the units I deleted, but I do see a relationship withg ID 26 (same as in the error message) that I'm not sure should be there | 15:19 |
oscarf | maybe I should juse delete relation 26? | 15:21 |
achilleasa | hml: are you around to lend in a hand with this problem? ^^^ | 15:24 |
oscarf | I think I just need some advice how to design a safe deletion query | 15:25 |
hml | achilleasa: otp right now… will check back after | 15:25 |
oscarf | I found the $unset operator.. | 15:26 |
oscarf | I think it worked | 15:38 |
oscarf | I deleted relation 26 and restarted the agent.. no more "failed" state | 15:38 |
achilleasa | so everything is green in the juju status? | 15:38 |
oscarf | yes | 15:38 |
achilleasa | awesome! | 15:39 |
oscarf | thanks for your help | 15:39 |
oscarf | this would have been harder without it | 15:40 |
hml | achilleasa: all good here yes? | 15:55 |
achilleasa | hml: seems so :-) a variant of the race when migrating 2.7.x uniter state to the controller | 15:57 |
achilleasa | there was a phantom unit in the member list for a relation | 15:57 |
achilleasa | and the uniter refused to start | 15:57 |
hml | achilleasa: huh, wonder how that happened… was something else giong on at the same time as the upgrade? | 15:58 |
achilleasa | looks like it got triggered after an upgrade and remove-unit (there were still other units in the relation) | 16:00 |
achilleasa | (that is on a 2.8.1) | 16:00 |
hml | interesting | 16:00 |
kelvinliu | tlm: https://github.com/juju/juju/pull/12141 got this pr to fix the primary SA issue, could u take a look? ty | 23:51 |
kelvinliu | or wallyworld hpidcock anyone free, +1 plz ty | 23:56 |
hpidcock | sure | 23:56 |
wallyworld | looking | 23:57 |
hpidcock | kelvinliu: LGTM | 23:58 |
wallyworld | kelvinliu: so it was a regression from 2.7? | 23:58 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!