/srv/irclogs.ubuntu.com/2020/10/15/#juju.txt

oscarfDoes anyone know how to troubleshoot this error: "cannot create relation state tracker: cannot remove persisted state, relation X has members"?08:49
oscarfThe complete error is like this: "ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X-1": cannot create relation state tracker: cannot remove persisted state, relation X has members"08:51
achilleasaoscarf: which version of juju are you using?08:57
oscarfachilleasa: if I do "juju status" for the affected model it says 2.8.108:59
achilleasaand the controller?09:03
oscarfachilleasa: also 2.8.109:04
oscarffor the affected model, "Agent" says "failed" for the affected unit. It happened after I removed a unit it had a relation to.09:06
achilleasaoscarf: how did you remove the other unit? juju remove-unit?09:12
oscarfachilleasa: yes, remove-unit09:13
achilleasawithout --force, right?09:13
oscarfachilleasa: yes, without --force09:13
achilleasaand when did you upgrade to 2.8.1?09:15
oscarfachilleasa: prior to doing remove-unit, I mean long time ago09:16
achilleasaand was that relation already established prior to the upgrade? I believe you might be experiencing a variant of https://bugs.launchpad.net/juju/+bug/1890828 (got fixed in 2.8.2)09:17
mupBug #1890828: relation data lost during upgrade to juju 2.8.1  <upgrade-juju> <juju:Fix Released by hmlanigan> <https://launchpad.net/bugs/1890828>09:17
achilleasawe might be able to get the unit agent to start if we edit its local state on disk to remove the gone member ID09:20
achilleasais that a production unit?09:21
oscarfachilleasa: it's not on a production model yet, just on a test model09:21
oscarfachilleasa: I think the agent is actually running, but I'm not sure09:22
oscarfwhen I run "ps aux | grep juju" I see: /var/lib/juju/tools/unit-X-1/jujud unit --data-dir /var/lib/juju --unit-name X/1 --debug09:22
achilleasabut it still shows up in failed state in juju status?09:24
oscarfyes09:25
oscarfwhen I do "systemcel status X/1" it also shows it's up and running09:25
oscarfsystemctl09:25
achilleasathe uniter is one of the workers that the agent starts. It looks like it keeps starting and crashing.09:25
achilleasaCan you `juju ssh` into the unit's machine?09:26
oscarfyes, sure. I'm already inside the machine09:26
achilleasawe can try to patch the local state, restart the agent and then it should work (you should also upgrade to at least 2.8.2 after)09:26
oscarfthat would be great09:26
achilleasaok, give me 1 min to get a local controller running09:26
oscarfok09:28
oscarfachilleasa: do you know how I can do the "patch the local state"?09:39
achilleasaso, first you need to systemcl stop the agent09:40
achilleasathen you need to find the relation state data in /var/lib/juju/ (since 2.8 this is maintained by the controller so I am bootstrapping an older controller to find the exact path for you)09:41
oscarfachilleasa: thanks, I really appreciate your help09:41
oscarfI have executed "systemctl stop jujud-unit-X-1.service"09:44
achilleasaok, can you go to /var/lib/juju/agents/unit-X-1/state/relations?09:49
oscarfI'm here now (on the node running the "failed" unit): /var/lib/juju/agents/unit-X-1/state/ but it only has two directories: "bundles" and "deployer"09:51
oscarfthere is no "relations", neither file or directory09:52
achilleasaoh I see. the state has been already migrated to the controller. one sec09:52
oscarfok09:52
achilleasaoscarf: quick question, are there any more units in the other side of the relation?10:03
achilleasaso basically, is juju status showing the application for the other side without any units?10:04
oscarfachilleasa: yes, we do have other units with relations to the unit showing a "failed" state10:05
oscarfso basically, X/1 has relation to Y, and we have multiple Y/1, Y/2, Y/3, Y/4. I removed Y/1 and after that "X/1" ended up in a "failed" agent state10:07
achilleasaok, looks like we are going to have to do a bit of DB surgery :-(10:10
achilleasacan you try running the script from https://discourse.juju.is/t/login-into-mongodb/309?10:10
oscarfachilleasa: that was my initial feeling for this problem actually10:10
oscarfsure, I will try it now10:11
achilleasathen you need to find the doc for the broken unit by running 'db.unitstates.find().pretty()'10:11
achilleasaor even better: 'db.unitstates.find({},{"relation-state":1}).pretty()'10:12
oscarfachilleasa: hmm, but it only shows components for the controller10:19
achilleasaoscarf: that's odd... so no entries for your model at all? (mongo paginates so you may have to type 'it' to see more)10:23
achilleasaor db.unitstates.find({"model-uuid":"$your-model"},{"relation-state":1}).pretty()10:25
achilleasabecause the lack of the folders above from the unit's machine means that their contents have been migrated to the controller10:26
oscarfmaybe I'm not on the right node, let me check10:27
oscarfworking on to confirm the node10:35
oscarfachilleasa: sorry, this might take some time. I will try to communicate a bit with my coworker, but it's also lunch break here right now :)10:42
achilleasaoscarf: no worries. This is basically what I wanted to try https://paste.ubuntu.com/p/wrspV4y4bp/10:44
achilleasaI will be around for the next 6h so feel free to ping me when you get access to mongo10:45
oscarfachilleasa: thanks, I will try it as soon as I have figured out the nodes10:46
stickupkidachilleasa, https://github.com/juju/juju/pull/12138 CR11:03
oscarfachilleasa: I have identified the right controller and I'm inside the mongodb shell. can I ask, what does the number key represent inside the "relation-state" data structure?14:11
achilleasaoscarf: the keys in that map are relation IDs14:16
hmlstickupkid_: i had to push a test fix to the refresh polish pr, only test code changed.14:16
achilleasaoscarf: make sure to copy the original value somewhere before making the change just in case you need to revert the change14:17
oscarfachilleasa: I can see that the unit I removed is still listed in the yaml code for the key. so I think that is the problem14:21
achilleasaoscarf: yes, I think if you remove that entry as I suggested in the pastebin and then restart the uniter that should do the trick14:34
achilleasas/uniter/agent14:34
oscarfachilleasa: in the code that you provided, should I replace $set with something?14:34
achilleasaoscarf: the part after the colon should be the replacement value with the gone unit removed14:34
achilleasa(in the example I removed 'wordrpess/0:  1\n')14:35
achilleasadon't forget to replace the "_id" bit with the appropriate ID for the document you are editing14:36
oscarfachilleasa: okay, but I don't need to change "$set"?14:36
achilleasaah, no. that's a mongo command14:36
oscarfachilleasa: aha okay14:38
oscarfachilleasa: what is the logic behind the relation-state.1?14:41
oscarfI mean, why the .1?14:41
achilleasathis is because relation-state is a nested document and the query uses the dot notation to target it for the update14:42
achilleasamore precisely, relation-state is a map of nested documents14:43
oscarfachilleasa: aha, I see14:45
achilleasayou will need to replace the '.1' with the key that you got when you displayed the original doc14:46
oscarfoh14:49
oscarfthe relationship id?14:49
achilleasawhen you do the find query you need to find the entry in relation-state and use its key in the following $set query (in my example the key was "1")14:50
achilleasa(you will also see it inside the yaml blob as the first entry 'id: X')14:51
oscarfah, right14:51
oscarfI managed to update the right key14:54
achilleasacool, try to restart the unit agent14:55
oscarfit still doesn't want to play :/ still saying the agent is "failed" in "juju status". but looking at the error message it complains about relation 2615:02
achilleasasame unit?15:03
achilleasacan you share the full error message?15:03
oscarfsure wait15:03
oscarf2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: failed to initialize uniter for "unit-X": cannot create relation state tracker: cannot remove persisted state, relation 26 has members15:04
oscarf2020-10-15 15:03:22 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent15:04
achilleasaok, so this is a different relation than the one that failed before right?15:04
oscarfno, this is the same one all the time15:05
oscarfbut the key I changed in mongodb had a different number, so may be it is not related then15:05
achilleasacan you double-check the rest of the entries in the relation-state map to see if the deleted unit shows up elsewhere?15:05
oscarf(but that key still listed the unit I removed as a value)15:05
oscarfand the relation-state map is shown using "db.unitstates.find", right?15:07
achilleasayes, it's the first query in the pastebin link15:07
oscarfgoing over it now carefully15:07
achilleasamake sure to shutdown the agent before changing the db though15:08
oscarfI don't see the units I deleted, but I do see a relationship withg ID 26 (same as in the error message) that I'm not sure should be there15:19
oscarfmaybe I should juse delete relation 26?15:21
achilleasahml: are you around to lend in a hand with this problem? ^^^15:24
oscarfI think I just need some advice how to design a safe deletion query15:25
hmlachilleasa: otp right now… will check back after15:25
oscarfI found the $unset operator..15:26
oscarfI think it worked15:38
oscarfI deleted relation 26 and restarted the agent.. no more "failed" state15:38
achilleasaso everything is green in the juju status?15:38
oscarfyes15:38
achilleasaawesome!15:39
oscarfthanks for your help15:39
oscarfthis would have been harder without it15:40
hmlachilleasa:  all good here yes?15:55
achilleasahml: seems so :-) a variant of the race when migrating 2.7.x uniter state to the controller15:57
achilleasathere was a phantom unit in the member list for a relation15:57
achilleasaand the uniter refused to start15:57
hmlachilleasa: huh, wonder how that happened… was something else giong on at the same time as the upgrade?15:58
achilleasalooks like it got triggered after an upgrade and remove-unit (there were still other units in the relation)16:00
achilleasa(that is on a 2.8.1)16:00
hmlinteresting16:00
kelvinliutlm: https://github.com/juju/juju/pull/12141 got this pr to fix the primary SA issue, could u take a look? ty23:51
kelvinliuor wallyworld hpidcock anyone free, +1 plz ty23:56
hpidcocksure23:56
wallyworldlooking23:57
hpidcockkelvinliu: LGTM23:58
wallyworldkelvinliu: so it was a regression from 2.7?23:58

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!