[00:12] <teward> bryceh: around?
[00:13] <teward> oh nvm i can't read
[00:13] <teward> carry on
[05:12] <cpaelzer> good morning
[06:22] <jamespage> coreycb: hey - would you have time to complete the submitter information for https://bugs.launchpad.net/ubuntu/+source/jaraco.context/+bug/1975600
[06:22] <jamespage> then I can complete the MIR team review for you
[12:39] <lvoytek> Good morning
[13:03] <ahasenack> kanashiro: I started this discussion: https://lists.clusterlabs.org/pipermail/users/2022-May/030296.html
[13:04] <ahasenack> there is a node remove command that works, but I'm kind of leaning towards a full cluster removal when making changes. Depending on what you change, you may get away with it, or you will get phantom data
[13:05] <ahasenack> `pcs cluster destroy` does a lot of things, it goes over /var/lib/pcsd, /var/lib/pacemaker and removes many files
[14:20] <kanashiro> ahasenack, maybe the charms would be better using the high-level cluster management tools like pcs and crmsh instead of doing all of this manually (?)
[14:20] <ahasenack> maybe
[14:20] <ahasenack> but a bigger change
[14:20] <ahasenack> I think the key thing is changing nodeid, not just the name
[14:20] <ahasenack> if you keep the nodeid the same, and then change the name, all is fine (testing that now)
[14:22] <kanashiro> right, it makes sense, but from one of the answers in the thread I think if you restart first corosync and then pacemaker all should be fine
[14:22] <kanashiro> did you test that?
[14:22] <ahasenack> the phantom node is all about pacemaker, yes
[14:23] <ahasenack> I've been doing "systemctl restart pacemaker corosync", unsure if the order in that command line affects things
[14:23] <ahasenack> but after the package is installed, both are running, nothing that can be done about that (easily, other than policy-rc.d)
[14:24] <ahasenack> so the "contamination" with node1 happens right after install
[14:24] <kanashiro> so a possible minimum change to fix this would be to create a dependency between the pacemaker and corosync systemd services(?)
[14:24] <ahasenack> I have vague recollections of nish doing that in the past, and suffering a lot
[14:25] <ahasenack> it involved creating a file in one maintainerscript and checking for that file in another maintainer script
[14:25] <ahasenack> inter-package RPC :)
[14:26] <kanashiro> if we think this is too much we can at least document this in the server guide, so once we see this happening we can point users to it
[14:26] <ahasenack> even in the case where you keep the nodeid the same, and crm status is clean, the "node1" node is still referenced in old cib files
[14:26] <ahasenack> which seems right, if I understand it correctly
[14:27] <ahasenack> what I don't get yet is, let's say I deploy 3 nodes
[14:27] <ahasenack> all 3 get node1, nodeid=1 (default pkg install)
[14:27] <ahasenack> then in node1 I change name to be hostname, keep nodeid=1, adjust ring0_addr
[14:27] <ahasenack> and add the other 2 nodes to the config, with ids 2 and 3
[14:27] <ahasenack> and send the config to them via scp, and restart everything
[14:28] <ahasenack> I don't get why changing nodeid from 1 to 2 and 3 in the *other* nodes doesn't introduce the same problem
[14:28] <ahasenack> maybe because nodeid 1 is still around, it just has another name, and is no longer myself
[14:29] <ahasenack> I go from node1/id1, node1/id1, node1/id1 to f1/id1, f2/id2, f3/id3
[14:29] <ahasenack> (fN being the new names)
[14:30] <kanashiro> I *think* that in this case the cluster has quorum and they vote to make sure that node does not exist. In a single-node cluster I am not sure when to consider it quorate
[14:30] <ahasenack> if I change node1/id1 to f1/id101, then node1 is still in the list, but offline, even with the 3 nodes
[14:30] <ahasenack> f1/id101 does not replace node1/id1
[14:31] <ahasenack> and in reality, id1 really disappeared from the cluster in that case, no other node assumed id1
[14:31] <ahasenack> hence it shows offline
[14:31] <ahasenack> by "disappeared" I mean there is no host anymore responding to pings on id1
[14:31] <ahasenack> ok, I may be starting to get this
[14:31] <ahasenack> the charm does change the node ids too
[14:32] <ahasenack> from 1 to 1001 or something like that
[14:32] <ahasenack> 2 to 1002, and so on
[14:32] <ahasenack> the approach they took to fix it might be the simplest one after all. Pre-seed a config file
[14:32] <ahasenack> it's like one of the responses in the thread, don't start pacemaker until the config is final
[14:32] <ahasenack> achieves the same
[14:35] <kanashiro> I think that's the main takeaway here: do not restart pacemaker once everything in corosync is set
[14:45] <ahasenack> each cib-N.raw file in /var/lib/pacemaker/cib/ is like a state, right. I can diff between them to see what changed
[14:45] <ahasenack> there is probably a corosync/pacemaker (or cmrsh/pcs?) command to show that, I've sees some "diff" commands in some help output
[14:55] <ahasenack> messing with these attributes in a live cluster is dangerous
[14:56] <ahasenack> May 25 14:54:59 f3 pacemaker-controld[6239]:  warning: Node 'node1' and 'f1' share the same cluster nodeid: 1 f1
[14:56] <ahasenack> May 25 14:54:59 f3 pacemaker-controld[6239]:  error: crm_find_peer: Forked child 6391 to record non-fatal assert at membership.c:590 : member weirdness
[19:43] <sergiodj> kanashiro: hey, is https://bugs.launchpad.net/ubuntu/+source/openvpn/+bug/1975574 the bug you mentioned you were going to take a look during our housekeeping call today?
[19:43] <ahasenack> sounds like it
[19:44] <sergiodj> I will mark it as server-todo and bump its priority to high, just in case
[19:44] <sergiodj> ah, sorry
[19:44] <sergiodj> Lucas already did that, but I had opened the URL before his update
[19:44] <sergiodj> kanashiro: nevermind :)
[19:57] <kanashiro> :)
[21:55] <giu--> hi to all