Skip to content

recipes: Leader election#60

Merged
maltesander merged 24 commits into
stackabletech:mainfrom
rodio:leader-election
May 29, 2026
Merged

recipes: Leader election#60
maltesander merged 24 commits into
stackabletech:mainfrom
rodio:leader-election

Conversation

@rodio
Copy link
Copy Markdown
Contributor

@rodio rodio commented Apr 23, 2026

Implement leader elections recipe from #10

rodio added 4 commits April 21, 2026 18:58
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
@stackable-cla
Copy link
Copy Markdown

stackable-cla Bot commented Apr 23, 2026

CLA assistant check
All committers have signed the CLA.

@rodio rodio marked this pull request as ready for review April 25, 2026 10:42
@rodio
Copy link
Copy Markdown
Contributor Author

rodio commented Apr 25, 2026

I've marked it as ready for review because apparently no one receives notifications about drafts?

@maltesander
Copy link
Copy Markdown
Member

Hi @rodio ,

thank you for the contribution. I had a look, i think the general approach looks good.

I see two problems (not sure if Draft related):

  • The oneshot is wrong and cannot model all cases from https://zookeeper.apache.org/doc/current/recipes.html. Leadership is an ongoing state, not a one-shot event. Needed for session-loss semantics and the acknowledgment-ZNode pattern. I would go with some sort of watch::<LeaderState> for that.
  • Furthermore i would split the LeaderElection like:
use tokio::sync::watch;

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum LeaderState {
	// volunteered, watching predecessor
    Pending,
    // at index 0
    Leader,
    // session/candidacy ended; terminal
    Resigned,
}

/// Configuration for participating in a leader election.
#[derive(Debug, Clone)]
pub struct LeaderElection {
    election_node: String,
    zk: ZooKeeper,
}

/// An active candidacy. Drop to stop observing (the ephemeral znode
/// is still only removed when the underlying session ends).
#[derive(Debug)]
pub struct Candidate {
    my_path: String,
    state: watch::<LeaderState>,
    ...
}

impl Candidate {
    pub fn path(&self) -> &str { &self.my_path }
    pub fn state(&self) -> LeaderState { self.state.borrow().clone() }

    /// Resolves once we become leader, or returns `Err` if we lose
    /// the session before that happens.
    pub async fn wait_for_leadership(&mut self) -> Result<(), ElectionError> {
        loop {
            match *self.state.borrow_and_update() {
                LeaderState::Leader => return Ok(()),
                // Should probably even be split into something like `SessionLost` and `Withdrawn`
                LeaderState::Resigned   => return Err(ElectionError::SessionLost),
                LeaderState::Pending => {}
            }
            self.state.changed().await.map_err(|_| ElectionError::SessionLost)?;
        }
    }
}

With a volunteer impl like:

pub async fn volunteer(&self) -> Result<Candidate, ElectionError> {...}

So the candidate only exists after you volunteered and my_path is never an Option.

This will help with a couple of your TODOs:

  • The TOCTOU fix is just if stat.is_none() { continue; } (would block otherwise)
  • TODO could be ActiveLeaderElection?

Some other things:

  • proper GUID handling
  • sort children by sequence number suffix, not full string (If GUID differ, largest j < i)
  • "cant find myself" -> probe own path, retry on lag and transition to terminal if ephermeral is gone.

Malte

@maltesander maltesander self-requested a review April 26, 2026 16:32
@rodio
Copy link
Copy Markdown
Contributor Author

rodio commented Apr 27, 2026

Hi, @maltesander! Thank you for the thorough review!

I'd like to argue a bit here and defend the oneshot approach.

The "Resigned" state that you mentioned in your code snippet is not a valid state at all and should not be exposed to users. You have the reason in your comment there: Drop to stop observing (the ephemeral znode is still only removed when the underlying session ends). So resigning from the leader election process without dropping the connection would leave other nodes with an "impression" that you're still participating. When your node becomes the node with the lowest ID everyone would just be sitting there assuming you're the leader, but you don't care. You've resigned but but no one knows about it. So it puts the whole system into an invalid state where it is permanently with a leader that does not perform its "duties".

In fact, if users really want to stop participating they should drop their connection like I did in the test. The connection and the participation should be always be coupled and it should not be possible to drop one without dropping the other. This way there'll be no "orphan" ephemeral nodes. And the way to do it is the one-shot approach: you either are the leader, or wait for the leadership or drop the whole connection.

@maltesander
Copy link
Copy Markdown
Member

Hi @rodio,

yeah, i agree that "Resigned" is wrong and we shouldnt expose "Withdrawal".

Id still argue about involuntary withdrawal? E.g. Session loss (which the candidate did not choose)?

Consider: A candidate becomes leader, fires Leader on the oneshot. Application starts doing leader work. A little later, the ZK session expires (network partition, GC pause, ZK server failover, whatever). The ephemeral znode is gone. Some other node is now the leader. The original application is still doing leader work, because nobody told it otherwise. The oneshot already fired and can't fire again.

Without withdrawal, if we go with the oneshot we should document explicitly that we have to watch connect() for session expiry. Then it should treat itself as no longer Leader, regardless of the oneshot?

Im fine with both approaches.

Malte

@rodio
Copy link
Copy Markdown
Contributor Author

rodio commented Apr 27, 2026

I think you're right that we need to monitor session loss after and I think it is better to hide it inside the implementation of the recipe because it probably is something that almost everyone needs. I'll think more about it later and we'll see what I can come up with. Thanks!

rodio added 10 commits May 1, 2026 12:55
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
@rodio
Copy link
Copy Markdown
Contributor Author

rodio commented May 4, 2026

@maltesander I've pushed what I've come up whith. volunteer() now returns the reciever part of tokio::watch and an abort handle to abort leader election. Unfortunately I was not able to guarantee that the ephemeral nodes are dropped if it is aborted, but this is mentioned in docs.

Copy link
Copy Markdown
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rodio, thank you for the fixes. Some proposals and questions.

Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
rodio added 6 commits May 5, 2026 18:45
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
@rodio
Copy link
Copy Markdown
Contributor Author

rodio commented May 11, 2026

@maltesander I've applied your suggestions, please have a look whenever you have time. Thanks!

Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Copy link
Copy Markdown
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rodio, sorry for the late answer.

Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs Outdated
Comment thread src/recipes/leader/mod.rs
rodio and others added 2 commits May 21, 2026 17:40
Signed-off-by: Rodion Borovyk <rodion.borovyk@gmail.com>
Co-authored-by: maltesander <malte.sander.it@gmail.com>
@rodio rodio requested a review from maltesander May 21, 2026 15:43
Copy link
Copy Markdown
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rodio, LGTM!.

@maltesander maltesander added this pull request to the merge queue May 29, 2026
Merged via the queue into stackabletech:main with commit 084e0c5 May 29, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants