Zookeeper learning series: Fourth, the relationship between Paxos algorithm and zookeeper

Zookeeper learning series: 4. the relationship between Paxos algorithm and zookeeper

1. The origin of the problem

Taobao search blog  http://www.searchtb.com/2011/01/zookeeper-research.html   mentions that Paxos is the soul of zookeeper

There is an article titled " Zookeeper full analysis-Paxos as the soul " as the title, which is considered to be the basis of zookeeper:


Google's Chubby and Apache's Zookeeper are all based on its theory. Paxos is considered to be the only distributed consensus algorithm so far, and other algorithms are improvements or simplifications of Paxos. There is a question to mention. Paxos has a premise: there is no Byzantine general problem. That is to say, Paxos can only be established in a trusted computing environment, and this environment will not be destroyed by intrusion.


The Paxos  algorithm is a message-passing consensus algorithm proposed by Leslie Lamport in 1990. The Paxos algorithm solves the problem of how a distributed system can agree on a certain value (resolution).

The part-time parliament  Paxos Made Simple describes the execution process of the Paxos algorithm as follows:

  1. prepare phase:
    1. The proposer selects a proposal number n and sends the prepare request to a majority of acceptors;
    2. After the acceptor receives the prepare message, if the number of the proposal is greater than all the prepare messages it has already responded to, the acceptor will reply to the proposer with the proposal it accepted last time, and promise not to reply to proposals less than n;
  2. Approval stage:
    1. When a proposer receives the reply from most acceptors to prepare, it enters the approval stage. It must send an accept request to acceptors who reply to the prepare request, including the number n and the value determined by P2c (if there is no value accepted by P2c, then it can freely determine the value).
    2. Under the premise of not violating its commitment to other proposers, the acceptor will accept the request after receiving the accept request.

There are two stages on the wiki, but three stages in the paper, and by default there is a proposer equivalent to the leader. The heroes listed the third stage (http://www.wuzesheng.com/?p=2724):

3. Learn stage:

  • When each Acceptor is in agreement, it is necessary to notify all Learners of the consistent result.

Zookeeper uses org.apache.zookeeper.server.quorum.FastLeaderElection as its default election algorithm

This article http://csrd.aliapp.com/?p=162&replytocom=782 directly uses the paxos implementation as the title, mentioning that zookeeper uses the paxos algorithm when electing the leader (mainly fast paxos)

I happened to see someone rebutting below:

Wei Jiangwen: "

FastLeaderElection is not Paxos at all, nor is it an implementation of Fast Paxos. The FastLeaderElection source code is far from Paxos's paper.

There is also a leader election problem in Paxos and FastPaxos algorithms.

For zookeeper, FastLeaderElection is just equivalent to leader election in Paxos.


 2. Data verification

Okay, check the information, analyze the source code and start the investigation

The first is Wei Jiangwen's rebuttal :

There is a very common misunderstanding that the leader election algorithm in zookeeper is paxos or fast paxos. The leader election algorithm is not paxos or fast paxos, please consider the following facts:

  1. There is no the concept of proposal number in the leader election in zookeeper. the proposal number is a key concept to paxos. Some one think the epoch is the proposal number, but different followers may produce proposal with the same epoch which is not allowed in paxos.
  2. Fast paxos requires at least 3t + 1 acceptors, where t is the number of servers which are allowed to fail [3]. This is conflict with the fact that a zookeeper cluster with 3 servers works well even if one server failed.
  3. The leader election algorithm must make sure P1. However paxos does provide such guarantee.
  4. In paxos, a leader is also required to achieve progress. There are some similarities between the leader in paxos and the leader in zookeeper. Even if more than one servers believe they are the leader, the consistency is preserved both in zookeeper and in paxos. this is very clearly discussed in [1] and [2]. 

Then the author's three comparisons

1) Mailing list

Our protocol instead, has only two phases, just like a two-phase 
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in 
which we have a single proposer as we can run phase 1 for multiple instances at 
a time, which is what Ben called previously Multi-Paxos, I believe. The trick 
with skipping phase 1 is to deal with leader switching. 

2) Book interview

We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us . Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.

Zk's distributed consensus algorithm has a name called Zab

3) Thesis

We use an algorithm that shares some of the character- istics of Paxos, but that combines transaction logging needed for consensus with write-ahead logging needed for data tree recovery to enable an efficient implementa- tion. 

 3. leader election analysis

In my understanding, first of all, the paxos algorithm cannot be used in elections. Whether the president is selected in paxos or the leader is selected by zk, there is a difference between making a proposal for most people to agree. Only by selecting the master can there be no concurrent proposal conflicts with multiple proposers

Who is going to make a proposal as the proposer? It is the premise for the paxos algorithm to proceed. When a proposal is made to allow most followers to agree, an algorithm similar to paxos can be used to achieve consistency. Zookeeper uses the Zab algorithm to achieve consistency.

Zk's master selection strategy:

there can be at most one leader (proposer) at any time, and we guarantee this by making sure 
that a quorum of replicas recognize the leader as a leader by committing to an 
epoch change. This change in epoch also allows us to get unique zxids since the 
epoch forms part of the zxid. 

Each server has an id, and when a submitted transaction is received, it has a zxid. As the updated data changes, the transaction number increases, and the server id is different. First choose the leader with the largest zxid. If the zxid cannot be compared, choose the leader with the largest server id.

The zxid contains an epoch number. The epoch indicates the period when a server is the leader, and it increases with the birth of a new leader.

Look at the code again:

4. analysis of zookeeper data update principle

After understanding the method of master selection, let's understand the method of synchronizing data. The synchronization data uses the Zab protocol: Zookeeper Atomic broadcast protocol, which is a protocol similar to two-phase submission:

  1. The leader sends a PROPOSAL message, p, to all followers.
  2. Upon receiving p, a follower responds to the leader with an ACK, informing the leader that it has accepted the proposal.
  3. Uponreceivingacknowledgmentsfromaquorum(thequorumincludestheleader itself), the leader sends a message informing the followers to COMMIT it. 

The difference with paxos is that leaer is sent to all followers, not most, and all followers must confirm and notify the leader, not most.

Guarantee mechanism: Two transactions broadcast in sequence, T and Tʹ, T must be submitted before Tʹ takes effect. If one server submits T and Tʹ, all other servers must also submit T before Tʹ.

5. the leader's exploration

In order to solve the problem of leader crash and avoid multiple leaders causing transaction chaos, the Zab algorithm guarantees:

1. When a new transaction is opened, the leader must submit all transactions submitted during the last epoch

2. At no time will two leaders gain enough supporters at the same time.

The starting state of a new leader requires the agreement of most servers

6. observer

The third role in zk, the difference between observer and follower is that they have no right to vote. It is mainly 1. Existence for the read request scalability of the system 2. To meet the deployment requirements of multiple computer rooms, the central computer room deploys leader and follower, and other computer rooms deploy observers, and the read configuration is first read locally.

7. summary

I think zookeeper can only be said to be affected by the paxos algorithm, the role division is similar, the proposal is similar in the way of passing, and the implementation is simpler and more intuitive. The selection of the leader is based on voteid (server-id) and zxid for size priority sorting, and information synchronization uses two-phase commit. The leader submits the transaction and updates the status after obtaining all the consent of the follower. The observer role is to increase system throughput and satisfy cross-machine room deployment.


[1] Reed, B., & Junqueira, FP (2008). A simple totally ordered broadcast protocol. 2.Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2008). Yorktown Heights, NY: ACM. ISBN: 978-1 -60558-296-2. [2] Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 1825. [3] F. Junqueira, Y. Mao, and K. Marzullo. Classic paxos vs. fast paxos: caveat emptor. In Proceedings of the 3rd USENIX/IEEE/IFIP Workshop on Hot Topics in System Dependability (HotDep.07). Citeseer, 2007.

[4]O'Reilly.ZooKeeper.Distributed process coordination.2013

[5] http://agapple.iteye.com/blog/1184023   zookeeper project use some summary

Reference: https://cloud.tencent.com/developer/article/1067416 zookeeper learning series: 4. The relationship between Paxos algorithm and zookeeper-Cloud + Community-Tencent Cloud