Demystifying ZooKeeper — Group Membership Demo

Yasoob Haider
6 min readFeb 11, 2023
A pair of glasses focusing on a far away object

Do you keep hearing ZooKeeper in context of various distributed systems, but are not quite sure how it functions? After reading this article, that would no longer be the case. We will go over a simple example of coordination — group membership — using ZooKeeper. And I promise to keep it super simple.

This example will give you a feel of ZooKeeper internals, and it will build the intuition required for you to grasp more complex coordination primitives like leader election (which I will explain in a later post).

ZooKeeper is a high performance coordination service for distributed applications. Meaning, applications running more than one instance of themselves can use ZooKeeper to coordinate with each other. One such coordination problem is identifying who are the participants in a distributed system.

Group Membership is a classical problem which can arise in any distributed system. For example, Apache Solr maintains the list of available replicas using ZooKeeper.

We will break this article into five small parts (none longer than a couple of minutes read):

Sample Application

Let‘s say we have a simple application which is receiving logs from other applications, and storing it on disk for later retrieval. Let’s take one leader and two replicas in the initial setup.

Little green boxes are the log records

The obvious problem for the leader here is that when the application starts up on some server, it has no idea where the replicas are. So it doesn’t know where to replicate the logs at. And even if we put the initial replicas in some config or database for the leader to read from, we will have to keep updating it as replicas join and leave the group.

Now we could build this. It is not too complex. We would need a storage where to put the current members, and upon any membership changes we would need to notify the leader that something has changed, and it should re-read the members config.

But that’s exactly what ZooKeeper provides us out of the box. So without reinventing the wheel, let’s see how we can leverage ZooKeeper to solve our problem.

ZooKeeper — high level description

We can think of Apache ZooKeeper as a special, highly consistent file system where:

  • Each file (called znode, affectionately) can store a small amount of data.
  • Each znode can have child znodes.

For this demo, all we need to understand are the basic ZooKeeper commands, given in figure below. But if you’re interested, the official documentation is a good place to dive deeper into specific topics.

The commands given above should be self-explanatory.

But there are two more teeny-tiny ZooKeeper concepts that we need to understand for implementing group membership, and they’re quite straightforward.

Ephemeral Znode: Ephemeral is just a fancy word for temporary. When we create a znode, we can specify that it should be “ephemeral”, or temporary. The behaviour is that as soon as the session which created this znode ends, the znode is deleted. Poof!

Watch: Watch-ing a znode lets us know, unsurprisingly, if the znode data has been modified, or any its children have been created or deleted. (Note that watching the parent does not tell us if the data of a child has been modified).

You may have noticed that there is no command for setting up watches in the table above. That’s because it is passed as a flag in create, exists, getChildren, getData calls.

Group Membership Logic

Now, as promised, the simple implementation of membership:

[Left] R1, R2, R3 are replicas; [Right] ZooKeeper state
  1. At startup, all replicas create an entry at /membership/replica<x>. In this znode, they put their details —the host and port where they can be contacted.
  2. Then, in a single call, they getChildren of the parent znode at /membership, and setup a watch on the parent znode.
  3. All the replicas now know who the other replicas are, because either the replica executed getChildren last, or if it didn’t, it received a watch notification, when another replica created its entry.
  4. Then they chill for a while, doing their own work, until …
  5. A replica dies, and the ephemeral znode which the replica created is deleted, and all replicas get notified that the children of the parent znode have changed. (Something similar happens when a replica joins).
  6. All replicas now go back and execute Step 2.

That’s it. You have group membership implemented!

The leader knows all the current replicas, and is able to send log records to them. The replicas also know the leader, and each other, but that’s not being used in our application.

Group Membership Implemented

Code

(Not interested in the code? Skip to the last section for some extra tidbits!)

And, here is the Golang code which implements this logic. The code should make sense even if you don’t know Golang. Just make sure to read the comments. I’m using go-zookeeper/zk’s implementation of ZooKeeper client.

Logic for becoming a member
Logic for keeping the members updated

The above code will not run as is, as I’ve taken out pieces for clarity. I’ll update this post with the full code base soon.

That was easy, right? Now if you understood all this, you have a functional understanding of how to write a simple application using ZooKeeper. And you can stop here.

Or continue to the next section for a few more tidbits!

Zookeeper Guarantees

What we have not covered till now are the guarantees that ZooKeeper provides about its APIs which can be used to argue why this setup is fool-proof.

Without those guarantees, the system will not function properly.

For example, this would be bad scenario:

[r1-r3] indicates the replica that sent the command.

1. [r1] create replica1;
2. [r1] get members;
3. [r2] create replica2; <-- missed by [r1]
4. [r3] create replica3; <-- missed by [r1]
5. [r1] watch parent by replica1;
6. [r2] get members; watch parent by replica2;
7. [r3] get members; watch parent by replica3;

If this was possible, r1 would miss the creation of r2 and r3.

But ZooKeeper guarantees that the get parent; watch parent by replica<x>; will be atomic. That is, the client will get the watch notification for parent znode, before any of its children are modified.

Given below is a random order in which the commands given by the three replicas may execute correctly.

[r1-r3] indicates the replica that sent the command.

1. [r1] create replica1;
2. [r1] get members; watch parent by replica1; <-- is atomic
3. [r2] create replica2;
4. [r3] create replica3;
5. [r2] get members; watch parent by replica2;
6. [r3] get members; watch parent by replica3;

ZooKeeper guarantees are pretty well documented in their official documentation — here and here. Do give it a read.

If you have any comments or question, please let me know in comments.

That’s all folks! If this helped you, do show your appreciation and long press that Clap button :).

--

--