Demystifying ZooKeeper — Group Membership Demo
Do you keep hearing ZooKeeper in context of various distributed systems, but are not quite sure how it functions? After reading this article, that would no longer be the case. We will go over a simple example of coordination — group membership — using ZooKeeper. And I promise to keep it super simple.
This example will give you a feel of ZooKeeper internals, and it will build the intuition required for you to grasp more complex coordination primitives like leader election (which I will explain in a later post).
ZooKeeper is a high performance coordination service for distributed applications. Meaning, applications running more than one instance of themselves can use ZooKeeper to coordinate with each other. One such coordination problem is identifying who are the participants in a distributed system.
Group Membership is a classical problem which can arise in any distributed system. For example, Apache Solr maintains the list of available replicas using ZooKeeper.
We will break this article into five small parts (none longer than a couple of minutes read):
- Setup sample application to lay out the problem.
- Describe Zookeeper at a high level
- Algorithm for implementing group membership
- Code implementation
- Extra fun stuff!
Sample Application
Let‘s say we have a simple application which is receiving logs from other applications, and storing it on disk for later retrieval. Let’s take one leader and two replicas in the initial setup.
The obvious problem for the leader here is that when the application starts up on some server, it has no idea where the replicas are. So it doesn’t know where to replicate the logs at. And even if we put the initial replicas in some config or database for the leader to read from, we will have to keep updating it as replicas join and leave the group.
Now we could build this. It is not too complex. We would need a storage where to put the current members, and upon any membership changes we would need to notify the leader that something has changed, and it should re-read the members config.
But that’s exactly what ZooKeeper provides us out of the box. So without reinventing the wheel, let’s see how we can leverage ZooKeeper to solve our problem.
ZooKeeper — high level description
We can think of Apache ZooKeeper as a special, highly consistent file system where:
- Each file (called znode, affectionately) can store a small amount of data.
- Each znode can have child znodes.
For this demo, all we need to understand are the basic ZooKeeper commands, given in figure below. But if you’re interested, the official documentation is a good place to dive deeper into specific topics.
The commands given above should be self-explanatory.
But there are two more teeny-tiny ZooKeeper concepts that we need to understand for implementing group membership, and they’re quite straightforward.
Ephemeral Znode: Ephemeral is just a fancy word for temporary. When we create a znode, we can specify that it should be “ephemeral”, or temporary. The behaviour is that as soon as the session which created this znode ends, the znode is deleted. Poof!
Watch: Watch-ing a znode lets us know, unsurprisingly, if the znode data has been modified, or any its children have been created or deleted. (Note that
watching
the parent does not tell us if the data of a child has been modified).
You may have noticed that there is no command for setting up watches in the table above. That’s because it is passed as a flag in create
, exists
, getChildren
, getData
calls.
Group Membership Logic
Now, as promised, the simple implementation of membership:
- At startup, all replicas
create
an entry at/membership/replica<x>
. In this znode, they put their details —the host and port where they can be contacted. - Then, in a single call, they
getChildren
of the parent znode at/membership
, and setup awatch
on the parent znode. - All the replicas now know who the other replicas are, because either the replica executed
getChildren
last, or if it didn’t, it received a watch notification, when another replicacreated
its entry. - Then they chill for a while, doing their own work, until …
- A replica dies, and the ephemeral znode which the replica created is deleted, and all replicas get notified that the children of the parent znode have changed. (Something similar happens when a replica joins).
- All replicas now go back and execute Step 2.
That’s it. You have group membership implemented!
The leader knows all the current replicas, and is able to send log records to them. The replicas also know the leader, and each other, but that’s not being used in our application.
Code
(Not interested in the code? Skip to the last section for some extra tidbits!)
And, here is the Golang code which implements this logic. The code should make sense even if you don’t know Golang. Just make sure to read the comments. I’m using go-zookeeper/zk’s implementation of ZooKeeper client.
The above code will not run as is, as I’ve taken out pieces for clarity. I’ll update this post with the full code base soon.
That was easy, right? Now if you understood all this, you have a functional understanding of how to write a simple application using ZooKeeper. And you can stop here.
Or continue to the next section for a few more tidbits!
Zookeeper Guarantees
What we have not covered till now are the guarantees that ZooKeeper provides about its APIs which can be used to argue why this setup is fool-proof.
Without those guarantees, the system will not function properly.
For example, this would be bad scenario:
[r1-r3] indicates the replica that sent the command.
1. [r1] create replica1;
2. [r1] get members;
3. [r2] create replica2; <-- missed by [r1]
4. [r3] create replica3; <-- missed by [r1]
5. [r1] watch parent by replica1;
6. [r2] get members; watch parent by replica2;
7. [r3] get members; watch parent by replica3;
If this was possible, r1
would miss the creation of r2
and r3
.
But ZooKeeper guarantees that the get parent; watch parent by replica<x>;
will be atomic. That is, the client will get the watch notification for parent znode, before any of its children are modified.
Given below is a random order in which the commands given by the three replicas may execute correctly.
[r1-r3] indicates the replica that sent the command.
1. [r1] create replica1;
2. [r1] get members; watch parent by replica1; <-- is atomic
3. [r2] create replica2;
4. [r3] create replica3;
5. [r2] get members; watch parent by replica2;
6. [r3] get members; watch parent by replica3;
ZooKeeper guarantees are pretty well documented in their official documentation — here and here. Do give it a read.
If you have any comments or question, please let me know in comments.
That’s all folks! If this helped you, do show your appreciation and long press that Clap button :).