Setting up an etcd cluster with DNS discovery may be challenging. There are several building blocks:

  • Etcd – a distributed key value store
  • Amazon EC2 – cloud computing provider
  • Cloudflare – DNS provider
  • Chef – for configuring individual nodes

Each of them has their pitfalls, we will guide you through whole process.

DNS discovery

Any clustered system needs a way to maintain a list of nodes in a cluster. Usually you need to specify all cluster members when starting a node. This is the way zookeeper and consul works. Effectively you have redundancy in configuration – the list of nodes is stored on every node. The list must be consistent and it’s difficult to maintain it especially if the cluster lives long life. Old nodes break and get remove, new nodes get added, cluster size may grow/shrink over time. All that makes cluster configuration maintenance cumbersome and error prone.

DNS discovery is a killer feature of Etcd. Effectively it means you keep a list of cluster nodes in one place. All interested parties – cluster nodes, clients, monitoring agents – get it from DNS. There is a single copy of the list, thus there is no chance to get inconsistent cluster view. Etcd team should advertise this feature with capital letters on their website.

Process overview

Three nodes will form the cluster. This is the minimal number of nodes in a cluster that is tolerant to one failure whether expected or not.

For each node we start Amazon EC2 instance and create a DNS record.

When all instances are ready and DNS is prepared, we start Etcd on three nodes simultaneously.

Etcd cluster setup algorithm

Starting Amazon EC2 instance

We will illustrate each step with extracts from real code excerpts. We will skip non-important parts of the code, so copy&paste won’t work. 🙂

Function setup_new_cluster() starts a cluster of a given size.
It calls function launch_ec2_instance() in a loop that in its turn starts an EC2 instance and waits until it is available via SSH.

Creating DNS records

Three DNS records must be created for each node :

  1. A record to resolve host name into IP address
  2. SRV record that tells what TCP port is used to serve other cluster nodes
  3. SRV record that tells what TCP port is used to serve client requests

Eventually we should get following DNS records:

A cluster node requests these records when it wants to know what other peers are and what ports they listen to.

If a client wants to communicate to the cluster it requests these SRV records to know what host name to connect to and to which port.

And finally A records to resolve host names

At TwinDB we use CloudFlare to store zone. CloudFlare provides API that we’re going to use.

For the reference this is the code to work with CloudFlare API:

It takes time before DNS changes we made are propagated and available on a node. We should wait until DNS is ready:

At this point we should have three Amazon EC2 instances up&running and DNS records ready.

Bootstrapping Etcd node

We use Chef recipe for etcd cluster. There are two gotchas with the recipe:

  1. By default it installs ancient Etcd version 2.2.5 which is buggy.
  2. The recipe installs an init script that will fail if you start the first node (See Bug#63 for details). By the way, I got no feedback from Chef team as of time of writing, but they didn’t forget to send me a bunch of sales cold calls and spam. Look up to Etcd team, they’re extremely responsive even on weekends.

Etcd recipe attributes

We need to specify only one attribute – domain name.

Etcd recipe

When recipe is ready (we use our own chef server) we can bootstrap three cluster nodes. Remember, we need to start them simultaneously.

Code to bootstrap one node:

Checking health of Etcd cluster

Now we can communicate with the Etcd cluster from any host with installed etcdctl:

Have a good service discovery!