Confident deployments in Docker Swarm

Like many companies, Issuu has embraced containerization in our development and deployment process. This area is dominated by Docker, therefore it is no surprise that we use Docker as well. There is enough discussion on the pros and cons of Docker but today I want to discuss a different aspect of using Docker containers — Docker Swarm.

Docker Swarm

Docker Swarm can be thought of similar to the Docker Engine you’re using on your local machine but scaled to a cluster of machines. Therefore Swarm is very convenient because it requires little to no changes to the containers that you have experience in building already to have them running in the cloud.

The terminology is a bit confusing at first so here’s a short rundown of the terms and what they mean:

Swarm is the entirety of your cloud deployment: the workers, the management nodes etc. You deploy to a swarm and then the swarm takes care on which machines your containers will end up running.
Service is the Swarm equivalent of a container. It represents one single container, running on one or more machines.
Stack is the Swarm equivalent of Docker Compose. It represents a number of services, linked up to each other and deployed using a single configuration.

Where we started

When we started using Docker Swarm we didn’t yet have the experience on how to best utilize the system, so we started with the simplest solution. Since a service is like a container running on multiple hosts, let’s just start with this.

docker -H swarmhost service create \
  --name my-service \
  --detach false \
  image-name:build_number

This was quite easy, right? The detach option is very useful, as it allows your calling process to wait until Docker considers the service to be done deploying. After it has finished you can call a command to make sure your deployment worked and your service is running.

Unfortunately, reality is a bit more messy than than, so you usually end up having to configure a few more things.

docker -H swarmhost service create \
  --name my-service \
  --mount type=bind,source=/dev/log,target=/dev/log \
  --detach false \
  --replicas 8 \
  --update-delay 10s \
  --update-parallelism 2 \
  --update-failure-action rollback \
  --restart-max-attempts 1000 \
  --restart-window 1h \
  --limit-cpu 1 \
  --reserve-memory 256m \
  --limit-memory 1g \
  --log-driver syslog \
  --log-opt syslog-facility=local5 \
  --log-opt tag=my-service \
  image-name:build_number \
  my_command

Okay, that required a few more options, but these are set once, right? Right?

Unfortunately not, since this is only the create command, so it only works if the service you’re attempting to create does not already exist. A simple workaround would be to delete the service and recreate it but then you throw away zero downtime deployments.

Okay, so let’s also write the update statement:

docker -H swarmhost service update \
  --detach false \
  --replicas 8 \
  --update-delay 10s \
  --update-parallelism 2 \
  --update-failure-action rollback \
  --restart-max-attempts 1000 \
  --restart-window 1h \
  --limit-cpu 1 \
  --reserve-memory 256m \
  --limit-memory 1g \
  --log-driver syslog \
  --log-opt syslog-facility=local5 \
  --log-opt tag=my-service \
  --image image-name:build_number \
  --args my_command \
  my_service

Duplication everywhere

What you’ll notice here is that the options are mostly the same, but not quite: you need to specify --args and --image now, and don’t use --name anymore. Another annoying issue is that --mount cannot be set, it has to be altered with --mount-add and --mount-remove instead, but you’d need to know the current state of the mounts to know whether to add it or remove it. Therefore your deployment step would need to be quite smart about what is currently deployed and what isn’t.

Also, what happens if you want to deploy multiple services? Adding them is reasonably easy, but for removing services that you don’t use anymore you can’t just remove the docker service incantations from your deployment script and expect them to be gone. You have to either go in manually and remove services or make your deployment process even smarter to track the set of services that should be deployed versus the set of services that are deployed.

If only there was a way to tell Docker exactly what you want your services to look like. As it turns out, there is!

A potential cure

Imagine our excitement when we saw that all this complexity can just go away because of Docker Stack! With Docker Stack services can be declared in a docker-compose.yml file: which containers to deploy, what images, which mount point, how many instances, how much memory and CPUs to assign. It can also track which services you’ve removed from your docker-compose.yml and automatically remove your unused services.

So our example config looks roughly like this:

version: '3.3'

services:
  service:
    image: 'image-name:build-number'
    command: ['my_command']
    logging:
      driver: syslog
      options:
        syslog-facility: local5
        tag: my_service
    volumes:
      - type: bind
        source: /dev/log
        target: /dev/log
    deploy:
      replicas: 8
      resources:
        limits:
          cpus: "1"
          memory: 1gb
        reservations:
          memory: 256mb
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        max_attempts: 1000
        window: 1h

From here the deployment is pretty simple:

docker -H swarmhost \
  stack deploy \
  --prune \
  --compose-file docker-compose.yml \
  my-stack

This defines a stack which my-stack which contains one service. Stacks aren’t magical in Docker, this stack creates a my-stack_service (the format is <stack-name>_<service-name>) service which can then be inspected with the regular tools that are available for services.

The biggest advantage of the stack is that stack deploy is used to define how the stack is supposed to look like and created or modified accordingly to match the docker-compose.yml. No need to instruct Docker on what the changes to the services should be, but rather how they should be configured.

A nice side-effect of declaring it this way is that this format is commonly known among Docker users. New hires will be able to understand and modify docker-compose.yml files way easier than an complicated in-house deployment system with dozens of docker calls.

Confident?

You might have noticed that there is no --detach false option anymore, which will cause the docker command to exit immediately (there exists a bug report to implement it). This poses a problem, since if the command exits immediately it is not possible to just check whether the deployment was successful immediately afterwards, as the deployment may not have finished yet. Therefore checking whether your deployment finished successfully or not suddenly becomes quite a problem. You don’t want your stack deploy to be rolled back without being notified about it. In the worst case your code might not have been deployed in ages since all deploys were rolled back without you noticing.

Obviously this is not a state that we were happy about, therefore we decided to bite the bullet and write some tooling to help us with that.

Designing a solution

One of the learnings we have at Issuu is that if you don’t specifically need to write a custom wrapper do not write a wrapper. This is what made us decide against writing a very smart build system (since docker stack can do it for us instead) and this is why we don’t want to implement our own stack deploy, because we value using standard components in the ways they were meant to be used.

Therefore we decided to write a tool to just specifically address the issues we had with stack deploy, namely waiting for a stack to be finished deploying (be it successful or a rollback) and determining whether what currently is deployed matches what we expect to be deployed and signalize failure otherwise.

Enter sure-deploy, a tool we released as free software for everybody to use!

Confident deployments

The way sure-deploy works is overall very simple: it polls the instances in a stack on their UpdateStatus field (a field in the JSON that docker swarm outputs) and waits for all services to either converge into a finished status or a rolled back status. Thus it emulates the --detach false option of docker service but on docker swarm.

This brings us to the point where we know that the deployment is finished, but we still don’t know whether the deployment was successful or has been rolled back. Which is why sure-deploy has a second flag to verify whether the status of the stack matches the docker-compose.yml, thus detecting whether the preceding deployment was successful or rolled back.

Since the tool would need to run in multiple environments, be it local development machines or build environments we wanted to have a tool that is easy to get to work and would have as little moving parts as possible — preferably a single binary. We also wanted it to start up quickly, so users wouldn’t be frustrated by long waiting times to launch VMs or download dependencies.

Fortunately one of the languages we’re using in our team, OCaml ticks these boxes: native binaries, fast startup. We also get a great environment to write parsers in (a feature which was useful to implement the Docker-specific YAML template variables) as well as a powerful static type system that helped us to make sure we’re covering all kinds of failure cases.

To avoid having to install the binaries on all build hosts and make it simpler to use for other teams, we also created a small Docker image (11 MB at the moment of writing) containing pre-built binaries of sure-deploy which can be used like so:

docker run \
  --rm \
  --mount type=bind,src=$(pwd)/docker-compose.yml,dst=/home/opam/docker-compose.yml,readonly \
  issuu/sure-deploy:54 --help

The only complicated part of this command is the boilerplate required to pass the docker-compose.yml file into the container.

Caveats

While in general the tool is very helpful to know that if a deployment is “green” this also means that it in fact is correctly deployed there are some caveats to be aware of that we do not have much control of:

The initial creation of a stack makes services which do not have a UpdateStatus field, so we have the choice of either accepting it as successfully created or failed. We decided to consider this a failure since we prefer to err on being too cautious rather than inspiring false confidence. This is by far the most frustrating issue but there is no good solution to this unless Docker will include UpdateStatus in initial deployments.
The check functionality only compares the image fields of docker-compose.yml and the deployed image. We assume that the update of the service configuration always works correctly. Since each of our deployments also increases the tag of the image, a successful update of the image also means that the configuration has been successfully applied, so it is less of an issue in practice.
docker stack supports less of the templating syntax than docker-compose, default values are not supported. sure-deploy supports the complete set that docker-compose does. More humblebrag than caveat but we put a lot of effort in doing the right thing.

Conclusions

This was a long post to describe our way how we went from starting off with new tech to a solution that we’re happy with, so here’s what we recommend to do:

Write a docker-compose.yml with the configuration of the services you want to deploy. This is perfect to be stored in your source repository.
Deploy your containers with docker stack.
Use sure-deploy converge to wait for the deployment to finish.
Use sure-deploy verify to make sure that your deployment was successful and matches what your docker-compose.yml describes.