Everything you need to know about Docker for data science and ML
Just like human beings require a specific environment to survive and thrive, so too does software.Â
To function in areas outside of the environment our bodies are designed to work in, such as the deepest depths of the sea or high up in the sky where the atmosphere is thin, we require specialist âcontainersâ like submarines and spacesuits. Without them, we would simply die.Â
Similarly, to function in environments that a piece of software isnât designed for, it too needs a container that can isolate it from everything else that exists on the same system. This is exactly what âDockerâ was designed for, and weâre going to cover everything that you need to know about this infinitely useful containerization platform in this article, including how you can use it to benefit your own workflows. Â
What is Docker?
Before we get started, however, we need to make sure that we are all on the same page.Â
Docker is a software tool for the creation and deployment of isolated environments (i.e., virtual machines) for running applications with their dependencies. There are a few terms that you need to be familiar with before we dive into the fundamentals:
- Docker ContainerâA single instance of the live, running application.Â
- DockerfileâA text file with a list of commands to call when creating a Docker Image.
- Docker ImageâA blueprint for creating containers. All containers created from the same image are exactly alike.Â
There are many advantages to using Docker in data science and machine learning projects. These include:Â
- StandardizationâThe main advantage of using Docker is standardization. This means that you can define the parameters of a container once and run it wherever Docker is installed. This in turn provides two more benefits: reproducibility and portability.Â
- ReproducibilityâWith Docker, everyone has the same operating system and the same versions of tools. This helps to avoid the problem of an application working on one machine but not another. If it works on one machine, it will work on them all.Â
- PortabilityâPortability makes it easy to move from local development to a cluster. In addition, if youâre working on open-source data projects then portability makes it easy for collaborators to bypass setup.Â
- DeploymentâDocker makes it easier to deploy ML models on the fly. Need to provide external stakeholders with a status update? Thatâs fine: simply put your model into an API container and deploy it via Kubernetesâjob done. (OKâwe have simplified this somewhat, but the point that weâre making is that Docker makes it relatively straightforward to go from iteration in a workflow to deployment in a container.)
By this point, you might be thinking, âWhy should I care?â. Well, just keep in mind that many more systems are beginning to rely on Docker as the trend of ML containerization continues to grow, and getting to grips with it now will turn you into a better ML engineer in the long term as it helps you to turn your ML projects into applications and deploy models into production.Â
How can I create a Docker container?Â
Now that you have got an idea of what Docker is, letâs go through the process of how to create Docker containers. This can be achieved by following a three-step flow:
- DockerfileâInstructions for compiling an image.
- Docker ImageâThe compiled artifact.
- Docker ContainerâAn executed instance of the image.Â
1. Dockerfile
First things first, we need a set of instructions. Thatâs because Docker is instruction-based, not requirement-based; you need to describe the how rather than the what. To do this, we create a text file and name it âDockerfileâ.Â
The FROM command describes a base environment, eliminating the need to start from scratch. If you donât have a base image, you can find a whole load of them on DockerHub or through Google Images. The RUN command is an instruction to change the environment.Â
Although the example weâre sharing installs Python libraries one by one, this isnât how t should be done. Best practice says that you should utilize requirements.txt, which defines the Python dependencies. You can learn more about this in our previous blog post What ML teams should know about Python dependencies.
The COPY command copies a file from your local disk, e.g., the requirements.txt file, into the image. The RUN command then installs all the Python dependencies that are defined in requirements.txt.Â
2. Docker Image
Now that weâve got our Dockerfile, we can compile it into a binary artifact thatâs known as a Docker Image. The reason for compiling the Dockerfile is simple: it makes it faster and reproducible. If we didnât compile the Dockerfile, it would compromise standardization and reproducibility, leading to the problems we touched on earlier.
To compile your Dockerfile, use the build command:
This command builds an image on your machine. The -t parameter defines the image name, in this case, âmyimageâ, and gives it a tag, â1.0â. You can list all the images by running the command:
These images, also known as âsnapshotsâ in other virtual machine settings, are snapshots of a Docker virtual machine at a certain point in time and at a certain place. The key thing about Docker images is that theyâre immutable; they cannot be changed, only deleted.Â
This is critical in the Docker world because once youâve set up your Docker virtual machine and have created an image, you can be certain that the image will always function, making it simple to experiment with new features.Â
3. Docker Container
As we explained earlier, containers are what protect your application from other applications that exist on the same machine. The instructions can either be embedded into the image or provided before starting the container. To do the latter, run the command:
This command starts the container, runs an echo command, and then closes it down. We now have a reproducible method for executing our code in any environment that supports Docker. I.e., no matter what machine is being used, so long as Docker is available, the code will work. This level of standardization and reproducibility is critical in data science, where each project has several dependencies.Â
Containers will close themselves down when theyâve executed their instructions. That said, they can run for a long time, and you can control this by starting a long background command, for example:
By running the command docker container list, youâll be able to see if whether itâs running. To stop a container, take the container ID from the table and call the command docker stop <ID>. This will stop it but keep its state. Alternatively, to completely terminate the container, call the command docker rm -f <ID>.Â
Docker vs Python virtual environments
In our recent blog post What ML teams should know about Python dependencies. we talked about Python virtual environments and how these can be used to form a protective âbubbleâ between different Python projects in the same local development environment.Â
Although it may sound like Docker solves the same problem, it doesnât; it solves a similar problem but on a different layer. While a Python virtual environment creates the layer between all Python-related entities, Docker achieves this for the full software stack. The use cases for Python virtual environment and Docker are therefore different.Â
As a general ruleâand this is something to keep in mindâvirtual environments are ideal for developing applications on your local machine whereas Docker containers are built for collaborative productions in the cloud.Â
Containerized ML just makes sense
The machine learning space moves very fast. New research is constantly being implemented into APIs and open source frameworks. When things evolve this rapidly, keeping up with the latest developments and maintaining quality, consistency, and reliability can be a seemingly insurmountable challenge.Â
As you have hopefully learned from this article, one way to address this challenge is to move to containerized ML development by leveraging tools like Docker. Given that this enables ML teams to increase portability, achieve greater efficiency, operate more consistently, and develop better applications, it just makes sense, especially in situations where there are multiple engineers working on a single project.
Speaking of engineers, Docker enables them to track the different versions of a container image, check who built a version with what, and roll back to previous versions if necessary. Furthermore, ML applications can continue running even if one of its services is updating, being repaired, or down.Â