3 Learnings from Containerizing a Python API

Do not index

Canonical URL

I’ve been containerizing applications in all kinds of languages and frameworks for a couple of years now. Whilst I’m certainly not a guru, I haven’t picked up many new things for a while, until now. I had to create a container definition in Docker for a simple Python API, built using FastAPI backed by a Postgres database. Along the way, I learned a couple of things that I’d like to share.

Using a virtual environment inside a container

Using a virtual environment in Python is a widespread best practice. Using one inside a container definition? That turns out to be tricky. Imagine the following container definition:

**FROM python:3.9-slim**

# Create a virtualenv

RUN python3 -m venv /opt/venv

# Activate the virtualenv

RUN . /opt/venv/bin/activate

# Install dependencies:

COPY requirements.txt .

RUN pip install -r requirements.txt

# Run the application:

COPY app.py .

CMD ["python", "app.py"]

In this example, both installing your dependencies and running your actual application will NOT be using the defined virtual environment, because every RUN command in Docker is basically a different process! When running pip install or running your application, the virtualenv context is no longer present! However, it turns out there is an elegant solution to this problem. As investigated by Itamar Turner-Trauring in this excellent blog post, the activate script in a virtual environment does little more than tweak your command line to display the environment name, and set a couple of environment variables. That last bit is what we actually need!

By manually setting those environment variables, we effectively simulate running the activate script. And, what’s even better, these environment variables remain active across the different RUN commands in our container definition. This assures us that both the pip install and python commands will use our virtual environment.

**FROM python:3.9-slim**

## virtualenv setup

ENV VIRTUAL_ENV=/opt/venv

RUN python3 -m venv $VIRTUAL_ENV

ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install dependencies:

COPY requirements.txt .

RUN pip install -r requirements.txt

# Run the application:

COPY app.py .

CMD ["python", "app.py"]

Combining multi-stage builds and virtual environments

Multi-stage builds are a mechanism that allows us to build very lightweight containers with a minimal size by copying files and artifacts from one container into another. A pattern that is often used is first to define a “build” container, complete with all the libraries and tools required to compile your application. After the compilation is done, you can create a “runtime” container, limited to only the essentials you need at runtime. Now you can copy the application you’ve compiled before and copy it over, from the “compile” container into the “runtime” container. As you can see, none of the actual build dependencies are present in the container you will eventually run in production. This minimizes both the footprint of the container, as well as any possible attack surfaces.

**## Both "build" and "runtime" containers will be based on this image**

FROM python:3.9-slim as base

## Define the "build" container

FROM base as builder

## Define the "runtime" container

FROM base as runtime

## Copy compiled dependencies from the "build" to "runtime"

COPY --from=builder /opt/venv /opt/venv

This technique ties in nicely with the concept of the virtual environment: an isolated workspace where all of your binaries and dependencies can be collected. Applying the multi-stage technique means using a “build” container to create a virtual environment and install all required packages into it. This step might also include adding some OS-specific development libraries in case package compilation is required (e.g. psycopg2). Once the package installation is complete, we now copy the entire virtual environment over to the “runtime” container. Then we copy the application code over to the “runtime” container.

**FROM python:3.9-slim as base**

FROM base as builder

## virtualenv setup

ENV VIRTUAL_ENV=/opt/venv

RUN python3 -m venv $VIRTUAL_ENV

ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN apt-get update && \

apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \

pip install -r requirements.txt

FROM base as runtime

# Create user to run as

RUN adduser --disabled-password dumdum

COPY --from=builder /opt/venv /opt/venv

COPY . /app

RUN chown -R nml:nml /app

ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app

USER dumdum

CMD ["python", "src/app.py"]

NOTE: an extra step here could install your own application code as a package into the virtual environment during the “build” phase (see solution #2 here). It adheres to the multistage idea even more, but I had some difficulties making this work, so I’m sticking to copying source code into the runtime container for now.

**FROM python:3.9-slim as base**

FROM base as builder

## virtualenv setup

ENV VIRTUAL_ENV=/opt/venv

RUN python3 -m venv $VIRTUAL_ENV

ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN apt-get update && \

apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \

pip install -r requirements.txt

COPY setup.py .

COPY myapp/ .

RUN pip install .

FROM base as runtime

# Create user to run as

RUN adduser --disabled-password dumdum

COPY --from=builder /opt/venv /opt/venv

RUN chown -R nml:nml /app

ENV PATH="/opt/venv/bin:$PATH"

USER dumdum

CMD ["app"]

Accelerate your builds by optimizing cache usage

During the container building process, Docker will store the results of some instructions locally on disk, we call these layers. When running the build again, it can re-use those layers and skip executing a certain instruction completely. This is known as Docker layer caching.

Step 1/17: FROM python:3.9-slim as base

--> 609da079b03a

Step 2/17: FROM base as builder

--> 609da079b03a

Step 3/17: ENV VIRTUAL_ENV=/opt/venv

--> Using cache

--> 5a329e4c794d

Step 4/17: RUN python3 -m venv $VIRTUAL_ENV

--> Using cache

--> 6e51426bb86d

Step 5/17: ENV PATH="$VIRTUAL_ENV/bin:$PATH"

--> Using cache

--> f9b94a548d28

Step 6/17: RUN apt-get update && apt-get install -y build-essential

--> Using cache

--> 2d581eb3faf0

Step 7/17: COPY requirements.txt .

--> Using cache

--> ad30722902f6

Step 8/17: RUN pip install wheel && pip install -r requirements.txt

--> Using cache

--> f4c1f3486479

An example of this is installing packages using pip. We first copy the requirements.txt file to the container and then execute the pip install command. If no changes were made to the requirements.txt file, subsequent runs will be able to reuse the cached Docker layer and skip package installation altogether.

RUN apt-get update && \

apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \

pip install -r requirements.txt

If you’re running this inside a CI tool such as CirceCI, Jenkins or Gitlab however, you’ll most likely start with an empty Docker layer cache. This means that every build that you do will actually perform all build instructions from scratch (including some long-running package installation or compilation steps), even if nothing changes!

The best way to work around this is by actually performing a docker pull of the latest version of your image first. This is the quickest way actually to fill up the docker layer cache. Running your build will now optimally use the cached layers and should run significantly faster. A fun little trick that might save you lots of precious build minutes. Do note that when you’re using multistage builds, you should push and pull the “build” container to and from your registry, too, during your builds.

I retrieved the following snippet from one of our container build definitions and it illustrates how to combine the techniques of multistage builds and cache optimization.

script:

- docker pull $BUILDER_IMAGE:latest || true

- docker pull $CI_REGISTRY_IMAGE:latest || true

- docker build --cache-from $BUILDER_IMAGE:latest --target builder --tag $BUILDER_IMAGE:latest .

- docker build --cache-from $CI_REGISTRY_IMAGE:latest --cache-from $BUILDER_IMAGE:latest --target runtime --tag $CI_REGISTRY_IMAGE:$TAG --tag $CI_REGISTRY_IMAGE:latest .

- docker push $BUILDER_IMAGE:latest

- docker push $CI_REGISTRY_IMAGE:$TAG

- docker push $CI_REGISTRY_IMAGE:latest

We first pull both the latest version of the “build” and “runtime” images to warm up our Docker cache. We then run the docker build command for both the “build” and “runtime” definitions (notice the usage of the —cache-from flags). We then finish up by pushing both images to the registry with their associated tags.

NOTE: there is another way of doing this that might fit some setups better. If you are using the same set of packages across different applications, it might make sense to make use of the inheritance features that Docker offers. You can create your own base image and then add to this. In doing so, you will pull this base image when running a Docker build, achieving the same result as described earlier.

References

https://betterprogramming.pub/6-things-to-know-when-dockerizing-python-apps-in-production-f4701b50ca46

https://pythonspeed.com/articles/multi-stage-docker-python/

https://pythonspeed.com/articles/faster-multi-stage-builds/

https://pythonspeed.com/articles/activate-virtualenv-dockerfile/

https://pythonspeed.com/articles/smaller-python-docker-images

https://medium.com/swlh/dramatically-improve-your-docker-build-time-in-gitlab-ci-db0259f1bb08

Three things I learned while containerizing a Python API

Using a virtual environment inside a container

Combining multi-stage builds and virtual environments

Accelerate your builds by optimizing cache usage

References

Join other 1100+ data scientists now!