Aug 23, 2021

Three things I learned whilst containerizing a Python API

I’ve been containerizing applications in all kinds of languages and frameworks for a couple of years now. Whilst I’m certainly not a guru, I haven’t picked up many new things for a while. Until now. I had to create a container definition in Docker for a simple Python API, built using FastAPI backed by a Postgres database. Along the way, I learned a couple of things that I’d like to share.

Using a virtual environment inside of a container

Using a virtual environment in Python is a widespread best-practice. Using one inside a container definition? That turns out to be tricky. Imagine the following container definition:

FROM python:3.9-slim

# Create a virtualenv
RUN python3 -m venv /opt/venv

# Activate the virtualenv
RUN . /opt/venv/bin/activate

# Install dependencies:
COPY requirements.txt .
RUN pip install -r requirements.txt

# Run the application:
COPY app.py .
CMD ["python", "app.py"]

In this example, both installing your dependencies and running your actual application will NOT be using the defined virtual environment, because every RUN command in Docker is basically a different process! When running pip install or running your application, the virtualenv context is no longer present! However, it turns out there is an elegant solution to this problem. As investigated by Itamar Turner-Trauring in this excellent blog post, the activate script in a virtual environment does little more than tweak your command line to display the environment name, and set a couple of environment variables. That last bit is what we actually need!

By manually setting those environment variables, we effectively simulate running the activate script. And, what’s even better, these environment variables remain active across the different RUN commands in our container definition. This assures us that both the pip install and python commands will use our virtual environment.

FROM python:3.9-slim

## virtualenv setup
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install dependencies:
COPY requirements.txt .
RUN pip install -r requirements.txt

# Run the application:
COPY app.py .
CMD ["python", "app.py"]

Combining multi-stage builds and virtual environments

Multi-stage builds are a mechanism that allows us to build very lightweight containers with a minimal size by copying files and artifacts from one container into another. A pattern that is often used is to first define a “build” container, complete with all the libraries and tools required to compile your application. After the compilation is done, you can create a “runtime” container, limited to only the essentials you need at runtime. Now you can copy the application you’ve compiled before and copy it over, from the “compile” container into the “runtime” container. As you can see, none of the actual build dependencies are present in the container you will eventually run in production. This minimizes both the footprint of the container, as well as any possible attack surfaces.

## Both "build" and "runtime" containers will be based on this image
FROM python:3.9-slim as base

## Define the "build" container
FROM base as builder

## Define the "runtime" container
FROM base as runtime

## Copy compiled dependencies from the "build" to "runtime"
COPY --from=builder /opt/venv /opt/venv

This technique ties in nicely with the concept of the virtual environment: an isolated workspace where all of your binaries and dependencies can be collected. Applying the multi-stage technique means using a “build” container to create a virtual environment and install all required packages into it. This step might also include adding some OS-specific development libraries in case package compilation is required (e.g. psycopg2). Once the package installation is complete, we now copy the entire virtual environment over to the “runtime” container. Then we copy the application code over to the “runtime” container.

FROM python:3.9-slim as base

FROM base as builder

## virtualenv setup
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN apt-get update && \\
   apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \\
   pip install -r requirements.txt

FROM base as runtime

# Create user to run as
RUN adduser --disabled-password dumdum

COPY --from=builder /opt/venv /opt/venv
COPY . /app

RUN chown -R nml:nml /app

ENV PATH="/opt/venv/bin:$PATH"
WORKDIR /app
USER dumdum

CMD ["python", "src/app.py"]

We now have an active virtual environment with installed dependencies and application code in our runtime container, ready to go!

NOTE: an extra step here could instal your own application code as a package into the virtual environment during the “build” phase (see solution #2 here). It adheres to the multistage idea even more, but I had some difficulties making this work, so I’m sticking to copying source code into the runtime container for now.

FROM python:3.9-slim as base

FROM base as builder

## virtualenv setup
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN apt-get update && \\
   apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \\
   pip install -r requirements.txt

COPY setup.py .
COPY myapp/ .
RUN pip install .

FROM base as runtime

# Create user to run as
RUN adduser --disabled-password dumdum

COPY --from=builder /opt/venv /opt/venv

RUN chown -R nml:nml /app

ENV PATH="/opt/venv/bin:$PATH"
USER dumdum

CMD ["app"]

Accelerate your builds by optimizing cache usage

During the container building process, Docker will store the results of some instructions locally on disk, we call these layers. When running the build again, it can re-use those layers and skip executing a certain instruction completely. This is known as Docker layer caching.

Step 1/17 : FROM python:3.9-slim as base
---> 609da079b03a
Step 2/17 : FROM base as builder
---> 609da079b03a
Step 3/17 : ENV VIRTUAL_ENV=/opt/venv
---> Using cache
---> 5a329e4c794d
Step 4/17 : RUN python3 -m venv $VIRTUAL_ENV
---> Using cache
---> 6e51426bb86d
Step 5/17 : ENV PATH="$VIRTUAL_ENV/bin:$PATH"
---> Using cache
---> f9b94a548d28
Step 6/17 : RUN apt-get update &&     apt-get install -y build-essential
---> Using cache
---> 2d581eb3faf0
Step 7/17 : COPY requirements.txt .
---> Using cache
---> ad30722902f6
Step 8/17 : RUN pip install wheel &&     pip install -r requirements.txt
---> Using cache
---> f4c1f3486479

An example of this is installing packages using pip. We first copy the requirements.txt file to the container and then execute the pip install command. If no changes were made to the requirements.txt file, subsequent runs will be able to reuse the cached Docker layer and skip package installation all together.

RUN apt-get update && \\
   apt-get install -y build-essential

COPY requirements.txt .

RUN pip install wheel && \\
   pip install -r requirements.txt

If you’re running this inside a CI tool such as CirceCI, Jenkins or Gitlab however, you’ll most likely start with an empty Docker layer cache. This means that every build that you do will actually perform all build instructions from scratch (including some long-running package installation or compilation steps), even if nothing changed!

The best way to work around this is by actually performing a docker pull of the latest version of your image first. This is the quickest way to actually fill up the docker layer cache. Running your build will now optimally use the cached layers and should run significantly faster. A fun little trick that might save you lots of precious build minutes. Do note that when you’re using multistage builds, you should push and pull the “build” container to and from your registry too, during your builds.

I retrieved the following snippet from one of our container build definitions and illustrates how to combine the techniques of multistage builds and cache optimization.

script:
   - docker pull $BUILDER_IMAGE:latest || true
   - docker pull $CI_REGISTRY_IMAGE:latest || true

   - docker build --cache-from $BUILDER_IMAGE:latest --target builder --tag $BUILDER_IMAGE:latest .
   - docker build --cache-from $CI_REGISTRY_IMAGE:latest --cache-from $BUILDER_IMAGE:latest --target runtime --tag $CI_REGISTRY_IMAGE:$TAG --tag $CI_REGISTRY_IMAGE:latest .

   - docker push $BUILDER_IMAGE:latest
   - docker push $CI_REGISTRY_IMAGE:$TAG
   - docker push $CI_REGISTRY_IMAGE:latest

We first pull both the latest version of the “build” and “runtime” images to warm up our Docker cache. We then run the docker build command for both the “build” and “runtime” definitions (notice the usage of the —cache-from flags). We then finish up by pushing both images to the registry with their associated tags.

NOTE: there is another way of doing this that might fit some setups better. If you are using the same set of packages across different applications, it might make sense to make use of the inheritance features that Docker offers. You can create your own base image and then add to this. In doing so, you will pull this base image when running a Docker build, achieving the same result as described earlier.

References

Continue reading

Our newsletter
Get great AI insights every month.
Leave your email address below and we'll keep you posted about all the great AI insights we have to offer.
No spam!