RUN RUN COPY RUN FROM RUN COPY RUN CMD, EXPOSE ... ``` * The build fails as soon as an instruction fails * If `RUN ` fails, the build doesn't produce an image * If it succeeds, it produces a clean image (without test libraries and data) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/container-cranes.jpg)] --- name: toc-dockerfile-examples class: title Dockerfile examples .nav[ [Previous part](#toc-tips-for-efficient-dockerfiles) | [Back to table of contents](#toc-part-3) | [Next part](#toc-advanced-dockerfile-syntax) ] .debug[(automatically generated title slide)] --- # Dockerfile examples There are a number of tips, tricks, and techniques that we can use in Dockerfiles. But sometimes, we have to use different (and even opposed) practices depending on: - the complexity of our project, - the programming language or framework that we are using, - the stage of our project (early MVP vs. super-stable production), - whether we're building a final image or a base for further images, - etc. We are going to show a few examples using very different techniques. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## When to optimize an image When authoring official images, it is a good idea to reduce as much as possible: - the number of layers, - the size of the final image. This is often done at the expense of build time and convenience for the image maintainer; but when an image is downloaded millions of time, saving even a few seconds of pull time can be worth it. .small[ ```dockerfile RUN apt-get update && apt-get install -y libpng12-dev libjpeg-dev && rm -rf /var/lib/apt/lists/* \ && docker-php-ext-configure gd --with-png-dir=/usr --with-jpeg-dir=/usr \ && docker-php-ext-install gd ... RUN curl -o wordpress.tar.gz -SL https://wordpress.org/wordpress-${WORDPRESS_UPSTREAM_VERSION}.tar.gz \ && echo "$WORDPRESS_SHA1 *wordpress.tar.gz" | sha1sum -c - \ && tar -xzf wordpress.tar.gz -C /usr/src/ \ && rm wordpress.tar.gz \ && chown -R www-data:www-data /usr/src/wordpress ``` ] (Source: [Wordpress official image](https://github.com/docker-library/wordpress/blob/618490d4bdff6c5774b84b717979bfe3d6ba8ad1/apache/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## When to *not* optimize an image Sometimes, it is better to prioritize *maintainer convenience*. In particular, if: - the image changes a lot, - the image has very few users (e.g. only 1, the maintainer!), - the image is built and run on the same machine, - the image is built and run on machines with a very fast link ... In these cases, just keep things simple! (Next slide: a Dockerfile that can be used to preview a Jekyll / github pages site.) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ```dockerfile FROM debian:sid RUN apt-get update -q RUN apt-get install -yq build-essential make RUN apt-get install -yq zlib1g-dev RUN apt-get install -yq ruby ruby-dev RUN apt-get install -yq python-pygments RUN apt-get install -yq nodejs RUN apt-get install -yq cmake RUN gem install --no-rdoc --no-ri github-pages COPY . /blog WORKDIR /blog VOLUME /blog/_site EXPOSE 4000 CMD ["jekyll", "serve", "--host", "0.0.0.0", "--incremental"] ``` .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Multi-dimensional versioning systems Images can have a tag, indicating the version of the image. But sometimes, there are multiple important components, and we need to indicate the versions for all of them. This can be done with environment variables: ```dockerfile ENV PIP=9.0.3 \ ZC_BUILDOUT=2.11.2 \ SETUPTOOLS=38.7.0 \ PLONE_MAJOR=5.1 \ PLONE_VERSION=5.1.0 \ PLONE_MD5=76dc6cfc1c749d763c32fff3a9870d8d ``` (Source: [Plone official image](https://github.com/plone/plone.docker/blob/master/5.1/5.1.0/alpine/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Entrypoints and wrappers It is very common to define a custom entrypoint. That entrypoint will generally be a script, performing any combination of: - pre-flights checks (if a required dependency is not available, display a nice error message early instead of an obscure one in a deep log file), - generation or validation of configuration files, - dropping privileges (with e.g. `su` or `gosu`, sometimes combined with `chown`), - and more. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## A typical entrypoint script ```dockerfile #!/bin/sh set -e # first arg is '-f' or '--some-option' # or first arg is 'something.conf' if [ "${1#-}" != "$1" ] || [ "${1%.conf}" != "$1" ]; then set -- redis-server "$@" fi # allow the container to be started with '--user' if [ "$1" = 'redis-server' -a "$(id -u)" = '0' ]; then chown -R redis . exec su-exec redis "$0" "$@" fi exec "$@" ``` (Source: [Redis official image](https://github.com/docker-library/redis/blob/d24f2be82673ccef6957210cc985e392ebdc65e4/4.0/alpine/docker-entrypoint.sh)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Factoring information To facilitate maintenance (and avoid human errors), avoid to repeat information like: - version numbers, - remote asset URLs (e.g. source tarballs) ... Instead, use environment variables. .small[ ```dockerfile ENV NODE_VERSION 10.2.1 ... RUN ... && curl -fsSLO --compressed "https://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION.tar.xz" \ && curl -fsSLO --compressed "https://nodejs.org/dist/v$NODE_VERSION/SHASUMS256.txt.asc" \ && gpg --batch --decrypt --output SHASUMS256.txt SHASUMS256.txt.asc \ && grep " node-v$NODE_VERSION.tar.xz\$" SHASUMS256.txt | sha256sum -c - \ && tar -xf "node-v$NODE_VERSION.tar.xz" \ && cd "node-v$NODE_VERSION" \ ... ``` ] (Source: [Nodejs official image](https://github.com/nodejs/docker-node/blob/master/10/alpine/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Overrides In theory, development and production images should be the same. In practice, we often need to enable specific behaviors in development (e.g. debug statements). One way to reconcile both needs is to use Compose to enable these behaviors. Let's look at the [trainingwheels](https://github.com/jpetazzo/trainingwheels) demo app for an example. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Production image This Dockerfile builds an image leveraging gunicorn: ```dockerfile FROM python RUN pip install flask RUN pip install gunicorn RUN pip install redis COPY . /src WORKDIR /src CMD gunicorn --bind 0.0.0.0:5000 --workers 10 counter:app EXPOSE 5000 ``` (Source: [trainingwheels Dockerfile](https://github.com/jpetazzo/trainingwheels/blob/master/www/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Development Compose file This Compose file uses the same image, but with a few overrides for development: - the Flask development server is used (overriding `CMD`), - the `DEBUG` environment variable is set, - a volume is used to provide a faster local development workflow. .small[ ```yaml services: www: build: www ports: - 8000:5000 user: nobody environment: DEBUG: 1 command: python counter.py volumes: - ./www:/src ``` ] (Source: [trainingwheels Compose file](https://github.com/jpetazzo/trainingwheels/blob/master/docker-compose.yml)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## How to know which best practices are better? - The main goal of containers is to make our lives easier. - In this chapter, we showed many ways to write Dockerfiles. - These Dockerfiles use sometimes diametrically opposed techniques. - Yet, they were the "right" ones *for a specific situation.* - It's OK (and even encouraged) to start simple and evolve as needed. - Feel free to review this chapter later (after writing a few Dockerfiles) for inspiration! ??? :EN:Optimizing images :EN:- Dockerfile tips, tricks, and best practices :EN:- Reducing build time :EN:- Reducing image size :FR:Optimiser ses images :FR:- Bonnes pratiques, trucs et astuces :FR:- Réduire le temps de build :FR:- Réduire la taille des images .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/container-housing.jpg)] --- name: toc-advanced-dockerfile-syntax class: title Advanced Dockerfile Syntax .nav[ [Previous part](#toc-dockerfile-examples) | [Back to table of contents](#toc-part-3) | [Next part](#toc-reducing-image-size) ] .debug[(automatically generated title slide)] --- class: title # Advanced Dockerfile Syntax ![construction](images/title-advanced-dockerfiles.jpg) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Objectives We have seen simple Dockerfiles to illustrate how Docker build container images. In this section, we will give a recap of the Dockerfile syntax, and introduce advanced Dockerfile commands that we might come across sometimes; or that we might want to use in some specific scenarios. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `Dockerfile` usage summary * `Dockerfile` instructions are executed in order. * Each instruction creates a new layer in the image. * Docker maintains a cache with the layers of previous builds. * When there are no changes in the instructions and files making a layer, the builder re-uses the cached layer, without executing the instruction for that layer. * The `FROM` instruction MUST be the first non-comment instruction. * Lines starting with `#` are treated as comments. * Some instructions (like `CMD` or `ENTRYPOINT`) update a piece of metadata. (As a result, each call to these instructions makes the previous one useless.) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `RUN` instruction The `RUN` instruction can be specified in two ways. With shell wrapping, which runs the specified command inside a shell, with `/bin/sh -c`: ```dockerfile RUN apt-get update ``` Or using the `exec` method, which avoids shell string expansion, and allows execution in images that don't have `/bin/sh`: ```dockerfile RUN [ "apt-get", "update" ] ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## More about the `RUN` instruction `RUN` will do the following: * Execute a command. * Record changes made to the filesystem. * Work great to install libraries, packages, and various files. `RUN` will NOT do the following: * Record state of *processes*. * Automatically start daemons. If you want to start something automatically when the container runs, you should use `CMD` and/or `ENTRYPOINT`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Collapsing layers It is possible to execute multiple commands in a single step: ```dockerfile RUN apt-get update && apt-get install -y wget && apt-get clean ``` It is also possible to break a command onto multiple lines: ```dockerfile RUN apt-get update \ && apt-get install -y wget \ && apt-get clean ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `EXPOSE` instruction The `EXPOSE` instruction tells Docker what ports are to be published in this image. ```dockerfile EXPOSE 8080 EXPOSE 80 443 EXPOSE 53/tcp 53/udp ``` * All ports are private by default. * Declaring a port with `EXPOSE` is not enough to make it public. * The `Dockerfile` doesn't control on which port a service gets exposed. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Exposing ports * When you `docker run -p ...`, that port becomes public. (Even if it was not declared with `EXPOSE`.) * When you `docker run -P ...` (without port number), all ports declared with `EXPOSE` become public. A *public port* is reachable from other containers and from outside the host. A *private port* is not reachable from outside. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `COPY` instruction The `COPY` instruction adds files and content from your host into the image. ```dockerfile COPY . /src ``` This will add the contents of the *build context* (the directory passed as an argument to `docker build`) to the directory `/src` in the container. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Build context isolation Note: you can only reference files and directories *inside* the build context. Absolute paths are taken as being anchored to the build context, so the two following lines are equivalent: ```dockerfile COPY . /src COPY / /src ``` Attempts to use `..` to get out of the build context will be detected and blocked with Docker, and the build will fail. Otherwise, a `Dockerfile` could succeed on host A, but fail on host B. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `ADD` `ADD` works almost like `COPY`, but has a few extra features. `ADD` can get remote files: ```dockerfile ADD http://www.example.com/webapp.jar /opt/ ``` This would download the `webapp.jar` file and place it in the `/opt` directory. `ADD` will automatically unpack zip files and tar archives: ```dockerfile ADD ./assets.zip /var/www/htdocs/assets/ ``` This would unpack `assets.zip` into `/var/www/htdocs/assets`. *However,* `ADD` will not automatically unpack remote archives. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `ADD`, `COPY`, and the build cache * Before creating a new layer, Docker checks its build cache. * For most Dockerfile instructions, Docker only looks at the `Dockerfile` content to do the cache lookup. * For `ADD` and `COPY` instructions, Docker also checks if the files to be added to the container have been changed. * `ADD` always needs to download the remote file before it can check if it has been changed. (It cannot use, e.g., ETags or If-Modified-Since headers.) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `VOLUME` The `VOLUME` instruction tells Docker that a specific directory should be a *volume*. ```dockerfile VOLUME /var/lib/mysql ``` Filesystem access in volumes bypasses the copy-on-write layer, offering native performance to I/O done in those directories. Volumes can be attached to multiple containers, allowing to "port" data over from a container to another, e.g. to upgrade a database to a newer version. It is possible to start a container in "read-only" mode. The container filesystem will be made read-only, but volumes can still have read/write access if necessary. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `WORKDIR` instruction The `WORKDIR` instruction sets the working directory for subsequent instructions. It also affects `CMD` and `ENTRYPOINT`, since it sets the working directory used when starting the container. ```dockerfile WORKDIR /src ``` You can specify `WORKDIR` again to change the working directory for further operations. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `ENV` instruction The `ENV` instruction specifies environment variables that should be set in any container launched from the image. ```dockerfile ENV WEBAPP_PORT 8080 ``` This will result in an environment variable being created in any containers created from this image of ```bash WEBAPP_PORT=8080 ``` You can also specify environment variables when you use `docker run`. ```bash $ docker run -e WEBAPP_PORT=8000 -e WEBAPP_HOST=www.example.com ... ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `USER` instruction The `USER` instruction sets the user name or UID to use when running the image. It can be used multiple times to change back to root or to another user. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `CMD` instruction The `CMD` instruction is a default command run when a container is launched from the image. ```dockerfile CMD [ "nginx", "-g", "daemon off;" ] ``` Means we don't need to specify `nginx -g "daemon off;"` when running the container. Instead of: ```bash $ docker run /web_image nginx -g "daemon off;" ``` We can just do: ```bash $ docker run /web_image ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## More about the `CMD` instruction Just like `RUN`, the `CMD` instruction comes in two forms. The first executes in a shell: ```dockerfile CMD nginx -g "daemon off;" ``` The second executes directly, without shell processing: ```dockerfile CMD [ "nginx", "-g", "daemon off;" ] ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## Overriding the `CMD` instruction The `CMD` can be overridden when you run a container. ```bash $ docker run -it /web_image bash ``` Will run `bash` instead of `nginx -g "daemon off;"`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `ENTRYPOINT` instruction The `ENTRYPOINT` instruction is like the `CMD` instruction, but arguments given on the command line are *appended* to the entry point. Note: you have to use the "exec" syntax (`[ "..." ]`). ```dockerfile ENTRYPOINT [ "/bin/ls" ] ``` If we were to run: ```bash $ docker run training/ls -l ``` Instead of trying to run `-l`, the container will run `/bin/ls -l`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## Overriding the `ENTRYPOINT` instruction The entry point can be overridden as well. ```bash $ docker run -it training/ls bin dev home lib64 mnt proc run srv tmp var boot etc lib media opt root sbin sys usr $ docker run -it --entrypoint bash training/ls root@d902fb7b1fc7:/# ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## How `CMD` and `ENTRYPOINT` interact The `CMD` and `ENTRYPOINT` instructions work best when used together. ```dockerfile ENTRYPOINT [ "nginx" ] CMD [ "-g", "daemon off;" ] ``` The `ENTRYPOINT` specifies the command to be run and the `CMD` specifies its options. On the command line we can then potentially override the options when needed. ```bash $ docker run -d /web_image -t ``` This will override the options `CMD` provided with new flags. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Advanced Dockerfile instructions * `ONBUILD` lets you stash instructions that will be executed when this image is used as a base for another one. * `LABEL` adds arbitrary metadata to the image. * `ARG` defines build-time variables (optional or mandatory). * `STOPSIGNAL` sets the signal for `docker stop` (`TERM` by default). * `HEALTHCHECK` defines a command assessing the status of the container. * `SHELL` sets the default program to use for string-syntax RUN, CMD, etc. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## The `ONBUILD` instruction The `ONBUILD` instruction is a trigger. It sets instructions that will be executed when another image is built from the image being build. This is useful for building images which will be used as a base to build other images. ```dockerfile ONBUILD COPY . /src ``` * You can't chain `ONBUILD` instructions with `ONBUILD`. * `ONBUILD` can't be used to trigger `FROM` instructions. ??? :EN:- Advanced Dockerfile syntax :FR:- Dockerfile niveau expert .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/containers-by-the-water.jpg)] --- name: toc-reducing-image-size class: title Reducing image size .nav[ [Previous part](#toc-advanced-dockerfile-syntax) | [Back to table of contents](#toc-part-3) | [Next part](#toc-multi-stage-builds) ] .debug[(automatically generated title slide)] --- # Reducing image size * In the previous example, our final image contained: * our `hello` program * its source code * the compiler * Only the first one is strictly necessary. * We are going to see how to obtain an image without the superfluous components. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Can't we remove superfluous files with `RUN`? What happens if we do one of the following commands? - `RUN rm -rf ...` - `RUN apt-get remove ...` - `RUN make clean ...` -- This adds a layer which removes a bunch of files. But the previous layers (which added the files) still exist. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Removing files with an extra layer When downloading an image, all the layers must be downloaded. | Dockerfile instruction | Layer size | Image size | | ---------------------- | ---------- | ---------- | | `FROM ubuntu` | Size of base image | Size of base image | | `...` | ... | Sum of this layer + all previous ones | | `RUN apt-get install somepackage` | Size of files added (e.g. a few MB) | Sum of this layer + all previous ones | | `...` | ... | Sum of this layer + all previous ones | | `RUN apt-get remove somepackage` | Almost zero (just metadata) | Same as previous one | Therefore, `RUN rm` does not reduce the size of the image or free up disk space. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Removing unnecessary files Various techniques are available to obtain smaller images: - collapsing layers, - adding binaries that are built outside of the Dockerfile, - squashing the final image, - multi-stage builds. Let's review them quickly. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Collapsing layers You will frequently see Dockerfiles like this: ```dockerfile FROM ubuntu RUN apt-get update && apt-get install xxx && ... && apt-get remove xxx && ... ``` Or the (more readable) variant: ```dockerfile FROM ubuntu RUN apt-get update \ && apt-get install xxx \ && ... \ && apt-get remove xxx \ && ... ``` This `RUN` command gives us a single layer. The files that are added, then removed in the same layer, do not grow the layer size. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Collapsing layers: pros and cons Pros: - works on all versions of Docker - doesn't require extra tools Cons: - not very readable - some unnecessary files might still remain if the cleanup is not thorough - that layer is expensive (slow to build) .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Building binaries outside of the Dockerfile This results in a Dockerfile looking like this: ```dockerfile FROM ubuntu COPY xxx /usr/local/bin ``` Of course, this implies that the file `xxx` exists in the build context. That file has to exist before you can run `docker build`. For instance, it can: - exist in the code repository, - be created by another tool (script, Makefile...), - be created by another container image and extracted from the image. See for instance the [busybox official image](https://github.com/docker-library/busybox/blob/fe634680e32659aaf0ee0594805f74f332619a90/musl/Dockerfile) or this [older busybox image](https://github.com/jpetazzo/docker-busybox). .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Building binaries outside: pros and cons Pros: - final image can be very small Cons: - requires an extra build tool - we're back in dependency hell and "works on my machine" Cons, if binary is added to code repository: - breaks portability across different platforms - grows repository size a lot if the binary is updated frequently .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Squashing the final image The idea is to transform the final image into a single-layer image. This can be done in (at least) two ways. - Activate experimental features and squash the final image: ```bash docker image build --squash ... ``` - Export/import the final image. ```bash docker build -t temp-image . docker run --entrypoint true --name temp-container temp-image docker export temp-container | docker import - final-image docker rm temp-container docker rmi temp-image ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Squashing the image: pros and cons Pros: - single-layer images are smaller and faster to download - removed files no longer take up storage and network resources Cons: - we still need to actively remove unnecessary files - squash operation can take a lot of time (on big images) - squash operation does not benefit from cache (even if we change just a tiny file, the whole image needs to be re-squashed) .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds Multi-stage builds allow us to have multiple *stages*. Each stage is a separate image, and can copy files from previous stages. We're going to see how they work in more detail. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/distillery-containers.jpg)] --- name: toc-multi-stage-builds class: title Multi-stage builds .nav[ [Previous part](#toc-reducing-image-size) | [Back to table of contents](#toc-part-3) | [Next part](#toc-publishing-images-to-the-docker-hub) ] .debug[(automatically generated title slide)] --- # Multi-stage builds * At any point in our `Dockerfile`, we can add a new `FROM` line. * This line starts a new stage of our build. * Each stage can access the files of the previous stages with `COPY --from=...`. * When a build is tagged (with `docker build -t ...`), the last stage is tagged. * Previous stages are not discarded: they will be used for caching, and can be referenced. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds in practice * Each stage is numbered, starting at `0` * We can copy a file from a previous stage by indicating its number, e.g.: ```dockerfile COPY --from=0 /file/from/first/stage /location/in/current/stage ``` * We can also name stages, and reference these names: ```dockerfile FROM golang AS builder RUN ... FROM alpine COPY --from=builder /go/bin/mylittlebinary /usr/local/bin/ ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds for our C program We will change our Dockerfile to: * give a nickname to the first stage: `compiler` * add a second stage using the same `ubuntu` base image * add the `hello` binary to the second stage * make sure that `CMD` is in the second stage The resulting Dockerfile is on the next slide. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage build `Dockerfile` Here is the final Dockerfile: ```dockerfile FROM ubuntu AS compiler RUN apt-get update RUN apt-get install -y build-essential COPY hello.c / RUN make hello FROM ubuntu COPY --from=compiler /hello /hello CMD /hello ``` Let's build it, and check that it works correctly: ```bash docker build -t hellomultistage . docker run hellomultistage ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Comparing single/multi-stage build image sizes List our images with `docker images`, and check the size of: - the `ubuntu` base image, - the single-stage `hello` image, - the multi-stage `hellomultistage` image. We can achieve even smaller images if we use smaller base images. However, if we use common base images (e.g. if we standardize on `ubuntu`), these common images will be pulled only once per node, so they are virtually "free." .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Build targets * We can also tag an intermediary stage with the following command: ```bash docker build --target STAGE --tag NAME ``` * This will create an image (named `NAME`) corresponding to stage `STAGE` * This can be used to easily access an intermediary stage for inspection (instead of parsing the output of `docker build` to find out the image ID) * This can also be used to describe multiple images from a single Dockerfile (instead of using multiple Dockerfiles, which could go out of sync) ??? :EN:Optimizing our images and their build process :EN:- Leveraging multi-stage builds :FR:Optimiser les images et leur construction :FR:- Utilisation d'un *multi-stage build* .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/lots-of-containers.jpg)] --- name: toc-publishing-images-to-the-docker-hub class: title Publishing images to the Docker Hub .nav[ [Previous part](#toc-multi-stage-builds) | [Back to table of contents](#toc-part-3) | [Next part](#toc-exercise--writing-better-dockerfiles) ] .debug[(automatically generated title slide)] --- # Publishing images to the Docker Hub We have built our first images. We can now publish it to the Docker Hub! *You don't have to do the exercises in this section, because they require an account on the Docker Hub, and we don't want to force anyone to create one.* *Note, however, that creating an account on the Docker Hub is free (and doesn't require a credit card), and hosting public images is free as well.* .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Logging into our Docker Hub account * This can be done from the Docker CLI: ```bash docker login ``` .warning[When running Docker for Mac/Windows, or Docker on a Linux workstation, it can (and will when possible) integrate with your system's keyring to store your credentials securely. However, on most Linux servers, it will store your credentials in `~/.docker/config`.] .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Image tags and registry addresses * Docker images tags are like Git tags and branches. * They are like *bookmarks* pointing at a specific image ID. * Tagging an image doesn't *rename* an image: it adds another tag. * When pushing an image to a registry, the registry address is in the tag. Example: `registry.example.net:5000/image` * What about Docker Hub images? -- * `jpetazzo/clock` is, in fact, `index.docker.io/jpetazzo/clock` * `ubuntu` is, in fact, `library/ubuntu`, i.e. `index.docker.io/library/ubuntu` .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Tagging an image to push it on the Hub * Let's tag our `figlet` image (or any other to our liking): ```bash docker tag figlet jpetazzo/figlet ``` * And push it to the Hub: ```bash docker push jpetazzo/figlet ``` * That's it! -- * Anybody can now `docker run jpetazzo/figlet` anywhere. .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## The goodness of automated builds * You can link a Docker Hub repository with a GitHub or BitBucket repository * Each push to GitHub or BitBucket will trigger a build on Docker Hub * If the build succeeds, the new image is available on Docker Hub * You can map tags and branches between source and container images * If you work with public repositories, this is free .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- class: extra-details ## Setting up an automated build * We need a Dockerized repository! * Let's go to https://github.com/jpetazzo/trainingwheels and fork it. * Go to the Docker Hub (https://hub.docker.com/) and sign-in. Select "Repositories" in the blue navigation menu. * Select "Create" in the top-right bar, and select "Create Repository+". * Connect your Docker Hub account to your GitHub account. * Click "Create" button. * Then go to "Builds" folder. * Click on Github icon and select your user and the repository that we just forked. * In "Build rules" block near page bottom, put `/www` in "Build Context" column (or whichever directory the Dockerfile is in). * Click "Save and Build" to build the repository immediately (without waiting for a git push). * Subsequent builds will happen automatically, thanks to GitHub hooks. .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Building on the fly - Some services can build images on the fly from a repository - Example: [ctr.run](https://ctr.run/) .lab[ - Use ctr.run to automatically build a container image and run it: ```bash docker run ctr.run/github.com/undefinedlabs/hello-world ``` ] There might be a long pause before the first layer is pulled, because the API behind `docker pull` doesn't allow to stream build logs, and there is no feedback during the build. It is possible to view the build logs by setting up an account on [ctr.run](https://ctr.run/). ??? :EN:- Publishing images to the Docker Hub :FR:- Publier des images sur le Docker Hub .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/plastic-containers.JPG)] --- name: toc-exercise--writing-better-dockerfiles class: title Exercise — writing better Dockerfiles .nav[ [Previous part](#toc-publishing-images-to-the-docker-hub) | [Back to table of contents](#toc-part-3) | [Next part](#toc-buildkit) ] .debug[(automatically generated title slide)] --- # Exercise — writing better Dockerfiles Let's update our Dockerfiles to leverage multi-stage builds! The code is at: https://github.com/jpetazzo/wordsmith Use a different tag for these images, so that we can compare their sizes. What's the size difference between single-stage and multi-stage builds? .debug[[containers/Exercise_Dockerfile_Advanced.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Exercise_Dockerfile_Advanced.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/train-of-containers-1.jpg)] --- name: toc-buildkit class: title Buildkit .nav[ [Previous part](#toc-exercise--writing-better-dockerfiles) | [Back to table of contents](#toc-part-4) | [Next part](#toc-container-network-drivers) ] .debug[(automatically generated title slide)] --- # Buildkit - "New" backend for Docker builds - announced in 2017 - ships with Docker Engine 18.09 - enabled by default on Docker Desktop in 2021 - Huge improvements in build efficiency - 100% compatible with existing Dockerfiles - New features for multi-arch - Not just for building container images .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Old vs New - Classic `docker build`: - copy whole build context - linear execution - `docker run` + `docker commit` + `docker run` + `docker commit`... - Buildkit: - copy files only when they are needed; cache them - compute dependency graph (dependencies are expressed by `COPY`) - parallel execution - doesn't rely on Docker, but on internal runner/snapshotter - can run in "normal" containers (including in Kubernetes pods) .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Parallel execution - In multi-stage builds, all stages can be built in parallel (example: https://github.com/jpetazzo/shpod; [before] and [after]) - Stages are built only when they are necessary (i.e. if their output is tagged or used in another necessary stage) - Files are copied from context only when needed - Files are cached in the builder [before]: https://github.com/jpetazzo/shpod/blob/c6efedad6d6c3dc3120dbc0ae0a6915f85862474/Dockerfile [after]: https://github.com/jpetazzo/shpod/blob/d20887bbd56b5fcae2d5d9b0ce06cae8887caabf/Dockerfile .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Turning it on and off - On recent version of Docker Desktop (since 2021): *enabled by default* - On older versions, or on Docker CE (Linux): `export DOCKER_BUILDKIT=1` - Turning it off: `export DOCKER_BUILDKIT=0` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Multi-arch support - Historically, Docker only ran on x86_64 / amd64 (Intel/AMD 64 bits architecture) - Folks have been running it on 32-bit ARM for ages (e.g. Raspberry Pi) - This required a Go compiler and appropriate base images (which means changing/adapting Dockerfiles to use these base images) - Docker [image manifest v2 schema 2][manifest] introduces multi-arch images (`FROM alpine` automatically gets the right image for your architecture) [manifest]: https://docs.docker.com/registry/spec/manifest-v2-2/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Why? - Raspberry Pi (32-bit and 64-bit ARM) - Other ARM-based embedded systems (ODROID, NVIDIA Jetson...) - Apple M1 - AWS Graviton - Ampere Altra (e.g. on Oracle Cloud) - ... .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Multi-arch builds in a nutshell Use the `docker buildx build` command: ```bash docker buildx build … \ --platform linux/amd64,linux/arm64,linux/arm/v7,linux/386 \ [--tag jpetazzo/hello --push] ``` - Requires all base images to be available for these platforms - Must not use binary downloads with hard-coded architectures! (streamlining a Dockerfile for multi-arch: [before], [after]) [before]: https://github.com/jpetazzo/shpod/blob/d20887bbd56b5fcae2d5d9b0ce06cae8887caabf/Dockerfile [after]: https://github.com/jpetazzo/shpod/blob/c50789e662417b34fea6f5e1d893721d66d265b7/Dockerfile .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native vs emulated vs cross - Native builds: *aarch64 machine running aarch64 programs building aarch64 images/binaries* - Emulated builds: *x86_64 machine running aarch64 programs building aarch64 images/binaries* - Cross builds: *x86_64 machine running x86_64 programs building aarch64 images/binaries* .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native - Dockerfiles are (relatively) simple to write (nothing special to do to handle multi-arch; just avoid hard-coded archs) - Best performance - Requires "exotic" machines - Requires setting up a build farm .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Emulated - Dockerfiles are (relatively) simple to write - Emulation performance can vary (from "OK" to "ouch this is slow") - Emulation isn't always perfect (weird bugs/crashes are rare but can happen) - Doesn't require special machines - Supports arbitrary architectures thanks to QEMU .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Cross - Dockerfiles are more complicated to write - Requires cross-compilation toolchains - Performance is good - Doesn't require special machines .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native builds - Requires base images to be available - To view available architectures for an image: ```bash regctl manifest get --list docker manifest inspect ``` - Nothing special to do, *except* when downloading binaries! ``` https://releases.hashicorp.com/terraform/1.1.5/terraform_1.1.5_linux_`amd64`.zip ``` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Finding the right architecture `uname -m` → armv7l, aarch64, i686, x86_64 `GOARCH` (from `go env`) → arm, arm64, 386, amd64 In Dockerfile, add `ARG TARGETARCH` (or `ARG TARGETPLATFORM`) - `TARGETARCH` matches `GOARCH` - `TARGETPLAFORM` → linux/arm/v7, linux/arm64, linux/386, linux/amd64 .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## Welp Sometimes, binary releases be like: ``` Linux_arm64.tar.gz Linux_ppc64le.tar.gz Linux_s390x.tar.gz Linux_x86_64.tar.gz ``` This needs a bit of custom mapping. .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Emulation - Leverages `binfmt_misc` and QEMU on Linux - Enabling: ```bash docker run --rm --privileged aptman/qus -s -- -p ``` - Disabling: ```bash docker run --rm --privileged aptman/qus -- -r ``` - Checking status: ```bash ls -l /proc/sys/fs/binfmt_misc ``` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## How it works - `binfmt_misc` lets us register _interpreters_ for binaries, e.g.: - [DOSBox][dosbox] for DOS programs - [Wine][wine] for Windows programs - [QEMU][qemu] for Linux programs for other architectures - When we try to execute e.g. a SPARC binary on our x86_64 machine: - `binfmt_misc` detects the binary format and invokes `qemu- the-binary ...` - QEMU translates SPARC instructions to x86_64 instructions - system calls go straight to the kernel [dosbox]: https://www.dosbox.com/ [QEMU]: https://www.qemu.org/ [wine]: https://www.winehq.org/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## QEMU registration - The `aptman/qus` image mentioned earlier contains static QEMU builds - It registers all these interpreters with the kernel - For more details, check: - https://github.com/dbhi/qus - https://dbhi.github.io/qus/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Cross-compilation - Cross-compilation is about 10x faster than emulation (non-scientific benchmarks!) - In Dockerfile, add: `ARG BUILDARCH BUILDPLATFORM TARGETARCH TARGETPLATFORM` - Can use `FROM --platform=$BUILDPLATFORM ` - Then use `$TARGETARCH` or `$TARGETPLATFORM` (e.g. for Go, `export GOARCH=$TARGETARCH`) - Check [tonistiigi/xx][xx] and [Toni's blog][toni] for some amazing cross tools! [xx]: https://github.com/tonistiigi/xx [toni]: https://medium.com/@tonistiigi/faster-multi-platform-builds-dockerfile-cross-compilation-guide-part-1-ec087c719eaf .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Checking runtime capabilities Build and run the following Dockerfile: ```dockerfile FROM --platform=linux/amd64 busybox AS amd64 FROM --platform=linux/arm64 busybox AS arm64 FROM --platform=linux/arm/v7 busybox AS arm32 FROM --platform=linux/386 busybox AS ia32 FROM alpine RUN apk add file WORKDIR /root COPY --from=amd64 /bin/busybox /root/amd64/busybox COPY --from=arm64 /bin/busybox /root/arm64/busybox COPY --from=arm32 /bin/busybox /root/arm32/busybox COPY --from=ia32 /bin/busybox /root/ia32/busybox CMD for A in *; do echo "$A => $($A/busybox uname -a)"; done ``` It will indicate which executables can be run on your engine. .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## More than builds - Buildkit is also used in other systems: - [Earthly] - generic repeatable build pipelines - [Dagger] - CICD pipelines that run anywhere - and more! [Earthly]: https://earthly.dev/ [Dagger]: https://dagger.io/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/train-of-containers-2.jpg)] --- name: toc-container-network-drivers class: title Container network drivers .nav[ [Previous part](#toc-buildkit) | [Back to table of contents](#toc-part-4) | [Next part](#toc-deep-dive-into-container-internals) ] .debug[(automatically generated title slide)] --- # Container network drivers The Docker Engine supports different network drivers. The built-in drivers include: * `bridge` (default) * `null` (for the special network called `none`) * `host` (for the special network called `host`) * `container` (that one is a bit magic!) The network is selected with `docker run --net ...`. Each network is managed by a driver. The different drivers are explained with more details on the following slides. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The default bridge * By default, the container gets a virtual `eth0` interface. (In addition to its own private `lo` loopback interface.) * That interface is provided by a `veth` pair. * It is connected to the Docker bridge. (Named `docker0` by default; configurable with `--bridge`.) * Addresses are allocated on a private, internal subnet. (Docker uses 172.17.0.0/16 by default; configurable with `--bip`.) * Outbound traffic goes through an iptables MASQUERADE rule. * Inbound traffic goes through an iptables DNAT rule. * The container can have its own routes, iptables rules, etc. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The null driver * Container is started with `docker run --net none ...` * It only gets the `lo` loopback interface. No `eth0`. * It can't send or receive network traffic. * Useful for isolated/untrusted workloads. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The host driver * Container is started with `docker run --net host ...` * It sees (and can access) the network interfaces of the host. * It can bind any address, any port (for ill and for good). * Network traffic doesn't have to go through NAT, bridge, or veth. * Performance = native! Use cases: * Performance sensitive applications (VOIP, gaming, streaming...) * Peer discovery (e.g. Erlang port mapper, Raft, Serf...) .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The container driver * Container is started with `docker run --net container:id ...` * It re-uses the network stack of another container. * It shares with this other container the same interfaces, IP address(es), routes, iptables rules, etc. * Those containers can communicate over their `lo` interface. (i.e. one can bind to 127.0.0.1 and the others can connect to it.) ??? :EN:Advanced container networking :EN:- Transparent network access with the "host" driver :EN:- Sharing is caring with the "container" driver :FR:Paramétrage réseau avancé :FR:- Accès transparent au réseau avec le mode "host" :FR:- Partage de la pile réseau avece le mode "container" .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/two-containers-on-a-truck.jpg)] --- name: toc-deep-dive-into-container-internals class: title Deep dive into container internals .nav[ [Previous part](#toc-container-network-drivers) | [Back to table of contents](#toc-part-4) | [Next part](#toc-control-groups) ] .debug[(automatically generated title slide)] --- # Deep dive into container internals In this chapter, we will explain some of the fundamental building blocks of containers. This will give you a solid foundation so you can: - understand "what's going on" in complex situations, - anticipate the behavior of containers (performance, security...) in new scenarios, - implement your own container engine. The last item should be done for educational purposes only! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## There is no container code in the Linux kernel - If we search "container" in the Linux kernel code, we find: - generic code to manipulate data structures (like linked lists, etc.), - unrelated concepts like "ACPI containers", - *nothing* relevant to "our" containers! - Containers are composed using multiple independent features. - On Linux, containers rely on "namespaces, cgroups, and some filesystem magic." - Security also requires features like capabilities, seccomp, LSMs... .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/wall-of-containers.jpeg)] --- name: toc-control-groups class: title Control groups .nav[ [Previous part](#toc-deep-dive-into-container-internals) | [Back to table of contents](#toc-part-4) | [Next part](#toc-namespaces) ] .debug[(automatically generated title slide)] --- # Control groups - Control groups provide resource *metering* and *limiting*. - This covers a number of "usual suspects" like: - memory - CPU - block I/O - network (with cooperation from iptables/tc) - And a few exotic ones: - huge pages (a special way to allocate memory) - RDMA (resources specific to InfiniBand / remote memory transfer) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Crowd control - Control groups also allow to group processes for special operations: - freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT") - perf_event (gather performance events on multiple processes) - cpuset (limit or pin processes to specific CPUs) - There is a "pids" cgroup to limit the number of processes in a given group. - There is also a "devices" cgroup to control access to device nodes. (i.e. everything in `/dev`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Generalities - Cgroups form a hierarchy (a tree). - We can create nodes in that hierarchy. - We can associate limits to a node. - We can move a process (or multiple processes) to a node. - The process (or processes) will then respect these limits. - We can check the current usage of each node. - In other words: limits are optional (if we only want accounting). - When a process is created, it is placed in its parent's groups. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Example The numbers are PIDs. The names are the names of our nodes (arbitrarily chosen). .small[ ```bash cpu memory ├── batch ├── stateless │ ├── cryptoscam │ ├── 25 │ │ └── 52 │ ├── 26 │ └── ffmpeg │ ├── 27 │ ├── 109 │ ├── 52 │ └── 88 │ ├── 109 └── realtime │ └── 88 ├── nginx └── databases │ ├── 25 ├── 1008 │ ├── 26 └── 524 │ └── 27 ├── postgres │ └── 524 └── redis └── 1008 ``` ] .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Cgroups v1 vs v2 - Cgroups v1 are available on all systems (and widely used). - Cgroups v2 are a huge refactor. (Development started in Linux 3.10, released in 4.5.) - Cgroups v2 have a number of differences: - single hierarchy (instead of one tree per controller), - processes can only be on leaf nodes (not inner nodes), - and of course many improvements / refactorings. - Cgroups v2 enabled by default on Fedora 31 (2019), Ubuntu 21.10... .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup: accounting - Keeps track of pages used by each group: - file (read/write/mmap from block devices), - anonymous (stack, heap, anonymous mmap), - active (recently accessed), - inactive (candidate for eviction). - Each page is "charged" to a group. - Pages can be shared across multiple groups. (Example: multiple processes reading from the same files.) - To view all the counters kept by this cgroup: ```bash $ cat /sys/fs/cgroup/memory/memory.stat ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v1: limits - Each group can have (optional) hard and soft limits. - Limits can be set for different kinds of memory: - physical memory, - kernel memory, - total memory (including swap). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Soft limits and hard limits - Soft limits are not enforced. (But they influence reclaim under memory pressure.) - Hard limits *cannot* be exceeded: - if a group of processes exceeds a hard limit, - and if the kernel cannot reclaim any memory, - then the OOM (out-of-memory) killer is triggered, - and processes are killed until memory gets below the limit again. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Avoiding the OOM killer - For some workloads (databases and stateful systems), killing processes because we run out of memory is not acceptable. - The "oom-notifier" mechanism helps with that. - When "oom-notifier" is enabled and a hard limit is exceeded: - all processes in the cgroup are frozen, - a notification is sent to user space (instead of killing processes), - user space can then raise limits, migrate containers, etc., - once the memory usage is below the hard limit, unfreeze the cgroup. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Overhead of the memory cgroup - Each time a process grabs or releases a page, the kernel update counters. - This adds some overhead. - Unfortunately, this cannot be enabled/disabled per process. - It has to be done system-wide, at boot time. - Also, when multiple groups use the same page: - only the first group gets "charged", - but if it stops using it, the "charge" is moved to another group. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a limit with the memory cgroup Create a new memory cgroup: ```bash $ CG=/sys/fs/cgroup/memory/onehundredmegs $ sudo mkdir $CG ``` Limit it to approximately 100MB of memory usage: ```bash $ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000 ``` Move the current process to that cgroup: ```bash $ sudo tee $CG/tasks <<< $$ ``` The current process *and all its future children* are now limited. (Confused about `<<<`? Look at the next slide!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## What's `<<<`? - This is a "here string". (It is a non-POSIX shell extension.) - The following commands are equivalent: ```bash foo <<< hello ``` ```bash echo hello | foo ``` ```bash foo < $CG/tasks" ``` The following commands, however, would be invalid: ```bash sudo echo $$ > $CG/tasks ``` ```bash sudo -i # (or su) echo $$ > $CG/tasks ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Testing the memory limit Start the Python interpreter: ```bash $ python Python 3.6.4 (default, Jan 5 2018, 02:35:40) [GCC 7.2.1 20171224] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` Allocate 80 megabytes: ```python >>> s = "!" * 1000000 * 80 ``` Add 20 megabytes more: ```python >>> t = "!" * 1000000 * 20 Killed ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v2: limits - `memory.min` = hard reservation (guaranteed memory for this cgroup) - `memory.low` = soft reservation ("*try* not to reclaim memory if we're below this") - `memory.high` = soft limit (aggressively reclaim memory; don't trigger OOMK) - `memory.max` = hard limit (triggers OOMK) - `memory.swap.high` = aggressively reclaim memory when using that much swap - `memory.swap.max` = prevent using more swap than this .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## CPU cgroup - Keeps track of CPU time used by a group of processes. (This is easier and more accurate than `getrusage` and `/proc`.) - Keeps track of usage per CPU as well. (i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".) - Allows setting relative weights used by the scheduler. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Cpuset cgroup - Pin groups to specific CPU(s). - Use-case: reserve CPUs for specific apps. - Warning: make sure that "default" processes aren't using all CPUs! - CPU pinning can also avoid performance loss due to cache flushes. - This is also relevant for NUMA systems. - Provides extra dials and knobs. (Per zone memory pressure, process migration costs...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Blkio cgroup - Keeps track of I/Os for each group: - per block device - read vs write - sync vs async - Set throttle (limits) for each group: - per block device - read vs write - ops vs bytes - Set relative weights for each group. - Note: most writes go through the page cache. (So classic writes will appear to be unthrottled at first.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net_cls and net_prio cgroup - Only works for egress (outgoing) traffic. - Automatically set traffic class or priority for traffic generated by processes in the group. - Net_cls will assign traffic to a class. - Classes have to be matched with tc or iptables, otherwise traffic just flows normally. - Net_prio will assign traffic to a priority. - Priorities are used by queuing disciplines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Devices cgroup - Controls what the group can do on device nodes - Permissions include read/write/mknod - Typical use: - allow `/dev/{tty,zero,random,null}` ... - deny everything else - A few interesting nodes: - `/dev/net/tun` (network interface manipulation) - `/dev/fuse` (filesystems in user space) - `/dev/kvm` (VMs in containers, yay inception!) - `/dev/dri` (GPU) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/Container-Ship-Freighter-Navigation-Elbe-Romance-1782991.jpg)] --- name: toc-namespaces class: title Namespaces .nav[ [Previous part](#toc-control-groups) | [Back to table of contents](#toc-part-4) | [Next part](#toc-security-features) ] .debug[(automatically generated title slide)] --- # Namespaces - Provide processes with their own view of the system. - Namespaces limit what you can see (and therefore, what you can use). - These namespaces are available in modern kernels: - pid - net - mnt - uts - ipc - user - time - cgroup (We are going to detail them individually.) - Each process belongs to one namespace of each type. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Namespaces are always active - Namespaces exist even when you don't use containers. - This is a bit similar to the UID field in UNIX processes: - all processes have the UID field, even if no user exists on the system - the field always has a value / the value is always defined (i.e. any process running on the system has some UID) - the value of the UID field is used when checking permissions (the UID field determines which resources the process can access) - You can replace "UID field" with "namespace" above and it still works! - In other words: even when you don't use containers, there is one namespace of each type, containing all the processes on the system. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Manipulating namespaces - Namespaces are created with two methods: - the `clone()` system call (used when creating new threads and processes), - the `unshare()` system call. - The Linux tool `unshare` allows doing that from a shell. - A new process can re-use none / all / some of the namespaces of its parent. - It is possible to "enter" a namespace with the `setns()` system call. - The Linux tool `nsenter` allows doing that from a shell. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces lifecycle - When the last process of a namespace exits, the namespace is destroyed. - All the associated resources are then removed. - Namespaces are materialized by pseudo-files in `/proc//ns`. ```bash ls -l /proc/self/ns ``` - It is possible to compare namespaces by checking these files. (This helps to answer the question, "are these two processes in the same namespace?") - It is possible to preserve a namespace by bind-mounting its pseudo-file. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces can be used independently - As mentioned in the previous slides: *A new process can re-use none / all / some of the namespaces of its parent.* - We are going to use that property in the examples in the next slides. - We are going to present each type of namespace. - For each type, we will provide an example using only that namespace. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## UTS namespace - gethostname / sethostname - Allows setting a custom hostname for a container. - That's (mostly) it! - Also allows setting the NIS domain. (If you don't know what a NIS domain is, you don't have to worry about it!) - If you're wondering: UTS = UNIX time sharing. - This namespace was named like this because of the `struct utsname`, which is commonly used to obtain the machine's hostname, architecture, etc. (The more you know!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Creating our first namespace Let's use `unshare` to create a new process that will have its own UTS namespace: ```bash $ sudo unshare --uts ``` - We have to use `sudo` for most `unshare` operations. - We indicate that we want a new uts namespace, and nothing else. - If we don't specify a program to run, a `$SHELL` is started. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Demonstrating our uts namespace In our new "container", check the hostname, change it, and check it: ```bash # hostname nodeX # hostname tupperware # hostname tupperware ``` In another shell, check that the machine's hostname hasn't changed: ```bash $ hostname nodeX ``` Exit the "container" with `exit` or `Ctrl-D`. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace overview - Each network namespace has its own private network stack. - The network stack includes: - network interfaces (including `lo`), - routing table**s** (as in `ip rule` etc.), - iptables chains and rules, - sockets (as seen by `ss`, `netstat`). - You can move a network interface from a network namespace to another: ```bash ip link set dev eth0 netns PID ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace typical use - Each container is given its own network namespace. - For each network namespace (i.e. each container), a `veth` pair is created. (Two `veth` interfaces act as if they were connected with a cross-over cable.) - One `veth` is moved to the container network namespace (and renamed `eth0`). - The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating a network namespace Start a new process with its own network namespace: ```bash $ sudo unshare --net ``` See that this new network namespace is unconfigured: ```bash # ping 1.1 connect: Network is unreachable # ifconfig # ip link ls 1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating the `veth` interfaces In another shell (on the host), create a `veth` pair: ```bash $ sudo ip link add name in_host type veth peer name in_netns ``` Configure the host side (`in_host`): ```bash $ sudo ip link set in_host master docker0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Moving the `veth` interface *In the process created by `unshare`,* check the PID of our "network container": ```bash # echo $$ 533 ``` *On the host*, move the other side (`in_netns`) to the network namespace: ```bash $ sudo ip link set in_netns netns 533 ``` (Make sure to update "533" with the actual PID obtained above!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Basic network configuration Let's set up `lo` (the loopback interface): ```bash # ip link set lo up ``` Activate the `veth` interface and rename it to `eth0`: ```bash # ip link set in_netns name eth0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Allocating IP address and default route *On the host*, check the address of the Docker bridge: ```bash $ ip addr ls dev docker0 ``` (It could be something like `172.17.0.1`.) Pick an IP address in the middle of the same subnet, e.g. `172.17.0.99`. *In the process created by `unshare`,* configure the interface: ```bash # ip addr add 172.17.0.99/24 dev eth0 # ip route add default via 172.17.0.1 ``` (Make sure to update the IP addresses if necessary.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Validating the setup Check that we now have connectivity: ```bash # ping 1.1 ``` Note: we were able to take a shortcut, because Docker is running, and provides us with a `docker0` bridge and a valid `iptables` setup. If Docker is not running, you will need to take care of this! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Cleaning up network namespaces - Terminate the process created by `unshare` (with `exit` or `Ctrl-D`). - Since this was the only process in the network namespace, it is destroyed. - All the interfaces in the network namespace are destroyed. - When a `veth` interface is destroyed, it also destroys the other half of the pair. - So we don't have anything else to do to clean up! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Other ways to use network namespaces - `--net none` gives an empty network namespace to a container. (Effectively isolating it completely from the network.) - `--net host` means "do not containerize the network". (No network namespace is created; the container uses the host network stack.) - `--net container` means "reuse the network namespace of another container". (As a result, both containers share the same interfaces, routes, etc.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Mnt namespace - Processes can have their own root fs (à la chroot). - Processes can also have "private" mounts. This allows: - isolating `/tmp` (per user, per service...) - masking `/proc`, `/sys` (for processes that don't need them) - mounting remote filesystems or sensitive data, but make it visible only for allowed processes - Mounts can be totally private, or shared. - At this point, there is no easy way to pass along a mount from a namespace to another. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a private `/tmp` Create a new mount namespace: ```bash $ sudo unshare --mount ``` In that new namespace, mount a brand new `/tmp`: ```bash # mount -t tmpfs none /tmp ``` Check the content of `/tmp` in the new namespace, and compare to the host. The mount is automatically cleaned up when you exit the process. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## PID namespace - Processes within a PID namespace only "see" processes in the same PID namespace. - Each PID namespace has its own numbering (starting at 1). - When PID 1 goes away, the whole namespace is killed. (When PID 1 goes away on a normal UNIX system, the kernel panics!) - Those namespaces can be nested. - A process ends up having multiple PIDs (one per namespace in which it is nested). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespace in action Create a new PID namespace: ```bash $ sudo unshare --pid --fork ``` (We need the `--fork` flag because the PID namespace is special.) Check the process tree in the new namespace: ```bash # ps faux ``` -- class: extra-details, deep-dive 🤔 Why do we see all the processes?!? .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces and `/proc` - Tools like `ps` rely on the `/proc` pseudo-filesystem. - Our new namespace still has access to the original `/proc`. - Therefore, it still sees host processes. - But it cannot affect them. (Try to `kill` a process: you will get `No such process`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces, take 2 - This can be solved by mounting `/proc` in the namespace. - The `unshare` utility provides a convenience flag, `--mount-proc`. - This flag will mount `/proc` in the namespace. - It will also unshare the mount namespace, so that this mount is local. Try it: ```bash $ sudo unshare --pid --fork --mount-proc # ps faux ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## OK, really, why do we need `--fork`? *It is not necessary to remember all these details. This is just an illustration of the complexity of namespaces!* The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary. A process calling `unshare` to create new namespaces is moved to the new namespaces... ... Except for the PID namespace. (Because this would change the current PID of the process from X to 1.) The processes created by the new binary are placed into the new PID namespace. The first one will be PID 1. If PID 1 exits, it is not possible to create additional processes in the namespace. (Attempting to do so will result in `ENOMEM`.) Without the `--fork` flag, the first command that we execute will be PID 1 ... ... And once it exits, we cannot create more processes in the namespace! Check `man 2 unshare` and `man pid_namespaces` if you want more details. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## IPC namespace -- - Does anybody know about IPC? -- - Does anybody *care* about IPC? -- - Allows a process (or group of processes) to have own: - IPC semaphores - IPC message queues - IPC shared memory ... without risk of conflict with other instances. - Older versions of PostgreSQL cared about this. *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## User namespace - Allows mapping UID/GID; e.g.: - UID 0→1999 in container C1 is mapped to UID 10000→11999 on host - UID 0→1999 in container C2 is mapped to UID 12000→13999 on host - etc. - UID 0 in the container can still perform privileged operations in the container. (For instance: setting up network interfaces.) - But outside of the container, it is a non-privileged user. - It also means that the UID in containers becomes unimportant. (Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.) - Ultimately enables better privilege separation in container engines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## User namespace challenges - UID needs to be mapped when passed between processes or kernel subsystems. - Filesystem permissions and file ownership are more complicated. .small[(E.g. when the same root filesystem is shared by multiple containers running with different UIDs.)] - With the Docker Engine: - some feature combinations are not allowed (e.g. user namespace + host network namespace sharing) - user namespaces need to be enabled/disabled globally (when the daemon is started) - container images are stored separately (so the first time you toggle user namespaces, you need to re-pull images) *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Time namespace - Virtualize time - Expose a slower/faster clock to some processes (for e.g. simulation purposes) - Expose a clock offset to some processes (simulation, suspend/restore...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Cgroup namespace - Virtualize access to `/proc//cgroup` - Lets containerized processes view their relative cgroup tree .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/ShippingContainerSFBay.jpg)] --- name: toc-security-features class: title Security features .nav[ [Previous part](#toc-namespaces) | [Back to table of contents](#toc-part-4) | [Next part](#toc-orchestration-an-overview) ] .debug[(automatically generated title slide)] --- # Security features - Namespaces and cgroups are not enough to ensure strong security. - We need extra mechanisms: capabilities, seccomp, LSMs. - These mechanisms were already used before containers to harden security. - They can be used together with containers. - Good container engines will automatically leverage these features. (So that you don't have to worry about it.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Capabilities - In traditional UNIX, many operations are possible if and only if UID=0 (root). - Some of these operations are very powerful: - changing file ownership, accessing all files ... - Some of these operations deal with system configuration, but can be abused: - setting up network interfaces, mounting filesystems ... - Some of these operations are not very dangerous but are needed by servers: - binding to a port below 1024. - Capabilities are per-process flags to allow these operations individually. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Some capabilities - `CAP_CHOWN`: arbitrarily change file ownership and permissions. - `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions. - `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc. - `CAP_NET_BIND_SERVICE`: bind a port below 1024. See `man capabilities` for the full list and details. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Using capabilities - Container engines will typically drop all "dangerous" capabilities. - You can then re-enable capabilities on a per-container basis, as needed. - With the Docker engine: `docker run --cap-add ...` - If you write your own code to manage capabilities: - make sure that you understand what each capability does, - read about *ambient* capabilities as well. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Seccomp - Seccomp is secure computing. - Achieve high level of security by restricting drastically available syscalls. - Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`. - The seccomp-bpf extension allows specifying custom filters with BPF rules. - This allows filtering by syscall, and by parameter. - BPF code can perform arbitrarily complex checks, quickly, and safely. - Container engines take care of this so you don't have to. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Linux Security Modules - The most popular ones are SELinux and AppArmor. - Red Hat distros generally use SELinux. - Debian distros (in particular, Ubuntu) generally use AppArmor. - LSMs add a layer of access control to all process operations. - Container engines take care of this so you don't have to. ??? :EN:Containers internals :EN:- Control groups (cgroups) :EN:- Linux kernel namespaces :FR:Fonctionnement interne des conteneurs :FR:- Les "control groups" (cgroups) :FR:- Les namespaces du noyau Linux .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/aerial-view-of-containers.jpg)] --- name: toc-orchestration-an-overview class: title Orchestration, an overview .nav[ [Previous part](#toc-security-features) | [Back to table of contents](#toc-part-4) | [Next part](#toc-) ] .debug[(automatically generated title slide)] --- # Orchestration, an overview In this chapter, we will: * Explain what is orchestration and why we would need it. * Present (from a high-level perspective) some orchestrators. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## What's orchestration? ![Joana Carneiro (orchestra conductor)](images/conductor.jpg) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## What's orchestration? According to Wikipedia: *Orchestration describes the __automated__ arrangement, coordination, and management of complex computer systems, middleware, and services.* -- *[...] orchestration is often discussed in the context of __service-oriented architecture__, __virtualization__, provisioning, Converged Infrastructure and __dynamic datacenter__ topics.* -- What does that really mean? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances -- - Q: do we always use 100% of our servers? -- - A: obviously not! .center[![Daily variations of traffic](images/traffic-graph.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances - Every night, scale down (by shutting down extraneous replicated instances) - Every morning, scale up (by deploying new copies) - "Pay for what you use" (i.e. save big $$$ here) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances How do we implement this? - Crontab - Autoscaling (save even bigger $$$) That's *relatively* easy. Now, how are things for our IAAS provider? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter - Q: what's the #1 cost in a datacenter? -- - A: electricity! -- - Q: what uses electricity? -- - A: servers, obviously - A: ... and associated cooling -- - Q: do we always use 100% of our servers? -- - A: obviously not! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter - If only we could turn off unused servers during the night... - Problem: we can only turn off a server if it's totally empty! (i.e. all VMs on it are stopped/moved) - Solution: *migrate* VMs and shutdown empty servers (e.g. combine two hypervisors with 40% load into 80%+0%, and shut down the one at 0%) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter How do we implement this? - Shut down empty hosts (but keep some spare capacity) - Start hosts again when capacity gets low - Ability to "live migrate" VMs (Xen already did this 10+ years ago) - Rebalance VMs on a regular basis - what if a VM is stopped while we move it? - should we allow provisioning on hosts involved in a migration? *Scheduling* becomes more complex. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## What is scheduling? According to Wikipedia (again): *In computing, scheduling is the method by which threads, processes or data flows are given access to system resources.* The scheduler is concerned mainly with: - throughput (total amount of work done per time unit); - turnaround time (between submission and completion); - response time (between submission and start); - waiting time (between job readiness and execution); - fairness (appropriate times according to priorities). In practice, these goals often conflict. **"Scheduling" = decide which resources to use.** .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 1 - You have: - 5 hypervisors (physical machines) - Each server has: - 16 GB RAM, 8 cores, 1 TB disk - Each week, your team requests: - one VM with X RAM, Y CPU, Z disk Scheduling = deciding which hypervisor to use for each VM. Difficulty: easy! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 2 - You have: - 1000+ hypervisors (and counting!) - Each server has different resources: - 8-500 GB of RAM, 4-64 cores, 1-100 TB disk - Multiple times a day, a different team asks for: - up to 50 VMs with different characteristics Scheduling = deciding which hypervisor to use for each VM. Difficulty: ??? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 2 - You have: - 1000+ hypervisors (and counting!) - Each server has different resources: - 8-500 GB of RAM, 4-64 cores, 1-100 TB disk - Multiple times a day, a different team asks for: - up to 50 VMs with different characteristics Scheduling = deciding which hypervisor to use for each VM. ![Troll face](images/trollface.png) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 3 - You have machines (physical and/or virtual) - You have containers - You are trying to put the containers on the machines - Sounds familiar? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with one resource .center[![Not-so-good bin packing](images/binpacking-1d-1.gif)] ## We can't fit a job of size 6 :( .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with one resource .center[![Better bin packing](images/binpacking-1d-2.gif)] ## ... Now we can! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with two resources .center[![2D bin packing](images/binpacking-2d.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with three resources .center[![3D bin packing](images/binpacking-3d.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## You need to be good at this .center[![Tangram](images/tangram.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## But also, you must be quick! .center[![Tetris](images/tetris-1.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## And be web scale! .center[![Big tetris](images/tetris-2.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## And think outside (?) of the box! .center[![3D tetris](images/tetris-3.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Good luck! .center[![FUUUUUU face](images/fu-face.jpg)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## TL,DR * Scheduling with multiple resources (dimensions) is hard. * Don't expect to solve the problem with a Tiny Shell Script. * There are literally tons of research papers written on this. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## But our orchestrator also needs to manage ... * Network connectivity (or filtering) between containers. * Load balancing (external and internal). * Failure recovery (if a node or a whole datacenter fails). * Rolling out new versions of our applications. (Canary deployments, blue/green deployments...) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Some orchestrators We are going to present briefly a few orchestrators. There is no "absolute best" orchestrator. It depends on: - your applications, - your requirements, - your pre-existing skills... .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Nomad - Open Source project by Hashicorp. - Arbitrary scheduler (not just for containers). - Great if you want to schedule mixed workloads. (VMs, containers, processes...) - Less integration with the rest of the container ecosystem. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Mesos - Open Source project in the Apache Foundation. - Arbitrary scheduler (not just for containers). - Two-level scheduler. - Top-level scheduler acts as a resource broker. - Second-level schedulers (aka "frameworks") obtain resources from top-level. - Frameworks implement various strategies. (Marathon = long running processes; Chronos = run at intervals; ...) - Commercial offering through DC/OS by Mesosphere. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Rancher - Rancher 1 offered a simple interface for Docker hosts. - Rancher 2 is a complete management platform for Docker and Kubernetes. - Technically not an orchestrator, but it's a popular option. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Swarm - Tightly integrated with the Docker Engine. - Extremely simple to deploy and setup, even in multi-manager (HA) mode. - Secure by default. - Strongly opinionated: - smaller set of features, - easier to operate. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Kubernetes - Open Source project initiated by Google. - Contributions from many other actors. - *De facto* standard for container orchestration. - Many deployment options; some of them very complex. - Reputation: steep learning curve. - Reality: - true, if we try to understand *everything*; - false, if we focus on what matters. ??? :EN:- Orchestration overview :FR:- Survol de techniques d'orchestration .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: title, self-paced Thank you! .debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/shared/thankyou.md)] --- class: title, in-person That's all, folks! Questions? ![end](images/end.jpg) .debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/shared/thankyou.md)]
RUN CMD, EXPOSE ... ``` * The build fails as soon as an instruction fails * If `RUN ` fails, the build doesn't produce an image * If it succeeds, it produces a clean image (without test libraries and data) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/container-cranes.jpg)] --- name: toc-dockerfile-examples class: title Dockerfile examples .nav[ [Previous part](#toc-tips-for-efficient-dockerfiles) | [Back to table of contents](#toc-part-3) | [Next part](#toc-advanced-dockerfile-syntax) ] .debug[(automatically generated title slide)] --- # Dockerfile examples There are a number of tips, tricks, and techniques that we can use in Dockerfiles. But sometimes, we have to use different (and even opposed) practices depending on: - the complexity of our project, - the programming language or framework that we are using, - the stage of our project (early MVP vs. super-stable production), - whether we're building a final image or a base for further images, - etc. We are going to show a few examples using very different techniques. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## When to optimize an image When authoring official images, it is a good idea to reduce as much as possible: - the number of layers, - the size of the final image. This is often done at the expense of build time and convenience for the image maintainer; but when an image is downloaded millions of time, saving even a few seconds of pull time can be worth it. .small[ ```dockerfile RUN apt-get update && apt-get install -y libpng12-dev libjpeg-dev && rm -rf /var/lib/apt/lists/* \ && docker-php-ext-configure gd --with-png-dir=/usr --with-jpeg-dir=/usr \ && docker-php-ext-install gd ... RUN curl -o wordpress.tar.gz -SL https://wordpress.org/wordpress-${WORDPRESS_UPSTREAM_VERSION}.tar.gz \ && echo "$WORDPRESS_SHA1 *wordpress.tar.gz" | sha1sum -c - \ && tar -xzf wordpress.tar.gz -C /usr/src/ \ && rm wordpress.tar.gz \ && chown -R www-data:www-data /usr/src/wordpress ``` ] (Source: [Wordpress official image](https://github.com/docker-library/wordpress/blob/618490d4bdff6c5774b84b717979bfe3d6ba8ad1/apache/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## When to *not* optimize an image Sometimes, it is better to prioritize *maintainer convenience*. In particular, if: - the image changes a lot, - the image has very few users (e.g. only 1, the maintainer!), - the image is built and run on the same machine, - the image is built and run on machines with a very fast link ... In these cases, just keep things simple! (Next slide: a Dockerfile that can be used to preview a Jekyll / github pages site.) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ```dockerfile FROM debian:sid RUN apt-get update -q RUN apt-get install -yq build-essential make RUN apt-get install -yq zlib1g-dev RUN apt-get install -yq ruby ruby-dev RUN apt-get install -yq python-pygments RUN apt-get install -yq nodejs RUN apt-get install -yq cmake RUN gem install --no-rdoc --no-ri github-pages COPY . /blog WORKDIR /blog VOLUME /blog/_site EXPOSE 4000 CMD ["jekyll", "serve", "--host", "0.0.0.0", "--incremental"] ``` .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Multi-dimensional versioning systems Images can have a tag, indicating the version of the image. But sometimes, there are multiple important components, and we need to indicate the versions for all of them. This can be done with environment variables: ```dockerfile ENV PIP=9.0.3 \ ZC_BUILDOUT=2.11.2 \ SETUPTOOLS=38.7.0 \ PLONE_MAJOR=5.1 \ PLONE_VERSION=5.1.0 \ PLONE_MD5=76dc6cfc1c749d763c32fff3a9870d8d ``` (Source: [Plone official image](https://github.com/plone/plone.docker/blob/master/5.1/5.1.0/alpine/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Entrypoints and wrappers It is very common to define a custom entrypoint. That entrypoint will generally be a script, performing any combination of: - pre-flights checks (if a required dependency is not available, display a nice error message early instead of an obscure one in a deep log file), - generation or validation of configuration files, - dropping privileges (with e.g. `su` or `gosu`, sometimes combined with `chown`), - and more. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## A typical entrypoint script ```dockerfile #!/bin/sh set -e # first arg is '-f' or '--some-option' # or first arg is 'something.conf' if [ "${1#-}" != "$1" ] || [ "${1%.conf}" != "$1" ]; then set -- redis-server "$@" fi # allow the container to be started with '--user' if [ "$1" = 'redis-server' -a "$(id -u)" = '0' ]; then chown -R redis . exec su-exec redis "$0" "$@" fi exec "$@" ``` (Source: [Redis official image](https://github.com/docker-library/redis/blob/d24f2be82673ccef6957210cc985e392ebdc65e4/4.0/alpine/docker-entrypoint.sh)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Factoring information To facilitate maintenance (and avoid human errors), avoid to repeat information like: - version numbers, - remote asset URLs (e.g. source tarballs) ... Instead, use environment variables. .small[ ```dockerfile ENV NODE_VERSION 10.2.1 ... RUN ... && curl -fsSLO --compressed "https://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION.tar.xz" \ && curl -fsSLO --compressed "https://nodejs.org/dist/v$NODE_VERSION/SHASUMS256.txt.asc" \ && gpg --batch --decrypt --output SHASUMS256.txt SHASUMS256.txt.asc \ && grep " node-v$NODE_VERSION.tar.xz\$" SHASUMS256.txt | sha256sum -c - \ && tar -xf "node-v$NODE_VERSION.tar.xz" \ && cd "node-v$NODE_VERSION" \ ... ``` ] (Source: [Nodejs official image](https://github.com/nodejs/docker-node/blob/master/10/alpine/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Overrides In theory, development and production images should be the same. In practice, we often need to enable specific behaviors in development (e.g. debug statements). One way to reconcile both needs is to use Compose to enable these behaviors. Let's look at the [trainingwheels](https://github.com/jpetazzo/trainingwheels) demo app for an example. .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Production image This Dockerfile builds an image leveraging gunicorn: ```dockerfile FROM python RUN pip install flask RUN pip install gunicorn RUN pip install redis COPY . /src WORKDIR /src CMD gunicorn --bind 0.0.0.0:5000 --workers 10 counter:app EXPOSE 5000 ``` (Source: [trainingwheels Dockerfile](https://github.com/jpetazzo/trainingwheels/blob/master/www/Dockerfile)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## Development Compose file This Compose file uses the same image, but with a few overrides for development: - the Flask development server is used (overriding `CMD`), - the `DEBUG` environment variable is set, - a volume is used to provide a faster local development workflow. .small[ ```yaml services: www: build: www ports: - 8000:5000 user: nobody environment: DEBUG: 1 command: python counter.py volumes: - ./www:/src ``` ] (Source: [trainingwheels Compose file](https://github.com/jpetazzo/trainingwheels/blob/master/docker-compose.yml)) .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- ## How to know which best practices are better? - The main goal of containers is to make our lives easier. - In this chapter, we showed many ways to write Dockerfiles. - These Dockerfiles use sometimes diametrically opposed techniques. - Yet, they were the "right" ones *for a specific situation.* - It's OK (and even encouraged) to start simple and evolve as needed. - Feel free to review this chapter later (after writing a few Dockerfiles) for inspiration! ??? :EN:Optimizing images :EN:- Dockerfile tips, tricks, and best practices :EN:- Reducing build time :EN:- Reducing image size :FR:Optimiser ses images :FR:- Bonnes pratiques, trucs et astuces :FR:- Réduire le temps de build :FR:- Réduire la taille des images .debug[[containers/Dockerfile_Tips.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Dockerfile_Tips.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/container-housing.jpg)] --- name: toc-advanced-dockerfile-syntax class: title Advanced Dockerfile Syntax .nav[ [Previous part](#toc-dockerfile-examples) | [Back to table of contents](#toc-part-3) | [Next part](#toc-reducing-image-size) ] .debug[(automatically generated title slide)] --- class: title # Advanced Dockerfile Syntax ![construction](images/title-advanced-dockerfiles.jpg) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Objectives We have seen simple Dockerfiles to illustrate how Docker build container images. In this section, we will give a recap of the Dockerfile syntax, and introduce advanced Dockerfile commands that we might come across sometimes; or that we might want to use in some specific scenarios. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `Dockerfile` usage summary * `Dockerfile` instructions are executed in order. * Each instruction creates a new layer in the image. * Docker maintains a cache with the layers of previous builds. * When there are no changes in the instructions and files making a layer, the builder re-uses the cached layer, without executing the instruction for that layer. * The `FROM` instruction MUST be the first non-comment instruction. * Lines starting with `#` are treated as comments. * Some instructions (like `CMD` or `ENTRYPOINT`) update a piece of metadata. (As a result, each call to these instructions makes the previous one useless.) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `RUN` instruction The `RUN` instruction can be specified in two ways. With shell wrapping, which runs the specified command inside a shell, with `/bin/sh -c`: ```dockerfile RUN apt-get update ``` Or using the `exec` method, which avoids shell string expansion, and allows execution in images that don't have `/bin/sh`: ```dockerfile RUN [ "apt-get", "update" ] ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## More about the `RUN` instruction `RUN` will do the following: * Execute a command. * Record changes made to the filesystem. * Work great to install libraries, packages, and various files. `RUN` will NOT do the following: * Record state of *processes*. * Automatically start daemons. If you want to start something automatically when the container runs, you should use `CMD` and/or `ENTRYPOINT`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Collapsing layers It is possible to execute multiple commands in a single step: ```dockerfile RUN apt-get update && apt-get install -y wget && apt-get clean ``` It is also possible to break a command onto multiple lines: ```dockerfile RUN apt-get update \ && apt-get install -y wget \ && apt-get clean ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `EXPOSE` instruction The `EXPOSE` instruction tells Docker what ports are to be published in this image. ```dockerfile EXPOSE 8080 EXPOSE 80 443 EXPOSE 53/tcp 53/udp ``` * All ports are private by default. * Declaring a port with `EXPOSE` is not enough to make it public. * The `Dockerfile` doesn't control on which port a service gets exposed. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Exposing ports * When you `docker run -p ...`, that port becomes public. (Even if it was not declared with `EXPOSE`.) * When you `docker run -P ...` (without port number), all ports declared with `EXPOSE` become public. A *public port* is reachable from other containers and from outside the host. A *private port* is not reachable from outside. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `COPY` instruction The `COPY` instruction adds files and content from your host into the image. ```dockerfile COPY . /src ``` This will add the contents of the *build context* (the directory passed as an argument to `docker build`) to the directory `/src` in the container. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Build context isolation Note: you can only reference files and directories *inside* the build context. Absolute paths are taken as being anchored to the build context, so the two following lines are equivalent: ```dockerfile COPY . /src COPY / /src ``` Attempts to use `..` to get out of the build context will be detected and blocked with Docker, and the build will fail. Otherwise, a `Dockerfile` could succeed on host A, but fail on host B. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `ADD` `ADD` works almost like `COPY`, but has a few extra features. `ADD` can get remote files: ```dockerfile ADD http://www.example.com/webapp.jar /opt/ ``` This would download the `webapp.jar` file and place it in the `/opt` directory. `ADD` will automatically unpack zip files and tar archives: ```dockerfile ADD ./assets.zip /var/www/htdocs/assets/ ``` This would unpack `assets.zip` into `/var/www/htdocs/assets`. *However,* `ADD` will not automatically unpack remote archives. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `ADD`, `COPY`, and the build cache * Before creating a new layer, Docker checks its build cache. * For most Dockerfile instructions, Docker only looks at the `Dockerfile` content to do the cache lookup. * For `ADD` and `COPY` instructions, Docker also checks if the files to be added to the container have been changed. * `ADD` always needs to download the remote file before it can check if it has been changed. (It cannot use, e.g., ETags or If-Modified-Since headers.) .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## `VOLUME` The `VOLUME` instruction tells Docker that a specific directory should be a *volume*. ```dockerfile VOLUME /var/lib/mysql ``` Filesystem access in volumes bypasses the copy-on-write layer, offering native performance to I/O done in those directories. Volumes can be attached to multiple containers, allowing to "port" data over from a container to another, e.g. to upgrade a database to a newer version. It is possible to start a container in "read-only" mode. The container filesystem will be made read-only, but volumes can still have read/write access if necessary. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `WORKDIR` instruction The `WORKDIR` instruction sets the working directory for subsequent instructions. It also affects `CMD` and `ENTRYPOINT`, since it sets the working directory used when starting the container. ```dockerfile WORKDIR /src ``` You can specify `WORKDIR` again to change the working directory for further operations. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `ENV` instruction The `ENV` instruction specifies environment variables that should be set in any container launched from the image. ```dockerfile ENV WEBAPP_PORT 8080 ``` This will result in an environment variable being created in any containers created from this image of ```bash WEBAPP_PORT=8080 ``` You can also specify environment variables when you use `docker run`. ```bash $ docker run -e WEBAPP_PORT=8000 -e WEBAPP_HOST=www.example.com ... ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `USER` instruction The `USER` instruction sets the user name or UID to use when running the image. It can be used multiple times to change back to root or to another user. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `CMD` instruction The `CMD` instruction is a default command run when a container is launched from the image. ```dockerfile CMD [ "nginx", "-g", "daemon off;" ] ``` Means we don't need to specify `nginx -g "daemon off;"` when running the container. Instead of: ```bash $ docker run /web_image nginx -g "daemon off;" ``` We can just do: ```bash $ docker run /web_image ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## More about the `CMD` instruction Just like `RUN`, the `CMD` instruction comes in two forms. The first executes in a shell: ```dockerfile CMD nginx -g "daemon off;" ``` The second executes directly, without shell processing: ```dockerfile CMD [ "nginx", "-g", "daemon off;" ] ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## Overriding the `CMD` instruction The `CMD` can be overridden when you run a container. ```bash $ docker run -it /web_image bash ``` Will run `bash` instead of `nginx -g "daemon off;"`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## The `ENTRYPOINT` instruction The `ENTRYPOINT` instruction is like the `CMD` instruction, but arguments given on the command line are *appended* to the entry point. Note: you have to use the "exec" syntax (`[ "..." ]`). ```dockerfile ENTRYPOINT [ "/bin/ls" ] ``` If we were to run: ```bash $ docker run training/ls -l ``` Instead of trying to run `-l`, the container will run `/bin/ls -l`. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## Overriding the `ENTRYPOINT` instruction The entry point can be overridden as well. ```bash $ docker run -it training/ls bin dev home lib64 mnt proc run srv tmp var boot etc lib media opt root sbin sys usr $ docker run -it --entrypoint bash training/ls root@d902fb7b1fc7:/# ``` .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## How `CMD` and `ENTRYPOINT` interact The `CMD` and `ENTRYPOINT` instructions work best when used together. ```dockerfile ENTRYPOINT [ "nginx" ] CMD [ "-g", "daemon off;" ] ``` The `ENTRYPOINT` specifies the command to be run and the `CMD` specifies its options. On the command line we can then potentially override the options when needed. ```bash $ docker run -d /web_image -t ``` This will override the options `CMD` provided with new flags. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- ## Advanced Dockerfile instructions * `ONBUILD` lets you stash instructions that will be executed when this image is used as a base for another one. * `LABEL` adds arbitrary metadata to the image. * `ARG` defines build-time variables (optional or mandatory). * `STOPSIGNAL` sets the signal for `docker stop` (`TERM` by default). * `HEALTHCHECK` defines a command assessing the status of the container. * `SHELL` sets the default program to use for string-syntax RUN, CMD, etc. .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: extra-details ## The `ONBUILD` instruction The `ONBUILD` instruction is a trigger. It sets instructions that will be executed when another image is built from the image being build. This is useful for building images which will be used as a base to build other images. ```dockerfile ONBUILD COPY . /src ``` * You can't chain `ONBUILD` instructions with `ONBUILD`. * `ONBUILD` can't be used to trigger `FROM` instructions. ??? :EN:- Advanced Dockerfile syntax :FR:- Dockerfile niveau expert .debug[[containers/Advanced_Dockerfiles.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Advanced_Dockerfiles.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/containers-by-the-water.jpg)] --- name: toc-reducing-image-size class: title Reducing image size .nav[ [Previous part](#toc-advanced-dockerfile-syntax) | [Back to table of contents](#toc-part-3) | [Next part](#toc-multi-stage-builds) ] .debug[(automatically generated title slide)] --- # Reducing image size * In the previous example, our final image contained: * our `hello` program * its source code * the compiler * Only the first one is strictly necessary. * We are going to see how to obtain an image without the superfluous components. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Can't we remove superfluous files with `RUN`? What happens if we do one of the following commands? - `RUN rm -rf ...` - `RUN apt-get remove ...` - `RUN make clean ...` -- This adds a layer which removes a bunch of files. But the previous layers (which added the files) still exist. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Removing files with an extra layer When downloading an image, all the layers must be downloaded. | Dockerfile instruction | Layer size | Image size | | ---------------------- | ---------- | ---------- | | `FROM ubuntu` | Size of base image | Size of base image | | `...` | ... | Sum of this layer + all previous ones | | `RUN apt-get install somepackage` | Size of files added (e.g. a few MB) | Sum of this layer + all previous ones | | `...` | ... | Sum of this layer + all previous ones | | `RUN apt-get remove somepackage` | Almost zero (just metadata) | Same as previous one | Therefore, `RUN rm` does not reduce the size of the image or free up disk space. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Removing unnecessary files Various techniques are available to obtain smaller images: - collapsing layers, - adding binaries that are built outside of the Dockerfile, - squashing the final image, - multi-stage builds. Let's review them quickly. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Collapsing layers You will frequently see Dockerfiles like this: ```dockerfile FROM ubuntu RUN apt-get update && apt-get install xxx && ... && apt-get remove xxx && ... ``` Or the (more readable) variant: ```dockerfile FROM ubuntu RUN apt-get update \ && apt-get install xxx \ && ... \ && apt-get remove xxx \ && ... ``` This `RUN` command gives us a single layer. The files that are added, then removed in the same layer, do not grow the layer size. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Collapsing layers: pros and cons Pros: - works on all versions of Docker - doesn't require extra tools Cons: - not very readable - some unnecessary files might still remain if the cleanup is not thorough - that layer is expensive (slow to build) .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Building binaries outside of the Dockerfile This results in a Dockerfile looking like this: ```dockerfile FROM ubuntu COPY xxx /usr/local/bin ``` Of course, this implies that the file `xxx` exists in the build context. That file has to exist before you can run `docker build`. For instance, it can: - exist in the code repository, - be created by another tool (script, Makefile...), - be created by another container image and extracted from the image. See for instance the [busybox official image](https://github.com/docker-library/busybox/blob/fe634680e32659aaf0ee0594805f74f332619a90/musl/Dockerfile) or this [older busybox image](https://github.com/jpetazzo/docker-busybox). .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Building binaries outside: pros and cons Pros: - final image can be very small Cons: - requires an extra build tool - we're back in dependency hell and "works on my machine" Cons, if binary is added to code repository: - breaks portability across different platforms - grows repository size a lot if the binary is updated frequently .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Squashing the final image The idea is to transform the final image into a single-layer image. This can be done in (at least) two ways. - Activate experimental features and squash the final image: ```bash docker image build --squash ... ``` - Export/import the final image. ```bash docker build -t temp-image . docker run --entrypoint true --name temp-container temp-image docker export temp-container | docker import - final-image docker rm temp-container docker rmi temp-image ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Squashing the image: pros and cons Pros: - single-layer images are smaller and faster to download - removed files no longer take up storage and network resources Cons: - we still need to actively remove unnecessary files - squash operation can take a lot of time (on big images) - squash operation does not benefit from cache (even if we change just a tiny file, the whole image needs to be re-squashed) .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds Multi-stage builds allow us to have multiple *stages*. Each stage is a separate image, and can copy files from previous stages. We're going to see how they work in more detail. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/distillery-containers.jpg)] --- name: toc-multi-stage-builds class: title Multi-stage builds .nav[ [Previous part](#toc-reducing-image-size) | [Back to table of contents](#toc-part-3) | [Next part](#toc-publishing-images-to-the-docker-hub) ] .debug[(automatically generated title slide)] --- # Multi-stage builds * At any point in our `Dockerfile`, we can add a new `FROM` line. * This line starts a new stage of our build. * Each stage can access the files of the previous stages with `COPY --from=...`. * When a build is tagged (with `docker build -t ...`), the last stage is tagged. * Previous stages are not discarded: they will be used for caching, and can be referenced. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds in practice * Each stage is numbered, starting at `0` * We can copy a file from a previous stage by indicating its number, e.g.: ```dockerfile COPY --from=0 /file/from/first/stage /location/in/current/stage ``` * We can also name stages, and reference these names: ```dockerfile FROM golang AS builder RUN ... FROM alpine COPY --from=builder /go/bin/mylittlebinary /usr/local/bin/ ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage builds for our C program We will change our Dockerfile to: * give a nickname to the first stage: `compiler` * add a second stage using the same `ubuntu` base image * add the `hello` binary to the second stage * make sure that `CMD` is in the second stage The resulting Dockerfile is on the next slide. .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Multi-stage build `Dockerfile` Here is the final Dockerfile: ```dockerfile FROM ubuntu AS compiler RUN apt-get update RUN apt-get install -y build-essential COPY hello.c / RUN make hello FROM ubuntu COPY --from=compiler /hello /hello CMD /hello ``` Let's build it, and check that it works correctly: ```bash docker build -t hellomultistage . docker run hellomultistage ``` .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Comparing single/multi-stage build image sizes List our images with `docker images`, and check the size of: - the `ubuntu` base image, - the single-stage `hello` image, - the multi-stage `hellomultistage` image. We can achieve even smaller images if we use smaller base images. However, if we use common base images (e.g. if we standardize on `ubuntu`), these common images will be pulled only once per node, so they are virtually "free." .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- ## Build targets * We can also tag an intermediary stage with the following command: ```bash docker build --target STAGE --tag NAME ``` * This will create an image (named `NAME`) corresponding to stage `STAGE` * This can be used to easily access an intermediary stage for inspection (instead of parsing the output of `docker build` to find out the image ID) * This can also be used to describe multiple images from a single Dockerfile (instead of using multiple Dockerfiles, which could go out of sync) ??? :EN:Optimizing our images and their build process :EN:- Leveraging multi-stage builds :FR:Optimiser les images et leur construction :FR:- Utilisation d'un *multi-stage build* .debug[[containers/Multi_Stage_Builds.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Multi_Stage_Builds.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/lots-of-containers.jpg)] --- name: toc-publishing-images-to-the-docker-hub class: title Publishing images to the Docker Hub .nav[ [Previous part](#toc-multi-stage-builds) | [Back to table of contents](#toc-part-3) | [Next part](#toc-exercise--writing-better-dockerfiles) ] .debug[(automatically generated title slide)] --- # Publishing images to the Docker Hub We have built our first images. We can now publish it to the Docker Hub! *You don't have to do the exercises in this section, because they require an account on the Docker Hub, and we don't want to force anyone to create one.* *Note, however, that creating an account on the Docker Hub is free (and doesn't require a credit card), and hosting public images is free as well.* .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Logging into our Docker Hub account * This can be done from the Docker CLI: ```bash docker login ``` .warning[When running Docker for Mac/Windows, or Docker on a Linux workstation, it can (and will when possible) integrate with your system's keyring to store your credentials securely. However, on most Linux servers, it will store your credentials in `~/.docker/config`.] .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Image tags and registry addresses * Docker images tags are like Git tags and branches. * They are like *bookmarks* pointing at a specific image ID. * Tagging an image doesn't *rename* an image: it adds another tag. * When pushing an image to a registry, the registry address is in the tag. Example: `registry.example.net:5000/image` * What about Docker Hub images? -- * `jpetazzo/clock` is, in fact, `index.docker.io/jpetazzo/clock` * `ubuntu` is, in fact, `library/ubuntu`, i.e. `index.docker.io/library/ubuntu` .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Tagging an image to push it on the Hub * Let's tag our `figlet` image (or any other to our liking): ```bash docker tag figlet jpetazzo/figlet ``` * And push it to the Hub: ```bash docker push jpetazzo/figlet ``` * That's it! -- * Anybody can now `docker run jpetazzo/figlet` anywhere. .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## The goodness of automated builds * You can link a Docker Hub repository with a GitHub or BitBucket repository * Each push to GitHub or BitBucket will trigger a build on Docker Hub * If the build succeeds, the new image is available on Docker Hub * You can map tags and branches between source and container images * If you work with public repositories, this is free .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- class: extra-details ## Setting up an automated build * We need a Dockerized repository! * Let's go to https://github.com/jpetazzo/trainingwheels and fork it. * Go to the Docker Hub (https://hub.docker.com/) and sign-in. Select "Repositories" in the blue navigation menu. * Select "Create" in the top-right bar, and select "Create Repository+". * Connect your Docker Hub account to your GitHub account. * Click "Create" button. * Then go to "Builds" folder. * Click on Github icon and select your user and the repository that we just forked. * In "Build rules" block near page bottom, put `/www` in "Build Context" column (or whichever directory the Dockerfile is in). * Click "Save and Build" to build the repository immediately (without waiting for a git push). * Subsequent builds will happen automatically, thanks to GitHub hooks. .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- ## Building on the fly - Some services can build images on the fly from a repository - Example: [ctr.run](https://ctr.run/) .lab[ - Use ctr.run to automatically build a container image and run it: ```bash docker run ctr.run/github.com/undefinedlabs/hello-world ``` ] There might be a long pause before the first layer is pulled, because the API behind `docker pull` doesn't allow to stream build logs, and there is no feedback during the build. It is possible to view the build logs by setting up an account on [ctr.run](https://ctr.run/). ??? :EN:- Publishing images to the Docker Hub :FR:- Publier des images sur le Docker Hub .debug[[containers/Publishing_To_Docker_Hub.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Publishing_To_Docker_Hub.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/plastic-containers.JPG)] --- name: toc-exercise--writing-better-dockerfiles class: title Exercise — writing better Dockerfiles .nav[ [Previous part](#toc-publishing-images-to-the-docker-hub) | [Back to table of contents](#toc-part-3) | [Next part](#toc-buildkit) ] .debug[(automatically generated title slide)] --- # Exercise — writing better Dockerfiles Let's update our Dockerfiles to leverage multi-stage builds! The code is at: https://github.com/jpetazzo/wordsmith Use a different tag for these images, so that we can compare their sizes. What's the size difference between single-stage and multi-stage builds? .debug[[containers/Exercise_Dockerfile_Advanced.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Exercise_Dockerfile_Advanced.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/train-of-containers-1.jpg)] --- name: toc-buildkit class: title Buildkit .nav[ [Previous part](#toc-exercise--writing-better-dockerfiles) | [Back to table of contents](#toc-part-4) | [Next part](#toc-container-network-drivers) ] .debug[(automatically generated title slide)] --- # Buildkit - "New" backend for Docker builds - announced in 2017 - ships with Docker Engine 18.09 - enabled by default on Docker Desktop in 2021 - Huge improvements in build efficiency - 100% compatible with existing Dockerfiles - New features for multi-arch - Not just for building container images .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Old vs New - Classic `docker build`: - copy whole build context - linear execution - `docker run` + `docker commit` + `docker run` + `docker commit`... - Buildkit: - copy files only when they are needed; cache them - compute dependency graph (dependencies are expressed by `COPY`) - parallel execution - doesn't rely on Docker, but on internal runner/snapshotter - can run in "normal" containers (including in Kubernetes pods) .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Parallel execution - In multi-stage builds, all stages can be built in parallel (example: https://github.com/jpetazzo/shpod; [before] and [after]) - Stages are built only when they are necessary (i.e. if their output is tagged or used in another necessary stage) - Files are copied from context only when needed - Files are cached in the builder [before]: https://github.com/jpetazzo/shpod/blob/c6efedad6d6c3dc3120dbc0ae0a6915f85862474/Dockerfile [after]: https://github.com/jpetazzo/shpod/blob/d20887bbd56b5fcae2d5d9b0ce06cae8887caabf/Dockerfile .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Turning it on and off - On recent version of Docker Desktop (since 2021): *enabled by default* - On older versions, or on Docker CE (Linux): `export DOCKER_BUILDKIT=1` - Turning it off: `export DOCKER_BUILDKIT=0` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Multi-arch support - Historically, Docker only ran on x86_64 / amd64 (Intel/AMD 64 bits architecture) - Folks have been running it on 32-bit ARM for ages (e.g. Raspberry Pi) - This required a Go compiler and appropriate base images (which means changing/adapting Dockerfiles to use these base images) - Docker [image manifest v2 schema 2][manifest] introduces multi-arch images (`FROM alpine` automatically gets the right image for your architecture) [manifest]: https://docs.docker.com/registry/spec/manifest-v2-2/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Why? - Raspberry Pi (32-bit and 64-bit ARM) - Other ARM-based embedded systems (ODROID, NVIDIA Jetson...) - Apple M1 - AWS Graviton - Ampere Altra (e.g. on Oracle Cloud) - ... .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Multi-arch builds in a nutshell Use the `docker buildx build` command: ```bash docker buildx build … \ --platform linux/amd64,linux/arm64,linux/arm/v7,linux/386 \ [--tag jpetazzo/hello --push] ``` - Requires all base images to be available for these platforms - Must not use binary downloads with hard-coded architectures! (streamlining a Dockerfile for multi-arch: [before], [after]) [before]: https://github.com/jpetazzo/shpod/blob/d20887bbd56b5fcae2d5d9b0ce06cae8887caabf/Dockerfile [after]: https://github.com/jpetazzo/shpod/blob/c50789e662417b34fea6f5e1d893721d66d265b7/Dockerfile .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native vs emulated vs cross - Native builds: *aarch64 machine running aarch64 programs building aarch64 images/binaries* - Emulated builds: *x86_64 machine running aarch64 programs building aarch64 images/binaries* - Cross builds: *x86_64 machine running x86_64 programs building aarch64 images/binaries* .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native - Dockerfiles are (relatively) simple to write (nothing special to do to handle multi-arch; just avoid hard-coded archs) - Best performance - Requires "exotic" machines - Requires setting up a build farm .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Emulated - Dockerfiles are (relatively) simple to write - Emulation performance can vary (from "OK" to "ouch this is slow") - Emulation isn't always perfect (weird bugs/crashes are rare but can happen) - Doesn't require special machines - Supports arbitrary architectures thanks to QEMU .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Cross - Dockerfiles are more complicated to write - Requires cross-compilation toolchains - Performance is good - Doesn't require special machines .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Native builds - Requires base images to be available - To view available architectures for an image: ```bash regctl manifest get --list docker manifest inspect ``` - Nothing special to do, *except* when downloading binaries! ``` https://releases.hashicorp.com/terraform/1.1.5/terraform_1.1.5_linux_`amd64`.zip ``` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Finding the right architecture `uname -m` → armv7l, aarch64, i686, x86_64 `GOARCH` (from `go env`) → arm, arm64, 386, amd64 In Dockerfile, add `ARG TARGETARCH` (or `ARG TARGETPLATFORM`) - `TARGETARCH` matches `GOARCH` - `TARGETPLAFORM` → linux/arm/v7, linux/arm64, linux/386, linux/amd64 .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## Welp Sometimes, binary releases be like: ``` Linux_arm64.tar.gz Linux_ppc64le.tar.gz Linux_s390x.tar.gz Linux_x86_64.tar.gz ``` This needs a bit of custom mapping. .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Emulation - Leverages `binfmt_misc` and QEMU on Linux - Enabling: ```bash docker run --rm --privileged aptman/qus -s -- -p ``` - Disabling: ```bash docker run --rm --privileged aptman/qus -- -r ``` - Checking status: ```bash ls -l /proc/sys/fs/binfmt_misc ``` .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## How it works - `binfmt_misc` lets us register _interpreters_ for binaries, e.g.: - [DOSBox][dosbox] for DOS programs - [Wine][wine] for Windows programs - [QEMU][qemu] for Linux programs for other architectures - When we try to execute e.g. a SPARC binary on our x86_64 machine: - `binfmt_misc` detects the binary format and invokes `qemu- the-binary ...` - QEMU translates SPARC instructions to x86_64 instructions - system calls go straight to the kernel [dosbox]: https://www.dosbox.com/ [QEMU]: https://www.qemu.org/ [wine]: https://www.winehq.org/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: extra-details ## QEMU registration - The `aptman/qus` image mentioned earlier contains static QEMU builds - It registers all these interpreters with the kernel - For more details, check: - https://github.com/dbhi/qus - https://dbhi.github.io/qus/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Cross-compilation - Cross-compilation is about 10x faster than emulation (non-scientific benchmarks!) - In Dockerfile, add: `ARG BUILDARCH BUILDPLATFORM TARGETARCH TARGETPLATFORM` - Can use `FROM --platform=$BUILDPLATFORM ` - Then use `$TARGETARCH` or `$TARGETPLATFORM` (e.g. for Go, `export GOARCH=$TARGETARCH`) - Check [tonistiigi/xx][xx] and [Toni's blog][toni] for some amazing cross tools! [xx]: https://github.com/tonistiigi/xx [toni]: https://medium.com/@tonistiigi/faster-multi-platform-builds-dockerfile-cross-compilation-guide-part-1-ec087c719eaf .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## Checking runtime capabilities Build and run the following Dockerfile: ```dockerfile FROM --platform=linux/amd64 busybox AS amd64 FROM --platform=linux/arm64 busybox AS arm64 FROM --platform=linux/arm/v7 busybox AS arm32 FROM --platform=linux/386 busybox AS ia32 FROM alpine RUN apk add file WORKDIR /root COPY --from=amd64 /bin/busybox /root/amd64/busybox COPY --from=arm64 /bin/busybox /root/arm64/busybox COPY --from=arm32 /bin/busybox /root/arm32/busybox COPY --from=ia32 /bin/busybox /root/ia32/busybox CMD for A in *; do echo "$A => $($A/busybox uname -a)"; done ``` It will indicate which executables can be run on your engine. .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- ## More than builds - Buildkit is also used in other systems: - [Earthly] - generic repeatable build pipelines - [Dagger] - CICD pipelines that run anywhere - and more! [Earthly]: https://earthly.dev/ [Dagger]: https://dagger.io/ .debug[[containers/Buildkit.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Buildkit.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/train-of-containers-2.jpg)] --- name: toc-container-network-drivers class: title Container network drivers .nav[ [Previous part](#toc-buildkit) | [Back to table of contents](#toc-part-4) | [Next part](#toc-deep-dive-into-container-internals) ] .debug[(automatically generated title slide)] --- # Container network drivers The Docker Engine supports different network drivers. The built-in drivers include: * `bridge` (default) * `null` (for the special network called `none`) * `host` (for the special network called `host`) * `container` (that one is a bit magic!) The network is selected with `docker run --net ...`. Each network is managed by a driver. The different drivers are explained with more details on the following slides. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The default bridge * By default, the container gets a virtual `eth0` interface. (In addition to its own private `lo` loopback interface.) * That interface is provided by a `veth` pair. * It is connected to the Docker bridge. (Named `docker0` by default; configurable with `--bridge`.) * Addresses are allocated on a private, internal subnet. (Docker uses 172.17.0.0/16 by default; configurable with `--bip`.) * Outbound traffic goes through an iptables MASQUERADE rule. * Inbound traffic goes through an iptables DNAT rule. * The container can have its own routes, iptables rules, etc. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The null driver * Container is started with `docker run --net none ...` * It only gets the `lo` loopback interface. No `eth0`. * It can't send or receive network traffic. * Useful for isolated/untrusted workloads. .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The host driver * Container is started with `docker run --net host ...` * It sees (and can access) the network interfaces of the host. * It can bind any address, any port (for ill and for good). * Network traffic doesn't have to go through NAT, bridge, or veth. * Performance = native! Use cases: * Performance sensitive applications (VOIP, gaming, streaming...) * Peer discovery (e.g. Erlang port mapper, Raft, Serf...) .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- ## The container driver * Container is started with `docker run --net container:id ...` * It re-uses the network stack of another container. * It shares with this other container the same interfaces, IP address(es), routes, iptables rules, etc. * Those containers can communicate over their `lo` interface. (i.e. one can bind to 127.0.0.1 and the others can connect to it.) ??? :EN:Advanced container networking :EN:- Transparent network access with the "host" driver :EN:- Sharing is caring with the "container" driver :FR:Paramétrage réseau avancé :FR:- Accès transparent au réseau avec le mode "host" :FR:- Partage de la pile réseau avece le mode "container" .debug[[containers/Network_Drivers.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Network_Drivers.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/two-containers-on-a-truck.jpg)] --- name: toc-deep-dive-into-container-internals class: title Deep dive into container internals .nav[ [Previous part](#toc-container-network-drivers) | [Back to table of contents](#toc-part-4) | [Next part](#toc-control-groups) ] .debug[(automatically generated title slide)] --- # Deep dive into container internals In this chapter, we will explain some of the fundamental building blocks of containers. This will give you a solid foundation so you can: - understand "what's going on" in complex situations, - anticipate the behavior of containers (performance, security...) in new scenarios, - implement your own container engine. The last item should be done for educational purposes only! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## There is no container code in the Linux kernel - If we search "container" in the Linux kernel code, we find: - generic code to manipulate data structures (like linked lists, etc.), - unrelated concepts like "ACPI containers", - *nothing* relevant to "our" containers! - Containers are composed using multiple independent features. - On Linux, containers rely on "namespaces, cgroups, and some filesystem magic." - Security also requires features like capabilities, seccomp, LSMs... .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/wall-of-containers.jpeg)] --- name: toc-control-groups class: title Control groups .nav[ [Previous part](#toc-deep-dive-into-container-internals) | [Back to table of contents](#toc-part-4) | [Next part](#toc-namespaces) ] .debug[(automatically generated title slide)] --- # Control groups - Control groups provide resource *metering* and *limiting*. - This covers a number of "usual suspects" like: - memory - CPU - block I/O - network (with cooperation from iptables/tc) - And a few exotic ones: - huge pages (a special way to allocate memory) - RDMA (resources specific to InfiniBand / remote memory transfer) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Crowd control - Control groups also allow to group processes for special operations: - freezer (conceptually similar to a "mass-SIGSTOP/SIGCONT") - perf_event (gather performance events on multiple processes) - cpuset (limit or pin processes to specific CPUs) - There is a "pids" cgroup to limit the number of processes in a given group. - There is also a "devices" cgroup to control access to device nodes. (i.e. everything in `/dev`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Generalities - Cgroups form a hierarchy (a tree). - We can create nodes in that hierarchy. - We can associate limits to a node. - We can move a process (or multiple processes) to a node. - The process (or processes) will then respect these limits. - We can check the current usage of each node. - In other words: limits are optional (if we only want accounting). - When a process is created, it is placed in its parent's groups. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Example The numbers are PIDs. The names are the names of our nodes (arbitrarily chosen). .small[ ```bash cpu memory ├── batch ├── stateless │ ├── cryptoscam │ ├── 25 │ │ └── 52 │ ├── 26 │ └── ffmpeg │ ├── 27 │ ├── 109 │ ├── 52 │ └── 88 │ ├── 109 └── realtime │ └── 88 ├── nginx └── databases │ ├── 25 ├── 1008 │ ├── 26 └── 524 │ └── 27 ├── postgres │ └── 524 └── redis └── 1008 ``` ] .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Cgroups v1 vs v2 - Cgroups v1 are available on all systems (and widely used). - Cgroups v2 are a huge refactor. (Development started in Linux 3.10, released in 4.5.) - Cgroups v2 have a number of differences: - single hierarchy (instead of one tree per controller), - processes can only be on leaf nodes (not inner nodes), - and of course many improvements / refactorings. - Cgroups v2 enabled by default on Fedora 31 (2019), Ubuntu 21.10... .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup: accounting - Keeps track of pages used by each group: - file (read/write/mmap from block devices), - anonymous (stack, heap, anonymous mmap), - active (recently accessed), - inactive (candidate for eviction). - Each page is "charged" to a group. - Pages can be shared across multiple groups. (Example: multiple processes reading from the same files.) - To view all the counters kept by this cgroup: ```bash $ cat /sys/fs/cgroup/memory/memory.stat ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v1: limits - Each group can have (optional) hard and soft limits. - Limits can be set for different kinds of memory: - physical memory, - kernel memory, - total memory (including swap). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Soft limits and hard limits - Soft limits are not enforced. (But they influence reclaim under memory pressure.) - Hard limits *cannot* be exceeded: - if a group of processes exceeds a hard limit, - and if the kernel cannot reclaim any memory, - then the OOM (out-of-memory) killer is triggered, - and processes are killed until memory gets below the limit again. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Avoiding the OOM killer - For some workloads (databases and stateful systems), killing processes because we run out of memory is not acceptable. - The "oom-notifier" mechanism helps with that. - When "oom-notifier" is enabled and a hard limit is exceeded: - all processes in the cgroup are frozen, - a notification is sent to user space (instead of killing processes), - user space can then raise limits, migrate containers, etc., - once the memory usage is below the hard limit, unfreeze the cgroup. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Overhead of the memory cgroup - Each time a process grabs or releases a page, the kernel update counters. - This adds some overhead. - Unfortunately, this cannot be enabled/disabled per process. - It has to be done system-wide, at boot time. - Also, when multiple groups use the same page: - only the first group gets "charged", - but if it stops using it, the "charge" is moved to another group. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a limit with the memory cgroup Create a new memory cgroup: ```bash $ CG=/sys/fs/cgroup/memory/onehundredmegs $ sudo mkdir $CG ``` Limit it to approximately 100MB of memory usage: ```bash $ sudo tee $CG/memory.memsw.limit_in_bytes <<< 100000000 ``` Move the current process to that cgroup: ```bash $ sudo tee $CG/tasks <<< $$ ``` The current process *and all its future children* are now limited. (Confused about `<<<`? Look at the next slide!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## What's `<<<`? - This is a "here string". (It is a non-POSIX shell extension.) - The following commands are equivalent: ```bash foo <<< hello ``` ```bash echo hello | foo ``` ```bash foo < $CG/tasks" ``` The following commands, however, would be invalid: ```bash sudo echo $$ > $CG/tasks ``` ```bash sudo -i # (or su) echo $$ > $CG/tasks ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Testing the memory limit Start the Python interpreter: ```bash $ python Python 3.6.4 (default, Jan 5 2018, 02:35:40) [GCC 7.2.1 20171224] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` Allocate 80 megabytes: ```python >>> s = "!" * 1000000 * 80 ``` Add 20 megabytes more: ```python >>> t = "!" * 1000000 * 20 Killed ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Memory cgroup v2: limits - `memory.min` = hard reservation (guaranteed memory for this cgroup) - `memory.low` = soft reservation ("*try* not to reclaim memory if we're below this") - `memory.high` = soft limit (aggressively reclaim memory; don't trigger OOMK) - `memory.max` = hard limit (triggers OOMK) - `memory.swap.high` = aggressively reclaim memory when using that much swap - `memory.swap.max` = prevent using more swap than this .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## CPU cgroup - Keeps track of CPU time used by a group of processes. (This is easier and more accurate than `getrusage` and `/proc`.) - Keeps track of usage per CPU as well. (i.e., "this group of process used X seconds of CPU0 and Y seconds of CPU1".) - Allows setting relative weights used by the scheduler. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Cpuset cgroup - Pin groups to specific CPU(s). - Use-case: reserve CPUs for specific apps. - Warning: make sure that "default" processes aren't using all CPUs! - CPU pinning can also avoid performance loss due to cache flushes. - This is also relevant for NUMA systems. - Provides extra dials and knobs. (Per zone memory pressure, process migration costs...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Blkio cgroup - Keeps track of I/Os for each group: - per block device - read vs write - sync vs async - Set throttle (limits) for each group: - per block device - read vs write - ops vs bytes - Set relative weights for each group. - Note: most writes go through the page cache. (So classic writes will appear to be unthrottled at first.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net_cls and net_prio cgroup - Only works for egress (outgoing) traffic. - Automatically set traffic class or priority for traffic generated by processes in the group. - Net_cls will assign traffic to a class. - Classes have to be matched with tc or iptables, otherwise traffic just flows normally. - Net_prio will assign traffic to a priority. - Priorities are used by queuing disciplines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Devices cgroup - Controls what the group can do on device nodes - Permissions include read/write/mknod - Typical use: - allow `/dev/{tty,zero,random,null}` ... - deny everything else - A few interesting nodes: - `/dev/net/tun` (network interface manipulation) - `/dev/fuse` (filesystems in user space) - `/dev/kvm` (VMs in containers, yay inception!) - `/dev/dri` (GPU) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/Container-Ship-Freighter-Navigation-Elbe-Romance-1782991.jpg)] --- name: toc-namespaces class: title Namespaces .nav[ [Previous part](#toc-control-groups) | [Back to table of contents](#toc-part-4) | [Next part](#toc-security-features) ] .debug[(automatically generated title slide)] --- # Namespaces - Provide processes with their own view of the system. - Namespaces limit what you can see (and therefore, what you can use). - These namespaces are available in modern kernels: - pid - net - mnt - uts - ipc - user - time - cgroup (We are going to detail them individually.) - Each process belongs to one namespace of each type. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Namespaces are always active - Namespaces exist even when you don't use containers. - This is a bit similar to the UID field in UNIX processes: - all processes have the UID field, even if no user exists on the system - the field always has a value / the value is always defined (i.e. any process running on the system has some UID) - the value of the UID field is used when checking permissions (the UID field determines which resources the process can access) - You can replace "UID field" with "namespace" above and it still works! - In other words: even when you don't use containers, there is one namespace of each type, containing all the processes on the system. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Manipulating namespaces - Namespaces are created with two methods: - the `clone()` system call (used when creating new threads and processes), - the `unshare()` system call. - The Linux tool `unshare` allows doing that from a shell. - A new process can re-use none / all / some of the namespaces of its parent. - It is possible to "enter" a namespace with the `setns()` system call. - The Linux tool `nsenter` allows doing that from a shell. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces lifecycle - When the last process of a namespace exits, the namespace is destroyed. - All the associated resources are then removed. - Namespaces are materialized by pseudo-files in `/proc//ns`. ```bash ls -l /proc/self/ns ``` - It is possible to compare namespaces by checking these files. (This helps to answer the question, "are these two processes in the same namespace?") - It is possible to preserve a namespace by bind-mounting its pseudo-file. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Namespaces can be used independently - As mentioned in the previous slides: *A new process can re-use none / all / some of the namespaces of its parent.* - We are going to use that property in the examples in the next slides. - We are going to present each type of namespace. - For each type, we will provide an example using only that namespace. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## UTS namespace - gethostname / sethostname - Allows setting a custom hostname for a container. - That's (mostly) it! - Also allows setting the NIS domain. (If you don't know what a NIS domain is, you don't have to worry about it!) - If you're wondering: UTS = UNIX time sharing. - This namespace was named like this because of the `struct utsname`, which is commonly used to obtain the machine's hostname, architecture, etc. (The more you know!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Creating our first namespace Let's use `unshare` to create a new process that will have its own UTS namespace: ```bash $ sudo unshare --uts ``` - We have to use `sudo` for most `unshare` operations. - We indicate that we want a new uts namespace, and nothing else. - If we don't specify a program to run, a `$SHELL` is started. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Demonstrating our uts namespace In our new "container", check the hostname, change it, and check it: ```bash # hostname nodeX # hostname tupperware # hostname tupperware ``` In another shell, check that the machine's hostname hasn't changed: ```bash $ hostname nodeX ``` Exit the "container" with `exit` or `Ctrl-D`. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace overview - Each network namespace has its own private network stack. - The network stack includes: - network interfaces (including `lo`), - routing table**s** (as in `ip rule` etc.), - iptables chains and rules, - sockets (as seen by `ss`, `netstat`). - You can move a network interface from a network namespace to another: ```bash ip link set dev eth0 netns PID ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Net namespace typical use - Each container is given its own network namespace. - For each network namespace (i.e. each container), a `veth` pair is created. (Two `veth` interfaces act as if they were connected with a cross-over cable.) - One `veth` is moved to the container network namespace (and renamed `eth0`). - The other `veth` is moved to a bridge on the host (e.g. the `docker0` bridge). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating a network namespace Start a new process with its own network namespace: ```bash $ sudo unshare --net ``` See that this new network namespace is unconfigured: ```bash # ping 1.1 connect: Network is unreachable # ifconfig # ip link ls 1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Creating the `veth` interfaces In another shell (on the host), create a `veth` pair: ```bash $ sudo ip link add name in_host type veth peer name in_netns ``` Configure the host side (`in_host`): ```bash $ sudo ip link set in_host master docker0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Moving the `veth` interface *In the process created by `unshare`,* check the PID of our "network container": ```bash # echo $$ 533 ``` *On the host*, move the other side (`in_netns`) to the network namespace: ```bash $ sudo ip link set in_netns netns 533 ``` (Make sure to update "533" with the actual PID obtained above!) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Basic network configuration Let's set up `lo` (the loopback interface): ```bash # ip link set lo up ``` Activate the `veth` interface and rename it to `eth0`: ```bash # ip link set in_netns name eth0 up ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Allocating IP address and default route *On the host*, check the address of the Docker bridge: ```bash $ ip addr ls dev docker0 ``` (It could be something like `172.17.0.1`.) Pick an IP address in the middle of the same subnet, e.g. `172.17.0.99`. *In the process created by `unshare`,* configure the interface: ```bash # ip addr add 172.17.0.99/24 dev eth0 # ip route add default via 172.17.0.1 ``` (Make sure to update the IP addresses if necessary.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Validating the setup Check that we now have connectivity: ```bash # ping 1.1 ``` Note: we were able to take a shortcut, because Docker is running, and provides us with a `docker0` bridge and a valid `iptables` setup. If Docker is not running, you will need to take care of this! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## Cleaning up network namespaces - Terminate the process created by `unshare` (with `exit` or `Ctrl-D`). - Since this was the only process in the network namespace, it is destroyed. - All the interfaces in the network namespace are destroyed. - When a `veth` interface is destroyed, it also destroys the other half of the pair. - So we don't have anything else to do to clean up! .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Other ways to use network namespaces - `--net none` gives an empty network namespace to a container. (Effectively isolating it completely from the network.) - `--net host` means "do not containerize the network". (No network namespace is created; the container uses the host network stack.) - `--net container` means "reuse the network namespace of another container". (As a result, both containers share the same interfaces, routes, etc.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Mnt namespace - Processes can have their own root fs (à la chroot). - Processes can also have "private" mounts. This allows: - isolating `/tmp` (per user, per service...) - masking `/proc`, `/sys` (for processes that don't need them) - mounting remote filesystems or sensitive data, but make it visible only for allowed processes - Mounts can be totally private, or shared. - At this point, there is no easy way to pass along a mount from a namespace to another. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## Setting up a private `/tmp` Create a new mount namespace: ```bash $ sudo unshare --mount ``` In that new namespace, mount a brand new `/tmp`: ```bash # mount -t tmpfs none /tmp ``` Check the content of `/tmp` in the new namespace, and compare to the host. The mount is automatically cleaned up when you exit the process. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## PID namespace - Processes within a PID namespace only "see" processes in the same PID namespace. - Each PID namespace has its own numbering (starting at 1). - When PID 1 goes away, the whole namespace is killed. (When PID 1 goes away on a normal UNIX system, the kernel panics!) - Those namespaces can be nested. - A process ends up having multiple PIDs (one per namespace in which it is nested). .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespace in action Create a new PID namespace: ```bash $ sudo unshare --pid --fork ``` (We need the `--fork` flag because the PID namespace is special.) Check the process tree in the new namespace: ```bash # ps faux ``` -- class: extra-details, deep-dive 🤔 Why do we see all the processes?!? .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces and `/proc` - Tools like `ps` rely on the `/proc` pseudo-filesystem. - Our new namespace still has access to the original `/proc`. - Therefore, it still sees host processes. - But it cannot affect them. (Try to `kill` a process: you will get `No such process`.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## PID namespaces, take 2 - This can be solved by mounting `/proc` in the namespace. - The `unshare` utility provides a convenience flag, `--mount-proc`. - This flag will mount `/proc` in the namespace. - It will also unshare the mount namespace, so that this mount is local. Try it: ```bash $ sudo unshare --pid --fork --mount-proc # ps faux ``` .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details ## OK, really, why do we need `--fork`? *It is not necessary to remember all these details. This is just an illustration of the complexity of namespaces!* The `unshare` tool calls the `unshare` syscall, then `exec`s the new binary. A process calling `unshare` to create new namespaces is moved to the new namespaces... ... Except for the PID namespace. (Because this would change the current PID of the process from X to 1.) The processes created by the new binary are placed into the new PID namespace. The first one will be PID 1. If PID 1 exits, it is not possible to create additional processes in the namespace. (Attempting to do so will result in `ENOMEM`.) Without the `--fork` flag, the first command that we execute will be PID 1 ... ... And once it exits, we cannot create more processes in the namespace! Check `man 2 unshare` and `man pid_namespaces` if you want more details. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## IPC namespace -- - Does anybody know about IPC? -- - Does anybody *care* about IPC? -- - Allows a process (or group of processes) to have own: - IPC semaphores - IPC message queues - IPC shared memory ... without risk of conflict with other instances. - Older versions of PostgreSQL cared about this. *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## User namespace - Allows mapping UID/GID; e.g.: - UID 0→1999 in container C1 is mapped to UID 10000→11999 on host - UID 0→1999 in container C2 is mapped to UID 12000→13999 on host - etc. - UID 0 in the container can still perform privileged operations in the container. (For instance: setting up network interfaces.) - But outside of the container, it is a non-privileged user. - It also means that the UID in containers becomes unimportant. (Just use UID 0 in the container, since it gets squashed to a non-privileged user outside.) - Ultimately enables better privilege separation in container engines. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: extra-details, deep-dive ## User namespace challenges - UID needs to be mapped when passed between processes or kernel subsystems. - Filesystem permissions and file ownership are more complicated. .small[(E.g. when the same root filesystem is shared by multiple containers running with different UIDs.)] - With the Docker Engine: - some feature combinations are not allowed (e.g. user namespace + host network namespace sharing) - user namespaces need to be enabled/disabled globally (when the daemon is started) - container images are stored separately (so the first time you toggle user namespaces, you need to re-pull images) *No demo for that one.* .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Time namespace - Virtualize time - Expose a slower/faster clock to some processes (for e.g. simulation purposes) - Expose a clock offset to some processes (simulation, suspend/restore...) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Cgroup namespace - Virtualize access to `/proc//cgroup` - Lets containerized processes view their relative cgroup tree .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/ShippingContainerSFBay.jpg)] --- name: toc-security-features class: title Security features .nav[ [Previous part](#toc-namespaces) | [Back to table of contents](#toc-part-4) | [Next part](#toc-orchestration-an-overview) ] .debug[(automatically generated title slide)] --- # Security features - Namespaces and cgroups are not enough to ensure strong security. - We need extra mechanisms: capabilities, seccomp, LSMs. - These mechanisms were already used before containers to harden security. - They can be used together with containers. - Good container engines will automatically leverage these features. (So that you don't have to worry about it.) .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Capabilities - In traditional UNIX, many operations are possible if and only if UID=0 (root). - Some of these operations are very powerful: - changing file ownership, accessing all files ... - Some of these operations deal with system configuration, but can be abused: - setting up network interfaces, mounting filesystems ... - Some of these operations are not very dangerous but are needed by servers: - binding to a port below 1024. - Capabilities are per-process flags to allow these operations individually. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Some capabilities - `CAP_CHOWN`: arbitrarily change file ownership and permissions. - `CAP_DAC_OVERRIDE`: arbitrarily bypass file ownership and permissions. - `CAP_NET_ADMIN`: configure network interfaces, iptables rules, etc. - `CAP_NET_BIND_SERVICE`: bind a port below 1024. See `man capabilities` for the full list and details. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Using capabilities - Container engines will typically drop all "dangerous" capabilities. - You can then re-enable capabilities on a per-container basis, as needed. - With the Docker engine: `docker run --cap-add ...` - If you write your own code to manage capabilities: - make sure that you understand what each capability does, - read about *ambient* capabilities as well. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Seccomp - Seccomp is secure computing. - Achieve high level of security by restricting drastically available syscalls. - Original seccomp only allows `read()`, `write()`, `exit()`, `sigreturn()`. - The seccomp-bpf extension allows specifying custom filters with BPF rules. - This allows filtering by syscall, and by parameter. - BPF code can perform arbitrarily complex checks, quickly, and safely. - Container engines take care of this so you don't have to. .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- ## Linux Security Modules - The most popular ones are SELinux and AppArmor. - Red Hat distros generally use SELinux. - Debian distros (in particular, Ubuntu) generally use AppArmor. - LSMs add a layer of access control to all process operations. - Container engines take care of this so you don't have to. ??? :EN:Containers internals :EN:- Control groups (cgroups) :EN:- Linux kernel namespaces :FR:Fonctionnement interne des conteneurs :FR:- Les "control groups" (cgroups) :FR:- Les namespaces du noyau Linux .debug[[containers/Namespaces_Cgroups.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Namespaces_Cgroups.md)] --- class: pic .interstitial[![Image separating from the next part](https://gallant-turing-d0d520.netlify.com/containers/aerial-view-of-containers.jpg)] --- name: toc-orchestration-an-overview class: title Orchestration, an overview .nav[ [Previous part](#toc-security-features) | [Back to table of contents](#toc-part-4) | [Next part](#toc-) ] .debug[(automatically generated title slide)] --- # Orchestration, an overview In this chapter, we will: * Explain what is orchestration and why we would need it. * Present (from a high-level perspective) some orchestrators. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## What's orchestration? ![Joana Carneiro (orchestra conductor)](images/conductor.jpg) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## What's orchestration? According to Wikipedia: *Orchestration describes the __automated__ arrangement, coordination, and management of complex computer systems, middleware, and services.* -- *[...] orchestration is often discussed in the context of __service-oriented architecture__, __virtualization__, provisioning, Converged Infrastructure and __dynamic datacenter__ topics.* -- What does that really mean? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances -- - Q: do we always use 100% of our servers? -- - A: obviously not! .center[![Daily variations of traffic](images/traffic-graph.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances - Every night, scale down (by shutting down extraneous replicated instances) - Every morning, scale up (by deploying new copies) - "Pay for what you use" (i.e. save big $$$ here) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 1: dynamic cloud instances How do we implement this? - Crontab - Autoscaling (save even bigger $$$) That's *relatively* easy. Now, how are things for our IAAS provider? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter - Q: what's the #1 cost in a datacenter? -- - A: electricity! -- - Q: what uses electricity? -- - A: servers, obviously - A: ... and associated cooling -- - Q: do we always use 100% of our servers? -- - A: obviously not! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter - If only we could turn off unused servers during the night... - Problem: we can only turn off a server if it's totally empty! (i.e. all VMs on it are stopped/moved) - Solution: *migrate* VMs and shutdown empty servers (e.g. combine two hypervisors with 40% load into 80%+0%, and shut down the one at 0%) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Example 2: dynamic datacenter How do we implement this? - Shut down empty hosts (but keep some spare capacity) - Start hosts again when capacity gets low - Ability to "live migrate" VMs (Xen already did this 10+ years ago) - Rebalance VMs on a regular basis - what if a VM is stopped while we move it? - should we allow provisioning on hosts involved in a migration? *Scheduling* becomes more complex. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## What is scheduling? According to Wikipedia (again): *In computing, scheduling is the method by which threads, processes or data flows are given access to system resources.* The scheduler is concerned mainly with: - throughput (total amount of work done per time unit); - turnaround time (between submission and completion); - response time (between submission and start); - waiting time (between job readiness and execution); - fairness (appropriate times according to priorities). In practice, these goals often conflict. **"Scheduling" = decide which resources to use.** .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 1 - You have: - 5 hypervisors (physical machines) - Each server has: - 16 GB RAM, 8 cores, 1 TB disk - Each week, your team requests: - one VM with X RAM, Y CPU, Z disk Scheduling = deciding which hypervisor to use for each VM. Difficulty: easy! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 2 - You have: - 1000+ hypervisors (and counting!) - Each server has different resources: - 8-500 GB of RAM, 4-64 cores, 1-100 TB disk - Multiple times a day, a different team asks for: - up to 50 VMs with different characteristics Scheduling = deciding which hypervisor to use for each VM. Difficulty: ??? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 2 - You have: - 1000+ hypervisors (and counting!) - Each server has different resources: - 8-500 GB of RAM, 4-64 cores, 1-100 TB disk - Multiple times a day, a different team asks for: - up to 50 VMs with different characteristics Scheduling = deciding which hypervisor to use for each VM. ![Troll face](images/trollface.png) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Exercise 3 - You have machines (physical and/or virtual) - You have containers - You are trying to put the containers on the machines - Sounds familiar? .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with one resource .center[![Not-so-good bin packing](images/binpacking-1d-1.gif)] ## We can't fit a job of size 6 :( .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with one resource .center[![Better bin packing](images/binpacking-1d-2.gif)] ## ... Now we can! .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with two resources .center[![2D bin packing](images/binpacking-2d.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Scheduling with three resources .center[![3D bin packing](images/binpacking-3d.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## You need to be good at this .center[![Tangram](images/tangram.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## But also, you must be quick! .center[![Tetris](images/tetris-1.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## And be web scale! .center[![Big tetris](images/tetris-2.gif)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## And think outside (?) of the box! .center[![3D tetris](images/tetris-3.png)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: pic ## Good luck! .center[![FUUUUUU face](images/fu-face.jpg)] .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## TL,DR * Scheduling with multiple resources (dimensions) is hard. * Don't expect to solve the problem with a Tiny Shell Script. * There are literally tons of research papers written on this. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## But our orchestrator also needs to manage ... * Network connectivity (or filtering) between containers. * Load balancing (external and internal). * Failure recovery (if a node or a whole datacenter fails). * Rolling out new versions of our applications. (Canary deployments, blue/green deployments...) .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Some orchestrators We are going to present briefly a few orchestrators. There is no "absolute best" orchestrator. It depends on: - your applications, - your requirements, - your pre-existing skills... .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Nomad - Open Source project by Hashicorp. - Arbitrary scheduler (not just for containers). - Great if you want to schedule mixed workloads. (VMs, containers, processes...) - Less integration with the rest of the container ecosystem. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Mesos - Open Source project in the Apache Foundation. - Arbitrary scheduler (not just for containers). - Two-level scheduler. - Top-level scheduler acts as a resource broker. - Second-level schedulers (aka "frameworks") obtain resources from top-level. - Frameworks implement various strategies. (Marathon = long running processes; Chronos = run at intervals; ...) - Commercial offering through DC/OS by Mesosphere. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Rancher - Rancher 1 offered a simple interface for Docker hosts. - Rancher 2 is a complete management platform for Docker and Kubernetes. - Technically not an orchestrator, but it's a popular option. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Swarm - Tightly integrated with the Docker Engine. - Extremely simple to deploy and setup, even in multi-manager (HA) mode. - Secure by default. - Strongly opinionated: - smaller set of features, - easier to operate. .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- ## Kubernetes - Open Source project initiated by Google. - Contributions from many other actors. - *De facto* standard for container orchestration. - Many deployment options; some of them very complex. - Reputation: steep learning curve. - Reality: - true, if we try to understand *everything*; - false, if we focus on what matters. ??? :EN:- Orchestration overview :FR:- Survol de techniques d'orchestration .debug[[containers/Orchestration_Overview.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/containers/Orchestration_Overview.md)] --- class: title, self-paced Thank you! .debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/shared/thankyou.md)] --- class: title, in-person That's all, folks! Questions? ![end](images/end.jpg) .debug[[shared/thankyou.md](https://github.com/jpetazzo/container.training/tree/2022-02-enix/slides/shared/thankyou.md)]