On Pausing Containers - how we built, and why we deprecated, our container pause feature
Did you know that you can âfreezeâ running containers? We found out about this feature while developing a particularly handy capability for mirrord. mirrord lets developers run local code in the context of a remote Kubernetes pod. This lets them test their code in cloud conditions without going through CI and deployment. In most cases, stealing or mirroring incoming traffic from the remote pod to the local process is good enough. However, when you have applications that consume tasks and do more than react to incoming requests, the user might want to pause the remote application completely , so that it doesnât compete with their local process for tasks.
How Do We Pause?
We found out that there is a Docker API for pausing a container , which can be used to suspend all processes in a container. Luckily, we already work with the CRI (Container Runtime Interface) API to collect information about the container, so using it for pausing is easy. We found out that this also exists for the other main container runtimes as well: cri-o and containerd . So, that worked pretty well. We released an initial version of the pause feature and marked it unstable. Like any software, the naive idea works until you meet the edge cases as you try to stabilize it.
Here Come the Roadblocks
The main use case for the pause feature was reading from queues - we wanted to pause the remote container so it doesnât read messages from the remote queue, and our local process can read them instead. The first issue we encountered when working with queues is that it takes time for the pause to take effect. The container is frozen immediately, but it takes time for the queue to understand that a consumer is disconnected and then re-allocate its messages to other consumers (the user’s local app). This results in a wait time of about five minutes just to get started, which isnât fun.
The second issue we ran into was that if there were live incoming connections, those would also take time to drop, or sometimes never drop at all. We mostly saw this in gRPC services. The common case was users trying to use mirrord with pause on a service in their cluster that received gRPC connections from another microservice. The calling microservice would stay connected to the paused service rather than dropping the connection and connecting to our local process instead.
We tried finding ways to flush all connections when using pause.It seems that the Linux kernel really likes holding those connections, as even if we conntrack flush, itâd just bring back the entries on its own. The only solution we saw was a kernel module to kill these connections, but that would have been violent and ugly, plus we didnât want to maintain a kernel module.
The last and biggest issue we encountered was when we tried to migrate mirrord to using Ephemeral Containers for the mirrord agent. The mirrord agent is a component that runs in the cluster and helps mirrord proxy data back and forth between the local process and remote pod. By default we use a Job to run the agent, but Ephemeral Containers, when available, are a better functional fit. However, it means we lose access to the container runtime API, and so we couldnât use it to pause the container. We did not give up easily though, and we soon realized we could implement the pause functionality ourselves, without using the container API.
What does the API do? Simple! Containers are abstractions over Linux namespaces.Thereâs a namespace type called cgroup which is responsible for controlling the compute and memory of processes running in that namespace. In the same way each container usually belongs to a networking namespace, process namespace, file system namespace and other types,it also belongs to a cgroup namespace that is used by the container runtime to limit it (see the Kubernetes limits feature for example). The cgroup namespace also enables a functionality called âfreezingâ which means you can limit the cpu to 0, essentially freezing the affected processes. Thatâs how the CRI freezes the container. We implemented pause using that API, then realized there are two versions to cgroup, V1 and V2, and weâd need to support both. In order to do it, the pod running the mirrord agent needed to be privileged as well - but okay, a small price to pay.
cgroup V1 flow:
- Open the cgroup of our target pod (since weâre in ephemeral mode, it is
/proc/1/cgroup
) - Find the freezer cgroup entry (cgroup_type == freezer)
- Return the freezer cgroup entry path (itâs a directory)
- The path we found is something like
/kubepods/kubepods/besteffort/podb21003be-a5c1-424e-ab07-d6814f1464d0/e84d894e48a4191bc79bd27a172a3655e1b41c566be9527821639f1ad0e29f26
which isnât available to us at this point - We mount the cgroup file system under
/mirrord_cgroup
(a directory we create)
mount(
Some("cgroup"),
cgroup_path,
Some("cgroup"),
MsFlags::MS_NOSUID | MsFlags::MS_NOEXEC | MsFlags::MS_NODEV,
Some("freezer"),
)
- Now we can access the freezer cgroup using the path from step 4
- We write âFROZENâ to the cgroup entry path with
freezer.state
file. - When we unpause, we just write
THAWED
to the same file.
cgroup V2 flow:
- The implementation initially was similar, but then we found out about a cgroup V2 flag called
nsdelegate
which means that in order to move a process between cgroups you have to have permission to move the process back - if you move from cgroup A to cgroup B, you need to have permission to move it back to cgroup A, which we didnât because it wasnât available to our context. - We mounted the current cgroup in a new directory, then we created a new sub-cgroup under it. This made our cgroup inherit from the original cgroup, apart from the changes we did (freezing).
- We move all the pids under the original cgroup to the sub-cgroup we created by reading the cgroup.procs file.
- Then, we can freeze the sub-cgroup.
- When unpausing, we move the pids back to the original cgroup after unfreezing.
systemd doesnât like it
When running in cgroup V2 in some runtimes, we found out that systemd didnât like us messing with cgroups. The reason is that systemd is actually the system component responsible for creating the cgroup namespaces and managing them. This also makes sense as it goes against both âTwo Key Design Rulesâ mentioned here . When we created the cgroup, systemd took over, and then we had some sort of race condition leading to flaky behavior.đ
Getting the Hint
After spending around two months investigating, researching, and chasing bugs to make it work in a stable manner, we finally decided that it might be better to drop the pause API approach and find a better way to accomplish this use case. Not only was implementing the pause feature in an ephemeral container hard, but when it did work, it still didnât provide a perfect user experience (see timeout and queue issues above).
Going in a Different Direction
We had an issue in our backlog to implement the âcopy podâ feature
- which creates a new replica of an existing pod. On top of that, we added a feature called scaledown
. When targeting a deployment with this feature, mirrord will create a new exclusive pod for the user and then scaledown the other pods of the same deployment.
The way it works:
The mirrord Operator creates a new pod that is part of the deployment, copying the configmap, mount, sidecars, and anything that can be useful.
The Operator scales down the deployment, shutting down the rest of pods for the duration of the mirrord session.
Given the delicate state it can leave the cluster in, we decided to have this feature as part of mirrord for Teams. The general mandate is that anything that is stateless and wonât break the cluster can be in the OSS, while anything that manages state should be done by the Operator.
Pros and Cons of the New Solution
Pros:
- Immediate effect both for queues and incoming traffic.
- It uses a stable, high-level k8s API.
Cons:
- We donât use the original base image for the pod replica, so if there are files in the image base, those arenât available to the local process
- Following the previous point, we donât run the original Docker entrypoint, which might be an issue if it contains meaningful logic (for example, we saw users that copy configuration files from configmap to other paths on start) Note that both issues have workarounds, but theyâre still a bit of a hassle.
Final Notes
Itâs no fun removing a feature youâve worked hard on, but itâs even worse to leave broken functionality in your product, and so weâll be deprecating the pause feature in the coming months. Hopefully, at least the lessons we learned while working on it and documented in this blog post would be useful or interesting for some of you. And if you want to try out copy/scaledown functionality, check out mirrord for Teams !