Out-of-memory (OOM) in Kubernetes – Part 1: Intro and topics discussed

Intro

You’ve seen it all before: that important application service running inside Kubernetes goes down at the most inconvenient time. There’s a tug of war between the DevOps team – that maintains Kubernetes did nothing wrong and the service simply exited with an error message (“can’t they just fix their code?”) – and the Development team – that maintains all the logs show the application was alive and well, and there must be something that Kubernetes did that resulted in the service to be taken down (“can’t they just make that Kubernetes thing stable?”). Sometimes an “OOMKilled” tell-tale status is observed for one of the containers, other times some Kubernetes pods are seen in an “Evicted” state. There are times when there’s some strange 139 exit code, which gets mapped to something called “SIGSEGV”, but it doesn’t solve the mystery either. The investigation continues, and nodes are found that show weird numbers like 107% memory usage at times. Is that number normal? Is there some glitch in how Kubernetes reports memory on that very node? You start suspecting there might be a memory issue, and so you turn to the Kubernetes logs to get more details. But where are those to begin with? This continues, with more questions coming up as you go along and hours spent trying to figure out what went wrong. And there’s always management that at the end of the day wants an answer to “how can we prevent this from happening in the future?”…

This 4-part article will look into what happens in detail when Kubernetes runs into out-of-memory (OOM) situations and how it responds to them. The discussion will include the concepts involved, details around how the various mechanisms work, the metrics that describe memory and the tools used to gather them, diagrams to show how things come together, and in-depth analysis about some of the situations that can be encountered.

Why would you want to read this article? You’re using Kubernetes,…

  • …you saw evicted pods in the past, but don’t really know what caused it since you didn’t have the time to investigate in detail yet
  • …but the figures for memory usage (involving pods, nodes, etc) don’t match up – e.g. the sum of the pods’ memory usage is not equal to the amount reported by the node
  • …but the metrics you’re seeing on some dashboards aren’t quite clear (e.g. what’s WSS?)
  • …and you want to understand how to troubleshoot issues with out-of-memory issues. This would not only speed up finding the cause of the problem but also prevent finger-pointing between Development and DevOps teams
  • …and want to find out what happens when a Kubernetes node is out of memory, but (obviously) don’t want to do this on your production infrastructure
  • …and you’d like to simulate applications using memory on the nodes in a controlled manner
  • …and would like to see what’s the point of setting memory limits for your containers
  • …and would like to understand why some of the mechanisms involving memory work the way they do

Specifically related to the last reason, note that there are several “rabbit holes” throughout the article, where seemingly simple concepts are discussed at length, sometimes going deep inside the Kubernetes code base. Feel free to skim around, and refer to the Table of contents to get to what you’re interested in.

If you’re not sure where to start, or just want to skip the theoretical stuff, head over to Flows leading to out-of-memory situations section.

I tried to get a clear understanding of the concepts presented in this article using various ways: reading documentation, going through lots of GitHub issues, and testing extensively. When I felt it wasn’t enough, I went through the actual Kubernetes source code (using my limited Go knowledge). When the source code proved too complicated for me I reached out to some of the folks that are working on the Kubernetes code base, and they were extremely kind to answer my questions.  As such most of what’s presented should be backed up by something. But I do admit there might be things I got wrong, either because I didn’t look close enough, or the outcome of my tests led me to wrong conclusions, or simply because my thinking failed me. If that’s the case, and you do spot an error, I’d be grateful if you leave a comment or contact me.

Out of scope / Assumptions

A couple of things to set from the start in regards to expectations:

Kubernetes Test Cluster

To test the various scenarios and situations I’m using a small 2-node Azure Kubernetes Service (AKS) cluster throughout the article. This was created from the Azure portal with version 1.21.2, and the default Linux node pool consists of 2 Azure DS2_v2 nodes. Each node has 2 vCPUs and 7 GB of RAM.

The container runtime used is containerd – as this is what’s deployed out of the box ever since AKS version 1.19.

Prometheus, Grafana, and a set of basic dashboards are deployed inside the Kubernetes cluster using the community prometheus-kube-stack Helm chart. Instructions on how to deploy the Helm chart can be found inside this article https://www.infracloud.io/blogs/prometheus-operator-helm-guide/. To access Grafana – and get the required printscreens – I just ran kubectl port-forward svc/prometheus-grafana 3000:80 and connected to http://localhost:3000 using the default credentials of admin / prom-operator.

Memory leak tool

One of the goals of this article is to analyze out-of-memory situations and how Kubernetes responds to them. Yet just filling the node’s memory isn’t the best way to do that. The mechanism that Kubernetes uses to respond to various stages of memory pressure is a little more nuanced. For example, gradually allocating a large amount of memory inside a pod will have different outcomes depending on the speed it’s done at. For this reason, a tool is required to allocate memory in a controlled manner.

Such a tool has already been discussed in a previous post, and it will be perfect for the various scenarios we’ll have to tackle. So if you see references to the “memory leak tool” anywhere in this article, then this is what it’s about – as it essentially allows us to leak memory and control the process.

As the tool is a .NET application, it will use some memory for itself: around 30 MiB will be allocated right at the start when running on Linux (it’s less on Windows). Each allocated block of memory will incur an additional small amount of memory used, but as we’ll see throughout this article, that will be negligible for our use-cases.

We’ll mostly use the memory leak tool inside a container, as it’s Kubernetes we’ll be focusing on. But there will be limited instances when it will be run as a regular process.

Once the tool completes allocating the specified amount of memory – provided it’s not run to exhaust all memory – it waits for Ctrl+C to be pressed. A reference to all the memory allocated is kept, as to prevent the .NET runtime from garbage collecting it. Note that this doesn’t mean that the memory that has been touched can’t be swapped to disk in case the OS decides so, but since Kubernetes is not currently using swap (the feature went alpha in 1.22) we won’t run into that.

But aren’t there already tools that can allocate memory? Why build one myself? The Linux stress tool is very nice – and you can actually see it referenced in the Kubernetes official docs here Exceed a Container’s memory limit. But I couldn’t find a way to use a lower touch rate for the allocated memory (by setting --vm-stride way above 4096, as some memory pages are left untouched) – maybe I just missed something. There is a very advanced stress-ng tool with lots of parameters, but this one doesn’t appear to support incremental allocations at first glance. Either way, writing my own tool on top of a runtime – in this case .NET – did allow insight into Runtime implications around OOM.

Table of contents

The numbers in brackets below represent how many paragraphs a particular subsection contains.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s