Intro
You’ve seen it all before: that important application service running inside Kubernetes goes down at the most inconvenient time. There’s a tug of war between the DevOps team – that maintains Kubernetes did nothing wrong and the service simply exited with an error message (“can’t they just fix their code?”) – and the Development team – that maintains all the logs show the application was alive and well, and there must be something that Kubernetes did that resulted in the service to be taken down (“can’t they just make that Kubernetes thing stable?”). Sometimes an “OOMKilled” tell-tale status is observed for one of the containers, other times some Kubernetes pods are seen in an “Evicted” state. There are times when there’s some strange 139 exit code, which gets mapped to something called “SIGSEGV”, but it doesn’t solve the mystery either. The investigation continues, and nodes are found that show weird numbers like 107% memory usage at times. Is that number normal? Is there some glitch in how Kubernetes reports memory on that very node? You start suspecting there might be a memory issue, and so you turn to the Kubernetes logs to get more details. But where are those to begin with? This continues, with more questions coming up as you go along and hours spent trying to figure out what went wrong. And there’s always management that at the end of the day wants an answer to “how can we prevent this from happening in the future?”…
This 4-part article will look into what happens in detail when Kubernetes runs into out-of-memory (OOM) situations and how it responds to them. The discussion will include the concepts involved, details around how the various mechanisms work, the metrics that describe memory and the tools used to gather them, diagrams to show how things come together, and in-depth analysis about some of the situations that can be encountered.
Why would you want to read this article? You’re using Kubernetes,…
- …you saw evicted pods in the past, but don’t really know what caused it since you didn’t have the time to investigate in detail yet
- …but the figures for memory usage (involving pods, nodes, etc) don’t match up – e.g. the sum of the pods’ memory usage is not equal to the amount reported by the node
- …but the metrics you’re seeing on some dashboards aren’t quite clear (e.g. what’s WSS?)
- …and you want to understand how to troubleshoot issues with out-of-memory issues. This would not only speed up finding the cause of the problem but also prevent finger-pointing between Development and DevOps teams
- …and want to find out what happens when a Kubernetes node is out of memory, but (obviously) don’t want to do this on your production infrastructure
- …and you’d like to simulate applications using memory on the nodes in a controlled manner
- …and would like to see what’s the point of setting memory limits for your containers
- …and would like to understand why some of the mechanisms involving memory work the way they do
Specifically related to the last reason, note that there are several “rabbit holes” throughout the article, where seemingly simple concepts are discussed at length, sometimes going deep inside the Kubernetes code base. Feel free to skim around, and refer to the Table of contents to get to what you’re interested in.
If you’re not sure where to start, or just want to skip the theoretical stuff, head over to Flows leading to out-of-memory situations section.
I tried to get a clear understanding of the concepts presented in this article using various ways: reading documentation, going through lots of GitHub issues, and testing extensively. When I felt it wasn’t enough, I went through the actual Kubernetes source code (using my limited Go knowledge). When the source code proved too complicated for me I reached out to some of the folks that are working on the Kubernetes code base, and they were extremely kind to answer my questions. As such most of what’s presented should be backed up by something. But I do admit there might be things I got wrong, either because I didn’t look close enough, or the outcome of my tests led me to wrong conclusions, or simply because my thinking failed me. If that’s the case, and you do spot an error, I’d be grateful if you leave a comment or contact me.
Out of scope / Assumptions
A couple of things to set from the start in regards to expectations:
- At the time of this writing (Feb 2022) Kubernetes 1.23 is the latest official release. The Kubernetes Test Cluster used throughout the article is 1.21. Depending on when you read this, parts of the article might not be applicable anymore depending on what changes subsequent versions will bring in.
- The article discusses exclusively Kubernetes Linux nodes. Windows or any other OS aren’t considered.
- There isn’t any information in this article about cost computations or savings by efficiently using memory on the Kubernetes nodes. The article is about what happens and why in regards to Kubernetes and its use of memory, and even though one can indirectly deduct some efficient approaches involving the nodes’ memory, this isn’t addressed.
- Pod evictions due to non-memory thresholds issues – such as those happening when draining a node – are not in the scope of this article. Neither are soft-eviction thresholds, mainly because AKS doesn’t use them by default (and AKS is used for testing in this article)
- Hugepages aren’t discussed (for a description about them see this https://docs.openshift.com/container-platform/4.8/scalability_and_performance/what-huge-pages-do-and-how-they-are-consumed-by-apps.html)
- The
--kernel-memcg-notification
flag described here https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#kubelet-may-not-observe-memory-pressure-right-away isn’t discussed either. This changes how often eviction thresholds are checked by the Kubelet, and as such, some parts of the Pod evictions section would not apply as-is. I chose to leave it out as AKS doesn’t use it currently (as of Feb 2022 it’s not part of the parameters that can be used https://docs.microsoft.com/en-us/azure/aks/custom-node-configuration#kubelet-custom-configuration) and there are also issues with it in cgroupv2 (see this https://github.com/kubernetes/kubernetes/issues/106331 and the associated PR) - cgroups v2 aren’t considered in this article. An explanation about this is in the cgroups v2 section
Kubernetes Test Cluster
To test the various scenarios and situations I’m using a small 2-node Azure Kubernetes Service (AKS) cluster throughout the article. This was created from the Azure portal with version 1.21.2, and the default Linux node pool consists of 2 Azure DS2_v2 nodes. Each node has 2 vCPUs and 7 GB of RAM.
The container runtime used is containerd – as this is what’s deployed out of the box ever since AKS version 1.19.
Prometheus, Grafana, and a set of basic dashboards are deployed inside the Kubernetes cluster using the community prometheus-kube-stack Helm chart. Instructions on how to deploy the Helm chart can be found inside this article https://www.infracloud.io/blogs/prometheus-operator-helm-guide/. To access Grafana – and get the required printscreens – I just ran kubectl port-forward svc/prometheus-grafana 3000:80
and connected to http://localhost:3000 using the default credentials of admin
/ prom-operator
.
Memory leak tool
One of the goals of this article is to analyze out-of-memory situations and how Kubernetes responds to them. Yet just filling the node’s memory isn’t the best way to do that. The mechanism that Kubernetes uses to respond to various stages of memory pressure is a little more nuanced. For example, gradually allocating a large amount of memory inside a pod will have different outcomes depending on the speed it’s done at. For this reason, a tool is required to allocate memory in a controlled manner.
Such a tool has already been discussed in a previous post, and it will be perfect for the various scenarios we’ll have to tackle. So if you see references to the “memory leak tool” anywhere in this article, then this is what it’s about – as it essentially allows us to leak memory and control the process.
As the tool is a .NET application, it will use some memory for itself: around 30 MiB will be allocated right at the start when running on Linux (it’s less on Windows). Each allocated block of memory will incur an additional small amount of memory used, but as we’ll see throughout this article, that will be negligible for our use-cases.
We’ll mostly use the memory leak tool inside a container, as it’s Kubernetes we’ll be focusing on. But there will be limited instances when it will be run as a regular process.
Once the tool completes allocating the specified amount of memory – provided it’s not run to exhaust all memory – it waits for Ctrl+C to be pressed. A reference to all the memory allocated is kept, as to prevent the .NET runtime from garbage collecting it. Note that this doesn’t mean that the memory that has been touched can’t be swapped to disk in case the OS decides so, but since Kubernetes is not currently using swap (the feature went alpha in 1.22) we won’t run into that.
But aren’t there already tools that can allocate memory? Why build one myself? The Linux stress tool is very nice – and you can actually see it referenced in the Kubernetes official docs here Exceed a Container’s memory limit. But I couldn’t find a way to use a lower touch rate for the allocated memory (by setting --vm-stride
way above 4096, as some memory pages are left untouched) – maybe I just missed something. There is a very advanced stress-ng
tool with lots of parameters, but this one doesn’t appear to support incremental allocations at first glance. Either way, writing my own tool on top of a runtime – in this case .NET – did allow insight into Runtime implications around OOM.
Table of contents
The numbers in brackets below represent how many paragraphs a particular subsection contains.
- Part 1: Intro and topics discussed
- Intro (6p)
- Out of scope / Assumptions (6p)
- Kubernetes Test Cluster (3p)
- Memory leak tool
- reason for using one (1p)
- description of the tool that will be used (4p)
- other tools (1p)
- Table of contents
- Part 2: The OOM killer and application runtime implications
- Overcommit
- what it is
- an example of the mechanism (2p)
- analogies (1p)
- a test with overcommit always on (1p)
- describe overcommit modes and parallel to Windows (2p)
- test the “don’t overcommit” mode (3p)
- the “don’t overcommit” mode can actually be configured to overcommit as well (1p)
- summary (1p)
- reasons for overcommit (1p)
- philosophical “is it good” question (1p)
- OOM killer
- what it is and when it’s invoked (1 paragraph)
- a test with it in action (3p)
- discuss if and why the OOM killer wouldn’t be invoked for a previous analyzed situation (2p)
- decouple overcommit from the OOM killer and describe how it can be triggered even in that state (2p)
- disable the OOM killer and discussion around it (4p)
- system behavior when it cannot allocate memory but OOM killer isn’t triggered (3p)
- cgroups
- reference to other intro articles (1p)
- what are cgroups and why are they important (2p)
- the aspect of enforcing memory limits, but not visibility (1p)
- query the root cgroup to get node stats (4p)
- cgroups and Kubernetes
- mapping of containers to cgroups in Kubernetes (1p)
- pause container (1p)
- how to obtain the path of a container inside the root cgroup (1p)
- cgroups v2
- state of cgroups v2 in Kubernetes (1p)
- how to check if a host is using it (1p)
- the fact that it’s currently not in use by cloud providers (1p)
- why the current article talks only about cgroups v1 (2p)
- Cgroups and the OOM killer
- how the OOM killer applies to cgroups and when it’s triggered (3p)
- OOM Scenario #1: Container is OOMKilled when it exceeds its limit (3p)
- Unexpected twists
- 2 tests invoking multiple processes consuming memory inside the same cgroup (7p)
- conclusion around the impredictibility of the OOM killer in certain cases (1p)
- explanation of the OOM killer output in regards to tasks displayed (1p)
- Kubernetes and the OOM killer
- Kubernetes doesn’t control the OOM killer (1p)
- implications of OOM killer terminating containers’ processes (3p)
- reconciliation between Kubernetes and the OOM killer (1p)
- OOMKilled containers are restarted endlessly (1p)
- Kubernetes not using cgroups soft limits (1p)
- Runtime implications around OOM
- intro context (1p)
- types of languages from a runtime perspective (2p)
- an additional limiting filter as the reason for failed allocations (1p)
- default runtime limits as an additional reason to look further into this topic (1p)
- .NET
- description of the heap (1p)
- one setting controlling max heap and why it’s important (1p)
- test the default value (3p)
- overriding the default value (2p)
- difference between allocating and touching memory, why it matters and an example (2p)
- test when no limit is set for the container (3p)
- conclusion (1p)
- difference for the Go runtime (1p)
- Kubernetes resource requests and limits
- reason of having this section (1p)
- what they represent (1p)
- useful articles for the topic of requests and limits (1p)
- QoS classes for pods
- Q&A
- Q&A: Overcommit (4p)
- Q&A: OOM killer (4p)
- Q&A: cgroups (2p)
- Q&A: cgroups and the OOM killer (1p)
- Overcommit
- Part 3: Memory metrics sources and tools to collect them
- Metrics components
- a brief description about what they are (2p)
- what’s not in scope (1p)
- the purpose of why we’re discussing them (1p): depending on which tool we use to get metrics, there are specific components involved that return different metrics and have respective “rates” of producing the metrics values
- Components (2p)
- Endpoints
- intro (2p)
- Metrics components diagram (7p)
- Metrics collection rate (2p)
- Tools for viewing metrics
- intro (1p)
- Grafana
- default charts that involve memory (2p)
- metrics used inside the default charts that show memory data (4p)
- Prometheus
- what does the tool do (1p)
- reasons for using Prometheus in the article (2p)
- define jobs and targets, show a sample and explain how to retrieve the full list (3p)
- targets as input to the metric components diagram (1p)
- the reason why the Kubelet Summary API endpoint and Resource endpoint aren’t scraped (1p)
- how to obtain the list of metrics returned and some sample queries (3p)
- timings (1p)
- metric lifecycle and labels applied (1p)
kubectl get --raw
- how can it be used and 2 examples (2p)
- browser
- how can it be used and 2 examples (2p)
kubectl top pod / top node
(1p)kubectl describe node
(1p)- htop (1p)
- Metrics values
- why do we want to track down the source of the metrics (4p)
- cAdvisor endpoint
- cAdvisor endpoint source code reference (1p)
- cAdvisor metrics table (3p)
- explanation of the most important metrics (2p)
- Summary API endpoint
- source code reference (2p)
- name of the metrics (1p)
- Resource Metrics endpoint
- metrics it returns (1p)
- source code reference (1p)
- Resource Metrics API endpoint metrics table (1p)
- Prometheus node exporter (4p)
- Prometheus (2p)
- Grafana (2p)
- kubectl top pod (3p)
- htop (4p)
- Memory leak tool (5p)
- Adventures in code
- the purpose of this section (1p)
- Running unit tests and compiling Go code (3p)
- How does the Summary API endpoint get its metrics? (7p)
- How can I see that the Resource Metrics endpoint gets its data from the Summary API? (11p)
- How does cAdvisor get its memory metric data? (12p)
- How does cAdvisor publish the internal metrics it collects as Prometheus metrics? (4p)
- How come cAdvisor’s own endpoint doesn’t return any node data, but the Resource Metrics endpoint (that queries cAdvisor in turn) does? (4p)
- Are container runtime stats obtained via CRI overwritten by cAdvisor data for the Kubelet Summary API endpoint? (1p)
- Where can I see that the Metrics Server talks to the
/metrics/resource
endpoint on the Kubelet to retrieve memory stats? (3p) - What decides the names of the metrics that the Summary API endpoint is emitting, considering that its data comes from cAdvisor and/or the container runtime? (2p)
- What is the memory metric that the Kubelet is using when making eviction decisions? (10p)
- Q&A
- Metrics components
- Part 4: Pod evictions, OOM scenarios and flows leading to them
- Pod evictions
- what they are (1p)
- parallel with OOM killer for cgroups and integration (2p)
- ambiguity of “low memory” as criteria for Kubelet evicting pods (1p)
- definition of “allocatable” along with output showing the value in kubectl describe node (2p)
- movie showing a pod eviction from the pod’s own fault (5p)
- Allocatable
- reservation values and formula (3p)
- Kubelet command line and parameters tied to evictions (2p)
--kube-reserved
- what does the flag control (3p)
- implication of kube-reserved value on the kubepods cgroup limit (4p)
- how does the flag value get enforced at the infrastructure level? (1p)
- Eviction mechanism at a glance
- summary of the mechanism (1p)
- interval when thresholds are checked by the Kubelet (1p)
- effects of pod evictions on the underlying node (2p)
- Node Allocatable, illustrated
- diagram and accompanying text (3p)
- eviction mechanism revisited on the diagram (2p)
- Metric watched for during pod evictions (5p)
- OOM Scenario #2: Pods’ memory usage exceeds node’s “allocatable” value (16p)
--eviction-hard
- the asymmetric problem of –kube-reserved and how OS can encroach on “allocatable” (1p)
- what it is (1p)
- OOM Scenario #3: Node available memory drops below the –eviction-hard flag value (25p)
- Changing the
--eviction-hard
memory threshold- new value and how to change the flag value (1p)
- how “kubepods” and “allocatable” cgroup limits change when the flag’s value is changed (6p)
- Interactions between Kubelet’s pod eviction mechanism and the kernel’s OOM killer (2p)
- Is Kubelet killing containers due to OOM?
- documentation and misleading event name (3p)
- arguments against it (2p)
- Conclusions around pod evictions (9p)
- Is it a problem that kube top pod shows a memory usage >100%? (27p)
- Signals and exit codes (10p)
- Metrics testing
- purpose of this section (1p)
- scenario description (3p)
- tool’s console output and explanation of visible memory (2p)
- next steps and reference to metric components diagram (1p)
- why only working set size will be considered as metric (2p)
- Grafana (4p)
- kubectl top pod (2p)
- kubectl top node (5p)
- Resource Metrics endpoint (3p)
- Summary API endpoint (2p)
- cAdvisor (3p)
- cgroups pseudo-filesystem (4p)
- htop (4p)
- Why is there a difference between what htop shows for the container process and kubectl top pod? (6p)
- Flows leading to out-of-memory situations
- intro and diagram of flow (4p)
- description of what the diagram represents (2p)
- assumptions in creating the diagram (1p)
- effects not necessarily tied to the allocating container (1p)
- OOM scenarios
- intro and link to flows diagram (1p)
- info on pod manifests that can be used to replicate the scenarios (1p)
- OOM1: Container is OOMKilled when it exceeds its memory limit (3p)
- OOM2: Pods’ memory usage exceeds node’s “allocatable” value (4p)
- OOM3: Node available memory drops below the hard eviction threshold (3p)
- OOM4: Pods’ memory usage exceeds node’s “allocatable” value (fast allocation) (2p)
- OOM5: Container has a limit set, app inside allocates memory but the app’s runtime eventually fails the allocations way before the limit (4p)
- Q&A
- list of other articles exploring the topics present in this part 4 (2p)
- Q&A: Pod evictions (15p)
- Pod evictions
so fundamental work! totally appreciated your work!
LikeLike