3

I'm at the end of my patience with a prometheus setup leveraging kube-prometheus-stack 44.3.0 (latest being 45).

I have two environments, staging and prod. In staging, my prometheus runs smoothly. In prod it has started crashing with OOMKilled errors roughly every 4 minutes.

Things I already tried:

  • Increased the scrape interval from 30s to 300s
  • Identified heavy metrics and dropped them before ingestion [More on that later]
  • Enabled the web.enable-admin-api, to query tsdb and clean the tombstones
  • Deleted prometheusrules, having noticed that they tended to shorten the pod life until the next crash
  • Upped the resources (limits and requests) to the maximum available considering the nodes I'm using (memory limit currently at 6Gi; staging works with under 1Gi memory)
  • Reduced the number of targets to scrape (taking down e.g. etcd metrics)

Comparing TSDB status across staging and prod enter image description here when prod is up, it doesn't show higher numbers - until it crashes: enter image description here

By looking at TSDB statistics I noticed I used to have kube_replicasets metrics swarming prometheus. Another component in the cluster has created a high number of replicasets due to a bug, thus increasing the metrics. I deactivated those metrics from the ingestion completely:

  ...
  metricRelabelings:
  - regex: '(kube_replicaset_status_observed_generation|kube_replicaset_status_replicas|kube_replicaset_labels|kube_replicaset_created|kube_replicaset_annotations|kube_replicaset_status_ready_replicas|kube_replicaset_spec_replicas|kube_replicaset_owner|kube_replicaset_status_fully_labeled_replicas|kube_replicaset_metadata_generation)'
    action: drop
    sourceLabels: [__name__]

I verified that those replicasets metrics are no longer present in the prod prometheus.

TL;DR:

Prometheus in my K8S environment is OOMkilled continuously, making the tool nigh impossible to use. I need insight on how to find and isolate the cause of the issue. Right now the only reasonable culprit still seems to be kube-state-metrics (todo - I need to disable it to verify the idea).

Related questions I've already looked at:

3
  • Hi Liquid welcome to S.F. While sniffing around I saw this comment where they had 30Gi so your 6Gi may be table-stakes at this point. However, that issue also made it seem as though it wasn't ongoing prom that was eating all that memory as much as startup; is that your experience, too? Have you already examined the prometheus_tsdb_head_series metric?
    – mdaniel
    Commented Apr 15, 2023 at 0:39
  • 1
    @mdaniel, thanks for the tip. In my case it's quite a baby-sized cluster of a around 6 nodes. I've checked the head series travels and its around 60k. I was able to isolate the broken component - which was related to metrics shipped by kube_state_metrics. I'll post an answer as update in case it could be useful for someone else
    – Liquid
    Commented Apr 17, 2023 at 14:52
  • 1
    If you have an answer, it is always better to post it. And make it an answer, not an update to question: that way it will be clearer for those who'll might stumble across similar problem.
    – markalex
    Commented Apr 17, 2023 at 14:56

2 Answers 2

2

Here are most likely reasons of prometheus eating memory:

  1. Overwhelming number of timeseries. Considering background this is most plausable. In prometheus datapoints are taking not much memory compared to unique timeseries. I couldn't find link now but AFAIR one datapoint takes around 4 bytes while timeseries w/o any datapoints takes around 1Kb. So having timeseries even w/o any datapoints will take space and might take memory. You can rule out this reason by comparing number of timeseries in prod and stage: count({__name__=~".+"}). If there are significantly more timeseries in prod you'll have to figure out why and probably further reduce the number.
  2. PromQL queries that load to much data into memory. If you have queries requesting long time period or huge amount of timeseries it could also be a reason since prometheus tries to load requested data into memory. Since you have OOM constantly reproducing you can test this assumption by blocking all queries to prometheus and see if it still hits OOM. It may be worth to look at query log too.
  3. Not enough memory on node. It could be just that other containers consume memory on node and prometheus is killed because it has lower QoS. Just make sure prometheus falls into guaranteed QoS.
1

The cause of my issue was a broken keycloak deployment in the keycloak namespace. An old keycloak setup was creating an high number of replicasets (around 36000), which caused the high cardinality for the replicaset-related queries in Prom.

The issue was not in staging since staging didn't mirror that configuration completely.

I had already tried the following relabeling to kube-state-metrics, dropping the queries before ingestion:

   - regex: '(kube_replicaset_status_observed_generation|kube_replicaset_status_replicas|kube_replicaset_labels|kube_replicaset_created|kube_replicaset_annotations|kube_replicaset_status_ready_replicas|kube_replicaset_spec_replicas|kube_replicaset_owner|kube_replicaset_status_fully_labeled_replicas|kube_replicaset_metadata_generation)'
    action: drop
    sourceLabels: [__name__]

but it proved to be too conservative. After adding:

- regex: 'keycloak'
action: drop
sourceLabels: [namespace]

my instance became stable again.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .