I need a little bit of help here. I have a Kubernetes cluster up and running and I have a dedicated machine for monitoring with Prometheus running on it. I already have node exporters running and scraping machine-level metrics, like cpu, memory, file system, etc.
But I am still confused about how to proceed. I just tried to integrate Kubernetes elements into Prometheus (I started with services). The (basic) scrape config and the service account (token) is not much of a problem. My issues are the following.
Most importantly: I am wondering why basically everyone's go-to strategy it to run Prometheus INSIDE the SAME cluster they want to monitor? To me, this feels like a really bad idea. What am I missing? I mean, to me the logic is simple: If my Kubernetes goes down (for whatever reason), the monitoring (including alerting) goes down with it, which I think should be avoided.
Currently I tried to include the monitoring for Kubernetes services (as a start). But they are all down, because Prometheus can't reach addresses like
http://argo-server.argo.svc:2746/metric
. At least this proofs that Prometheus can successfully talk to the Kubernetes API (so that's good), but this also implies, that even Prometheus itself expects to run inside Kubernetes, because those addresses are only reachable from inside (which closes the loop to point 1).I also stumbled across something called "agent mode". I'm not sure yet if this can solve my problem. With "agent mode" in mind (and a blurry understanding of how this might works), I can think of a scenario where I deploy a Prometheus in agent mode within the Kubernetes cluster. And its only job is to act as some kind of proxy for the real Prometheus, which runs in server mode outside the cluster. It also seems, that this can be configured to basically push metrics into another Prometheus server, which kind of conflicts with the pull-only policy of Prometheus. So, is that design possible? And is it a good idea?
So basically I need input of where to put Prometheus (inside vs outside of the cluster), and why? And also some clues about about good strategies to collect Kubernetes metrics and brings them into a Prometheus, outside the cluster.