Setting up Grafana to monitor Kubernetes metrics and logs

Grafana is a tool beloved by it's community for versatile graphing and alerting capabilities. Combined with the right datasources, it can monitor just about anything a developers or operator may be interested in, from hardware usage to marketing goals. Including it in a monitoring stack for kubernetes clusters is an easy decision, and can be done fairly quickly through tools like helm.

Preparing for installation

Before beginning the installation process, you need to have Helm (at least v3), as well as the repositories for the charts we are about to install.

Add the repositories:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts

The update them to ensure you are using the latest charts:

helm repo update

As a last step, we will create a namespace for all monitoring components, so they are neatly grouped together:

kubectl create namespace monitoring

With the prerequisites finished up, we are now ready for the installation.

Installing Grafana and Prometheus

The prometheus community maintains a very comprehensive helm chart for a full prometheus/Grafana stack to monitor metrics within a kubernetes cluster. This chart not only ships with all necessary dependencies preconfigured, but also includes a vast amount of default alerts already set up, so you don't have to spend time figuring out what values need to be monitored within a kubernetes cluster.

Before installing the stack, we will slightly customize it to include a preconfigured datasource for the Loki instance installed later.

kube-prometheus-stack-values.yaml

grafana:
 additionalDataSources:
 - name: Loki
   type: loki
   url: http://loki-gateway.monitoring.svc.cluster.local
   access: proxy
   jsonData:
     httpHeaderName1: "X-Scope-OrgID"
   secureJsonData:
     httpHeaderValue1: "1"
 sidecar:
 datasources:
   enabled: true

Install the chart using the custom values:

helm install kube-prometheus-stack -n monitoring --values kube-prometheus-stack-values.yaml prometheus-community/kube-prometheus-stack

The command may seem unresponsive while it is working, but should complete after a few seconds. Once the command finished, the installation is complete - but the services may not be fully running yet.

You can watch the pod status with

kubectl --namespace monitoring get pods -l "release=kube-prometheus-stack" --watch

Once all pods are Running or Completed, you may proceed to the next step.

Configuring Grafana

When accessing the grafana instance for the first time, the initial username will be admin, and the default password can be found with

kubectl get secret kube-prometheus-stack-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Forward the grafana services to your local machine:

kubectl port-forward -n monitoring service/kube-prometheus-stack-grafana 8080:80

Then access http://127.0.0.1:8080 and log in with the credentials you just retrieved. Before anything else, you should change the login information to something else, to ensure you are not vulnerable to default credential attacks.

Alerts and dashboards are already preconfigured, so you can use them out of the box for most needs. All you need to do is set up the contact points (Alerts > Contact Points) to forward those alerts to some place you will see them, like a slack channel or work email.

Installing Loki

For the loki installation, we will choose a middle ground in terms of complexity: Storage will be durable through a deployed minio S3 instance, but the underlying volumes still need to be highly available, as the S3 storage is not redundant. When deploying Loki, you can do so in one of three modes:

SingleBinary: Deploys Loki as a single binary instance, without any clustering or HA capabilities. This is fine for smaller clusters of up to ~10GB of logs per day

SimpleScalable: A balanced mode, deploying Loki as a small highly available cluster together with an S3 storage provider. Suitable for most users, up to ~1TB of logs per day

Distributed: Deploys Loki as a set of highly-available microservices. The most durable options, scaling to several TB of logs per day, but also requires a large amount of resources to run.

For our sample, we are using SimpleScalable to strike a balance between availability and cost that works for most users. The default helm chart for loki is almost exactly what is needed for this mode, only minio storage needs to be configured:

loki.yaml

loki:
 schemaConfig:
   configs:
     - from: 2024-04-01
       store: tsdb
       object_store: s3
       schema: v13
       index:
         prefix: loki_index_
         period: 24h
 ingester:
   chunk_encoding: snappy
 tracing:
   enabled: true
 querier:
   # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
   max_concurrent: 4

deploymentMode: SimpleScalable

# Enable minio for storage
minio:
 enabled: true
 rootUser: enterprise-logs
 rootPassword: supersecret

Make sure to change the rootUser and rootPassword variables to something more secure, then install it:

helm install loki -n monitoring --values loki.yaml grafana/loki

Once all pods are running, the loki instance is ready to receive logs.

Deploying a promtail daemonset

The last missing piece in the stack is promtail, to deliver logs from pods to loki. Since the loki instance is in the monitoring namespace, we have to alter the service url from the default value in the chart:

promtail.yaml

config:
# publish data to loki
 clients:
   - url: http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push
     tenant_id: 1

Then the daemonset can be installed:

helm install promtail -n monitoring --values promtail.yaml grafana/promtail

Once all pods are up and running, logs from all pods should be delivered to the loki instance.

All that is left to do now is configure dashboards and alerts for your logs - but since there is no generic way to do this, you will have to do this step manually. Take some time to go over the LogQL and Grafana alerting functionality, to ensure you stay informed about every event that needs your attention.