Skip to content

Alerts

If your application is serving metrics, you can create alerts to notify your team when something happens.

You can define alerts by using Kubernetes resources (PrometheusRule), as well as directly in Grafana (GUI based).

Kubernetes resources

Warning

The current implementation is the MVP, more convenience will be added.

We use native Prometheus alert rules, and let Alertmanager handle the notifications.

A prerequisite for defining alerts is to specify where the teams alerts should be sent to in that cluster. This is done by defining a AlertmanagerConfig as well as a Secret for your Slack webhook URL:

---
apiVersion: v1
kind: Secret
type: kubernetes.io/Opaque
metadata:
  name: slack-webhook
stringData:
  apiUrl: "https://hooks.slack.com/services/..."
---
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: slack
spec:
  receivers:
    - name: myteam-slack
      slackConfigs:
        - apiURL:
            key: apiUrl
            name: slack-webhook
          sendResolved: true
          channel: team-env-alerts # E.g. foo-dev-alerts
          color: '{{ template "slack.color" . }}'
          text: '{{ template "slack.text" . }}'
          title: '{{ template "slack.title" . }}'

  route:
    receiver: myteam-slack
    groupBy:
      - alertname
    groupInterval: 5m
    groupWait: 10s
    repeatInterval: 1h

Apply these resources to your teams namespace by creating a file containing the content above with your own values and running kubectl apply -f <path to file>

Now you are able to create alerts using PrometheusRule like so:

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myteam-alert
spec:
  groups:
  - name: myteam-alerts
      rules:
      - alert: InstanceDown
        expr: count(up) == 0
        for: 5m
        annotations:
          consequence: Application is unavailable
          action: `kubectl describe pod <podname>` -> `kubectl logs <podname>` 
          summary: |-
            This is a multi-line summary with
            linebreaks and everything. Here you can give a more detailed
            summary of what this alert is about 
        labels:
          namespace: <team namespace>
          severity: critical 

Apply this resource to your teams namespace by creating a file containing the content above with your own values and running kubectl apply -f <path to file>

How to write a good alert

Writing the expr

In order to minimize the feedback loop we suggest experimenting on the Prometheus server to find the right metric for your alert and the notification threshold. The Prometheus server can be found in each cluster, at https://prometheus.{env}.{tenant-name}.cloud.nais.io (e.g. https://prometheus.dev.nav.cloud.nais.io).

You can also visit the Alertmanager at https://alertmanager.{env}.{tenant-name}.cloud.nais.io (e.g. https://alertmanager.dev.nav.cloud.nais.io) to see which alerts are triggered now (you can also silence already triggered alerts).

for

How long time the expr must evaluate to true before firing.

When the expr first evaluates to true the alert will be in pending state for the duration specified.

Example values: 30s, 5m, 1h.

Severity

This will affect what color the notification gets. Possible values are critical (red), warning (yellow) and notice (green).

Consequence

Optionally describe ahead of time to the one receiving the alert what happens in the world when this alert fires.

Action

Optionally describe ahead of time to the one receiving the alert what is the best course of action to resolve this issue.

Summary

Optional longer description of the alert


Last update: November 22, 2022
Created: November 22, 2022