KOF Alerts#
Summary#
At this point you have metrics collected and visualized. It is important to check them manually, but it is even better to automate detection and notification about the issues found in the data.
We believe the rules should be configured using YAML IaC (Infrastructure as Code), while you can perform temporary management such as Silences using the UI.
Alerting rules and recording rules in KOF are based on the PrometheusRules from the kube-prometheus-stack chart with per-cluster customization options.
KOF uses the data source managed rules to store and execute recording rules in regional clusters closer to the source data, and to reduce the load on Grafana, even for alerting rules executed by Promxy in the management cluster.
Promxy is used as a data source and executor of alerting rules instead of VMAlert because:
-
As the Promxy FAQ says, "for example, if you wanted to know that the global error rate was < 10%, this would be impossible on the individual prometheus hosts (without federation, or re-scraping) but trivial in promxy."
-
It fixes the "See graph" button in the Grafana Alerting rules UI, as Grafana gets the metrics from all regional clusters via Promxy.
VMAlertManager aggregates and sends alerts to various receivers like Slack with advanced routing options.
Let's start with the demo of an alert sent and received.
Alertmanager Demo#
-
Open the https://webhook.site/ and save "Your unique URL" for the next step.
-
Add the following to the
mothership-values.yaml
file, replacing$WEBHOOK_URL
with the URL from step 1:victoriametrics: vmalert: vmalertmanager: config: | route: receiver: webhook receivers: - name: webhook webhook_configs: - url: $WEBHOOK_URL
-
Apply the
mothership-values.yaml
file as described in the Management Cluster section. -
Wait until the https://webhook.site/ shows the
Watchdog
alert, as in:{ "receiver": "webhook", "status": "firing", "alerts": [ { "status": "firing", "labels": { "alertgroup": "general.rules", "alertname": "Watchdog", "severity": "none", "source": "promxy" }, "annotations": { "description": "This is an alert meant to ensure that the entire alerting pipeline is functional...", "runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/general/watchdog", "summary": "An alert that should always be firing to certify that Alertmanager is working properly." }, "startsAt": "2025-06-02T10:27:29.14Z", "endsAt": "0001-01-01T00:00:00Z", "generatorURL": "http://127.0.0.1:8082/...",
Advanced Routing#
The configuration of the Alertmanager Demo is very basic.
Please use these guides to apply advanced routing options:
-
Prometheus Alertmanager configuration reference - all possible options.
-
VMAlertManager Slack example - a multichannel notification system to ensure that critical alerts are promptly delivered to the responsible teams.
-
Matchers - configurable routing rules that determine where and how alerts are directed (for example, email, Slack, PagerDuty) based on severity, source, or other attributes.
-
Grouping and example from Prometheus with
group_by: [cluster, alertname]
- you may want to usegroup_by: [alertgroup, alertname]
instead for alert correlation across clusters to identify systemic issues and reduce noise when the same alert fires in multiple clusters.
Alertmanager UI#
To access the Alertmanager UI:
-
In the management cluster, forward the alertmanager port:
kubectl port-forward -n kof svc/vmalertmanager-cluster 9093:9093
-
Open http://127.0.0.1:9093/ and check tabs such as "Alerts" and "Silences".
See the demo in the Grafana Alerting UI section where Alertmanager UI shows the same data.
Grafana Alerting UI#
To access the Grafana Alerting UI:
-
Apply the Access to Grafana step.
-
Open: Grafana - Alerting - and then "Alert rules" or "Silences", like this:
Prometheus UI#
There are few places where you can find the graph of the firing alert:
-
Grafana - Alerting - Alert rules - rule - See graph.
This shows the graph in Grafana UI, as in the Grafana Alerting UI demo above.
-
Grafana - Alerting - Groups - group - alert - See source - Graph.
This shows the graph in Prometheus UI.
-
The same Prometheus UI link is sent to receiver like Slack in
generatorURL
field, as shown in the Alertmanager Demo.
The Prometheus UI looks like this:
To enable Promxy Prometheus UI, please run this command in the management cluster:
kubectl port-forward -n kof svc/kof-mothership-promxy 8082:8082
If you expose the Prometheus UI with some external domain,
please set promxy.extraArgs."web.external-url"
in mothership-values.yaml
file
and reapply it as described in the Management Cluster section.
Custom rules#
You can update or create rules for all or specific clusters in a centralized way,
passing values
to the kof-mothership
chart installed in the management cluster.
For example, let's update the CPUThrottlingHigh
alert in the kubernetes-resources
group:
-
Note the original alert in the
PrometheusRule
has the threshold> ( 25 / 100 )
. -
Add this cluster-specific patch to the
mothership-values.yaml
file:Note theclusterAlertRules: cluster1: kubernetes-resources: CPUThrottlingHigh: expr: |- sum(increase(container_cpu_cfs_throttled_periods_total{cluster="cluster1", container!=""}[5m])) without (id, metrics_path, name, image, endpoint, job, node) / on (cluster, namespace, pod, container, instance) group_left sum(increase(container_cpu_cfs_periods_total{cluster="cluster1"}[5m])) without (id, metrics_path, name, image, endpoint, job, node) > ( 42 / 100 )
cluster="cluster1"
filters and the> ( 42 / 100 )
threshold. -
Add a similar patch for
cluster10
to the sameclusterAlertRules
. -
Now that we have special
CPUThrottlingHigh
alerts forcluster1
andcluster10
, we want to exclude these clusters from the defaultCPUThrottlingHigh
alert to avoid the ambiguity of which threshold fires this alert in each cluster.Add this patch to the same file:
Note thedefaultAlertRules: kubernetes-resources: CPUThrottlingHigh: expr: |- sum(increase(container_cpu_cfs_throttled_periods_total{cluster!~"^cluster1$|^cluster10$", container!=""}[5m])) without (id, metrics_path, name, image, endpoint, job, node) / on (cluster, namespace, pod, container, instance) group_left sum(increase(container_cpu_cfs_periods_total{cluster!~"^cluster1$|^cluster10$"}[5m])) without (id, metrics_path, name, image, endpoint, job, node) > ( 25 / 100 )
cluster!~"^cluster1$|^cluster10$"
filters and the default threshold. -
You can also update or create recording rules in the same way, but the whole rule group should be redefined, because the
record
field is not unique. -
You may update or create more rules, like the
ContainerHighMemoryUsage
alert that was added on demand from the awesome-prometheus-alerts collection. -
Apply the
mothership-values.yaml
file as described in the Management Cluster section.
Generation of rules#
The next steps are automated:
graph TB
KPSRF[rules files and values<br>copied from kube-prometheus-stack<br>to kof-mothership] -->
PR[PrometheusRules]
ARV[kof-mothership values:<br>defaultAlertRules,<br>clusterAlertRules] -->
ARCM[input ConfigMaps:<br>k-m-promxy-rules-default,<br>k-m-promxy-rules-cluster-*] --kof-operator:<br>configmap_controller<br>updates-->
KMPR[output ConfigMap:<br>k-m-promxy-rules]
PR --> KMPR --->
EPR[Alerting /etc/promxy/rules]
RRV[kof-mothership values:<br>defaultRecordRules<br>clusterRecordRules] -->
RRCM[input ConfigMaps:<br>kof-record-rules-default,<br>kof-record-rules-cluster-*]
PR --> KRVM[output ConfigMap:<br>kof-record-vmrules-*]
RRCM --> KRVM
KRVM --"Management special case:<br>helm upgrade -i kof-storage<br>-f vmrules.yaml"--> KSV
KRVM --Regional MCS/ClusterProfile<br>valuesFrom: ConfigMap-->
KSV[kof-storage values:<br>vmrules: groups: ...] -->
VMR[Recording VMRules]
RCD[Regional ClusterDeployment] --"kof-operator:<br>clusterdeployment_controller<br>creates empty"--> KRVM
-
Rules patches (empty by default) are rendered from
kof-mothership
values to the inputConfigMaps
, which are merged with upstreamPrometheusRules
, generating the outputConfigMaps
.- If you want to protect some output
ConfigMap
from automatic changes, set its labelk0rdent.mirantis.com/kof-generated: "false"
- If you want to protect some output
-
Alerting rules are mounted to Promxy in the management cluster as
/etc/promxy/rules
. -
Recording rules are passed via
MultiClusterService
(orClusterProfile
foristio
case) to each regional cluster wherekof-storage
chart renders them toVMRules
.
Mothership recording rules#
If you've selected to store KOF data of the management cluster in the same management cluster, then:
-
Copy the generated mothership recording rules from the output
ConfigMap
to a YAML file:kubectl get cm -n kof kof-record-vmrules-mothership -o yaml \ | yq -r .data.values > vmrules.yaml
-
Add
-f vmrules.yaml
to thehelm upgrade ... kof-storage
command described in the From Management to Management section and apply it.
Execution of rules#
Details of where and how the recording and alerting rules are executed:
sequenceDiagram
box rgba(0, 0, 255, 0.2) Regional kof-storage
participant VMR as Recording VMRules
participant VMA as VMAlert
participant VMS as VMStorage
end
box rgba(255, 0, 0, 0.2) Management kof-mothership
participant MP as Promxy
participant MVMS as VMStorage
participant VMAM as VMAlertManager
end
VMA->>VMR: execute
VMA->>VMS: read "expr" metrics
VMA->>VMS: write "record" metrics
note over MP: execute<br>Alerting /etc/promxy/rules
MP->>VMS: read "expr" metrics
MP->>MVMS: write "ALERTS" metrics
MP->>VMAM: Notify about alert
-
Recording
VMRules
are executed byVMAlert
, reading and writing toVMStorage
- all this happens inkof-storage
in each regional cluster.The From Management to Management case is special:
VMRules
are provided bykof-storage
chart in the management cluster, whileVMAlert
andVMStorage
are provided bykof-mothership
- to avoid having two VictoriaMetrics engines in the same cluster. -
Alerting rules are executed by Promxy in
kof-mothership
in the management cluster, reading metrics from all regionalVMStorages
, writing to the managementVMStorage
, and notifyingVMAlertManager
in the management cluster.