Prometheus Alertmanager 告警路由
大约 13 分钟约 4024 字
Prometheus Alertmanager 告警路由
简介
Alertmanager 的作用不是单纯把告警"发出去",而是把来自 Prometheus 的告警事件做分组、去重、路由、抑制、静默和升级处理,让值班体系真正能消化这些信号。一个成熟的告警系统,重点不是告警数量多,而是重要告警能快速到达对的人,噪音告警不会把团队淹没。
告警系统的设计目标可以总结为四个"正确":
- 正确的人:告警发送到负责该服务的团队和个人
- 正确的时间:在故障发生的第一时间通知,维护窗口静默
- 正确的内容:包含故障摘要、影响范围、排查入口
- 正确的频率:避免告警风暴,也不应漏掉关键告警
特点
告警架构设计
告警链路全景
# 告警处理完整链路
[应用/Middleware] --指标--> [Prometheus] --评估规则--> [Alertmanager]
|
+-- 分组/去重
+-- 路由匹配
+-- 抑制检查
+-- 静默检查
|
v
[接收器]
/ | \
[邮件] [IM] [Pager]
/ | \
[值班人] [群组] [值班系统]告警分级标准
# 推荐的告警分级体系
# Critical(P0)— 服务完全不可用,影响核心业务
# 响应时间:5 分钟内
# 通知方式:电话/Pager + IM + 邮件
# 示例:支付服务宕机、数据库主节点故障、核心 API 5xx 率超过 50%
# Warning(P1)— 服务部分降级,影响非核心功能
# 响应时间:30 分钟内
# 通知方式:IM + 邮件
# 示例:缓存命中率下降、磁盘使用率超过 80%、慢查询增多
# Info(P2)— 需要关注但不影响服务
# 响应时间:工作时间处理
# 通知方式:邮件 / 工单
# 示例:证书即将过期、日志采集延迟、配置变更未生效实现
基础路由、分组与接收器
# alertmanager.yml — 完整配置
# 全局配置
global:
# 告警恢复超时(如果 Prometheus 未发送 resolved 通知,Alertmanager 在此时间后自动标记恢复)
resolve_timeout: 5m
# SMTP 配置(邮件通知)
smtp_smarthost: 'smtp.example.com:25'
smtp_from: 'monitor@example.com'
smtp_auth_username: 'monitor@example.com'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
# 默认的模板配置
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 路由树
route:
# 默认接收器(没有匹配到子路由时使用)
receiver: default-team
# 分组标签(相同标签值的告警会被合并为一条通知)
group_by: ['alertname', 'job', 'severity']
# 首次等待时间(收到第一个告警后等待多久,看是否有同类告警可以合并)
group_wait: 30s
# 分组间隔时间(同一组的后续通知间隔)
group_interval: 5m
# 重复间隔时间(如果告警持续 firing,多久重复发送通知)
repeat_interval: 4h
# 子路由(按优先级从上到下匹配)
routes:
# Critical 告警 — 发送到 SRE 团队
- matchers:
- severity = critical
receiver: sre-critical
group_wait: 10s # Critical 告警更快通知
group_interval: 2m
repeat_interval: 30m # 更频繁地重复通知
continue: true # 继续匹配后续路由
# Warning 告警
- matchers:
- severity = warning
receiver: sre-warning
group_wait: 30s
group_interval: 10m
repeat_interval: 4h
# 按团队路由
- matchers:
- team = backend
receiver: backend-team
- matchers:
- team = frontend
receiver: frontend-team
- matchers:
- team = data
receiver: data-team
# 按环境路由
- matchers:
- env = dev
receiver: dev-null # 开发环境不发送通知
- matchers:
- env = staging
receiver: staging-team
# 抑制规则
inhibit_rules:
# 当节点宕机时,抑制该节点上的所有 warning 告警
- source_matchers:
- alertname = NodeDown
- severity = critical
target_matchers:
- severity = warning
equal: ['instance']
# 当集群不可用时,抑制节点级告警
- source_matchers:
- alertname = ClusterDown
target_matchers:
- alertname = NodeDown
equal: ['cluster']
# 当主数据库故障时,抑制从数据库的告警
- source_matchers:
- alertname = DatabaseMasterDown
target_matchers:
- alertname = DatabaseReplicaDown
equal: ['cluster', 'database']
# 接收器配置
receivers:
- name: default-team
webhook_configs:
- url: 'http://alert-router.internal/default'
send_resolved: true
- name: sre-critical
# Pager 通知
webhook_configs:
- url: 'http://alert-router.internal/pagerduty'
send_resolved: true
# 邮件通知
email_configs:
- to: 'sre-oncall@example.com'
send_resolved: true
# IM 通知
webhook_configs:
- url: 'http://alert-router.internal/slack-critical'
send_resolved: true
- name: sre-warning
webhook_configs:
- url: 'http://alert-router.internal/slack-warning'
send_resolved: true
email_configs:
- to: 'sre-team@example.com'
send_resolved: true
- name: backend-team
email_configs:
- to: 'backend-oncall@example.com'
send_resolved: true
- name: frontend-team
email_configs:
- to: 'frontend-oncall@example.com'
send_resolved: true
- name: data-team
email_configs:
- to: 'data-team@example.com'
send_resolved: true
- name: staging-team
webhook_configs:
- url: 'http://alert-router.internal/slack-staging'
send_resolved: true
- name: dev-null
# 空配置 = 不发送通知Prometheus 告警规则
# Prometheus 告警规则示例
groups:
- name: app-rules
rules:
# API 错误率过高
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, service)
/ sum(rate(http_requests_total[5m])) by (job, service)
> 0.05
for: 10m
labels:
severity: critical
team: backend
env: prod
annotations:
summary: "API 5xx 错误率过高"
description: "服务 {{ $labels.job }} 在过去 5 分钟错误率超过 5%,当前值: {{ $value | printf \"%.2f\" }}%"
runbook: "https://wiki.internal/runbook/high-error-rate"
dashboard: "https://grafana.internal/d/api-errors"
# Pod 重启次数过多
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: warning
team: platform
env: prod
annotations:
summary: "Pod 频繁重启"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 在过去 15 分钟重启了 {{ $value }} 次"
# 磁盘使用率过高
- alert: DiskUsageHigh
expr: |
(node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) < 0.2
for: 30m
labels:
severity: warning
team: platform
env: prod
annotations:
summary: "磁盘使用率超过 80%"
description: "节点 {{ $labels.instance }} 的 {{ $labels.mountpoint }} 磁盘剩余空间不足 20%"
# 证书即将过期
- alert: CertificateExpiringSoon
expr: |
probe_ssl_earliest_cert_expiry - time() < 7 * 24 * 3600
for: 1h
labels:
severity: warning
team: platform
env: prod
annotations:
summary: "SSL 证书将在 7 天内过期"
description: "证书 {{ $labels.instance }} 将在 {{ $value | humanizeDuration }} 后过期"
# 内存使用率过高
- alert: MemoryUsageHigh
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9
for: 5m
labels:
severity: critical
team: platform
env: prod
annotations:
summary: "内存使用率超过 90%"
description: "节点 {{ $labels.instance }} 内存使用率 {{ $value | printf \"%.1f\" }}%"# 数据库告警规则
groups:
- name: database-rules
rules:
- alert: MySQLReplicationLag
expr: |
mysql_slave_seconds_behind_master > 30
for: 5m
labels:
severity: warning
team: data
env: prod
annotations:
summary: "MySQL 从库复制延迟"
description: "MySQL 实例 {{ $labels.instance }} 复制延迟 {{ $value }} 秒"
- alert: PostgreSQLConnectionPoolExhausted
expr: |
pg_stat_activity_count / pg_settings_max_connections > 0.9
for: 5m
labels:
severity: critical
team: data
env: prod
annotations:
summary: "PostgreSQL 连接池即将耗尽"
description: "连接使用率 {{ $value | printf \"%.1f\" }}%"# Kubernetes 告警规则
groups:
- name: k8s-rules
rules:
- alert: KubeNodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
team: platform
env: prod
annotations:
summary: "Kubernetes 节点 NotReady"
description: "节点 {{ $labels.node }} 已持续 5 分钟处于 NotReady 状态"
- alert: KubeDeploymentReplicasMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 15m
labels:
severity: warning
team: platform
env: prod
annotations:
summary: "Deployment 副本数不匹配"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} 期望 {{ $labels.spec_replicas }} 个副本,实际可用 {{ $labels.available_replicas }} 个"
- alert: KubePodOOMKilled
expr: |
rate(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0
for: 5m
labels:
severity: warning
team: platform
env: prod
annotations:
summary: "Pod 因 OOM 被 Kill"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 的容器 {{ $labels.container }} 因内存不足被 Kill"校验与运维命令
# 校验 Alertmanager 配置
amtool check-config /etc/alertmanager/alertmanager.yml
# 校验 Prometheus 告警规则
promtool check rules /etc/prometheus/rules/*.yml
# 查看 Alertmanager 状态
curl -s http://127.0.0.1:9093/api/v2/status | python3 -m json.tool
# 查看当前活跃告警
curl -s http://127.0.0.1:9093/api/v2/alerts | python3 -m json.tool
# 查看 Prometheus 告警规则
curl -s http://127.0.0.1:9090/api/v1/rules | python3 -m json.tool
# 手动触发告警测试
curl -X POST http://127.0.0.1:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"team": "backend",
"env": "prod"
},
"annotations": {
"summary": "这是一个测试告警",
"description": "请忽略此告警,用于测试告警链路"
}
}
]'
# 重载 Alertmanager 配置
kill -HUP $(pidof alertmanager)
# 或者
curl -X POST http://127.0.0.1:9093/-/reload静默管理
# 创建静默(维护窗口)
amtool silence add \
alertname=HighErrorRate env=prod team=backend \
--author "ops" \
--comment "发布窗口临时静默" \
--duration 2h
# 创建静默(按匹配器)
amtool silence add \
--matcher 'service=payment-api' \
--matcher 'env=prod' \
--author "ops" \
--comment "支付服务维护窗口" \
--duration 4h
# 查看所有静默
amtool silence query
# 查看活跃的静默
amtool silence query --active
# 过期所有静默
amtool silence expire $(amtool silence query -q)
# 删除特定静默
amtool silence expire <silence-id>
# 使用 API 创建静默
curl -X POST http://127.0.0.1:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "env", "value": "prod", "isRegex": false}
],
"startsAt": "2026-04-13T02:00:00Z",
"endsAt": "2026-04-13T06:00:00Z",
"createdBy": "ops",
"comment": "凌晨维护窗口"
}'通知模板
{# /etc/alertmanager/templates/custom.tmpl #}
{{ define "custom.message" }}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
环境: {{ .CommonLabels.env }}
服务: {{ .CommonLabels.job }}
级别: {{ .CommonLabels.severity }}
团队: {{ .CommonLabels.team }}
摘要: {{ .CommonAnnotations.summary }}
详情: {{ .CommonAnnotations.description }}
Runbook: {{ .CommonAnnotations.runbook }}
Dashboard: {{ .CommonAnnotations.dashboard }}
开始时间: {{ .CommonAlerts.Get 0 | alertStartsAt }}
{{ end }}
{{ define "custom.slack" }}
{
"text": "{{ .CommonLabels.alertname }}",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}"
}
},
{
"type": "section",
"fields": [
{ "type": "mrkdwn", "text": "*Severity:*\n{{ .CommonLabels.severity }}" },
{ "type": "mrkdwn", "text": "*Environment:*\n{{ .CommonLabels.env }}" },
{ "type": "mrkdwn", "text": "*Service:*\n{{ .CommonLabels.job }}" },
{ "type": "mrkdwn", "text": "*Team:*\n{{ .CommonLabels.team }}" }
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "{{ .CommonAnnotations.description }}"
}
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": { "type": "plain_text", "text": "Runbook" },
"url": "{{ .CommonAnnotations.runbook }}"
},
{
"type": "button",
"text": { "type": "plain_text", "text": "Dashboard" },
"url": "{{ .CommonAnnotations.dashboard }}"
}
]
}
]
}
{{ end }}
{{ define "custom.email.subject" }}
[{{ .Status | toUpper }}] [{{ .CommonLabels.env }}] {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.summary }}
{{ end }}
{{ define "custom.email.html" }}
<html>
<body>
<h2>{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}</h2>
<table border="1" cellpadding="5">
<tr><td><b>Severity</b></td><td>{{ .CommonLabels.severity }}</td></tr>
<tr><td><b>Environment</b></td><td>{{ .CommonLabels.env }}</td></tr>
<tr><td><b>Service</b></td><td>{{ .CommonLabels.job }}</td></tr>
<tr><td><b>Team</b></td><td>{{ .CommonLabels.team }}</td></tr>
<tr><td><b>Description</b></td><td>{{ .CommonAnnotations.description }}</td></tr>
<tr><td><b>Runbook</b></td><td><a href="{{ .CommonAnnotations.runbook }}">Open</a></td></tr>
<tr><td><b>Dashboard</b></td><td><a href="{{ .CommonAnnotations.dashboard }}">Open</a></td></tr>
</table>
<hr>
<p><b>Alerts in this group:</b></p>
<ul>
{{ range .Alerts }}
<li>{{ .Labels.instance }}: {{ .Annotations.description }} (started: {{ .StartsAt }})</li>
{{ end }}
</ul>
</body>
</html>
{{ end }}多团队分层路由与升级策略
# 多团队分层路由
route:
receiver: platform-default
group_by: ['alertname', 'team', 'env']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# 生产环境 Critical 告警
- matchers:
- env = prod
- severity = critical
receiver: prod-critical
group_wait: 10s
group_interval: 2m
repeat_interval: 30m
routes:
# 后端服务
- matchers:
- team = backend
receiver: backend-critical
# 前端服务
- matchers:
- team = frontend
receiver: frontend-critical
# 数据服务
- matchers:
- team = data
receiver: data-critical
# 生产环境 Warning 告警
- matchers:
- env = prod
- severity = warning
receiver: prod-warning
repeat_interval: 4h
# 生产环境 Info 告警
- matchers:
- env = prod
- severity = info
receiver: prod-info
repeat_interval: 24h
# 测试环境告警
- matchers:
- env = staging
receiver: staging-team
repeat_interval: 8h
# 开发环境不通知
- matchers:
- env = dev
receiver: dev-null# 接收器配置
receivers:
- name: prod-critical
# PagerDuty 通知
webhook_configs:
- url: 'http://alert-router.internal/pagerduty-critical'
send_resolved: true
# 邮件通知
email_configs:
- to: 'sre-lead@example.com'
send_resolved: true
html: '{{ template "custom.email.html" . }}'
subject: '{{ template "custom.email.subject" . }}'
- name: prod-warning
webhook_configs:
- url: 'http://alert-router.internal/slack-warning'
send_resolved: true
email_configs:
- to: 'sre-team@example.com'
send_resolved: true
- name: prod-info
email_configs:
- to: 'ops-tickets@example.com'
send_resolved: true
- name: backend-critical
webhook_configs:
- url: 'http://alert-router.internal/pagerduty-backend'
send_resolved: true
email_configs:
- to: 'backend-oncall@example.com'
send_resolved: true
- name: frontend-critical
webhook_configs:
- url: 'http://alert-router.internal/pagerduty-frontend'
send_resolved: true
email_configs:
- to: 'frontend-oncall@example.com'
send_resolved: true
- name: data-critical
webhook_configs:
- url: 'http://alert-router.internal/pagerduty-data'
send_resolved: true
email_configs:
- to: 'data-oncall@example.com'
send_resolved: true
- name: staging-team
webhook_configs:
- url: 'http://alert-router.internal/slack-staging'
send_resolved: true
- name: platform-default
email_configs:
- to: 'platform-team@example.com'
send_resolved: true
- name: dev-nullAlertmanager 高可用
# Alertmanager 集群配置
# 多个 Alertmanager 实例使用 gossip 协议同步状态
# 启动参数示例:
# alertmanager \
# --cluster.listen-address=0.0.0.0:9094 \
# --cluster.peer=alertmanager-2:9094 \
# --cluster.peer=alertmanager-3:9094 \
# --storage.path=/alertmanager-data \
# --config.file=/etc/alertmanager/alertmanager.yml# Prometheus 配置多个 Alertmanager 地址
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.0.0.11:9093
- 10.0.0.12:9093
- 10.0.0.13:9093
# 或者使用 DNS 服务发现
alerting:
alertmanagers:
- dns_sd_configs:
- names:
- _alertmanager._tcp.alertmanager.monitoring.svc.cluster.local
type: A
port: 9093Webhook 自定义接收器
# 钉钉机器人 Webhook
receivers:
- name: dingtalk-team
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true# 飞书机器人 Webhook
receivers:
- name: feishu-team
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/YOUR_TOKEN'
send_resolved: true# 企业微信机器人 Webhook
receivers:
- name: wecom-team
webhook_configs:
- url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY'
send_resolved: true告警治理与复盘
# 使用 amtool 进行告警治理
# 查看所有告警
amtool alert query
# 按严重级别查看告警
amtool alert query --filter 'severity=critical'
# 按团队查看告警
amtool alert query --filter 'team=backend'
# 查看 Alertmanager 配置
amtool config show# 告警规则文件组织结构
# /etc/prometheus/rules/
# ├── app/
# │ ├── api.yml # 应用层告警
# │ ├── database.yml # 数据库告警
# │ └── cache.yml # 缓存告警
# ├── k8s/
# │ ├── nodes.yml # 节点告警
# │ ├── pods.yml # Pod 告警
# │ └── deployments.yml # Deployment 告警
# ├── infra/
# │ ├── network.yml # 网络告警
# │ ├── disk.yml # 磁盘告警
# │ └── memory.yml # 内存告警
# └── slo/
# ├── latency.yml # SLO 延迟告警
# └── availability.yml # SLO 可用性告警优点
缺点
总结
Alertmanager 的价值在于把监控信号转换为可执行的值班动作。落地时应优先建立统一标签体系(env、team、service、severity),再设计路由、分组和抑制策略,最后补上模板、静默和升级机制,否则告警系统很容易沦为噪音制造机。
关键知识点
- 告警标签设计质量,直接决定 Alertmanager 路由能否长期可维护
group_wait、group_interval、repeat_interval三者决定告警推送节奏- 抑制用于"根因已知时压制次生告警",不等同于静默
- 生产环境告警必须支持 resolved 通知,便于确认故障恢复
continue: true让告警继续匹配后续路由(可用于多通道通知)- 告警规则中的
for子句决定了告警的持续时间阈值,避免瞬态抖动触发告警 - 高可用部署时 Alertmanager 实例通过 gossip 协议同步告警状态
项目落地视角
- 按 team/env/service/severity 做统一路由标签
- Critical 告警走 Pager,Warning 告警走 IM 群和邮件
- 发布窗口用 silence 管控噪音,避免误报影响值班
- 通过统一模板输出 runbook 链接、Grafana 链接和 trace 信息
- 建立告警复盘机制,定期清理噪音和过期规则
- 告警配置纳入 Git 管理,变更需要 Code Review
常见误区
- 所有告警都推到同一个群,没有责任边界
- 没有抑制规则,一个节点故障引爆几十条下游告警
- 告警描述太抽象,收到后仍然不知道先查哪里
- 只配置 firing,不配置 resolved,导致值班人不知道故障是否恢复
repeat_interval设置太短,导致值班人被持续轰炸- 告警规则没有
for子句,瞬态波动触发大量告警 - 没有定期清理过期告警规则,导致告警数量持续膨胀
进阶路线
- 接入 PagerDuty / OpsGenie / 自研值班平台
- 按服务等级(SLA/SLO)设计分级告警策略
- 为告警通知模板补充 Runbook、Dashboard、日志检索链接
- 建立告警复盘机制,定期清理噪音和过期规则
- 学习 Thanos / Cortex 等分布式 Prometheus 方案的多集群告警联邦
- 研究 Alertmanager 与 GitOps 的集成方案
适用场景
- 多团队共享监控平台的企业环境
- K8s、微服务、数据库、中间件统一告警治理
- 需要区分 Critical / Warning / Info 的值班体系
- 需要静默、抑制、升级、恢复通知的成熟运维体系
- 需要对接多种通知渠道(IM、邮件、电话、工单)的场景
落地建议
- 先统一 Prometheus 规则标签,再设计 Alertmanager 路由
- 给每个关键 receiver 明确 owner 和值班责任
- 重要告警必须附带摘要、影响面和排查入口链接
- 告警规则和 Alertmanager 配置都应纳入 Git 审核流程
- 建立"告警覆盖率"和"告警噪音率"的度量指标
- 定期做告警复盘(每周/每月),清理无效告警
排错清单
- 检查 Prometheus 告警是否真的发送到了 Alertmanager
- 检查 route 是否命中正确 receiver,是否被 continue / 子路由影响
- 检查 silence、inhibit 是否误杀了本应发送的告警
- 检查模板变量和接收器配置是否正确渲染并发送
- 检查 Alertmanager 集群节点之间的 gossip 状态
- 检查 Prometheus 的 alerting 配置中 Alertmanager 地址是否正确
- 检查
resolve_timeout是否与告警规则的for子句配合合理
复盘问题
- 你们当前最痛的,是漏告警还是告警噪音过多?
- 告警标签体系是否足够支撑团队化路由和升级?
- 收到一条 critical 告警时,值班人是否知道下一步怎么做?
- 哪些 warning 告警应该被抑制、合并或降级处理?
- 告警规则的数量和变更频率是否在可控范围?
- 是否有定期清理过期告警和无效规则的机制?
