Grafana 仪表盘与告警
大约 7 分钟约 2166 字
Grafana 仪表盘与告警
简介
Grafana 是一款开源的数据可视化和监控平台,支持多种数据源接入,通过丰富的面板类型(折线图、仪表盘、热力图等)构建直观的监控仪表盘。Grafana 的告警功能可在指标异常时触发通知,配合 Prometheus、InfluxDB 等时序数据库,是构建可观测性体系的核心组件。
特点
数据源配置
Prometheus 数据源
# Grafana 数据源配置文件
# /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POST
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
- name: Elasticsearch
type: elasticsearch
access: proxy
url: http://elasticsearch:9200
database: "app-logs-*"
jsonData:
timeField: "@timestamp"
esVersion: "8.12.0"
logMessageField: message
logLevelField: level通过 API 添加数据源
# 添加 Prometheus 数据源
curl -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true
}'
# 查看已有数据源
curl http://admin:admin@localhost:3000/api/datasources
# 测试数据源连接
curl http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up面板配置
CPU 使用率面板
// Grafana Dashboard JSON — CPU 面板示例
{
"dashboard": {
"title": "服务器监控",
"panels": [
{
"title": "CPU 使用率",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 90 }
]
}
}
}
}
]
}
}常用 PromQL 查询
# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 磁盘使用率
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
# 网络流量(入站/出站)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# HTTP 请求 QPS
sum by(status) (rate(http_requests_total[5m]))
# 请求延迟 P99
histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
# .NET 应用指标
# GC 堆内存
dotnet_total_memory_bytes
# 线程池工作线程数
dotnet_threadpool_active_threads
# 请求执行时间 P95
histogram_quantile(0.95, sum by(le) (rate(aspnetcore_request_duration_seconds_bucket[5m])))Docker Compose 部署监控栈
# docker-compose.yml — Prometheus + Grafana
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:10.3.0
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
ports:
- "3000:3000"
depends_on:
- prometheus
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
networks:
monitoring:
driver: bridge# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
env: 'production'
- job_name: 'myapp'
metrics_path: /metrics
static_configs:
- targets: ['myapp:8080']
# .NET 应用需安装 prometheus-net 包变量与模板
Dashboard 变量配置
// Grafana 变量定义
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"current": { "text": "Prometheus", "value": "Prometheus" }
},
{
"name": "instance",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(node_cpu_seconds_total, instance)",
"refresh": 2,
"includeAll": true,
"multi": true,
"sort": 1
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
}
}在面板中使用变量
# 在查询中使用 instance 变量
100 - (avg by(instance) (rate(node_cpu_seconds_total{instance=~"$instance", mode="idle"}[$interval])) * 100)
# 在面板标题中使用变量
# 标题: CPU 使用率 - $instance
# 使用变量筛选日志(Loki 数据源)
{app="$app", env="$env"} |= "$search"
# 常用变量类型
| 变量类型 | 用途 | 示例 |
|---|---|---|
| Query | 从数据源查询值 | `label_values(metric, label)` |
| Interval | 时间间隔选择 | 1m, 5m, 10m, 1h |
| Datasource | 切换数据源 | Prometheus, InfluxDB |
| Custom | 自定义选项 | dev, staging, production |
| Text box | 文本输入 | 日志搜索关键词 |告警规则
Grafana 告警规则配置
# /etc/grafana/provisioning/alerting/rules.yml
apiVersion: 1
groups:
- orgId: 1
name: server-alerts
folder: Server Monitoring
interval: 1m
rules:
- uid: cpu-alert
title: CPU 使用率过高
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus-uid
model:
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
instant: true
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
type: reduce
expression: A
reducer: last
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator:
params:
- 90
type: gt
operator:
type: and
noDataState: OK
execErrState: Alerting
for: 5m
annotations:
summary: "CPU 使用率超过 90%"
description: "实例 {{ $labels.instance }} CPU 使用率为 {{ $value }}%"
labels:
severity: critical
team: ops
- uid: memory-alert
title: 内存使用率过高
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus-uid
model:
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
noDataState: OK
execErrState: Alerting
for: 5m
annotations:
summary: "内存使用率超过 85%"
labels:
severity: warning通过 API 创建告警规则
# 创建告警通知策略
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/policies \
-H "Content-Type: application/json" \
-d '{
"receiver": "ops-team",
"group_by": ["alertname", "instance"],
"group_wait": "30s",
"group_interval": "5m",
"repeat_interval": "4h",
"routes": [
{
"matcher": { "severity": "critical" },
"receiver": "ops-team-urgent",
"group_wait": "10s",
"repeat_interval": "1h"
}
]
}'通知渠道
配置通知渠道
# /etc/grafana/provisioning/alerting/contactpoints.yml
apiVersion: 1
contactPoints:
- orgId: 1
name: ops-team
receivers:
- uid: email-receiver
type: email
settings:
addresses: ops-team@example.com
singleEmail: false
- uid: webhook-receiver
type: webhook
settings:
url: http://webhook-server:8080/alert
httpMethod: POST
- uid: dingtalk-receiver
type: dingding
settings:
url: https://oapi.dingtalk.com/robot/send?access_token=xxx
msgType: markdown
- uid: slack-receiver
type: slack
settings:
url: https://hooks.slack.com/services/xxx
recipient: "#alerts"
username: Grafana
iconEmoji: ":grafana:"Grafana INI 配置
# /etc/grafana/grafana.ini
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
[security]
admin_user = admin
admin_password = ${GRAFANA_PASSWORD}
allow_embedding = true
[auth.anonymous]
enabled = false
[auth.basic]
enabled = true
[smtp]
enabled = true
host = smtp.example.com:587
user = grafana@example.com
password = ${SMTP_PASSWORD}
from_address = grafana@example.com
from_name = Grafana Alert
[alerting]
enabled = true
execute_alerts = true
evaluation_timeout_seconds = 30
notification_timeout_seconds = 30
max_attempts = 3
[dashboards]
default_home_dashboard_path = /var/lib/grafana/dashboards/home.json优点
缺点
总结
Grafana 是构建可观测性体系的核心可视化平台,通过接入 Prometheus 等数据源,利用变量系统实现动态面板切换,配合告警规则和多种通知渠道,能够实现从指标采集到可视化展示再到异常告警的完整监控闭环。建议使用 Provisioning 机制将数据源、仪表盘和告警规则代码化管理,结合 Docker Compose 快速搭建完整的监控栈,并通过社区仪表盘模板加速监控体系建设。
关键知识点
- DevOps 主题的核心是让交付更快、更稳、更可审计。
- 自动化不是把命令脚本化,而是把失败、回滚、权限和观测一起设计进去。
- 生产链路必须明确制品、环境、凭据、配置和责任边界。
项目落地视角
- 把流水线拆成构建、测试、制品、部署、验证和回滚几个阶段。
- 为关键步骤补齐日志、指标、通知和人工兜底点。
- 定期演练扩容、回滚、故障注入和灾备切换。
常见误区
- 只关注部署成功,不关注失败恢复和审计追踪。
- 把环境差异藏在临时脚本或人工操作里。
- 上线频率高了以后,没有标准化制品和配置管理。
进阶路线
- 继续补齐 GitOps、可观测性、平台工程和成本治理。
- 把主题和应用架构、安全、权限、备份恢复联动起来理解。
- 形成团队级平台能力,而不是每个项目重复造轮子。
适用场景
- 当你准备把《Grafana 仪表盘与告警》真正落到项目里时,最适合先在一个独立模块或最小样例里验证关键路径。
- 适合构建自动化交付、基础设施治理、监控告警和生产发布体系。
- 当团队规模扩大、发布频率提升或环境变多时,这类主题会显著影响交付效率。
落地建议
- 所有自动化流程尽量做到幂等、可审计、可回滚。
- 把制品、变量、凭据和执行权限分层管理。
- 定期演练扩容、回滚、密钥轮换和灾备恢复。
排错清单
- 先定位失败发生在代码、构建、制品、环境还是权限层。
- 检查流水线变量、凭据、镜像标签和目标环境配置是否一致。
- 如果问题偶发,重点看并发发布、资源争抢和外部依赖抖动。
复盘问题
- 如果把《Grafana 仪表盘与告警》放进你的当前项目,最先要验证的输入、输出和失败路径分别是什么?
- 《Grafana 仪表盘与告警》最容易在什么规模、什么边界条件下暴露问题?你会用什么指标或日志去确认?
- 相比默认实现或替代方案,采用《Grafana 仪表盘与告警》最大的收益和代价分别是什么?
