Grafana 仪表盘与告警

SunnyFan大约 7 分钟约 2166 字

Grafana 仪表盘与告警

简介

Grafana 是一款开源的数据可视化和监控平台，支持多种数据源接入，通过丰富的面板类型（折线图、仪表盘、热力图等）构建直观的监控仪表盘。Grafana 的告警功能可在指标异常时触发通知，配合 Prometheus、InfluxDB 等时序数据库，是构建可观测性体系的核心组件。

特点

1.多数据源支持 — Prometheus、InfluxDB、MySQL、Elasticsearch 等
2.丰富的可视化面板 — 折线图、仪表盘、热力图、表格等
3.灵活的变量系统 — 支持下拉框切换服务器、环境等维度
4.统一的告警管理 — 集中管理多条数据源的告警规则

数据源配置

Prometheus 数据源

# Grafana 数据源配置文件
# /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000

  - name: Elasticsearch
    type: elasticsearch
    access: proxy
    url: http://elasticsearch:9200
    database: "app-logs-*"
    jsonData:
      timeField: "@timestamp"
      esVersion: "8.12.0"
      logMessageField: message
      logLevelField: level

通过 API 添加数据源

# 添加 Prometheus 数据源
curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://prometheus:9090",
    "access": "proxy",
    "isDefault": true
  }'

# 查看已有数据源
curl http://admin:admin@localhost:3000/api/datasources

# 测试数据源连接
curl http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up

面板配置

CPU 使用率面板

// Grafana Dashboard JSON — CPU 面板示例
{
  "dashboard": {
    "title": "服务器监控",
    "panels": [
      {
        "title": "CPU 使用率",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 70 },
                { "color": "red", "value": 90 }
              ]
            }
          }
        }
      }
    ]
  }
}

常用 PromQL 查询

# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# 磁盘使用率
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

# 网络流量（入站/出站）
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# HTTP 请求 QPS
sum by(status) (rate(http_requests_total[5m]))

# 请求延迟 P99
histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))

# .NET 应用指标
# GC 堆内存
dotnet_total_memory_bytes

# 线程池工作线程数
dotnet_threadpool_active_threads

# 请求执行时间 P95
histogram_quantile(0.95, sum by(le) (rate(aspnetcore_request_duration_seconds_bucket[5m])))

Docker Compose 部署监控栈

# docker-compose.yml — Prometheus + Grafana
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.3.0
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.7.0
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          env: 'production'

  - job_name: 'myapp'
    metrics_path: /metrics
    static_configs:
      - targets: ['myapp:8080']
    # .NET 应用需安装 prometheus-net 包

变量与模板

Dashboard 变量配置

// Grafana 变量定义
{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "current": { "text": "Prometheus", "value": "Prometheus" }
      },
      {
        "name": "instance",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(node_cpu_seconds_total, instance)",
        "refresh": 2,
        "includeAll": true,
        "multi": true,
        "sort": 1
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,10m,30m,1h",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s"
      }
    ]
  }
}

在面板中使用变量

# 在查询中使用 instance 变量
100 - (avg by(instance) (rate(node_cpu_seconds_total{instance=~"$instance", mode="idle"}[$interval])) * 100)

# 在面板标题中使用变量
# 标题: CPU 使用率 - $instance

# 使用变量筛选日志（Loki 数据源）
{app="$app", env="$env"} |= "$search"

# 常用变量类型

| 变量类型 | 用途 | 示例 |
|---|---|---|
| Query | 从数据源查询值 | `label_values(metric, label)` |
| Interval | 时间间隔选择 | 1m, 5m, 10m, 1h |
| Datasource | 切换数据源 | Prometheus, InfluxDB |
| Custom | 自定义选项 | dev, staging, production |
| Text box | 文本输入 | 日志搜索关键词 |

告警规则

Grafana 告警规则配置

# /etc/grafana/provisioning/alerting/rules.yml
apiVersion: 1

groups:
  - orgId: 1
    name: server-alerts
    folder: Server Monitoring
    interval: 1m
    rules:
      - uid: cpu-alert
        title: CPU 使用率过高
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus-uid
            model:
              expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              instant: true
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    params:
                      - 90
                    type: gt
                  operator:
                    type: and
        noDataState: OK
        execErrState: Alerting
        for: 5m
        annotations:
          summary: "CPU 使用率超过 90%"
          description: "实例 {{ $labels.instance }} CPU 使用率为 {{ $value }}%"
        labels:
          severity: critical
          team: ops

      - uid: memory-alert
        title: 内存使用率过高
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus-uid
            model:
              expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
        noDataState: OK
        execErrState: Alerting
        for: 5m
        annotations:
          summary: "内存使用率超过 85%"
        labels:
          severity: warning

通过 API 创建告警规则

# 创建告警通知策略
curl -X POST http://admin:admin@localhost:3000/api/v1/provisioning/policies \
  -H "Content-Type: application/json" \
  -d '{
    "receiver": "ops-team",
    "group_by": ["alertname", "instance"],
    "group_wait": "30s",
    "group_interval": "5m",
    "repeat_interval": "4h",
    "routes": [
      {
        "matcher": { "severity": "critical" },
        "receiver": "ops-team-urgent",
        "group_wait": "10s",
        "repeat_interval": "1h"
      }
    ]
  }'

通知渠道

配置通知渠道

# /etc/grafana/provisioning/alerting/contactpoints.yml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: ops-team
    receivers:
      - uid: email-receiver
        type: email
        settings:
          addresses: ops-team@example.com
          singleEmail: false

      - uid: webhook-receiver
        type: webhook
        settings:
          url: http://webhook-server:8080/alert
          httpMethod: POST

      - uid: dingtalk-receiver
        type: dingding
        settings:
          url: https://oapi.dingtalk.com/robot/send?access_token=xxx
          msgType: markdown

      - uid: slack-receiver
        type: slack
        settings:
          url: https://hooks.slack.com/services/xxx
          recipient: "#alerts"
          username: Grafana
          iconEmoji: ":grafana:"

Grafana INI 配置

# /etc/grafana/grafana.ini
[server]
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com

[security]
admin_user = admin
admin_password = ${GRAFANA_PASSWORD}
allow_embedding = true

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[smtp]
enabled = true
host = smtp.example.com:587
user = grafana@example.com
password = ${SMTP_PASSWORD}
from_address = grafana@example.com
from_name = Grafana Alert

[alerting]
enabled = true
execute_alerts = true
evaluation_timeout_seconds = 30
notification_timeout_seconds = 30
max_attempts = 3

[dashboards]
default_home_dashboard_path = /var/lib/grafana/dashboards/home.json

优点

1.可视化效果优秀 — 丰富的面板类型和主题定制能力
2.多数据源统一 — 一个平台查看所有监控数据
3.告警管理集中 — 统一的告警规则和通知渠道管理
4.社区生态丰富 — 大量社区共享的仪表盘模板可直接导入

缺点

1.复杂查询门槛 — PromQL 和 LogQL 语法需要学习
2.仪表盘维护成本 — 随服务增长仪表盘需要持续维护
3.告警规则管理 — 大量告警规则的维护和调优较为耗时
4.资源消耗 — 大量面板和告警会增加 Grafana 内存使用

总结

Grafana 是构建可观测性体系的核心可视化平台，通过接入 Prometheus 等数据源，利用变量系统实现动态面板切换，配合告警规则和多种通知渠道，能够实现从指标采集到可视化展示再到异常告警的完整监控闭环。建议使用 Provisioning 机制将数据源、仪表盘和告警规则代码化管理，结合 Docker Compose 快速搭建完整的监控栈，并通过社区仪表盘模板加速监控体系建设。