Prometheus + Grafana 监控

SunnyFan大约 8 分钟约 2491 字

Prometheus + Grafana 监控

简介

Prometheus 是开源的系统监控和告警工具，Grafana 是可视化仪表盘平台。两者配合是云原生监控的事实标准。Prometheus 负责数据采集和存储，Grafana 负责数据展示和告警，适用于 .NET 应用、K8s 集群、服务器等全方位监控。

特点

1.时序数据库 — Prometheus 内置 TSDB
2.Pull 模式 — 主动拉取指标数据
3.PromQL — 强大的查询语言
4.可视化 — Grafana 丰富的图表

Prometheus 部署

Docker 部署

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=10GB'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123

volumes:
  prometheus_data:
  grafana_data:

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # .NET 应用监控
  - job_name: 'dotnet-api'
    metrics_path: /metrics
    static_configs:
      - targets: ['host.docker.internal:8080']
        labels:
          app: 'myapp-api'
          env: 'production'

  # 多实例
  - job_name: 'dotnet-workers'
    metrics_path: /metrics
    static_configs:
      - targets:
          - 'host.docker.internal:8081'
          - 'host.docker.internal:8082'
          - 'host.docker.internal:8083'

  # Node Exporter — 服务器监控
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # Redis 监控
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

# 告警规则
rule_files:
  - 'alert_rules.yml'

# 告警管理
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

.NET 应用集成

暴露 Metrics

# 安装 Prometheus .NET 包
dotnet add package prometheus-net
dotnet add package prometheus-net.AspNetCore

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// 添加 Prometheus
builder.Services.AddMetrics();

var app = builder.Build();

// 启用 Prometheus 中间件
app.UseMetricServer("/metrics");       // /metrics 端点
app.UseHttpMetrics();                   // HTTP 请求指标

app.MapGet("/", () => "Hello World!");
app.Run();

自定义指标

/// <summary>
/// 自定义 Prometheus 指标
/// </summary>
public class OrderMetrics
{
    // 计数器 — 订单总数
    private static readonly Counter OrdersTotal = Metrics.CreateCounter(
        "orders_total",
        "订单总数",
        new CounterConfiguration
        {
            LabelNames = new[] { "status" } // 标签：success/failed
        });

    // 直方图 — 订单处理耗时
    private static readonly Histogram OrderProcessingDuration = Metrics.CreateHistogram(
        "order_processing_duration_seconds",
        "订单处理耗时（秒）",
        new HistogramConfiguration
        {
            Buckets = Histogram.LinearBuckets(0.1, 0.1, 10) // 0.1, 0.2, ..., 1.0
        });

    // 仪表盘 — 当前活跃订单
    private static readonly Gauge ActiveOrders = Metrics.CreateGauge(
        "active_orders",
        "当前活跃订单数");

    // 记录成功订单
    public static void OrderSuccess()
    {
        OrdersTotal.Labels("success").Inc();
    }

    // 记录失败订单
    public static void OrderFailed()
    {
        OrdersTotal.Labels("failed").Inc();
    }

    // 记录处理耗时
    public static IDisposable MeasureProcessing()
    {
        return OrderProcessingDuration.NewTimer();
    }

    // 活跃订单增减
    public static void OrderStarted() => ActiveOrders.Inc();
    public static void OrderCompleted() => ActiveOrders.Dec();
}

在业务中使用

/// <summary>
/// 订单服务 — 嵌入指标埋点
/// </summary>
public class OrderService
{
    public async Task<OrderResult> ProcessOrderAsync(CreateOrderRequest request)
    {
        OrderMetrics.OrderStarted();

        using (OrderMetrics.MeasureProcessing())
        {
            try
            {
                var result = await DoProcessOrderAsync(request);
                OrderMetrics.OrderSuccess();
                return result;
            }
            catch
            {
                OrderMetrics.OrderFailed();
                throw;
            }
            finally
            {
                OrderMetrics.OrderCompleted();
            }
        }
    }
}

PromQL 查询

常用查询

# HTTP 请求速率（每秒）
rate(http_requests_received_total[5m])

# 按 API 分组的请求速率
sum(rate(http_requests_received_total[5m])) by (method, path)

# P95 响应时间
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 错误率
sum(rate(http_requests_received_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_received_total[5m]))

# CPU 使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes * 100

# 磁盘使用率
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"} * 100

# 容器重启次数
increase(kube_pod_container_status_restarts_total[1h])

告警规则

告警配置

# alert_rules.yml
groups:
  - name: app_alerts
    rules:
      # 服务宕机
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"
          description: "实例 {{ $labels.instance }} 已宕机超过 1 分钟"

      # 高错误率
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_received_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_received_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API 错误率过高"
          description: "5xx 错误率超过 5%，当前值：{{ $value | printf \"%.2f\" }}"

      # 高内存
      - alert: HighMemoryUsage
        expr: |
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "容器内存使用过高"
          description: "容器 {{ $labels.container }} 内存使用超过 85%"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: |
          node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "剩余空间不足 10%"

      # 订单处理慢
      - alert: SlowOrderProcessing
        expr: |
          histogram_quantile(0.95, rate(order_processing_duration_seconds_bucket[5m])) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "订单处理耗时过长"
          description: "P95 耗时超过 1 秒"

Grafana 仪表盘

数据源配置

1. 访问 http://localhost:3000（默认 admin/admin123）
2. Configuration → Data Sources → Add data source
3. 选择 Prometheus
4. URL: http://prometheus:9090
5. Save & Test

ID	名称	说明
12486	.NET metrics	.NET 应用监控
1860	Node Exporter	服务器资源
11835	Redis	Redis 监控
15760	K8s Views	K8s 集群概览
1111	Nginx	Nginx 监控

导入方式

Dashboards → Import → 输入模板 ID → Load → 选择数据源 → Import

自定义 Dashboard 面板

// Grafana Dashboard JSON 配置示例
{
  "dashboard": {
    "title": ".NET 应用监控",
    "panels": [
      {
        "title": "HTTP 请求速率",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_received_total[5m])) by (method)",
            "legendFormat": "{{method}}"
          }
        ]
      },
      {
        "title": "P95 响应时间",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.5, "color": "yellow" },
                { "value": 1.0, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "错误率",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(http_requests_received_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_received_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 10,
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  }
}

Prometheus 高级配置

服务发现机制

# prometheus.yml — 服务发现（替代手动 static_configs）

# Docker 服务发现
scrape_configs:
  - job_name: 'docker'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus]
        regex: "true"
        action: keep
      - source_labels: [__meta_docker_container_label_prometheus_port]
        target_label: __address__
        regex: "(.*):(\\d+)"
        replacement: "${1}:${2}"

# Consul 服务发现（微服务架构推荐）
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        services: []  # 空数组表示发现所有服务
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: '.*,prometheus,.*'
        action: keep
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_health]
        regex: 'passing'
        action: keep

# Kubernetes 服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

远程写入与长期存储

# Prometheus 默认保留 15 天数据
# 长期存储方案：Thanos / Cortex / VictoriaMetrics

# 远程写入配置（发送到 VictoriaMetrics）
global:
  evaluation_interval: 15s

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 5

# 远程读取配置
remote_read:
  - url: "http://victoriametrics:8428/api/v1/read"
    read_recent: true  # 优先从本地读取最近数据

Prometheus 联邦集群

# 联邦集群 — 全局 Prometheus 汇总各集群数据
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="dotnet-api"}'
        - '{job="node"}'
        - '{__name__=~"job:.*"}'  # 预计算的 Recording Rules
    static_configs:
      - targets:
          - 'prometheus-cluster1:9090'
          - 'prometheus-cluster2:9090'
        labels:
          cluster: 'global'

Recording Rules 预计算

预计算高频查询

# recording_rules.yml
groups:
  - name: http_metrics
    interval: 30s
    rules:
      # 预计算 HTTP 请求速率
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_received_total[5m])) by (job)

      # 预计算错误率
      - record: job:http_error_rate:ratio5m
        expr: |
          sum(rate(http_requests_received_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_received_total[5m])) by (job)

      # 预计算 P95 延迟
      - record: job:http_request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )

      # 预计算 P99 延迟
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )

  - name: system_metrics
    interval: 30s
    rules:
      - record: instance:cpu_usage:ratio
        expr: |
          100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

      - record: instance:memory_usage:ratio
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100

Prometheus 高可用架构

高可用部署方案

# docker-compose-ha.yml — Prometheus 高可用部署
version: '3.8'

services:
  prometheus-1:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_1_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-remote-write-receiver'

  prometheus-2:
    image: prom/prometheus:latest
    ports:
      - "9092:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_2_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-remote-write-receiver'

  # Thanos Sidecar — 长期存储和高可用查询
  thanos-sidecar-1:
    image: thanosio/thanos:latest
    command:
      - 'sidecar'
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus-1:9090'
      - '--grpc-address=0.0.0.0:10901'
      - '--http-address=0.0.0.0:10902'
    volumes:
      - prometheus_1_data:/prometheus

  # AlertManager 集群
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--web.external-url=http://localhost:9093'

  # AlertManager 配置（告警通知）
  # alertmanager.yml
  # route:
  #   receiver: 'team-ops'
  #   group_by: ['alertname', 'cluster']
  #   group_wait: 10s
  #   group_interval: 5m
  #   repeat_interval: 4h
  #   routes:
  #     - match:
  #         severity: critical
  #       receiver: 'team-ops-critical'
  # receivers:
  #   - name: 'team-ops'
  #     email_configs:
  #       - to: 'ops@example.com'
  #   - name: 'team-ops-critical'
  #     webhook_configs:
  #       - url: 'http://webhook-receiver:8080/alerts'

volumes:
  prometheus_1_data:
  prometheus_2_data:

.NET 内置指标

指标	说明
http_requests_received_total	HTTP 请求总数
http_request_duration_seconds	HTTP 请求耗时
process_cpu_seconds_total	CPU 使用
process_working_set_bytes	内存使用
dotnet_gc_collections	GC 次数
dotnet_threadpool_threads	线程池线程数