Prometheus + Grafana 监控
大约 8 分钟约 2491 字
Prometheus + Grafana 监控
简介
Prometheus 是开源的系统监控和告警工具,Grafana 是可视化仪表盘平台。两者配合是云原生监控的事实标准。Prometheus 负责数据采集和存储,Grafana 负责数据展示和告警,适用于 .NET 应用、K8s 集群、服务器等全方位监控。
特点
Prometheus 部署
Docker 部署
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
prometheus_data:
grafana_data:Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# .NET 应用监控
- job_name: 'dotnet-api'
metrics_path: /metrics
static_configs:
- targets: ['host.docker.internal:8080']
labels:
app: 'myapp-api'
env: 'production'
# 多实例
- job_name: 'dotnet-workers'
metrics_path: /metrics
static_configs:
- targets:
- 'host.docker.internal:8081'
- 'host.docker.internal:8082'
- 'host.docker.internal:8083'
# Node Exporter — 服务器监控
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Redis 监控
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
# 告警规则
rule_files:
- 'alert_rules.yml'
# 告警管理
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093'].NET 应用集成
暴露 Metrics
# 安装 Prometheus .NET 包
dotnet add package prometheus-net
dotnet add package prometheus-net.AspNetCore// Program.cs
var builder = WebApplication.CreateBuilder(args);
// 添加 Prometheus
builder.Services.AddMetrics();
var app = builder.Build();
// 启用 Prometheus 中间件
app.UseMetricServer("/metrics"); // /metrics 端点
app.UseHttpMetrics(); // HTTP 请求指标
app.MapGet("/", () => "Hello World!");
app.Run();自定义指标
/// <summary>
/// 自定义 Prometheus 指标
/// </summary>
public class OrderMetrics
{
// 计数器 — 订单总数
private static readonly Counter OrdersTotal = Metrics.CreateCounter(
"orders_total",
"订单总数",
new CounterConfiguration
{
LabelNames = new[] { "status" } // 标签:success/failed
});
// 直方图 — 订单处理耗时
private static readonly Histogram OrderProcessingDuration = Metrics.CreateHistogram(
"order_processing_duration_seconds",
"订单处理耗时(秒)",
new HistogramConfiguration
{
Buckets = Histogram.LinearBuckets(0.1, 0.1, 10) // 0.1, 0.2, ..., 1.0
});
// 仪表盘 — 当前活跃订单
private static readonly Gauge ActiveOrders = Metrics.CreateGauge(
"active_orders",
"当前活跃订单数");
// 记录成功订单
public static void OrderSuccess()
{
OrdersTotal.Labels("success").Inc();
}
// 记录失败订单
public static void OrderFailed()
{
OrdersTotal.Labels("failed").Inc();
}
// 记录处理耗时
public static IDisposable MeasureProcessing()
{
return OrderProcessingDuration.NewTimer();
}
// 活跃订单增减
public static void OrderStarted() => ActiveOrders.Inc();
public static void OrderCompleted() => ActiveOrders.Dec();
}在业务中使用
/// <summary>
/// 订单服务 — 嵌入指标埋点
/// </summary>
public class OrderService
{
public async Task<OrderResult> ProcessOrderAsync(CreateOrderRequest request)
{
OrderMetrics.OrderStarted();
using (OrderMetrics.MeasureProcessing())
{
try
{
var result = await DoProcessOrderAsync(request);
OrderMetrics.OrderSuccess();
return result;
}
catch
{
OrderMetrics.OrderFailed();
throw;
}
finally
{
OrderMetrics.OrderCompleted();
}
}
}
}PromQL 查询
常用查询
# HTTP 请求速率(每秒)
rate(http_requests_received_total[5m])
# 按 API 分组的请求速率
sum(rate(http_requests_received_total[5m])) by (method, path)
# P95 响应时间
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 错误率
sum(rate(http_requests_received_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_received_total[5m]))
# CPU 使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes * 100
# 磁盘使用率
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"} * 100
# 容器重启次数
increase(kube_pod_container_status_restarts_total[1h])告警规则
告警配置
# alert_rules.yml
groups:
- name: app_alerts
rules:
# 服务宕机
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.job }} 不可用"
description: "实例 {{ $labels.instance }} 已宕机超过 1 分钟"
# 高错误率
- alert: HighErrorRate
expr: |
sum(rate(http_requests_received_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_received_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "API 错误率过高"
description: "5xx 错误率超过 5%,当前值:{{ $value | printf \"%.2f\" }}"
# 高内存
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "容器内存使用过高"
description: "容器 {{ $labels.container }} 内存使用超过 85%"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "剩余空间不足 10%"
# 订单处理慢
- alert: SlowOrderProcessing
expr: |
histogram_quantile(0.95, rate(order_processing_duration_seconds_bucket[5m])) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "订单处理耗时过长"
description: "P95 耗时超过 1 秒"Grafana 仪表盘
数据源配置
1. 访问 http://localhost:3000(默认 admin/admin123)
2. Configuration → Data Sources → Add data source
3. 选择 Prometheus
4. URL: http://prometheus:9090
5. Save & Test推荐 Dashboard 模板
| ID | 名称 | 说明 |
|---|---|---|
| 12486 | .NET metrics | .NET 应用监控 |
| 1860 | Node Exporter | 服务器资源 |
| 11835 | Redis | Redis 监控 |
| 15760 | K8s Views | K8s 集群概览 |
| 1111 | Nginx | Nginx 监控 |
导入方式
Dashboards → Import → 输入模板 ID → Load → 选择数据源 → Import自定义 Dashboard 面板
// Grafana Dashboard JSON 配置示例
{
"dashboard": {
"title": ".NET 应用监控",
"panels": [
{
"title": "HTTP 请求速率",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_received_total[5m])) by (method)",
"legendFormat": "{{method}}"
}
]
},
{
"title": "P95 响应时间",
"type": "stat",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.5, "color": "yellow" },
{ "value": 1.0, "color": "red" }
]
}
}
}
},
{
"title": "错误率",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(http_requests_received_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_received_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 10,
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
}
]
}
}Prometheus 高级配置
服务发现机制
# prometheus.yml — 服务发现(替代手动 static_configs)
# Docker 服务发现
scrape_configs:
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: [__meta_docker_container_label_prometheus]
regex: "true"
action: keep
- source_labels: [__meta_docker_container_label_prometheus_port]
target_label: __address__
regex: "(.*):(\\d+)"
replacement: "${1}:${2}"
# Consul 服务发现(微服务架构推荐)
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: [] # 空数组表示发现所有服务
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: '.*,prometheus,.*'
action: keep
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_health]
regex: 'passing'
action: keep
# Kubernetes 服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)远程写入与长期存储
# Prometheus 默认保留 15 天数据
# 长期存储方案:Thanos / Cortex / VictoriaMetrics
# 远程写入配置(发送到 VictoriaMetrics)
global:
evaluation_interval: 15s
remote_write:
- url: "http://victoriametrics:8428/api/v1/write"
queue_config:
max_samples_per_send: 10000
capacity: 20000
max_shards: 5
# 远程读取配置
remote_read:
- url: "http://victoriametrics:8428/api/v1/read"
read_recent: true # 优先从本地读取最近数据Prometheus 联邦集群
# 联邦集群 — 全局 Prometheus 汇总各集群数据
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="dotnet-api"}'
- '{job="node"}'
- '{__name__=~"job:.*"}' # 预计算的 Recording Rules
static_configs:
- targets:
- 'prometheus-cluster1:9090'
- 'prometheus-cluster2:9090'
labels:
cluster: 'global'Recording Rules 预计算
预计算高频查询
# recording_rules.yml
groups:
- name: http_metrics
interval: 30s
rules:
# 预计算 HTTP 请求速率
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_received_total[5m])) by (job)
# 预计算错误率
- record: job:http_error_rate:ratio5m
expr: |
sum(rate(http_requests_received_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_received_total[5m])) by (job)
# 预计算 P95 延迟
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
# 预计算 P99 延迟
- record: job:http_request_duration:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
- name: system_metrics
interval: 30s
rules:
- record: instance:cpu_usage:ratio
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
- record: instance:memory_usage:ratio
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100Prometheus 高可用架构
高可用部署方案
# docker-compose-ha.yml — Prometheus 高可用部署
version: '3.8'
services:
prometheus-1:
image: prom/prometheus:latest
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_1_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-remote-write-receiver'
prometheus-2:
image: prom/prometheus:latest
ports:
- "9092:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_2_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-remote-write-receiver'
# Thanos Sidecar — 长期存储和高可用查询
thanos-sidecar-1:
image: thanosio/thanos:latest
command:
- 'sidecar'
- '--tsdb.path=/prometheus'
- '--prometheus.url=http://prometheus-1:9090'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
volumes:
- prometheus_1_data:/prometheus
# AlertManager 集群
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--web.external-url=http://localhost:9093'
# AlertManager 配置(告警通知)
# alertmanager.yml
# route:
# receiver: 'team-ops'
# group_by: ['alertname', 'cluster']
# group_wait: 10s
# group_interval: 5m
# repeat_interval: 4h
# routes:
# - match:
# severity: critical
# receiver: 'team-ops-critical'
# receivers:
# - name: 'team-ops'
# email_configs:
# - to: 'ops@example.com'
# - name: 'team-ops-critical'
# webhook_configs:
# - url: 'http://webhook-receiver:8080/alerts'
volumes:
prometheus_1_data:
prometheus_2_data:.NET 内置指标
| 指标 | 说明 |
|---|---|
| http_requests_received_total | HTTP 请求总数 |
| http_request_duration_seconds | HTTP 请求耗时 |
| process_cpu_seconds_total | CPU 使用 |
| process_working_set_bytes | 内存使用 |
| dotnet_gc_collections | GC 次数 |
| dotnet_threadpool_threads | 线程池线程数 |
优点
缺点
总结
Prometheus + Grafana 是 .NET 应用监控的最佳组合。prometheus-net 暴露指标,Prometheus 采集存储,Grafana 可视化展示。核心指标:HTTP 请求率、错误率、响应时间、CPU/内存。生产环境务必配置告警规则。
关键知识点
- DevOps 主题的核心是让交付更快、更稳、更可审计。
- 自动化不是把命令脚本化,而是把失败、回滚、权限和观测一起设计进去。
- 生产链路必须明确制品、环境、凭据、配置和责任边界。
项目落地视角
- 把流水线拆成构建、测试、制品、部署、验证和回滚几个阶段。
- 为关键步骤补齐日志、指标、通知和人工兜底点。
- 定期演练扩容、回滚、故障注入和灾备切换。
常见误区
- 只关注部署成功,不关注失败恢复和审计追踪。
- 把环境差异藏在临时脚本或人工操作里。
- 上线频率高了以后,没有标准化制品和配置管理。
进阶路线
- 继续补齐 GitOps、可观测性、平台工程和成本治理。
- 把主题和应用架构、安全、权限、备份恢复联动起来理解。
- 形成团队级平台能力,而不是每个项目重复造轮子。
适用场景
- 当你准备把《Prometheus + Grafana 监控》真正落到项目里时,最适合先在一个独立模块或最小样例里验证关键路径。
- 适合构建自动化交付、基础设施治理、监控告警和生产发布体系。
- 当团队规模扩大、发布频率提升或环境变多时,这类主题会显著影响交付效率。
落地建议
- 所有自动化流程尽量做到幂等、可审计、可回滚。
- 把制品、变量、凭据和执行权限分层管理。
- 定期演练扩容、回滚、密钥轮换和灾备恢复。
排错清单
- 先定位失败发生在代码、构建、制品、环境还是权限层。
- 检查流水线变量、凭据、镜像标签和目标环境配置是否一致。
- 如果问题偶发,重点看并发发布、资源争抢和外部依赖抖动。
复盘问题
- 如果把《Prometheus + Grafana 监控》放进你的当前项目,最先要验证的输入、输出和失败路径分别是什么?
- 《Prometheus + Grafana 监控》最容易在什么规模、什么边界条件下暴露问题?你会用什么指标或日志去确认?
- 相比默认实现或替代方案,采用《Prometheus + Grafana 监控》最大的收益和代价分别是什么?
