可观测性平台搭建
大约 14 分钟约 4115 字
可观测性平台搭建
简介
可观测性(Observability)是现代分布式系统的基石,通过日志(Logs)、指标(Metrics)、链路追踪(Traces)三大支柱,帮助团队快速发现、定位和解决线上问题。随着微服务和容器化架构的普及,传统的监控方式已经无法满足复杂的故障排查需求。本文系统讲解基于 OpenTelemetry、Loki、Tempo、Grafana 的可观测性平台搭建方法。
特点
可观测性三大支柱
架构总览
┌──────────────────────────────────────────────────────────────────┐
│ 应用层 │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Service A│ │ Service B│ │ Service C│ │ Service D│ │
│ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │
│ └─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘ │
│ │ │ │ │ │
│ ──────┴─────────────┴─────────────┴─────────────┴───── │
│ OTel Collector │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Receivers -> Processors -> Exporters │ │
│ └───────────┬───────────────┬──────────────┬─────────────┘ │
└──────────────┼───────────────┼──────────────┼──────────────────┘
│ │ │
┌───────▼──────┐ ┌─────▼──────┐ ┌────▼───────┐
│ Loki │ │ Tempo │ │ Prometheus │
│ (日志聚合) │ │ (链路追踪) │ │ (指标存储) │
└───────┬──────┘ └─────┬──────┘ └────┬───────┘
│ │ │
┌───────▼───────────────▼──────────────▼───────┐
│ Grafana │
│ (统一可视化面板) │
└──────────────────────────────────────────────┘三支柱对比
| 维度 | 日志 (Logs) | 指标 (Metrics) | 链路追踪 (Traces) |
|---|---|---|---|
| 数据特征 | 事件记录 | 聚合数值 | 请求链路 |
| 数据量 | 大 | 小 | 中 |
| 查询方式 | 关键字/正则 | PromQL | TraceID |
| 典型工具 | Loki/ELK | Prometheus | Tempo/Jaeger |
| 告警能力 | 有限 | 强 | 中等 |
| 存储成本 | 高 | 低 | 中 |
OpenTelemetry 集成
OTel Collector 部署
# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Prometheus 自身指标采集
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8888']
processors:
# 批量处理
batch:
send_batch_size: 1024
send_batch_max_size: 2048
timeout: 5s
# 内存限制
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# 资源属性处理
resource:
attributes:
- key: cluster.name
value: production
action: upsert
- key: deployment.environment
value: prod
action: upsert
# 过滤不需要的遥测数据
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/health"'
- 'attributes["http.route"] == "/ready"'
- 'attributes["http.route"] == "/metrics"'
# 采样策略
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: error-policy
type: status_code
status_code:
status_codes:
- ERROR
- name: slow-policy
type: latency
latency:
threshold_ms: 1000
- name: sample-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
# 发送到 Loki
loki:
endpoint: http://loki.observability:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
resource: true
# 发送到 Tempo
otlphttp:
endpoint: http://tempo.observability:4318
# 发送到 Prometheus
prometheusremotewrite:
endpoint: http://prometheus.observability:9090/api/v1/write
# 调试日志
logging:
loglevel: warn
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, filter, tail_sampling, batch]
exporters: [otlphttp, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite, logging]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, filter, batch]
exporters: [loki, logging]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.96.0
args:
- "--config=/etc/otelcol/config.yaml"
ports:
- containerPort: 4317
name: grpc
- containerPort: 4318
name: http
- containerPort: 8888
name: metrics
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/otelcol
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
type: ClusterIP
ports:
- port: 4317
name: grpc
targetPort: 4317
- port: 4318
name: http
targetPort: 4318
selector:
app: otel-collector应用端 OTel SDK 集成
// Node.js 应用集成 OpenTelemetry
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http'
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { resourceFromAttributes } from '@opentelemetry/resources'
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions'
import { SimpleLogRecordProcessor } from '@opentelemetry/sdk-logs'
const serviceName = process.env.SERVICE_NAME || 'unknown-service'
const otelEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318'
const resource = resourceFromAttributes({
[ATTR_SERVICE_NAME]: serviceName,
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
'deployment.environment': process.env.NODE_ENV || 'development'
})
// Trace 导出器
const traceExporter = new OTLPTraceExporter({
url: `${otelEndpoint}/v1/traces`
})
// Metric 导出器
const metricExporter = new OTLPMetricExporter({
url: `${otelEndpoint}/v1/metrics`
})
// Log 导出器
const logExporter = new OTLPLogExporter({
url: `${otelEndpoint}/v1/logs`
})
const sdk = new NodeSDK({
resource,
spanProcessors: [
new BatchSpanProcessor(traceExporter, {
maxQueueSize: 2048,
maxExportBatchSize: 512,
scheduledDelayMillis: 5000
})
],
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 15000
}),
logRecordProcessors: [
new SimpleLogRecordProcessor(logExporter)
],
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
'@opentelemetry/instrumentation-http': {
enabled: true,
ignoreIncomingPaths: ['/health', '/ready', '/metrics']
},
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true }
})
]
})
// 启动 SDK
sdk.start()
// 优雅关闭
process.on('SIGTERM', async () => {
try {
await sdk.shutdown()
console.log('OpenTelemetry SDK shut down successfully')
} catch (err) {
console.error('Error shutting down OpenTelemetry SDK', err)
}
process.exit(0)
})
export { sdk }自定义指标采集
// metrics.ts
import { metrics } from '@opentelemetry/api'
const meter = metrics.getMeter('my-service', '1.0.0')
// 计数器:HTTP 请求总数
export const httpRequestsTotal = meter.createCounter('http_requests_total', {
description: 'HTTP 请求总数',
unit: '1'
})
// 直方图:请求延迟
export const httpRequestDuration = meter.createHistogram('http_request_duration_ms', {
description: 'HTTP 请求延迟(毫秒)',
unit: 'ms'
})
// 计量器:当前活跃连接数
export const activeConnections = meter.createUpDownCounter('active_connections', {
description: '当前活跃连接数',
unit: '1'
})
// 异步计量器:内存使用
meter.createObservableGauge('process_memory_usage_bytes', {
description: '进程内存使用',
unit: 'By'
}, (observableResult) => {
const usage = process.memoryUsage()
observableResult.observe(usage.heapUsed, { type: 'heap_used' })
observableResult.observe(usage.rss, { type: 'rss' })
observableResult.observe(usage.external, { type: 'external' })
})
// 使用示例:在 Express 中间件中使用
export function metricsMiddleware(req: any, res: any, next: any): void {
const start = Date.now()
// 增加活跃连接
activeConnections.add(1, { protocol: req.protocol })
res.on('finish', () => {
const duration = Date.now() - start
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: String(res.statusCode)
}
httpRequestsTotal.add(1, labels)
httpRequestDuration.record(duration, labels)
activeConnections.add(-1, { protocol: req.protocol })
})
next()
}Loki 日志聚合
Loki 部署
# loki.yaml — Loki 单体模式部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.9.4
args:
- "-config.file=/etc/loki/config.yaml"
- "-target=all"
ports:
- containerPort: 3100
name: http
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /loki
volumes:
- name: config
configMap:
name: loki-config
- name: storage
persistentVolumeClaim:
claimName: loki-storage
---
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: observability
data:
config.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
storage_config:
filesystem:
directory: /loki/storage
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
max_query_length: 721h
max_query_parallelism: 32
ingestion_rate_mb: 20
ingestion_burst_size_mb: 30
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 10MB
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
analytics:
reporting_enabled: falsePromtail 日志采集
# promtail.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: observability
data:
config.yaml: |
server:
http_listen_port: 3101
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki.observability:3100/loki/api/v1/push
tenant_id: default
batchwait: 1s
batchsize: 1048576
timeout: 10s
scrape_configs:
# 采集 Kubernetes Pod 日志
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
# 解析容器日志
- cri: {}
# 提取结构化日志字段
- json:
expressions:
level: level
msg: msg
trace_id: trace_id
span_id: span_id
# 设置标签
- labels:
level:
app:
# 设置 Loki 标签
- pack:
labels:
- level
- app
- namespace
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_namespace
target_label: namespace
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
- source_labels:
- __meta_kubernetes_namespace
action: replace
target_label: namespaceLogQL 查询示例
# 基本日志查询
{app="productpage", namespace="default"}
# 过滤 ERROR 日志
{app="productpage"} |= "ERROR"
# 正则过滤
{app="productpage"} |~ "error|exception|timeout"
# 排除健康检查日志
{app="productpage"} != "health check" |~ "ERROR"
# JSON 日志解析
{app="productpage"} | json | level="error" | line_format "{{.timestamp}} [{{.level}}] {{.msg}}"
# 统计每分钟错误数
sum(count_over_time({app="productpage"} |= "error" [1m]))
# 按 trace_id 关联链路
{app="productpage"} | json | trace_id="abc123def456"
# 统计各服务错误率
sum by (app) (
count_over_time({namespace="default"} |= "error" [5m])
)
/
sum by (app) (
count_over_time({namespace="default"} [5m])
)
# Top 10 最频繁的错误消息
topk(10,
sum by (msg) (
count_over_time({namespace="default"} | json | level="error" [1h])
)
)Tempo 分布式追踪
Tempo 部署
# tempo.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: tempo
template:
metadata:
labels:
app: tempo
spec:
containers:
- name: tempo
image: grafana/tempo:2.3.1
args:
- "-config.file=/etc/tempo/config.yaml"
ports:
- containerPort: 3200
name: http
- containerPort: 4317
name: grpc-otlp
- containerPort: 4318
name: http-otlp
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
volumeMounts:
- name: config
mountPath: /etc/tempo
- name: storage
mountPath: /var/tempo
volumes:
- name: config
configMap:
name: tempo-config
- name: storage
persistentVolumeClaim:
claimName: tempo-storage
---
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
namespace: observability
data:
config.yaml: |
server:
http_listen_port: 3200
log_level: info
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
max_block_duration: 5m
trace_idle_period: 10s
complete_block_timeout: 30m
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
pool:
max_workers: 100
queue_depth: 10000
metrics_generator:
registry:
external_labels:
source: tempo
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus.observability:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
generate_native_histograms: bothTraceQL 查询
# 查找特定 trace
{ trace:abc123def456 }
# 查找慢请求(>1秒)
{ span.http.status_code = 200 && span.duration > 1s }
# 查找错误请求
{ span.status = error }
# 按服务名查找
{ resource.service.name = "productpage" }
# 查找特定 HTTP 路由的慢请求
{ span.http.route = "/api/reviews" && span.duration > 500ms }
# 服务调用关系
{ span.http.method = "GET" } -> { resource.service.name = "reviews" }
# 统计延迟分布
rate({ resource.service.name = "productpage" } | count() [5m])Grafana 仪表板
数据源配置
# grafana-datasources.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: observability
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus.observability:9090
isDefault: true
editable: true
jsonData:
exemplarTraceIdDestinations:
- name: TraceID
datasourceUid: tempo
url: '/explore?left={"queries":[{"expr":"${__value.text}","refId":"trace"}],"datasource":"tempo"}'
- name: Loki
type: loki
access: proxy
url: http://loki.observability:3100
editable: true
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: '/explore?left={"queries":[{"expr":"${__value.text}","refId":"trace"}],"datasource":"tempo"}'
- name: Tempo
type: tempo
access: proxy
url: http://tempo.observability:3200
editable: true
jsonData:
tracesToMetrics:
datasourceUid: prometheus
tags:
- key: service.name
value: resource.service.name
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
filterByTraceID: true
filterBySpanID: true
customQuery: true
query: '{namespace="$${__span.tags.namespace}"} |= "$${__span.traceId}"'
nodeGraph:
enabled: true关键仪表板 JSON 配置
{
"dashboard": {
"title": "服务总览面板",
"panels": [
{
"title": "请求速率 (QPS)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
]
},
{
"title": "错误率 (%)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
"legendFormat": "{{service}}"
}
]
},
{
"title": "P99 延迟",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_ms_bucket[5m])) by (le, service))",
"legendFormat": "{{service}} p99"
}
]
},
{
"title": "最近错误日志",
"type": "logs",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"targets": [
{
"expr": "{namespace=\"default\"} |= \"error\" | json | level != \"info\""
}
]
}
]
}
}告警规则配置
# alerting-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alert-rules
namespace: observability
data:
alerts.yaml: |
groups:
- name: service-alerts
interval: 30s
rules:
# 高错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 3m
labels:
severity: critical
team: platform
annotations:
summary: "服务 {{ $labels.service }} 错误率过高"
description: "服务 {{ $labels.service }} 过去 5 分钟错误率为 {{ $value | humanizePercentage }}"
# 高延迟告警
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_ms_bucket[5m])) by (le, service)
) > 3000
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "服务 {{ $labels.service }} P99 延迟过高"
description: "服务 {{ $labels.service }} P99 延迟为 {{ $value }}ms"
# Pod 重启告警
- alert: PodRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} 频繁重启"
# 日志异常告警
- alert: LogErrors
expr: |
sum(count_over_time({namespace="default"} |= "error" | json | level="error" [5m])) > 100
for: 2m
labels:
severity: warning
annotations:
summary: "错误日志数量异常"SLO/SLI 管理
SLO 定义与监控
# slo/availability-slo.yaml
# 使用 Grafana Pyroscope 或 Sloth 生成 SLO 规则
apiVersion: sloth.slok.dev/v1
kind: PrometheusRuleSLO
metadata:
name: productpage-slo
namespace: observability
spec:
service: "productpage"
labels:
team: "platform"
slos:
# 可用性 SLO:99.9%
- name: "availability"
objective: 99.9
description: "Productpage 服务 99.9% 的请求返回非 5xx 状态码"
sli:
events:
errorQuery: sum(rate(http_requests_total{service="productpage",code=~"5.."}[{{.window}}]))
totalQuery: sum(rate(http_requests_total{service="productpage"}[{{.window}}]))
alerting:
name: ProductpageAvailabilitySLO
labels:
severity: critical
annotations:
summary: "Productpage 可用性 SLO 即将被打破"
pageAlert:
labels:
severity: critical
routing_key: platform-oncall
ticketAlert:
labels:
severity: warning
alerting:
name: ProductpageAvailabilityBurnRate
shortWindow:
burnRateThreshold: 14.4
window: 5m
longWindow:
burnRateThreshold: 14.4
window: 1h
# 延迟 SLO:P99 < 2s
- name: "latency"
objective: 99.0
description: "Productpage 服务 99% 的请求在 2 秒内完成"
sli:
events:
errorQuery: |
sum(rate(http_request_duration_ms_bucket{service="productpage",le="2000"}[{{.window}}]))
totalQuery: |
sum(rate(http_requests_total{service="productpage"}[{{.window}}]))SLO Burn Rate 告警
# 错误预算燃烧率计算
# 30 天 SLO 周期,99.9% SLO = 43.2 分钟的不可用时间
# 燃烧率 1x = 正常消耗速度(30天用完预算)
# 燃烧率 14.4x = 2天用完预算(需要立即告警)
# 燃烧率 6x = 5天用完预算(需要关注)
# 多窗口燃烧率告警规则(推荐)
# 快速燃烧:5m 窗口燃烧率 > 14.4 && 1h 窗口燃烧率 > 14.4 -> Page
# 慢速燃烧:30m 窗口燃烧率 > 6 && 6h 窗口燃烧率 > 6 -> Ticket信号关联
从 Trace 关联到 Logs 和 Metrics
// 在应用中注入 Trace ID 到日志
import { trace, context } from '@opentelemetry/api'
import { Logger } from 'winston'
const tracer = trace.getTracer('my-service')
const logger = new Logger()
// 封装日志方法,自动注入 trace_id
export function traceAwareLog(level: string, message: string, meta?: any) {
const span = trace.getSpan(context.active())
const traceId = span?.spanContext().traceId || ''
const spanId = span?.spanContext().spanId || ''
logger.log(level, message, {
...meta,
trace_id: traceId,
span_id: spanId
})
}
// 使用示例
app.get('/api/reviews', async (req, res) => {
const span = tracer.startSpan('get-reviews')
await context.with(trace.setSpan(context.active(), span), async () => {
try {
traceAwareLog('info', '开始获取评论列表')
const reviews = await fetchReviews()
// 添加 Span 属性
span.setAttribute('reviews.count', reviews.length)
span.setAttribute('http.status_code', 200)
traceAwareLog('info', `获取到 ${reviews.length} 条评论`)
res.json(reviews)
} catch (error) {
span.setAttribute('http.status_code', 500)
span.setStatus({ code: 2, message: (error as Error).message })
traceAwareLog('error', '获取评论失败', { error: (error as Error).message })
res.status(500).json({ error: 'Internal Server Error' })
} finally {
span.end()
}
})
})成本优化
存储策略
# 成本优化配置
# Loki 数据保留策略
# 热数据:7 天(SSD)
# 温数据:30 天(HDD)
# 冷数据:90 天(对象存储压缩)
# Tempo 保留策略
# traces: 14 天
# metrics: 30 天
# Prometheus 保留
# --storage.tsdb.retention.time=30d采样策略
# otel-collector 采样配置
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# 错误请求 100% 保留
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# 慢请求 100% 保留
- name: slow
type: latency
latency:
threshold_ms: 1000
# 正常请求 10% 采样
- name: normal
type: probabilistic
probabilistic:
sampling_percentage: 10自托管 vs SaaS
方案对比
| 维度 | 自托管 (Grafana Stack) | SaaS (Datadog/New Relic) |
|---|---|---|
| 初始成本 | 低(开源免费) | 高(按量计费) |
| 运维成本 | 高(需维护基础设施) | 低(托管服务) |
| 定制性 | 高 | 中 |
| 上手速度 | 慢 | 快 |
| 数据安全 | 可控 | 依赖第三方 |
| 扩展性 | 需要自行扩展 | 自动扩展 |
| 适用阶段 | 中大规模团队 | 快速起步阶段 |
优点
- 统一视图:日志、指标、追踪在 Grafana 中无缝关联
- 快速排障:通过 Trace ID 贯穿日志、指标和追踪
- 标准化:OpenTelemetry 提供厂商中立的采集标准
- 开源生态:Grafana 全家桶开源免费
- SLO 驱动:基于错误预算的告警更精准
缺点
- 学习曲线:OpenTelemetry、LogQL、TraceQL 等需学习
- 运维复杂:自托管需要维护多个组件
- 存储成本:日志和追踪数据量大,存储成本高
- 资源占用:OTel Collector 和各组件需要计算资源
- 跨信号关联需要配置:数据源间的关联需要手动配置
性能注意事项
- 批量上报:使用 BatchSpanProcessor 减少 Exporter 调用次数
- 采样控制:正常流量采样 1%-10%,异常流量 100% 保留
- OTel Collector 资源限制:配置 memory_limiter 防止 OOM
- Loki 标签控制:避免高基数标签(如 user_id)导致索引膨胀
- Prometheus 指标控制:避免高基数标签,控制 series 数量
- Tempo 存储配置:使用对象存储降低成本
总结
可观测性平台是微服务架构的"黑匣子"。通过 OpenTelemetry 统一采集遥测数据,Loki 聚合日志,Tempo 存储链路追踪,Prometheus 管理指标,Grafana 提供可视化,构建了一套完整的可观测性体系。关键是打通三大信号之间的关联(通过 Trace ID),实现从告警到日志到链路的一键跳转。
关键知识点
- 三大支柱:Logs 记录事件、Metrics 量化趋势、Traces 追踪链路
- OpenTelemetry 是 CNCF 的遥测数据采集标准,支持多种语言和框架
- LogQL 是 Loki 的查询语言,TraceQL 是 Tempo 的查询语言
- SLO/SLI 是衡量服务质量的标准方法,Burn Rate 是告警的核心算法
- 信号关联通过 Trace ID 打通日志、指标和追踪
- Grafana Exemplars 功能可以将指标与 Trace 关联
常见误区
- 三个支柱平均用力 — 指标适合告警,日志适合排障,追踪适合定位,各有侧重
- 全量采集一切 — 采样是控制成本的关键,正常流量低采样,异常流量全量
- 只关注技术指标 — 业务指标(订单量、支付成功率)同样重要
- 日志格式不统一 — 结构化日志(JSON)比文本日志更易查询和分析
- 忽略 Trace 上下文传播 — 跨服务调用必须正确传播 Trace Context
- 告警太多或太少 — 基于错误预算的 Burn Rate 告警是最优实践
进阶路线
- 入门:Prometheus 基础指标采集、Grafana 面板搭建
- 进阶:Loki 日志聚合、Tempo 链路追踪、OpenTelemetry 集成
- 高级:SLO/SLI 管理、自定义指标、采样策略优化
- 专家:多集群可观测性、Continuous Profiling、AIOps 集成
- 架构:可观测性平台设计、成本优化、多租户隔离
适用场景
- 微服务架构(5+ 服务的系统)
- 需要快速定位线上故障的团队
- 对服务可用性有严格要求(SLA 99.9%+)
- 需要全链路追踪的分布式系统
- DevOps/SRE 团队的日常运维
落地建议
- 先指标后日志再追踪:按此顺序逐步引入
- 统一 Trace ID 传播:确保所有服务正确传播和记录 Trace ID
- 结构化日志:全部使用 JSON 格式日志,包含 trace_id 字段
- 建立 SLO 文化:与业务方共同制定 SLO,基于错误预算决策
- 控制成本:合理设置采样率和数据保留策略
- 自动化告警:基于 Burn Rate 的多窗口告警策略
排错清单
复盘问题
- 为什么 OpenTelemetry 要统一 Logs、Metrics、Traces 的采集标准?
- Trace Context 的 W3C 标准是如何实现跨服务传播的?
- SLO 的 Burn Rate 告警为什么使用多窗口策略?
- Loki 为什么只对标签建索引而不对日志内容建索引?
- 如何在保证可观测性的同时控制存储成本?
- 自托管方案和 SaaS 方案各自的适用场景是什么?
