混沌工程实践
大约 14 分钟约 4089 字
混沌工程实践
简介
混沌工程的目标不是"制造故障",而是通过受控实验验证系统在异常条件下是否仍能维持关键业务稳态。它适合用来检验自动恢复、限流降级、超时重试、告警链路和值班响应是否真实有效,尤其适用于微服务、Kubernetes、分布式缓存、数据库主从和消息队列等复杂系统。
混沌工程的核心思想源于 Netflix 的 Chaos Monkey,其背后的理念是:与其等待生产事故来暴露系统弱点,不如在受控条件下主动注入故障,提前发现和修复脆弱点。这就像给建筑做抗震测试——你不希望地震来检验建筑的质量,而是在实验室里模拟地震来验证结构的韧性。
特点
混沌工程方法论
实验五步法
# 混沌实验五步法(Principles of Chaos Engineering)
# 第一步:定义稳态(Steady State)
# 系统正常运行时必须满足的关键业务指标
# 错误率、延迟、吞吐量、成功率等
# 第二步:假设稳态会持续
# 假设系统在控制变量和实验变量下,稳态不会改变
# 第三步:引入变量(实验)
# 模拟真实世界可能发生的事件
# 服务器宕机、网络延迟、磁盘满、DNS 故障等
# 第四步:观察差异
# 对比实验组和对照组的稳态指标
# 寻找稳态被破坏的证据
# 第五步:复盘改进
# 分析实验发现
# 将发现转化为工程改进项实验成熟度模型
# 混沌工程成熟度模型
# Level 1: 手动验证
# - 在测试环境手动执行混沌实验
# - 验证基本的容错机制(超时、重试、降级)
# - 实验结果记录在文档中
# Level 2: 自动化实验
# - 使用 Chaos Mesh/Litmus 等工具自动化实验
# - 实验集成到 CI/CD 流水线
# - 自动验证稳态指标
# Level 3: 持续演练
# - 定期执行 GameDay 演练
# - 实验覆盖生产环境
# - 建立实验模板库
# Level 4: 混沌即服务
# - 混沌实验平台化
# - 按需触发实验
# - 实验结果自动归档和追踪
# Level 5: 韧性工程文化
# - 全团队参与混沌实验
# - 实验驱动的架构演进
# - 混沌实验成为日常工程实践实现
定义稳态指标与实验假设
# 示例:订单服务稳态指标定义
steady_state:
service: order-api
environment: production
indicators:
- name: error_rate
metric: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])"
target: "< 1%"
severity: critical
- name: p95_latency
metric: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
target: "< 300ms"
severity: warning
- name: p99_latency
metric: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
target: "< 1000ms"
severity: warning
- name: order_success_rate
metric: "rate(order_created_total{status='success'}[5m]) / rate(order_created_total[5m])"
target: "> 99%"
severity: critical
- name: order_throughput
metric: "rate(order_created_total[5m])"
target: "> 100 / 5m"
severity: warning
- name: alert_delivery
metric: "alertmanager notification latency"
target: "critical alert within 2 minutes"
severity: critical
hypothesis:
- id: H001
description: "单个 order-api Pod 被杀死后,5 分钟内错误率不会持续超过 1%"
blast_radius: low
risk: low
- id: H002
description: "inventory 服务延迟 2 秒时,order-api 会触发降级逻辑而不是整体超时"
blast_radius: medium
risk: medium
- id: H003
description: "Redis 缓存不可用时,系统会降级到数据库查询且 P99 延迟不超过 3 秒"
blast_radius: medium
risk: medium
- id: H004
description: "数据库主节点故障时,从节点能在 30 秒内自动接管且不丢失数据"
blast_radius: high
risk: high# 混沌实验应先回答明确问题:
# 1. 单实例挂掉,是否还能自动恢复?
# 2. 下游变慢,调用链是否会雪崩?
# 3. 告警是否能在预期时间送达?
# 4. 手工回滚/自动恢复是否真的可执行?
# 5. 限流降级是否按预期工作?
# 6. 数据一致性在故障后是否保持?# 实验前检查清单
# 1. 监控面板在线且可访问
# 2. 告警链路已验证(测试告警能否送达)
# 3. 回滚负责人已到位且在线
# 4. 变更窗口已审批
# 5. 爆炸半径已明确且文档化
# 6. 停止条件已定义(指标阈值触发时自动停止)
# 7. 实验结果记录模板已准备好Kubernetes / Chaos Mesh 基础实验
# Pod 杀死实验
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: order-api-pod-kill
namespace: chaos-testing
labels:
experiment: order-resilience
hypothesis: H001
spec:
action: pod-kill
mode: one # one = 只杀一个 Pod
# mode: all # all = 杀死所有匹配的 Pod
# mode: fixed # fixed = 杀死固定数量
# mode: fixed-percent # fixed-percent = 杀死固定比例
selector:
namespaces:
- production
labelSelectors:
app: order-api
duration: "60s"
scheduler:
cron: "@every 5m" # 定期执行# Pod 故障实验(Pod failure 模拟,Pod 不再响应但不会被杀死)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: order-api-pod-failure
namespace: chaos-testing
spec:
action: pod-failure
mode: one
selector:
namespaces:
- production
labelSelectors:
app: order-api
duration: "120s"# 网络延迟实验
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: order-to-inventory-delay
namespace: chaos-testing
labels:
experiment: order-resilience
hypothesis: H002
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: order-api
delay:
latency: "2000ms" # 延迟 2 秒
correlation: "100" # 延迟相关性(0-100)
jitter: "200ms" # 延迟抖动
direction: to # to = 出站方向(order -> inventory)
target:
selector:
namespaces:
- production
labelSelectors:
app: inventory-api
mode: all
duration: "5m"# 网络丢包实验
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: order-to-payment-packet-loss
namespace: chaos-testing
spec:
action: loss
mode: all
selector:
namespaces:
- production
labelSelectors:
app: order-api
loss:
loss: "50" # 50% 丢包率
correlation: "80"
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: payment-api
mode: all
duration: "3m"# 网络分区实验
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: order-isolation
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
app: order-api
direction: both # both = 双向隔离
target:
selector:
namespaces:
- production
labelSelectors:
app: inventory-api
mode: all
duration: "2m"# CPU 压力实验
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: report-worker-cpu-stress
namespace: chaos-testing
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: report-worker
stressors:
cpu:
workers: 2 # 压力线程数
load: 80 # CPU 负载百分比
duration: "3m"# 内存压力实验
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-memory-stress
namespace: chaos-testing
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
stressors:
memory:
workers: 1
size: "256MB" # 内存占用大小
duration: "5m"# IO 压力实验
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: data-service-io-delay
namespace: chaos-testing
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: data-service
delay:
latency: "500ms"
correlation: "80"
path: "/data/*"
duration: "5m"# 时间偏移实验(验证定时任务和证书过期)
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: cert-renewal-time-shift
namespace: chaos-testing
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: cert-renewer
timeOffset: "-30d" # 时间倒退 30 天
duration: "10m"# 应用实验
kubectl apply -f pod-chaos.yaml
kubectl apply -f network-chaos.yaml
# 查看实验状态
kubectl get podchaos -A
kubectl get networkchaos -A
kubectl get stresschaos -A
# 查看实验详情
kubectl describe networkchaos order-to-inventory-delay -n chaos-testing
# 查看实验事件
kubectl get events -n chaos-testing --sort-by='.lastTimestamp'
# 暂停实验
kubectl annotate networkchaos order-to-inventory-delay \
-n chaos-testing \
experiment.chaos-mesh.org/pause="true"
# 恢复实验
kubectl annotate networkchaos order-to-inventory-delay \
-n chaos-testing \
experiment.chaos-mesh.org/pause="false"Litmus Chaos 实验
# Litmus Chaos — Pod 删除实验
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-api-chaos
namespace: litmus
spec:
appinfo:
appns: production
applabel: "app=order-api"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: TARGET_PODS
value: "1"# 安装 Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.6.yaml
# 创建 ChaosServiceAccount
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scripts/master/pkg/litmus-account-generator/cluster-admin/rbac.yaml
# 查看实验结果
kubectl get chaosexperiments -n litmus
kubectl get chaosresults -n litmus
kubectl logs -n litmus -l app=pod-delete -c chaos-runner应用层降级验证
# 验证调用超时和降级是否生效
import requests
import time
import statistics
endpoint = "https://api.example.com/orders/preview"
payload = {"items": [{"skuId": 1001, "count": 2}]}
results = []
errors = 0
timeouts = 0
fallbacks = 0
for i in range(50):
start = time.time()
try:
resp = requests.post(endpoint, json=payload, timeout=5)
cost = round((time.time() - start) * 1000, 2)
results.append(cost)
if resp.status_code >= 500:
errors += 1
# 检查降级响应头
if resp.headers.get("X-Fallback") == "true":
fallbacks += 1
print(f"[{i}] status={resp.status_code} latency={cost}ms fallback={resp.headers.get('X-Fallback')}")
except requests.exceptions.Timeout:
timeouts += 1
print(f"[{i}] TIMEOUT")
except Exception as e:
errors += 1
print(f"[{i}] ERROR: {e}")
time.sleep(1)
print(f"\n=== Summary ===")
print(f"Total requests: {len(results) + errors + timeouts}")
print(f"Success: {len(results)}, Errors: {errors}, Timeouts: {timeouts}")
print(f"Fallbacks: {fallbacks}")
if results:
print(f"P50 latency: {statistics.median(results):.1f}ms")
print(f"P95 latency: {sorted(results)[int(len(results) * 0.95)]:.1f}ms")
print(f"P99 latency: {sorted(results)[int(len(results) * 0.99)]:.1f}ms")# 使用 wrk 进行压力测试
wrk -t 4 -c 100 -d 60s --latency https://api.example.com/orders/preview \
-s post.lua
# post.lua 脚本
wrk.method = "POST"
wrk.body = '{"items":[{"skuId":1001,"count":2}]}'
wrk.headers["Content-Type"] = "application/json"# 观测指标建议
observability:
dashboards:
- name: "Error Rate"
panel: error_rate
query: 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])'
- name: "P95 Latency"
panel: p95_latency
query: 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
- name: "Throughput"
panel: throughput
query: 'rate(http_requests_total[5m])'
- name: "Pod Restarts"
panel: pod_restarts
query: 'rate(kube_pod_container_status_restarts_total[5m])'
- name: "Circuit Breaker"
panel: circuit_breaker
query: 'istio_circuit_breaker_opened_total'
logs:
key_fields:
- trace_id
- fallback_triggered
- circuit_breaker_opened
- retry_count
- timeout_count
alerts:
- name: "High Error Rate"
condition: "error_rate > 5%"
severity: critical
- name: "High Latency"
condition: "p95_latency > 1000ms"
severity: warning审批、回滚与实验记录
# 实验记录模板
experiment_record:
id: EXP-2026-0412-001
name: inventory-delay-2026-04-12
owner: sre-team
env: staging
start_time: "2026-04-12T14:00:00+08:00"
end_time: "2026-04-12T14:10:00+08:00"
duration: "10m"
blast_radius: "order-api -> inventory-api only"
hypothesis: H002
rollback:
command: "kubectl delete -f network-chaos.yaml"
estimated_time: "30s"
responsible: zhangsan
success_criteria:
- metric: order_success_rate
threshold: ">= 99%"
- metric: p95_latency
threshold: "<= 800ms"
- metric: alert_delivery
threshold: "within 2 minutes"
preconditions:
- "Monitoring dashboard accessible"
- "On-call team available"
- "Rollback procedure documented"
- "Change window approved"
findings: []
action_items: []# 结束实验并恢复
kubectl delete -f pod-chaos.yaml
kubectl delete -f network-chaos.yaml
kubectl delete -f stress-chaos.yaml
# 验证恢复
kubectl get pods -n production -l app=order-api
kubectl top pods -n production -l app=order-api
# 检查是否有遗留影响
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20# 实验后复盘至少回答:
# 1. 系统是否符合假设?稳态是否被破坏?
# 2. 告警是否按预期触发?延迟是多少?
# 3. 团队是否知道如何恢复?恢复耗时多久?
# 4. 哪个环节最脆弱?降级/熔断是否生效?
# 5. 发现了哪些之前未意识到的问题?
# 6. 需要改进哪些工程实践?GameDay 演练
# GameDay 演练流程
# 1. 准备阶段(提前 1 周)
# - 确定演练目标和假设
# - 准备实验场景和剧本
# - 通知所有参与人员
# - 准备监控面板和回滚方案
# 2. 演练开始(Day of Game)
# - 宣布演练开始
# - 确认所有人就位
# - 执行实验
# - 观察系统反应
# - 记录发现
# 3. 演练结束
# - 停止所有实验
# - 验证系统恢复
# - 确认所有告警清除
# 4. 复盘阶段(演练后 1-3 天)
# - 整理实验发现
# - 召开复盘会议
# - 制定改进计划
# - 追踪改进项落实# GameDay 演练剧本
game_day:
name: "Production Resilience GameDay"
date: "2026-04-12"
participants:
- role: facilitator
name: zhangsan
- role: observer
name: lisi
- role: sre_oncall
name: wangwu
- role: app_team
name: zhaoliu
schedule:
- time: "14:00"
action: "演练开始,确认所有人就位"
- time: "14:05"
action: "注入网络延迟(order -> inventory,2s)"
expected: "order-api 触发降级,错误率不超过 1%"
- time: "14:10"
action: "观察 5 分钟,记录稳态指标"
- time: "14:15"
action: "杀死一个 order-api Pod"
expected: "Pod 自动重建,错误率短暂升高后恢复"
- time: "14:20"
action: "恢复所有实验"
- time: "14:25"
action: "确认系统完全恢复"
- time: "14:30"
action: "演练结束,开始复盘"优点
缺点
总结
混沌工程的重点不是实验本身,而是通过实验验证系统韧性。只有先定义稳态指标、限制爆炸半径、准备恢复方案,再在可观测基础上做受控注入,混沌实验才能真正帮团队发现问题,而不是制造更大问题。
关键知识点
- 没有稳态指标,就无法判断实验是否成功
- 爆炸半径必须可控,优先从测试/预发或生产低风险窗口开始
- 混沌实验验证的是系统韧性,不是单个组件性能
- 每次实验都必须有回滚方法、责任人和停止条件
- 实验发现必须转化为工程改进项,否则实验没有价值
- GameDay 演练是团队建立韧性文化的有效方式
项目落地视角
- 验证单个 Pod 故障是否真的能自动恢复
- 验证下游服务高延迟时上游是否正确降级
- 验证告警平台和 oncall 链路是否能及时响应
- 验证数据库、缓存、消息队列故障时业务是否存在单点风险
- 验证限流、熔断、降级是否按预期工作
- 验证数据一致性在故障后是否保持
常见误区
- 没有监控、没有回滚,就贸然在生产做大实验
- 只关注技术指标,不看订单成功率、支付成功率等业务稳态
- 实验结束后不复盘,导致问题发现了也没有后续修复
- 一开始就做大范围故障注入,而不是从小爆炸半径试点
- 把混沌工程等同于"随机破坏",缺少明确的实验假设
- 只在测试环境做实验,不验证生产环境的实际行为
进阶路线
- 使用 Chaos Mesh / Litmus / Gremlin 建立标准实验模板
- 将混沌实验接入预发环境和日常演练计划
- 对关键链路建立 GameDay 机制与 runbook
- 将混沌实验与 SLO、容量压测、容灾演练结合起来
- 建立混沌实验平台,支持按需触发和自动化验证
- 将混沌实验纳入 CI/CD 流水线
适用场景
- 微服务、Kubernetes、Service Mesh 场景
- 对高可用、自动恢复要求较高的业务系统
- 需要验证降级、熔断、重试与告警链路的团队
- 平台团队推动韧性工程建设的组织
- 金融、电商等对可用性要求极高的行业
落地建议
- 先从预发环境、小范围实验、低风险目标开始
- 每次实验只验证一个明确假设,不要一次混多个变量
- 实验前准备仪表盘、告警、回滚人和停止阈值
- 复盘时把发现的问题转化为具体工程改进项
- 建立实验模板库,逐步积累可复用的实验场景
- 定期组织 GameDay 演练,建立韧性文化
排错清单
- 检查实验是否真正生效,还是注入配置根本没命中目标
- 检查系统异常来自实验本身还是业务已有脆弱点
- 检查监控、日志、告警是否完整覆盖实验链路
- 检查实验结束后资源是否完全恢复,是否遗留隐性影响
- 检查 Chaos Mesh/Litmus 的 Operator 是否正常运行
- 检查实验的 label selector 是否正确匹配目标 Pod
复盘问题
- 这次实验验证的是哪条韧性假设?结论是什么?
- 如果实验放大一倍,现有系统是否还能承受?
- 告警和恢复链路是否足够快、足够清晰?
- 哪个系统脆弱点最值得进入下一轮改进计划?
- 团队成员是否从实验中学到了新知识?
- 实验发现的问题是否已经转化为 JIRA 工单?
