分布式链路追踪
大约 10 分钟约 3060 字
分布式链路追踪
简介
分布式链路追踪(Distributed Tracing)跟踪请求在多个微服务间的调用链路,帮助定位性能瓶颈和故障节点。理解 OpenTelemetry 标准、W3C Trace Context 传播和 .NET 8 内置的追踪集成,有助于构建可观测的微服务系统。
特点
OpenTelemetry 集成
基础配置
// dotnet add package OpenTelemetry.Extensions.Hosting
// dotnet add package OpenTelemetry.Exporter.Jaeger
// dotnet add package OpenTelemetry.Exporter.Prometheus.AspNetCore
// dotnet add package OpenTelemetry.Instrumentation.AspNetCore
// dotnet add package OpenTelemetry.Instrumentation.Http
// dotnet add package OpenTelemetry.Instrumentation.EntityFrameworkCore
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing
// 数据源
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService(
serviceName: "order-service",
serviceVersion: "1.0.0",
serviceInstanceId: Environment.MachineName))
// ASP.NET Core 请求自动追踪
.AddAspNetCoreInstrumentation(options =>
{
options.Filter = context =>
!context.Request.Path.StartsWithSegments("/health"); // 过滤健康检查
options.EnrichWithHttpRequest = (activity, request) =>
{
activity.SetTag("http.request.body.size", request.ContentLength);
activity.SetTag("http.user_agent", request.Headers["User-Agent"].ToString());
};
options.EnrichWithHttpResponse = (activity, response) =>
{
activity.SetTag("http.response.body.size", response.ContentLength);
};
})
// HttpClient 调用自动追踪
.AddHttpClientInstrumentation(options =>
{
options.FilterHttpRequestMessage = request =>
{
// 过滤内部健康检查调用
return !request.RequestUri?.Host.Contains("localhost") ?? true;
};
})
// EF Core 查询追踪
.AddEntityFrameworkCoreInstrumentation(options =>
{
options.SetDbStatementForText = true;
options.SetDbStatementForStoredProcedure = true;
})
// gRPC 追踪
.AddGrpcClientInstrumentation()
// 自定义 ActivitySource
.AddSource("MyApp.*")
// 导出器
.AddJaegerExporter(options =>
{
options.AgentHost = builder.Configuration["Jaeger:Host"] ?? "localhost";
options.AgentPort = builder.Configuration.GetValue<int>("Jaeger:Port", 6831);
})
// 或使用 OTLP 导出(通用)
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri(builder.Configuration["Otlp:Endpoint"] ?? "http://localhost:4317");
});
})
.WithMetrics(metrics =>
{
metrics
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService("order-service"))
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddMeter("MyApp.*")
.AddPrometheusExporter();
});
// 启用 Prometheus 端点
app.MapPrometheusScrapingEndpoint();自定义 Span
// 定义 ActivitySource
public static class Tracing
{
public static readonly ActivitySource OrderActivity = new("MyApp.Orders", "1.0.0");
public static readonly ActivitySource PaymentActivity = new("MyApp.Payments", "1.0.0");
}
// 在服务中创建自定义 Span
public class OrderService
{
private readonly ILogger<OrderService> _logger;
public async Task<OrderDto> CreateOrderAsync(CreateOrderCommand command, CancellationToken ct)
{
// 创建 Span
using var activity = Tracing.OrderActivity.StartActivity("CreateOrder", ActivityKind.Internal);
// 添加标签
activity?.SetTag("order.user_id", command.UserId);
activity?.SetTag("order.item_count", command.Items.Count);
activity?.SetTag("order.total_amount", command.Items.Sum(i => i.Price * i.Quantity));
try
{
// 验证
using var validateSpan = Tracing.OrderActivity.StartActivity("ValidateOrder");
ValidateOrder(command);
// 计算价格
using var pricingSpan = Tracing.OrderActivity.StartActivity("CalculatePricing");
pricingSpan?.SetTag("pricing.strategy", "standard");
var totalAmount = CalculateTotal(command.Items);
// 保存到数据库(EF Core 自动创建 Span)
var order = new Order { /* ... */ };
await _repository.SaveAsync(order, ct);
// 发布事件
using var eventSpan = Tracing.OrderActivity.StartActivity("PublishOrderCreated");
await _eventBus.PublishAsync(new OrderCreatedEvent(order.Id), ct);
// 添加事件
activity?.AddEvent(new ActivityEvent("OrderCreated", tags: new ActivityTagsCollection
{
["order.id"] = order.Id.ToString()
}));
return new OrderDto(order);
}
catch (Exception ex)
{
// 记录错误
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.SetTag("error.type", ex.GetType().Name);
activity?.RecordException(ex);
throw;
}
}
}
// 数据库操作的精细追踪
public class OrderRepository
{
private readonly AppDbContext _db;
public async Task<Order?> GetByIdAsync(Guid id, CancellationToken ct)
{
using var activity = Tracing.OrderActivity.StartActivity("GetOrderById", ActivityKind.Client);
activity?.SetTag("db.operation", "SELECT");
activity?.SetTag("db.table", "Orders");
activity?.SetTag("order.id", id.ToString());
var order = await _db.Orders
.Include(o => o.Items)
.FirstOrDefaultAsync(o => o.Id == id, ct);
activity?.SetTag("db.result", order != null ? "found" : "not_found");
return order;
}
}
// HTTP 客户端调用的追踪传播
public class PaymentServiceClient
{
private readonly HttpClient _httpClient;
public async Task<PaymentResult> ChargeAsync(Guid orderId, decimal amount, CancellationToken ct)
{
using var activity = Tracing.PaymentActivity.StartActivity("ChargePayment", ActivityKind.Client);
activity?.SetTag("payment.order_id", orderId.ToString());
activity?.SetTag("payment.amount", amount);
var response = await _httpClient.PostAsJsonAsync("/api/payments/charge", new
{
OrderId = orderId,
Amount = amount
}, ct);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<PaymentResult>(ct);
activity?.SetTag("payment.transaction_id", result?.TransactionId);
return result!;
}
}Trace Context 传播
W3C 标准格式
// W3C Trace Context 传播格式:
// traceparent: 00-{traceId}-{spanId}-{flags}
// tracestate: vendor-specific key=value pairs
// ASP.NET Core 自动传播(内置)
// 不需要手动处理,框架自动传播 HTTP 头
// 手动传播(非 HTTP 场景,如消息队列)
public class TracedMessagePublisher
{
private readonly IConnection _connection;
public async Task PublishAsync<T>(T message, Activity? parentActivity = null)
{
using var channel = _connection.CreateModel();
var properties = channel.CreateBasicProperties();
// 传播 Trace Context
if (Activity.Current != null)
{
properties.Headers = new Dictionary<string, object>
{
["traceparent"] = $"00-{Activity.Current.TraceId}-{Activity.Current.SpanId}-01",
["tracestate"] = Activity.Current.TraceStateString ?? ""
};
}
var body = JsonSerializer.SerializeToUtf8Bytes(message);
channel.BasicPublish("events", typeof(T).Name, properties, body);
}
}
// 消费端恢复 Trace Context
public class TracedMessageConsumer
{
public async Task HandleAsync(BasicDeliverEventArgs ea, CancellationToken ct)
{
// 从消息头恢复 Trace Context
string? traceparent = null;
if (ea.BasicProperties.Headers?.TryGetValue("traceparent", out var tpObj) == true)
{
traceparent = Encoding.UTF8.GetString((byte[])tpObj);
}
// 创建关联的 Span
var parentContext = traceparent != null
? ActivityContext.Parse(traceparent, null)
: default;
using var activity = Tracing.OrderActivity.StartActivity(
$"Process_{ea.RoutingKey}",
ActivityKind.Consumer,
parentContext);
activity?.SetTag("messaging.system", "rabbitmq");
activity?.SetTag("messaging.destination", ea.RoutingKey);
activity?.SetTag("messaging.message_id", ea.BasicProperties.MessageId);
// 处理消息...
}
}采样策略
采样配置
// 头部采样 — 在 Trace 开始时决定是否采样
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing
.SetSampler(new TraceIdRatioBasedSampler(0.1)) // 采样 10%
// 或自定义采样器
.SetSampler(new SmartSampler());
});
// 自定义采样器
public class SmartSampler : Sampler
{
private readonly TraceIdRatioBasedSampler _baseSampler = new(0.1); // 默认 10%
public override SamplingResult ShouldSample(in SamplingParameters parameters)
{
// 健康检查不采样
if (parameters.Tags?.Any(t => t.Key == "http.path" && t.Value?.ToString()?.StartsWith("/health") == true) == true)
{
return new SamplingResult(false);
}
// 错误请求全部采样
if (parameters.Tags?.Any(t => t.Key == "http.status_code" && t.Value?.ToString() == "500") == true)
{
return new SamplingResult(true);
}
// 慢请求全部采样
if (parameters.Tags?.Any(t => t.Key == "http.duration" &&
double.TryParse(t.Value?.ToString(), out var d) && d > 1000) == true)
{
return new SamplingResult(true);
}
// 其他请求按比率采样
return _baseSampler.ShouldSample(parameters);
}
}优点
指标采集与 Prometheus
自定义指标
/// <summary>
/// 自定义指标采集 — 使用 System.Diagnostics.Metrics
/// </summary>
public class OrderMetrics
{
private readonly Counter<long> _ordersCreated;
private readonly Counter<long> _ordersFailed;
private readonly Histogram<double> _orderAmount;
private readonly UpDownCounter<int> _activeOrders;
private readonly ObservableGauge<int> _pendingOrders;
public OrderMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("MyApp.Orders");
_ordersCreated = meter.CreateCounter<long>(
"orders.created.total",
description: "已创建的订单总数");
_ordersFailed = meter.CreateCounter<long>(
"orders.failed.total",
description: "失败的订单总数");
_orderAmount = meter.CreateHistogram<double>(
"orders.amount",
unit: "CNY",
description: "订单金额分布");
_activeOrders = meter.CreateUpDownCounter<int>(
"orders.active",
description: "当前活跃订单数");
// 可观测的指标(定期计算)
_pendingOrders = meter.CreateObservableGauge<int>(
"orders.pending",
description: "待处理订单数",
observeValue: () => GetPendingOrderCount());
}
public void RecordOrderCreated(decimal amount)
{
_ordersCreated.Add(1);
_orderAmount.Record((double)amount);
}
public void RecordOrderFailed(string reason)
{
_ordersFailed.Add(1,
new("reason", reason));
}
public void IncrementActive() => _activeOrders.Increment();
public void DecrementActive() => _activeOrders.Decrement();
private static int GetPendingOrderCount() => 42; // 从数据库或缓存读取
}
// 注册
builder.Services.AddSingleton<OrderMetrics>();
// 在服务中使用
public class OrderService
{
private readonly OrderMetrics _metrics;
public OrderService(OrderMetrics metrics) => _metrics = metrics;
public async Task<OrderDto> CreateOrderAsync(CreateOrderCommand cmd)
{
_metrics.IncrementActive();
try
{
var order = await CreateInternal(cmd);
_metrics.RecordOrderCreated(order.TotalAmount);
return order;
}
catch (Exception ex)
{
_metrics.RecordOrderFailed(ex.GetType().Name);
throw;
}
finally
{
_metrics.DecrementActive();
}
}
}
// Prometheus 端点
app.MapPrometheusScrapingEndpoint();
// 访问 /metrics 查看 Prometheus 格式指标日志与追踪关联
结构化日志集成
/// <summary>
/// 日志与追踪关联 — 在日志中自动包含 TraceId 和 SpanId
/// </summary
builder.Logging.AddOpenTelemetry(logging =>
{
logging.IncludeFormattedMessage = true;
logging.IncludeScopes = true;
// 使用 OTLP 导出日志
logging.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://localhost:4317");
});
});
// 自定义日志丰富器
public class TraceLogEnricher : ILogEventEnricher
{
public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
{
var activity = Activity.Current;
if (activity != null)
{
logEvent.AddPropertyIfAbsent(
propertyFactory.CreateProperty("TraceId", activity.TraceId.ToString()));
logEvent.AddPropertyIfAbsent(
propertyFactory.CreateProperty("SpanId", activity.SpanId.ToString()));
logEvent.AddPropertyIfAbsent(
propertyFactory.CreateProperty("ParentSpanId",
activity.ParentSpanId?.ToString() ?? ""));
}
}
}
// 日志输出格式:
// [2024-01-15 10:30:00 ERR] OrderService (TraceId: abc123, SpanId: def456)
// 订单创建失败: 库存不足
// Serilog 配置示例
builder.Host.UseSerilog((context, services, loggerConfig) =>
{
loggerConfig
.ReadFrom.Configuration(context.Configuration)
.Enrich.FromLogContext()
.Enrich.WithProperty("ServiceName", "order-service")
.WriteTo.Console(
outputTemplate:
"[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} " +
"(TraceId:{TraceId}, SpanId:{SpanId}){NewLine}{Exception}")
.WriteTo.Seq("http://localhost:5341");
});多服务链路追踪
完整调用链示例
/// <summary>
/// 网关 → 订单服务 → 库存服务 → 支付服务 完整链路
/// </summary
// 1. API 网关(YARP 反向代理)
builder.Services.AddReverseProxy()
.LoadFromConfig(builder.Configuration.GetSection("ReverseProxy"))
.AddOpenTelemetry(); // YARP 自动传播 Trace Context
// 2. 订单服务
public class OrderController : ControllerBase
{
[HttpPost("/api/orders")]
public async Task<IActionResult> CreateOrder(
[FromBody] CreateOrderRequest request,
[FromServices] OrderService orderService,
[FromServices] InventoryClient inventoryClient,
[FromServices] PaymentClient paymentClient)
{
// ASP.NET Core 自动创建 "POST /api/orders" Span
using var activity = Tracing.OrderActivity.StartActivity("CreateOrderWorkflow");
activity?.SetTag("order.user_id", request.UserId);
// 1. 扣减库存(自动传播 Trace Context)
using (var invActivity = Tracing.OrderActivity.StartActivity("CheckInventory"))
{
var inventoryResult = await inventoryClient.DeductAsync(
new DeductRequest(request.ProductId, request.Quantity));
if (!inventoryResult.Success)
{
activity?.SetStatus(ActivityStatusCode.Error, "库存不足");
return BadRequest(new { message = "库存不足" });
}
}
// 2. 创建订单
var order = await orderService.CreateAsync(request);
// 3. 发起支付
using (var payActivity = Tracing.OrderActivity.StartActivity("InitPayment"))
{
var paymentResult = await paymentClient.ChargeAsync(
new ChargeRequest(order.Id, order.TotalAmount));
if (!paymentResult.Success)
{
// 记录事件但不中断流程
activity?.AddEvent(new ActivityEvent("PaymentFailed", tags: new ActivityTagsCollection
{
["payment.error"] = paymentResult.ErrorMessage ?? ""
}));
}
}
return Ok(order);
}
}
// 3. HttpClient 自动传播(AddHttpClientInstrumentation 已配置)
builder.Services.AddHttpClient<InventoryClient>(client =>
{
client.BaseAddress = new Uri("https://inventory-service");
});
builder.Services.AddHttpClient<PaymentClient>(client =>
{
client.BaseAddress = new Uri("https://payment-service");
});Grafana 可视化
仪表盘配置
/// <summary>
/// 推荐的 Grafana 仪表盘指标
/// </summary>
// 1. 服务健康面板
// - http.server.request.count(总请求数)
// - http.server.request.duration(请求延迟 P50/P95/P99)
// - http.server.active_requests(并发请求数)
// - aspnetcore.request.rate(每秒请求数)
// 2. 数据库面板
// - db.client.connections.active(活跃连接数)
// - db.client.operations.duration(查询延迟)
// - ef.core.query.duration(EF Core 查询耗时)
// 3. 外部调用面板
// - http.client.request.duration(HttpClient 调用延迟)
// - http.client.request.count(外部调用次数)
// - rpc.client.request.duration(gRPC 调用延迟)
// 4. 自定义业务面板
// - orders.created.total(订单创建趋势)
// - orders.failed.total(订单失败趋势)
// - orders.amount(订单金额分布)
// - orders.active(活跃订单数)
// 5. 告警规则
// - P99 延迟 > 2s
// - 错误率 > 5%
// - 活跃连接数 > 最大连接数 80%
// - 订单失败率 > 1%
// Prometheus 告警规则示例(alertmanager.yml):
// groups:
// - name: api-alerts
// rules:
// - alert: HighErrorRate
// expr: rate(http_server_request_duration_seconds_count{status=~"5.."}[5m]) / rate(http_server_request_duration_seconds_count[5m]) > 0.05
// for: 2m
// labels:
// severity: critical
// annotations:
// summary: "API 错误率超过 5%"
// - alert: HighLatency
// expr: histogram_quantile(0.99, rate(http_server_request_duration_seconds_bucket[5m])) > 2
// for: 5m
// labels:
// severity: warning
// annotations:
// summary: "API P99 延迟超过 2 秒".NET 9 新特性
Metrics 与 Tracing 增强
// .NET 9 OpenTelemetry 增强
// 1. 内置 Meter 支持
builder.Services.AddOpenTelemetry()
.WithMetrics(metrics =>
{
// .NET 9 内置 Meter 自动采集
metrics.AddMeter("Microsoft.AspNetCore.Hosting");
metrics.AddMeter("Microsoft.AspNetCore.Routing");
metrics.AddMeter("Microsoft.EntityFrameworkCore");
metrics.AddMeter("System.Net.Http");
metrics.AddMeter("System.Net.Security");
// 自定义 Meter
metrics.AddMeter("MyApp.Business");
// OTLP 导出
metrics.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
options.Protocol = OtlpExportProtocol.Grpc;
});
});
// 2. 基于 Attributes 的过滤
builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing.AddAspNetCoreInstrumentation(options =>
{
// 过滤特定路径
options.Filter = ctx =>
!ctx.Request.Path.StartsWithSegments("/health") &&
!ctx.Request.Path.StartsWithSegments("/metrics");
// 过滤特定状态码
options.Filter = ctx =>
{
// 不记录 3xx 重定向
return true; // 自定义逻辑
};
});
});
// 3. Exemplar 支持(将 TraceId 关联到 Metric 数据点)
// 在 Prometheus 中可以看到每个 Metric 对应的 Trace缺点
总结
OpenTelemetry 是分布式追踪的行业标准,.NET 8 通过 AddOpenTelemetry() 深度集成。自动追踪支持 ASP.NET Core 请求、HttpClient 调用、EF Core 查询和 gRPC。自定义 Span 通过 ActivitySource.StartActivity() 创建,支持标签、事件和错误记录。W3C Trace Context 通过 traceparent 头自动传播,消息队列场景需要手动传播。采样策略通过 Sampler 控制数据量,建议错误和慢请求全部采样。
关键知识点
- 先分清这个主题位于请求链路、后台任务链路还是基础设施链路。
- 服务端主题通常不只关心功能正确,还关心稳定性、性能和可观测性。
- 任何框架能力都要结合配置、生命周期、异常传播和外部依赖一起看。
项目落地视角
- 画清请求进入、业务执行、外部调用、日志记录和错误返回的完整路径。
- 为关键链路补齐超时、重试、熔断、追踪和结构化日志。
- 把配置与敏感信息分离,并明确不同环境的差异来源。
常见误区
- 只会堆中间件或组件,不知道它们在链路中的执行顺序。
- 忽略生命周期和线程池、连接池等运行时资源约束。
- 没有监控和测试就对性能或可靠性下结论。
进阶路线
- 继续向运行时行为、可观测性、发布治理和微服务协同深入。
- 把主题和数据库、缓存、消息队列、认证授权联动起来理解。
- 沉淀团队级模板,包括统一异常处理、配置约定和基础设施封装。
适用场景
- 当你准备把《分布式链路追踪》真正落到项目里时,最适合先在一个独立模块或最小样例里验证关键路径。
- 适合 API 服务、后台任务、实时通信、认证授权和微服务协作场景。
- 当需求开始涉及稳定性、性能、可观测性和发布流程时,这类主题会成为基础设施能力。
落地建议
- 先定义请求链路与失败路径,再决定中间件、过滤器、服务边界和依赖方式。
- 为关键链路补日志、指标、追踪、超时与重试策略。
- 环境配置与敏感信息分离,避免把生产参数写死在代码或镜像里。
排错清单
- 先确认问题发生在路由、模型绑定、中间件、业务层还是基础设施层。
- 检查 DI 生命周期、配置来源、序列化规则和认证上下文。
- 查看线程池、连接池、缓存命中率和外部依赖超时。
复盘问题
- 如果把《分布式链路追踪》放进你的当前项目,最先要验证的输入、输出和失败路径分别是什么?
- 《分布式链路追踪》最容易在什么规模、什么边界条件下暴露问题?你会用什么指标或日志去确认?
- 相比默认实现或替代方案,采用《分布式链路追踪》最大的收益和代价分别是什么?
