Linux 网络抓包与故障诊断

SunnyFan大约 21 分钟约 6156 字

Linux 网络抓包与故障诊断

简介

网络故障排查是运维工程师最核心的技能之一。在分布式系统中，网络问题是导致服务异常最常见的原因之一，包括连接超时、丢包、DNS 解析失败、高延迟等。掌握系统化的网络诊断工具和方法论，可以大幅缩短故障定位时间（MTTR），从数小时缩短到数分钟。

本文将全面介绍 Linux 网络诊断工具链，从基础连通性测试到深度抓包分析，涵盖 tcpdump、Wireshark、ss/netstat、mtr、dig/nslookup、curl、iptables 等核心工具，并结合实际案例讲解常见网络故障的排查思路。

核心工具概览

工具	层级	用途
ping	网络层	基础连通性测试
traceroute/mtr	网络层	路由路径追踪
nslookup/dig	应用层	DNS 解析诊断
ss/netstat	传输层	连接状态查看
tcpdump	数据链路层	数据包捕获
Wireshark	全层级	数据包深度分析
curl	应用层	HTTP 请求调试
iptables	网络层	防火墙规则调试
iperf3	传输层	网络性能测试
ip	网络层	网络接口管理
ethtool	数据链路层	网卡状态诊断
conntrack	网络层	连接跟踪表管理
nsenter	全层级	网络命名空间操作

网络分层诊断模型

+--------------------------------------------------+
|              应用层 (HTTP/DNS/SSH)                |  curl, dig, nslookup
+--------------------------------------------------+
|              传输层 (TCP/UDP)                     |  ss, netstat, tc
+--------------------------------------------------+
|              网络层 (IP/ICMP)                     |  ping, traceroute, mtr, ip route
+--------------------------------------------------+
|              数据链路层 (ARP/Frame)               |  tcpdump, ethtool, ip link
+--------------------------------------------------+
|              物理层 (NIC/Cable)                   |  ethtool, dmesg, lspci
+--------------------------------------------------+

诊断原则：自底向上逐层排查，先确认物理链路，再检查数据链路层，逐层往上定位。

基础连通性诊断

ping 命令详解

# 基础连通性测试
ping -c 4 192.168.1.1

# 指定包大小（测试 MTU 问题）
ping -c 4 -s 1472 192.168.1.1
# 注：1472 + 28(IP+ICMP头) = 1500(标准MTU)

# 快速 ping（间隔 0.1 秒）
ping -c 100 -i 0.1 192.168.1.1

# 设置 TTL 值
ping -c 4 -t 64 192.168.1.1

# 记录路由路径
ping -c 4 -R 192.168.1.1

# 设置超时时间
ping -c 4 -W 3 192.168.1.1

# 指定源地址
ping -c 4 -I eth0 192.168.1.1
ping -c 4 -I 192.168.1.10 192.168.1.1

# 洪水 ping（慎用，仅测试环境）
ping -c 1000 -f 192.168.1.1

# 查看 ping 统计信息
# --- 192.168.1.1 ping statistics ---
# 4 packets transmitted, 4 received, 0% packet loss, time 3005ms
# rtt min/avg/max/mdev = 0.020/0.035/0.050/0.012 ms

ping 输出分析

# 正常情况
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.035 ms

# 目标不可达
From 192.168.1.10 icmp_seq=1 Destination Host Unreachable
# 原因：ARP 解析失败或目标主机不存在

# 网络不可达
From 192.168.1.1 icmp_seq=1 Destination Net Unreachable
# 原因：没有到达目标网段的路由

# 超时
Request timeout for icmp_seq=1
# 原因：中间路由器丢弃或目标防火墙屏蔽

# 重定向
From 192.168.1.1: Redirect Host(New nexthop: 192.168.1.254)
# 原因：路由不是最优路径，路由器发送重定向

# 分片需要
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1400)
# 原因：包大小超过路径 MTU 且设置了不分片标志

traceroute 路由追踪

# 基础路由追踪
traceroute 8.8.8.8

# 使用 TCP SYN 替代 UDP（穿透防火墙）
traceroute -T -p 443 8.8.8.8

# 指定源地址
traceroute -s 192.168.1.10 8.8.8.8

# 指定接口
traceroute -i eth0 8.8.8.8

# 并发探测（加速）
traceroute -N 10 -w 3 8.8.8.8

# 显示 IP 地址而非域名
traceroute -n 8.8.8.8

mtr 实时诊断

# 安装 mtr
yum install -y mtr

# 实时追踪（推荐）
mtr -n 8.8.8.8

# 生成报告模式
mtr -n -r -c 10 8.8.8.8

# 输出示例分析
# Host                Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. 192.168.1.1       0.0%    10    0.3   0.4   0.2   0.8   0.2
# 2. 10.0.0.1          0.0%    10    1.2   1.5   1.0   2.3   0.4
# 3. ???              50.0%    10    0.0   0.0   0.0   0.0   0.0
# 4. 8.8.8.8           0.0%    10    5.3   5.8   4.9   7.2   0.7

# TCP 模式追踪
mtr -n -T -P 443 8.8.8.8

# 指定包大小
mtr -n -s 1000 8.8.8.8

# 使用 JSON 输出（便于脚本解析）
mtr -n -j -c 5 8.8.8.8

DNS 诊断

dig 命令详解

# 基础查询
dig example.com

# 只显示简短答案
dig +short example.com

# 指定 DNS 服务器查询
dig @8.8.8.8 example.com

# 查询指定记录类型
dig example.com A        # IPv4 地址
dig example.com AAAA     # IPv6 地址
dig example.com CNAME    # 别名记录
dig example.com MX       # 邮件交换
dig example.com NS       # 名称服务器
dig example.com TXT      # TXT 记录（SPF、DKIM）
dig example.com SOA      # 起始授权
dig example.com PTR      # 反向解析
dig -x 192.168.1.10      # 反向解析（简写）

# 追踪 DNS 解析过程
dig +trace example.com

# 显示 DNS 解析耗时
dig +stats example.com

# TCP 方式查询
dig +tcp example.com

# 指定 DNS 端口
dig @8.8.8.8 -p 53 example.com

# 查询 DNSSEC 相关信息
dig +dnssec example.com

# 批量查询
dig +short -f domain_list.txt

nslookup 命令

# 基础查询
nslookup example.com

# 指定 DNS 服务器
nslookup example.com 8.8.8.8

# 交互模式
nslookup
> server 8.8.8.8
> set type=MX
> example.com
> set type=A
> example.com
> exit

# 反向查询
nslookup 192.168.1.10

# 查询所有记录类型
nslookup -type=any example.com

DNS 故障排查流程

# 1. 检查 /etc/resolv.conf 配置
cat /etc/resolv.conf

# 2. 测试本地 DNS 解析
dig @127.0.0.1 example.com

# 3. 测试外部 DNS 解析
dig @8.8.8.8 example.com

# 4. 检查 DNS 缓存（nscd 或 systemd-resolved）
# systemd-resolved
resolvectl statistics
resolvectl flush-caches

# nscd
nscd -g
nscd -i hosts

# 5. 检查 /etc/hosts 文件
cat /etc/hosts | grep example

# 6. 检查 nsswitch.conf 解析顺序
cat /etc/nsswitch.conf | grep hosts
# 正常输出：hosts:      files dns myhostname

# 7. 查看 DNS 解析延迟
dig +stats example.com | grep "Query time"

连接状态诊断

ss 命令（推荐替代 netstat）

# 查看所有 TCP 连接
ss -t -a

# 查看所有 UDP 连接
ss -u -a

# 查看监听端口
ss -tlnp

# 常用组合
ss -tunlp          # TCP + UDP + 监听 + 进程信息
ss -s              # 连接统计摘要
ss -tn state established    # 已建立的连接
ss -tn state time-wait      # TIME_WAIT 连接
ss -tn state close-wait     # CLOSE_WAIT 连接

# 按端口过滤
ss -tlnp sport = :80
ss -tlnp dport = :443
ss -tlnp sport = :80 or sport = :443

# 按地址过滤
ss -tn dst 192.168.1.100
ss -tn dst 192.168.1.0/24

# 查看连接统计
ss -s
# 输出示例：
# Total: 1050 (kernel 1200)
# TCP:   85 (estab 42, closed 15, orphaned 0, timewait 15)
# Transport Total     IP        IPv6
# RAW       1         0         1
# UDP       5         3         2
# TCP       70        45        25
# INET      76        48        28
# FRAG      0         0         0

# 查看特定进程的连接
ss -tnp pid = 12345

# 查看 socket 内存使用
ss -tm

# 查看 socket 缓冲区信息
ss -ti

netstat 命令（传统工具）

# 查看所有监听端口
netstat -tlnp

# 查看所有连接
netstat -anp

# 查看连接统计
netstat -st

# 按状态统计 TCP 连接数
netstat -anp | awk '/^tcp/ {print $6}' | sort | uniq -c | sort -rn
# 输出示例：
#     42 ESTABLISHED
#     15 TIME_WAIT
#      5 CLOSE_WAIT
#      3 LISTEN

# 查看路由表
netstat -rn

# 查看网络接口统计
netstat -i

# 持续监控
netstat -tlnp -c 2

TCP 状态分析

# TCP 连接状态及含义
LISTEN       - 服务器等待连接
SYN-SENT     - 客户端发送 SYN 后等待 SYN+ACK
SYN-RECV     - 服务器收到 SYN 后发送 SYN+ACK，等待 ACK
ESTABLISHED  - 连接建立完成，可以传输数据
FIN-WAIT-1   - 主动关闭方发送 FIN 后等待 ACK
FIN-WAIT-2   - 主动关闭方收到 ACK 后等待对方 FIN
CLOSE-WAIT   - 被动关闭方收到 FIN 后等待应用关闭
LAST-ACK     - 被动关闭方发送 FIN 后等待 ACK
TIME-WAIT    - 主动关闭方等待 2MSL 后彻底关闭
CLOSING      - 双方同时关闭

CLOSE_WAIT 问题排查

# 统计 CLOSE_WAIT 数量
ss -tn state close-wait | wc -l

# 找出 CLOSE_WAIT 对应的进程
ss -tnp state close-wait

# 分析大量 CLOSE_WAIT 的原因
# 常见原因：
# 1. 应用程序没有正确关闭连接（忘记调用 close()）
# 2. 应用程序处理太慢，来不及关闭已接收 FIN 的连接
# 3. 线程池耗尽，无法处理关闭请求

# 查看进程打开的文件描述符
ls -la /proc/$(pidof java)/fd | grep socket | wc -l

# 使用 lsof 查看连接
lsof -i :8080
lsof -p $(pidof java) | grep TCP | wc -l

tcpdump 抓包详解

基础抓包

# 安装 tcpdump
yum install -y tcpdump

# 抓取所有流量（网卡 eth0）
tcpdump -i eth0

# 抓取指定端口流量
tcpdump -i eth0 port 80

# 抓取指定主机流量
tcpdump -i eth0 host 192.168.1.100

# 抓取指定网段流量
tcpdump -i eth0 net 192.168.1.0/24

# 抓取 TCP 流量
tcpdump -i eth0 tcp

# 抓取 UDP 流量
tcpdump -i eth0 udp

# 抓取 ICMP 流量
tcpdump -i eth0 icmp

# 抓取 ARP 流量
tcpdump -i eth0 arp

高级过滤

# 组合过滤条件
tcpdump -i eth0 host 192.168.1.100 and port 80
tcpdump -i eth0 host 192.168.1.100 and not port 22
tcpdump -i eth0 '(host 192.168.1.100 or host 192.168.1.101) and port 80'

# 按方向过滤
tcpdump -i eth0 src host 192.168.1.100    # 源地址过滤
tcpdump -i eth0 dst host 192.168.1.100    # 目标地址过滤
tcpdump -i eth0 src port 8080             # 源端口过滤
tcpdump -i eth0 dst port 80               # 目标端口过滤

# 按 TCP 标志位过滤
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'           # SYN 包
tcpdump -i eth0 'tcp[tcpflags] & tcp-ack != 0'           # ACK 包
tcpdump -i eth0 'tcp[tcpflags] & tcp-fin != 0'           # FIN 包
tcpdump -i eth0 'tcp[tcpflags] & tcp-rst != 0'           # RST 包
tcpdump -i eth0 'tcp[tcpflags] == tcp-syn'               # 仅 SYN 包

# 抓取 HTTP GET 请求
tcpdump -i eth0 -s 0 -A 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' | grep -i "GET\|POST\|HTTP"

# 抓取指定 VLAN
tcpdump -i eth0 vlan 100

# 按 TTL 过滤
tcpdump -i eth0 'ip[8] < 10'

输出控制

# 详细输出（显示包内容）
tcpdump -i eth0 -vv port 80

# 显示链路层信息（MAC 地址）
tcpdump -i eth0 -e port 80

# 以 ASCII 显示包内容
tcpdump -i eth0 -A port 80

# 以十六进制和 ASCII 显示
tcpdump -i eth0 -XX port 80

# 不解析主机名和端口名
tcpdump -i eth0 -nn port 80

# 显示时间戳
tcpdump -i eth0 -tttt port 80       # 可读时间格式
tcpdump -i eth0 -t port 80          # 不显示时间戳

# 抓取指定数量的包
tcpdump -i eth0 -c 100 port 80

# 设置抓包大小（snaplen）
tcpdump -i eth0 -s 0 port 80        # 抓取完整包
tcpdump -i eth0 -s 65535 port 80    # 等同 -s 0

保存和读取抓包文件

# 保存到文件（便于 Wireshark 分析）
tcpdump -i eth0 -w /tmp/capture.pcap port 80

# 保存指定大小的文件（滚动）
tcpdump -i eth0 -w /tmp/capture.pcap -C 100 port 80
# 每个文件最大 100MB

# 限制总文件数量
tcpdump -i eth0 -w /tmp/capture.pcap -C 100 -W 10 port 80
# 最多保留 10 个文件，循环覆盖

# 保存指定数量后停止
tcpdump -i eth0 -w /tmp/capture.pcap -c 1000 port 80

# 读取抓包文件
tcpdump -r /tmp/capture.pcap

# 读取并过滤
tcpdump -r /tmp/capture.pcap port 80
tcpdump -r /tmp/capture.pcap -nn host 192.168.1.100

# 压缩保存
tcpdump -i eth0 -w - port 80 | gzip > /tmp/capture.pcap.gz

实际抓包场景

# 场景一：诊断 TCP 三次握手问题
tcpdump -i eth0 -nn -S host 192.168.1.100 and port 443
# 观察 SYN, SYN-ACK, ACK 是否完整

# 场景二：诊断连接重置问题
tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst|tcp-fin) != 0' and host 192.168.1.100

# 场景三：抓取 MySQL 查询
tcpdump -i eth0 -s 0 -A port 3306 | grep -i "SELECT\|INSERT\|UPDATE\|DELETE"

# 场景四：诊断 DNS 问题
tcpdump -i eth0 -nn port 53

# 场景五：诊断 DHCP 问题
tcpdump -i eth0 -nn port 67 or port 68

# 场景六：诊断 HTTP 性能（计算响应时间）
tcpdump -i eth0 -nn -tttt port 80 | while read line; do echo "${line}"; done

# 场景七：抓取完整 TCP 流
tcpdump -i eth0 -nn -vv -X host 192.168.1.100 and port 8080

Wireshark 分析

安装与基础使用

# Linux 安装 Wireshark（GUI 环境）
yum install -y wireshark

# 使用 tshark（命令行版 Wireshark）
yum install -y wireshark

# 基础抓包
tshark -i eth0

# 抓取指定端口
tshark -i eth0 -f "port 80"

# 使用显示过滤器
tshark -i eth0 -Y "http.request.method == GET"

# 显示特定字段
tshark -i eth0 -Y "http" -T fields -e ip.src -e ip.dst -e http.host -e http.request.uri

# 统计 HTTP 状态码
tshark -r /tmp/capture.pcap -Y "http.response" -T fields -e http.response.code | sort | uniq -c | sort -rn

# 导出 HTTP 对象
tshark -r /tmp/capture.pcap --export-objects http,/tmp/http_objects/

# 统计 TCP 重传
tshark -r /tmp/capture.pcap -Y "tcp.analysis.retransmission" -T fields -e ip.src -e ip.dst | sort | uniq -c

Wireshark 显示过滤器（在 GUI 中使用）

# 基础过滤
ip.addr == 192.168.1.100
ip.src == 192.168.1.100
ip.dst == 192.168.1.100
tcp.port == 80
tcp.dstport == 443
http
dns
ssl

# HTTP 过滤
http.request.method == "POST"
http.response.code == 200
http.response.code >= 400
http.host contains "example.com"
http.content_type contains "json"

# TCP 过滤
tcp.flags.syn == 1
tcp.flags.reset == 1
tcp.analysis.retransmission
tcp.analysis.duplicate_ack
tcp.stream == 1

# DNS 过滤
dns.qry.name contains "example.com"
dns.flags.response == 1
dns.flags.rcode != 0

# SSL/TLS 过滤
ssl.handshake.type == 1          # Client Hello
ssl.handshake.type == 2          # Server Hello
ssl.handshake.type == 11         # Certificate
tls.record.content_type == 23    # Application Data

分析 TCP 流

# 在 Wireshark 中右键 -> Follow -> TCP Stream
# 可以看到完整的请求响应内容

# 常见分析点：
# 1. 三次握手是否完成（SYN -> SYN-ACK -> ACK）
# 2. 是否有大量重传（黑色背景标记）
# 3. 是否有零窗口（Zero Window）
# 4. 是否有 TCP Keepalive
# 5. 四次挥手是否正常

curl 调试

HTTP 请求调试

# 详细输出模式
curl -v http://example.com

# 更详细（包含 SSL 握手信息）
curl -vvv https://example.com

# 仅显示响应头
curl -I http://example.com

# 显示响应时间（自定义格式）
curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nStart: %{time_starttransfer}s\nTotal: %{time_total}s\nHTTP Code: %{http_code}\nSize: %{size_download} bytes\n" https://example.com

# 测试 HTTP 方法
curl -X POST http://example.com/api \
    -H "Content-Type: application/json" \
    -d '{"key": "value"}'

# 测试重定向
curl -L -v http://example.com

# 指定解析地址（绕过 DNS）
curl --resolve example.com:443:192.168.1.100 https://example.com

# 设置超时
curl --connect-timeout 5 --max-time 10 http://example.com

# 设置 Cookie
curl -b "session=abc123" http://example.com

# 保存响应头和体
curl -D headers.txt -o body.txt http://example.com

# 测试 HTTP/2
curl --http2 -v https://example.com

# 下载文件并显示进度
curl -O -# http://example.com/largefile.zip

# 发送文件
curl -F "file=@/path/to/file.txt" http://example.com/upload

# 测试 HTTP 基本认证
curl -u username:password http://example.com

# 测试 SSL 证书
curl --cacert /path/to/ca.pem https://example.com
curl --insecure https://example.com    # 跳过证书验证

# 模拟浏览器请求
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
     -H "Accept: text/html" \
     -H "Accept-Language: zh-CN,zh;q=0.9" \
     http://example.com

curl 批量测试脚本

#!/bin/bash
# 批量测试接口响应时间
URLS=(
    "https://api.example.com/health"
    "https://api.example.com/users"
    "https://api.example.com/orders"
)

for url in "${URLS[@]}"; do
    result=$(curl -o /dev/null -s -w "%{http_code} %{time_total}s" --max-time 10 "${url}")
    echo "$(date '+%Y-%m-%d %H:%M:%S') ${url} -> ${result}"
done

iptables 防火墙调试

查看规则

# 查看所有规则
iptables -L -n -v
iptables -L -n -v --line-numbers

# 查看指定链
iptables -L INPUT -n -v
iptables -L FORWARD -n -v

# 查看 NAT 规则
iptables -t nat -L -n -v

# 查看原始规则（含规则序号）
iptables -nL INPUT --line-numbers

# 查看连接跟踪
cat /proc/net/nf_conntrack | head
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

规则追踪

# 添加 LOG 规则进行追踪
iptables -I INPUT -s 192.168.1.100 -j LOG --log-prefix "IPTABLES_DEBUG: " --log-level 4

# 查看日志
tail -f /var/log/messages | grep IPTABLES_DEBUG

# 测试完成后删除 LOG 规则
iptables -D INPUT -s 192.168.1.100 -j LOG --log-prefix "IPTABLES_DEBUG: " --log-level 4

# 追踪特定数据包的完整路径
iptables -t raw -A PREROUTING -s 192.168.1.100 -j TRACE
# 查看追踪日志
tail -f /var/log/messages | grep TRACE

常见防火墙问题

# 问题一：端口不通但服务正常运行
# 排查步骤
iptables -L INPUT -n -v | grep 8080
# 如果没有放行规则，添加：
iptables -I INPUT -p tcp --dport 8080 -j ACCEPT

# 问题二：DNAT 转发不生效
iptables -t nat -L -n -v
# 检查 FORWARD 链是否放行
iptables -L FORWARD -n -v
# 开启 IP 转发
sysctl net.ipv4.ip_forward
sysctl -w net.ipv4.ip_forward=1

# 问题三：CONNTRACK 表满
cat /proc/sys/net/netfilter/nf_conntrack_max
cat /proc/sys/net/netfilter/nf_conntrack_count
# 增大连接跟踪表
sysctl -w net.netfilter.nf_conntrack_max=262144

# 清除所有规则（谨慎操作）
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X

连接跟踪

conntrack 工具

# 安装
yum install -y conntrack-tools

# 查看连接跟踪表
conntrack -L

# 统计连接数
conntrack -C

# 过滤查看
conntrack -L -s 192.168.1.100       # 按源地址
conntrack -L -d 192.168.1.100       # 按目标地址
conntrack -L -p tcp                  # 按 TCP 协议
conntrack -L -p tcp --dport 80      # 按目标端口

# 删除特定连接（调试用）
conntrack -D -s 192.168.1.100
conntrack -D -p tcp --orig-port-dst 80

# 查看连接跟踪事件
conntrack -E

# 查看连接跟踪统计
conntrack -S

连接跟踪优化

# 查看当前限制
cat /proc/sys/net/netfilter/nf_conntrack_max
cat /proc/sys/net/netfilter/nf_conntrack_count

# 调整连接跟踪参数
cat >> /etc/sysctl.conf << 'EOF'
# 连接跟踪表大小
net.netfilter.nf_conntrack_max = 262144

# TCP 已建立连接超时（默认 5 天太长）
net.netfilter.nf_conntrack_tcp_timeout_established = 7200

# TCP CLOSE_WAIT 超时
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60

# TCP FIN_WAIT 超时
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 30

# UDP 超时
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120
EOF

sysctl -p

网络命名空间

基础操作

# 创建命名空间
ip netns add ns1
ip netns add ns2

# 列出所有命名空间
ip netns list

# 在命名空间中执行命令
ip netns exec ns1 ip addr
ip netns exec ns1 ping 192.168.1.1

# 创建虚拟网线对
ip link add veth1 type veth peer name veth2

# 将接口移入命名空间
ip link set veth1 netns ns1
ip link set veth2 netns ns2

# 配置 IP 地址
ip netns exec ns1 ip addr add 10.0.0.1/24 dev veth1
ip netns exec ns2 ip addr add 10.0.0.2/24 dev veth2

# 启用接口
ip netns exec ns1 ip link set veth1 up
ip netns exec ns2 ip link set veth2 up

# 测试连通性
ip netns exec ns1 ping 10.0.0.2

容器网络调试

# 查看容器的网络命名空间
docker inspect --format '{{.State.Pid}}' container_name

# 使用 nsenter 进入容器网络命名空间
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' container_name)
nsenter -t ${CONTAINER_PID} -n ip addr
nsenter -t ${CONTAINER_PID} -n tcpdump -i eth0 -nn port 80
nsenter -t ${CONTAINER_PID} -n curl -v http://backend:8080/health

# 直接在容器网络命名空间中抓包
nsenter -t ${CONTAINER_PID} -n tcpdump -i eth0 -w /tmp/container.pcap

# 查看容器路由
nsenter -t ${CONTAINER_PID} -n ip route

# 查看容器 DNS 配置
nsenter -t ${CONTAINER_PID} -n cat /etc/resolv.conf

网络性能测试

iperf3 带宽测试

# 安装
yum install -y iperf3

# 服务端（在一台机器上启动）
iperf3 -s -p 5201

# 客户端（在另一台机器上测试）
iperf3 -c 192.168.1.10 -p 5201

# 测试 TCP 带宽（默认）
iperf3 -c 192.168.1.10 -t 30         # 测试 30 秒
iperf3 -c 192.168.1.10 -P 4          # 4 个并发流
iperf3 -c 192.168.1.10 -R            # 反向测试（服务端发送）

# 测试 UDP 带宽
iperf3 -c 192.168.1.10 -u -b 1G      # 目标带宽 1Gbps
iperf3 -c 192.168.1.10 -u -b 100M -l 1400   # 指定包大小

# 测试延迟
iperf3 -c 192.168.1.10 --udp -b 1M -l 64

# JSON 输出
iperf3 -c 192.168.1.10 -J

# 指定绑定接口
iperf3 -c 192.168.1.10 -B 192.168.1.20

网络延迟测试

# 使用 ping 统计延迟分布
ping -c 100 -i 0.1 192.168.1.1 | tail -1

# 使用 sockperf 测试延迟
yum install -y sockperf
sockperf sr --tcp -p 12345           # 服务端
sockperf pp --tcp -i 192.168.1.10 -p 12345 -t 30   # 客户端

# 使用 qperf 测试
yum install -y qperf
qperf 192.168.1.10 tcp_lat           # TCP 延迟
qperf 192.168.1.10 tcp_bw            # TCP 带宽
qperf 192.168.1.10 ud lat            # UDP 延迟

常见网络故障案例

案例一：TCP 连接超时

# 症状：客户端连接服务端超时

# 1. 检查网络连通性
ping -c 3 192.168.1.100

# 2. 检查端口是否可达
telnet 192.168.1.100 8080
# 或使用 nc
nc -zv -w 5 192.168.1.100 8080

# 3. 抓包分析
tcpdump -i eth0 -nn host 192.168.1.100 and port 8080

# 4. 检查路由
ip route get 192.168.1.100

# 5. 检查防火墙
iptables -L INPUT -n -v | grep 8080

# 常见原因：
# - 服务未启动或端口错误
# - 防火墙阻断
# - 路由不可达
# - SYN 包被丢弃（SYN Flood 防护）

案例二：DNS 解析超时

# 症状：偶尔出现 DNS 解析超时

# 1. 测试 DNS 解析
dig +stats example.com | grep "Query time"

# 2. 对比不同 DNS 服务器
dig @8.8.8.8 +stats example.com | grep "Query time"
dig @114.114.114.114 +stats example.com | grep "Query time"

# 3. 检查 resolv.conf
cat /etc/resolv.conf

# 4. 检查 DNS 服务器可用性
ping -c 3 8.8.8.8
telnet 8.8.8.8 53

# 5. 使用 tcpdump 抓取 DNS 包
tcpdump -i eth0 -nn port 53 -c 50

# 常见原因：
# - DNS 服务器不可用
# - /etc/resolv.conf 配置错误
# - DNS 服务器响应慢
# - 网络中间设备拦截 DNS 查询

案例三：网络丢包

# 症状：文件传输速度慢，偶尔断连

# 1. 使用 mtr 检测丢包位置
mtr -n -r -c 100 192.168.1.100

# 2. 检查网卡统计信息
ip -s link show eth0
# 关注 RX/TX 的 errors, dropped, overruns

# 3. 检查网卡状态
ethtool eth0
ethtool -S eth0 | grep -i error
ethtool -S eth0 | grep -i drop

# 4. 检查内核网络统计
cat /proc/net/dev | column -t
cat /proc/net/snmp

# 5. 检查网卡缓冲区
ethtool -g eth0

# 常见原因：
# - 网线/光纤质量问题
# - 网卡缓冲区不足
# - 网络拥塞
# - MTU 不匹配

案例四：MTU 问题导致连接异常

# 症状：小包正常，大包超时

# 1. 测试不同大小的包
ping -c 3 -s 1472 192.168.1.100      # 正常
ping -c 3 -s 1473 192.168.1.100      # 超时

# 2. 检查路径 MTU
ping -c 3 -M do -s 1472 192.168.1.100
# local error: Message too long 表示 MTU 不够

# 3. 查看接口 MTU
ip link show eth0 | grep mtu

# 4. 修改 MTU
ip link set eth0 mtu 1400

# 5. 抓包观察分片
tcpdump -i eth0 -nn 'ip[6:2] & 0x1fff != 0 or ip[6] & 0x20 != 0'

# 常见场景：VPN/隧道场景下 MTU 问题

网络接口诊断

ip 命令

# 查看 IP 地址
ip addr show
ip addr show eth0

# 查看 MAC 地址
ip link show eth0

# 查看路由表
ip route show
ip route get 192.168.1.100

# 查看 ARP 表
ip neigh show
ip neigh show dev eth0

# 查看 VLAN
ip -d link show type vlan

# 查看网络统计
ip -s link show eth0
ip -s -s link show eth0    # 更详细

# 查看组播
ip maddr show
ip maddr show dev eth0

ethtool 网卡诊断

# 查看网卡基本信息
ethtool eth0

# 查看链路状态
ethtool eth0 | grep "Link detected"

# 查看网卡速度和双工模式
ethtool eth0 | grep -E "Speed|Duplex"

# 查看网卡统计计数器
ethtool -S eth0

# 查看网卡 Ring Buffer
ethtool -g eth0

# 修改 Ring Buffer 大小（减少丢包）
ethtool -G eth0 rx 4096 tx 4096

# 查看 Offload 设置
ethtool -k eth0

# 查看网卡驱动信息
ethtool -i eth0

# 查看 RSS（接收端缩放）
ethtool -l eth0

# 开启网卡特性
ethtool -K eth0 gro on
ethtool -K eth0 tso on

优点

工具丰富：Linux 内置了大量网络诊断工具，覆盖各层协议
无需额外安装：大部分工具随系统预装
实时诊断：可以实时观察网络状态和数据流
脚本化：所有命令行工具均可脚本化，便于自动化监控
深度分析：tcpdump + Wireshark 组合可实现数据包级别的深度分析

缺点

学习曲线陡峭：工具众多，参数复杂
需要权限：抓包和修改网络配置需要 root 权限
分析耗时：大量数据包需要经验和时间分析
线上风险：不当操作可能影响生产网络

总结

网络故障排查是一项系统性工程，关键在于建立分层诊断思维。从物理层到应用层逐层排查，善用 tcpdump 和 Wireshark 进行深度分析，结合 ss/netstat 了解连接状态，使用 curl 验证应用层可达性。掌握这些工具和方法，可以快速定位绝大多数网络故障。

关键知识点

网络分层诊断模型和自底向上排查方法论
tcpdump 抓包过滤器语法（BPF）和高级用法
TCP 状态机及各状态的含义和排查方法
DNS 解析流程和常见 DNS 故障
iptables 防火墙规则匹配顺序和调试方法
网络命名空间在容器网络调试中的应用
MTU 对网络通信的影响和诊断方法
conntrack 连接跟踪表的原理和调优

常见误区

只 ping 不抓包：ping 通不代表服务可用，需要结合端口测试和抓包分析
忽略 DNS 问题：很多超时问题本质是 DNS 解析慢，而不是网络不通
过度依赖 netstat：ss 命令性能更好，在大连接数场景下应优先使用
忽视 MTU 问题：VPN/隧道场景下 MTU 不匹配是大包丢失的常见原因
抓包不保存：直接看终端输出效率低，应保存为 pcap 文件用 Wireshark 分析
忽略 ARP 问题：IP 冲突或 ARP 表错误会导致间歇性网络异常

进阶路线

学习 eBPF：了解基于 eBPF 的新型网络诊断工具（如 bpftrace）
网络性能调优：深入学习 TCP BBR 拥塞控制、网卡多队列、DPDK
Service Mesh 调试：学习 Istio/Envoy 环境下的网络诊断方法
自动化监控：结合 Prometheus + Grafana 构建网络监控体系
SRE 方法论：学习 Google SRE 中的网络故障应急处理流程

适用场景

服务连接超时、拒绝连接等连通性故障
网络丢包、高延迟等性能问题
DNS 解析异常
防火墙规则导致的访问异常
SSL/TLS 证书和握手问题
容器网络通信异常
负载均衡和反向代理问题排查

落地建议

建立诊断 SOP：将常见网络故障的诊断步骤文档化
工具预装：在新服务器初始化时预装 tcpdump、mtr、iperf3 等诊断工具
监控先行：部署网络质量监控（SmokePing、Blackbox Exporter）
定期演练：模拟各种网络故障场景，提升团队排查能力
抓包规范：制定抓包保存策略，便于事后分析
知识沉淀：建立网络故障案例库，记录问题现象、排查过程和解决方案

排错清单

现象	可能原因	排查命令
ping 不通	网络不可达/防火墙	`ping`, `traceroute`, `iptables -L`
端口不通	服务未启动/防火墙	`ss -tlnp`, `telnet`, `nc -zv`
DNS 解析失败	DNS 配置错误	`dig`, `nslookup`, `cat /etc/resolv.conf`
连接超时	路由/防火墙/MSS	`tcpdump`, `mtr`, `curl -v`
大量 TIME_WAIT	短连接过多	`ss -s`, 调整内核参数
大量 CLOSE_WAIT	应用未关闭连接	`ss -tnp`, 检查应用代码
丢包严重	网络拥塞/链路质量	`mtr`, `ethtool -S`, `ip -s link`
SSL 握手失败	证书/协议不匹配	`openssl s_client`, `curl -vvv`
间歇性超时	负载高/MTU/DNS	`tcpdump`, `mtr`, `dig +stats`
HTTP 502/504	后端不可用	`curl -v`, `ss -tlnp`, 检查后端日志

复盘问题

如何通过 tcpdump 和 Wireshark 定位一个 TCP 连接重置的根本原因？
大量 TIME_WAIT 和大量 CLOSE_WAIT 分别说明了什么问题？解决方案有何不同？
如何在不影响生产环境的前提下抓取特定接口的完整数据包？
在容器环境中，如何诊断 Pod 之间的网络通信问题？
MTU 不匹配会导致哪些现象？如何准确测量路径 MTU？
如何构建一套自动化的网络质量监控体系？