Linux 高可用集群方案

SunnyFan大约 16 分钟约 4757 字

Linux 高可用集群方案

简介

Linux 高可用(High Availability, HA)集群是通过多台服务器协同工作,在单点故障发生时自动切换服务,确保业务持续运行的解决方案。在生产环境中,HA 集群是保障关键业务 99.99% 以上可用性的基础架构。

本文将系统性地介绍 Linux HA 集群的核心技术,涵盖 Corosync + Pacemaker 集群架构、资源管理、STONITH 隔离、仲裁机制、集群化服务配置(MySQL/PostgreSQL/Nginx)、故障检测与恢复、集群监控和生产部署实践。

特点

Linux HA 集群的核心特征:

自动故障切换: 检测到节点故障后自动将服务迁移到健康节点
资源编排: 统一管理 IP 地址、文件系统、服务等资源的启动顺序
仲裁机制: 防止脑裂(split-brain)导致数据损坏
隔离保障: 通过 STONITH 确保故障节点真正下线
策略灵活: 支持位置约束、顺序约束、协同约束等多种编排策略

Linux HA 集群架构层次:

    +----------------------------------+
    |         应用服务层                |
    |   (Nginx/MySQL/PostgreSQL/...)   |
    +----------------------------------+
              |               |
    +-----------------+ +-----------------+
    |   Pacemaker     | |   Pacemaker     |
    |  (集群资源管理)  | |  (集群资源管理)  |
    +-----------------+ +-----------------+
              |               |
    +-----------------+ +-----------------+
    |   Corosync      | |   Corosync      |
    | (集群通信/成员)  | | (集群通信/成员)  |
    +-----------------+ +-----------------+
              |               |
    +-----------------+ +-----------------+
    |   节点 A        | |   节点 B        |
    | (主节点/Active) | | (备节点/Standby)|
    +-----------------+ +-----------------+

组件职责:
- Corosync: 集群节点间通信、成员管理、仲裁服务
- Pacemaker: 资源调度、故障检测、恢复策略
- PCS: 集群管理命令行工具
- SBD/STONITH: 节点隔离机制

1.2 环境准备

# ==== 系统要求 ====
# - CentOS 7/8 或 RHEL 7/8
# - 至少 2 个节点(推荐 3 个节点实现仲裁)
# - 节点间网络互通(建议专用心跳网络)
# - 所有节点时间同步(NTP/Chrony)
# - 主机名解析(/etc/hosts 或 DNS)

# ==== 配置主机名和解析 ====
# 在所有节点执行:

# 节点1
hostnamectl set-hostname node1.example.com

# 节点2
hostnamectl set-hostname node2.example.com

# 配置 /etc/hosts (所有节点)
cat >> /etc/hosts << 'EOF'
192.168.1.10  node1.example.com  node1
192.168.1.11  node2.example.com  node2
192.168.1.12  node3.example.com  node3
# 虚拟 IP
192.168.1.100 vip.example.com    vip
EOF

# ==== 配置时间同步 ====
yum install -y chrony
systemctl enable --now chronyd

# 验证时间同步
chronyc sources -v
chronyc tracking

# ==== 配置 SSH 免密 ====
# 在 node1 执行:
ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
ssh-copy-id node1
ssh-copy-id node2
ssh-copy-id node3

# 验证
ssh node2 "hostname"

2. 安装与配置集群

2.1 安装集群软件

# ==== CentOS 7 安装 ====
yum install -y pacemaker corosync pcs psmisc policycoreutils-python

# ==== CentOS 8 / RHEL 8 安装 ====
dnf install -y pacemaker corosync pcs psmisc policycoreutils-python-utils

# 所有节点启用 pcsd 服务
systemctl enable --now pcsd

# 设置 hacluster 用户密码(所有节点,密码需一致)
echo "hacluster:YourSecurePassword" | chpasswd

# ==== 配置防火墙 ====
# Corosync 通信端口
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --reload

# 或手动开放端口:
firewall-cmd --permanent --add-port=2224/tcp   # pcsd
firewall-cmd --permanent --add-port=3121/tcp   # Pacemaker remote
firewall-cmd --permanent --add-port=5403/udp   # Corosync
firewall-cmd --permanent --add-port=5404-5406/udp  # Corosync UDP
firewall-cmd --reload

# ==== SELinux 配置 ====
# 推荐保持 SELinux 启用,但需要设置正确的布尔值
setsebool -P daemons_enable_cluster_mode 1

2.2 创建集群

# ==== 认证节点 ====
# 在任一节点执行:
pcs cluster auth node1 node2 node3 -u hacluster -p YourSecurePassword

# ==== 创建集群 ====
pcs cluster setup --name mycluster node1 node2 node3

# 输出示例:
# Destroying cluster on nodes: node1, node2, node3...
# node1: Stopping Cluster (pacemaker)...
# node2: Stopping Cluster (pacemaker)...
# node3: Stopping Cluster (pacemaker)...
# Sending cluster config files to the nodes...
# node1: Succeeded
# node2: Succeeded
# node3: Succeeded

# ==== 启动集群 ====
pcs cluster start --all

# 设置开机自启
pcs cluster enable --all

# ==== 验证集群状态 ====
pcs status cluster

# 输出示例:
# Cluster Summary:
#   * Stack: corosync
#   * Current DC: node1 (version 2.0.5)
#   * Last updated: Thu Apr 16 10:00:00 2026
#   * Last change:  Thu Apr 16 09:59:00 2026
#   * 3 nodes configured
#   * 0 resource instances configured
#
# Node List:
#   * Online: [ node1 node2 node3 ]

# 查看 Corosync 成员
pcs status corosync

# 查看节点状态
pcs status nodes

2.3 集群全局配置

# ==== 禁用 STONITH(暂时,后续配置) ====
pcs property set stonith-enabled=false

# ==== 配置仲裁策略 ====
# 当没有仲裁时的行为: stop(默认), freeze, ignore
pcs property set no-quorum-policy=stop

# ==== 配置故障恢复策略 ====
# 故障节点恢复后是否自动迁回: true/false
pcs property set default-resource-stickiness=100

# ==== 查看所有配置 ====
pcs property list

# ==== 配置 Corosync 心跳 ====
# 编辑 /etc/corosync/corosync.conf
cat > /etc/corosync/corosync.conf << 'EOF'
totem {
    version: 2
    cluster_name: mycluster
    transport: knet
    token: 3000           # 令牌超时(毫秒)
    token_retransmits_before_loss_const: 10
    join: 50
    consensus: 3600       # 共识超时
    max_messages: 20
    secauth: on           # 启用认证
    threads: 0

    interface {
        ringnumber: 0
        bindnetaddr: 192.168.1.0
        mcastport: 5405
        ttl: 1
    }
}

nodelist {
    node {
        ring0_addr: node1
        nodeid: 1
    }
    node {
        ring0_addr: node2
        nodeid: 2
    }
    node {
        ring0_addr: node3
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 0
    wait_for_all: 1
    last_man_standing: 1
    last_man_standing_window: 10000
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
    debug: off
}
EOF

# 同步配置到所有节点
pcs cluster sync
pcs cluster reload corosync

3. STONITH / Fencing 配置

# ==== STONITH 的重要性 ====
# STONITH (Shoot The Other Node In The Head)
# 确保故障节点真正停止,防止脑裂导致数据损坏

# ==== 启用 STONITH ====
pcs property set stonith-enabled=true

# ==== 配置 IPMI Fencing (物理服务器) ====
pcs stonith create ipmi_fence fence_ipmilan \
    ipaddr=192.168.1.200 \
    login=admin \
    passwd=admin_password \
    lanplus=1 \
    cipher=1 \
    pcmk_hostlist="node1 node2 node3" \
    op monitor interval=60s

# ==== 配置 SBD (共享存储 fencing) ====
# 1. 准备共享存储(至少 1MB)
# 2. 初始化 SBD
sbd -d /dev/sdb create
sbd -d /dev/sdb dump

# 3. 配置 SBD
cat > /etc/sysconfig/sbd << 'EOF'
SBD_DEVICE="/dev/sdb"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=no
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=15
EOF

# 4. 启用 SBD
pcs property set stonith-watchdog-timeout=15
pcs property set stonith-enabled=true

# 5. 创建 SBD fencing 资源
pcs stonith create sbd_fence external/sbd \
    op monitor interval=15s timeout=60s

# ==== 配置 KVM/QEMU Fencing (虚拟化) ====
pcs stonith create vm_fence fence_xvm \
    pcmk_hostlist="node1 node2 node3" \
    op monitor interval=60s

# ==== 验证 fencing ====
# 测试 fencing (会重启目标节点!)
# pcs stonith fence node2

# 查看 fencing 配置
pcs stonith show
pcs stonith level

4. 资源管理

4.1 基础资源配置

# ==== VIP (虚拟 IP) 资源 ====
pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    op monitor interval=30s \
    --group web-group

# ==== Nginx 服务资源 ====
pcs resource create Nginx systemd:nginx \
    op monitor interval=20s \
    op start timeout=40s \
    op stop timeout=40s \
    --group web-group

# ==== 文件系统资源 ====
pcs resource create WebData ocf:heartbeat:Filesystem \
    device="/dev/mapper/vg_web-lv_web" \
    directory="/var/www/html" \
    fstype="xfs" \
    op monitor interval=20s \
    --group web-group

# ==== 查看资源组 ====
pcs resource show web-group
# Resource: web-group
#   VirtualIP  (ocf::heartbeat:IPaddr2):  Started node1
#   WebData    (ocf::heartbeat:Filesystem): Started node1
#   Nginx      (systemd:nginx):            Started node1

# ==== 资源启动顺序(已在组内保证) ====
# 组内资源按添加顺序启动,反序停止
# 即: VirtualIP -> WebData -> Nginx (启动)
#     Nginx -> WebData -> VirtualIP (停止)

4.2 约束配置

# ==== 位置约束 (Location Constraints) ====
# 指定资源偏好运行的节点

# 偏好在 node1 运行
pcs constraint location web-group prefers node1=100

# 避免 node3
pcs constraint location web-group avoids node3=-100

# 节点属性方式
pcs node attribute node1 rack=rack1
pcs node attribute node2 rack=rack1
pcs node attribute node3 rack=rack2

# 基于属性的约束
pcs constraint location web-group rule \
    score=100 rack eq rack1

# ==== 排序约束 (Order Constraints) ====
# 指定资源启动/停止顺序

# 先启动 IP, 再启动 Nginx
pcs constraint order VirtualIP then Nginx \
    kind=Mandatory  # Mandatory | Optional | Serialize

# 先挂载存储, 再启动服务
pcs constraint order WebData then Nginx \
    kind=Mandatory

# ==== 协同约束 (Colocation Constraints) ====
# 指定资源是否在同一节点运行

# Nginx 必须和 VIP 在同一节点
pcs constraint colocation add Nginx with VirtualIP score=INFINITY

# 查看所有约束
pcs constraint list
pcs constraint location
pcs constraint order
pcs constraint colocation

4.3 资源操作与维护

# ==== 资源操作 ====

# 手动移动资源
pcs resource move VirtualIP node2

# 清除移动约束(允许自动回迁)
pcs resource clear VirtualIP

# 手动启动/停止资源
pcs resource enable Nginx
pcs resource disable Nginx

# 将节点设为备用(不运行资源)
pcs node standby node2

# 取消备用
pcs node unstandby node2

# ==== 资源清理 ====
# 清理失败状态
pcs resource cleanup Nginx

# 清理所有资源
pcs resource cleanup

# ==== 查看资源状态 ====
pcs status resources
pcs status groups

# 查看资源失败历史
pcs resource failcount show Nginx

# 重置失败计数
pcs resource failcount reset Nginx

5. 集群化服务配置

5.1 NFS 高可用

# ==== NFS 高可用集群 ====

# 创建 LVM 资源(共享存储)
pcs resource create LVM-Web ocf:heartbeat:LVM \
    vg_name=vg_nfs \
    exclusive=true \
    op monitor interval=30s

# 创建文件系统
pcs resource create FS-NFS ocf:heartbeat:Filesystem \
    device="/dev/vg_nfs/lv_nfs" \
    directory="/export/nfs" \
    fstype="xfs" \
    op monitor interval=20s

# 创建 NFS export
pcs resource create NFS-Export ocf:heartbeat:exportfs \
    clientspec="192.168.1.0/24" \
    options="rw,sync,no_root_squash" \
    directory="/export/nfs" \
    fsid=0 \
    op monitor interval=30s

# 创建 VIP
pcs resource create VIP-NFS ocf:heartbeat:IPaddr2 \
    ip=192.168.1.101 \
    cidr_netmask=24 \
    op monitor interval=30s

# 创建 NFS server
pcs resource create NFS-Server systemd:nfs-server \
    op monitor interval=30s

# 创建资源组
pcs resource group add nfs-group LVM-Web FS-NFS NFS-Export NFS-Server VIP-NFS

# 设置约束
pcs constraint location nfs-group prefers node1=100

5.2 MySQL 高可用

# ==== MySQL 主从复制高可用 ====

# MySQL 资源脚本(使用 OCF)
pcs resource create mysql ocf:heartbeat:mysql \
    binary="/usr/sbin/mysqld" \
    config="/etc/my.cnf" \
    datadir="/var/lib/mysql" \
    user="mysql" \
    group="mysql" \
    log="/var/log/mariadb/mariadb.log" \
    pid="/var/run/mariadb/mariadb.pid" \
    socket="/var/lib/mysql/mysql.sock" \
    op start timeout=120s \
    op stop timeout=120s \
    op monitor interval=20s role=Master \
    op monitor interval=30s role=Slave

# 配置主从复制
pcs resource master mysql-master mysql \
    master-max=1 \
    master-node-max=1 \
    clone-max=2 \
    clone-node-max=1 \
    notify=true

# 创建 VIP
pcs resource create mysql-vip ocf:heartbeat:IPaddr2 \
    ip=192.168.1.102 \
    cidr_netmask=24 \
    op monitor interval=30s

# 协同约束: VIP 跟随 Master
pcs constraint colocation add mysql-vip with mysql-master INFINITY with-rsc-role=Master

# 排序约束: 先启动 MySQL, 再添加 VIP
pcs constraint order mysql-master then mysql-vip \
    kind=Mandatory with-rsc-role=Master

# 位置偏好
pcs constraint location mysql-master prefers node1=100

# ==== 检查 MySQL 集群状态 ====
pcs status
# Master/Slave Set: mysql-master [mysql]
#   Masters: [ node1 ]
#   Slaves: [ node2 ]
#   Stopped: [ node3 ]
# mysql-vip  (ocf::heartbeat:IPaddr2): Started node1

5.3 PostgreSQL 高可用

# ==== PostgreSQL 高可用(使用 Patroni) ====

# 安装 Patroni
pip3 install patroni[etcd]

# 创建 Patroni 配置文件 /etc/patroni/config.yml
cat > /etc/patroni/config.yml << 'EOF'
scope: pg-cluster
name: node1  # 每个节点不同

restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008

etcd:
  hosts: 192.168.1.10:2379,192.168.1.11:2379,192.168.1.12:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"

  initdb:
    - encoding: UTF8
    - data-checksums

  pg_hba:
    - host replication replicator 192.168.1.0/24 md5
    - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin_password
      options:
        - createrole
        - createdb
    replicator:
      password: replicator_password
      options:
        - replication

postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/pgsql/data
  bin_dir: /usr/pgsql-14/bin
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: replicator_password
    superuser:
      username: postgres
      password: postgres_password
  parameters:
    shared_buffers: "4GB"
    work_mem: "64MB"
    max_connections: 200

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
EOF

# 创建 systemd 服务
cat > /etc/systemd/system/patroni.service << 'EOF'
[Unit]
Description=Patroni PostgreSQL Cluster
After=syslog.target network.target

[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/config.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=no

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now patroni

# 查看 Patroni 集群状态
patronictl -c /etc/patroni/config.yml list
# +---------+---------+---------+----+-----------+
# | Cluster | Member  | Host    | RL | State     |
# +---------+---------+---------+----+-----------+
# | pg-cluster | node1 | node1 | TL | Leader    |
# | pg-cluster | node2 | node2 |    | Replica   |
# | pg-cluster | node3 | node3 |    | Replica   |
# +---------+---------+---------+----+-----------+

# 手动切换
patronictl -c /etc/patroni/config.yml switchover

5.4 Nginx 负载均衡高可用

# ==== Nginx + Keepalived 高可用 ====

# 创建 Nginx 资源组
pcs resource create nginx-lb systemd:nginx \
    op monitor interval=10s \
    op start timeout=30s \
    op stop timeout=30s

# 创建 VIP
pcs resource create lb-vip ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    nic=eth0 \
    op monitor interval=10s

# 创建健康检查脚本
cat > /usr/local/bin/nginx_healthcheck << 'SCRIPT'
#!/bin/bash
# Nginx 健康检查
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:80/)
if [ "$STATUS" != "200" ]; then
    exit 1
fi
exit 0
SCRIPT
chmod +x /usr/local/bin/nginx_healthcheck

# 添加健康检查到资源
pcs resource update nginx-lb \
    op monitor interval=10s \
    command="/usr/local/bin/nginx_healthcheck"

# 创建资源组
pcs resource group add lb-group lb-vip nginx-lb

# 双活模式(Active-Active): 使用 clone
pcs resource create nginx-clone systemd:nginx \
    op monitor interval=10s \
    clone clone-max=2 clone-node-max=1

# VIP 只在一个节点
pcs constraint colocation add lb-vip with nginx-clone INFINITY

6. 仲裁机制

# ==== 仲裁配置 ====

# 查看仲裁状态
pcs quorum status

# 仲裁策略:
# 1. 奇数节点(推荐): 天然仲裁优势
# 2. 仲裁设备(QDevice): 2节点集群推荐
# 3. 仲裁磁盘(QDisk): 共享存储仲裁

# ==== 配置 QDevice(2节点集群) ====

# 在仲裁节点安装
yum install -y corosync-qnetd
systemctl enable --now corosync-qnetd

# 在集群节点安装
yum install -y corosync-qdevice

# 添加 QDevice
pcs quorum device add model net host=qdevice.example.com algorithm=ffsplit

# 验证
pcs quorum config
pcs quorum status
# Quorum information
# ------------------
# Date:             Thu Apr 16 2026
# Quorum provider:  corosync_votequorum
# Nodes configured: 2
# QDevice:          qdevice.example.com
#   Model:          Net
#   Algorithm:      ffsplit
#   Tie-breaker:    lowest

# ==== 两节点集群配置(无 QDevice) ====
# 注意: 不推荐,仅用于测试
pcs property set no-quorum-policy=ignore
# 或使用 auto_tie_breaker
pcs quorum update auto_tie_breaker=1

7. 故障检测与恢复

# ==== 配置故障检测 ====

# 资源监控间隔
pcs resource update Nginx op monitor interval=10s timeout=20s

# 配置故障阈值
# migration-threshold: 同一节点失败多少次后迁移
# failure-timeout: 失败记录超时(秒)
pcs resource update Nginx meta migration-threshold=3 failure-timeout=300s

# ==== 故障恢复策略 ====

# 策略1: 自动回迁
pcs resource update VirtualIP meta resource-stickiness=50
# stickiness 值越大,越不容易回迁

# 策略2: 不回迁(保持在新节点)
pcs resource update VirtualIP meta resource-stickiness=INFINITY

# 策略3: 定时回迁
# 使用定时约束
pcs constraint location VirtualIP rule \
    score=100 \
    date-spec months=1-12 weekdays=1-5 hours=9-18

# ==== 查看故障历史 ====
pcs status failures
#   * node2: Nginx monitoring failed (interval=10s, exitreason='')
#   * node1: VirtualIP monitoring failed

# 清除故障记录
pcs resource cleanup

# ==== 节点级别故障检测 ====
# Corosync 配置 token 超时
# token: 节点间心跳超时(默认 3000ms)
# token_retransmits: 重传次数

# 修改 /etc/corosync/corosync.conf
# totem {
#     token: 5000    # 5秒超时
#     consensus: 6000
# }

8. 集群监控

# ==== 基础监控命令 ====

# 集群整体状态
pcs status

# 节点状态
pcs status nodes

# 资源状态
pcs status resources

# 约束状态
pcs constraint list

# Corosync 状态
pcs status corosync

# Pacemaker 状态
crm_mon -1     # 单次输出
crm_mon -Afr   # 详细输出

# ==== 日志分析 ====
# Pacemaker 日志
journalctl -u pacemaker -f

# Corosync 日志
tail -f /var/log/cluster/corosync.log

# 集群事件日志
pcs cluster log --all

# ==== 监控脚本 ====
cat > /usr/local/bin/ha_monitor.sh << 'SCRIPT'
#!/bin/bash
# HA 集群监控脚本

WARN_THRESHOLD=3  # 连续失败告警阈值

check_cluster_health() {
    echo "=== HA 集群健康检查 $(date) ==="

    # 检查在线节点数
    ONLINE=$(pcs status nodes | grep "Online" | grep -o '\[.*\]' | tr -d '[]' | wc -w)
    TOTAL=$(pcs status nodes | grep -c "^  .*")
    echo "节点: $ONLINE/$TOTAL 在线"

    if [ "$ONLINE" -lt "$((TOTAL / 2 + 1))" ]; then
        echo "[CRITICAL] 仲裁可能丢失!"
    fi

    # 检查资源状态
    FAILED=$(pcs status resources | grep -c "Stopped\|Failed")
    if [ "$FAILED" -gt 0 ]; then
        echo "[WARNING] $FAILED 个资源异常:"
        pcs status resources | grep "Stopped\|Failed"
    else
        echo "[OK] 所有资源正常运行"
    fi

    # 检查 STONITH
    STONITH=$(pcs property get stonith-enabled | awk '{print $2}')
    if [ "$STONITH" != "true" ]; then
        echo "[WARNING] STONITH 未启用"
    fi

    # 检查失败计数
    FAILCOUNT=$(pcs resource failcount show 2>/dev/null | grep -c "value=")
    if [ "$FAILCOUNT" -gt 0 ]; then
        echo "[WARNING] 存在资源失败记录:"
        pcs resource failcount show
    fi

    echo "---"
}

check_cluster_health
SCRIPT
chmod +x /usr/local/bin/ha_monitor.sh

# 添加到 cron (每5分钟检查)
echo "*/5 * * * * /usr/local/bin/ha_monitor.sh >> /var/log/ha_monitor.log 2>&1" | crontab -

9. 生产部署清单

# ==== 生产部署检查清单 ====

echo "=== 1. 基础环境检查 ==="
# 时间同步
chronyc tracking | grep "Last offset"
# 主机名解析
ping -c 1 node1 && ping -c 1 node2 && ping -c 1 node3

echo "=== 2. 集群通信检查 ==="
# Corosync 端口
ss -ulnp | grep corosync
# 节点间通信
corosync-cfgtool -s

echo "=== 3. 仲裁检查 ==="
pcs quorum status

echo "=== 4. STONITH 检查 ==="
pcs stonith show
pcs property get stonith-enabled

echo "=== 5. 资源检查 ==="
pcs status resources
pcs constraint list

echo "=== 6. 备份配置 ==="
# 备份 CIB (集群信息库)
pcs config backup ha_backup_$(date +%Y%m%d).tar.bz2

# 备份 Corosync 配置
cp /etc/corosync/corosync.conf /etc/corosync/corosync.conf.bak

echo "=== 7. 性能基线 ==="
# 记录正常状态下的资源分布
pcs status > /var/log/ha_baseline_$(date +%Y%m%d).log

echo "=== 8. 故障切换测试 ==="
# 在维护窗口进行:
# 1. 手动切换资源
# pcs resource move web-group node2
# 2. 验证服务可用性
# curl -I http://192.168.1.100
# 3. 模拟节点故障
# pcs node standby node1
# 4. 恢复
# pcs node unstandby node1
# pcs resource clear web-group

优点

高可用性: 自动故障切换,业务几乎无感知
标准化: Corosync + Pacemaker 是 Linux HA 事实标准
灵活编排: 支持复杂的资源依赖和约束配置
可扩展: 支持从 2 节点到数十节点的集群规模
生态完善: 支持主流数据库、Web 服务器、存储等

缺点

复杂度高: 配置和调优需要深入理解集群原理
测试困难: 故障场景难以完全模拟
网络依赖: 心跳网络故障可能导致脑裂
资源开销: 集群软件本身占用一定资源
学习曲线: 排错需要综合 Corosync/Pacemaker/资源层知识

性能注意事项

心跳超时: token 值根据网络延迟调整(建议 3-10 秒)
监控间隔: 资源监控间隔不要太短(建议 10-30 秒)
资源粘性: 合理设置 stickiness 避免不必要的切换
仲裁延迟: QDevice 网络延迟应小于 token 值的一半
日志级别: 生产环境使用 notice 级别,避免 debug 影响性能

总结

Linux 高可用集群的关键要点:

架构选择: Corosync + Pacemaker 是主流方案,PCS 简化管理
STONITH 必选: 生产环境必须启用节点隔离,防止脑裂
仲裁保障: 奇数节点或 QDevice 确保仲裁可用
资源编排: 合理使用组、约束管理资源依赖关系
持续监控: 建立完善的监控和告警体系
定期演练: 定期进行故障切换演练,确保切换可靠

关键知识点

概念	说明
Corosync	集群通信层,负责节点间消息传递和成员管理
Pacemaker	集群资源管理器,负责资源调度和故障恢复
PCS	Pacemaker/Corosync 配置系统,命令行管理工具
STONITH	节点隔离机制,确保故障节点真正停止
仲裁(Quorum)	防止脑裂的投票机制,需要多数节点同意
CIB	集群信息库,存储所有集群配置
约束	控制资源运行位置、顺序和协同的规则

常见误区

误区: 两节点集群足够
- 两节点无法自然形成仲裁,脑裂风险高
- 解决: 使用 3 节点或配置 QDevice
误区: 关掉 STONITH 更省事
- 生产环境关闭 STONITH 极其危险
- 解决: 必须配置合适的 fencing 机制
误区: 集群能解决所有故障
- 应用层 Bug、数据损坏等无法通过集群解决
- 解决: 集群是基础设施层面的保障,应用仍需健壮设计
误区: 配好就不用管了
- 集群需要持续监控、定期维护和演练
- 解决: 建立监控告警和定期演练机制

进阶路线

入门: 理解 HA 概念,搭建 2 节点基础集群
进阶: 配置 STONITH、仲裁、资源约束,部署实际服务
高级: 多站点集群、容器化集群监控、自动化运维
专家: 自定义资源代理、多集群联邦、容灾切换

适用场景

Web 服务高可用(Nginx/Apache)
数据库高可用(MySQL/PostgreSQL/MongoDB)
文件共享高可用(NFS/CIFS)
负载均衡器高可用(HAProxy/Keepalived)
虚拟化平台高可用(KVM/Libvirt)
关键业务系统基础设施保障

落地建议

从 3 节点开始: 避免两节点的仲裁问题
专用心跳网络: 配置独立的心跳网络,提高可靠性
STONITH 优先: 部署服务前先确保 fencing 可用
定期备份 CIB: 每次变更后备份集群配置
演练验证: 上线前完成节点故障、网络分区等场景演练

排错清单

问题	可能原因	解决方案
节点离线	Corosync 通信中断	检查网络、防火墙、Corosync 配置
资源启动失败	配置错误或依赖缺失	查看 pcs status failures,检查日志
脑裂	仲裁丢失	检查 QDevice,确保奇数节点
切换不自动	stickiness 过高或约束限制	调整 resource-stickiness,检查约束
fencing 失败	STONITH 配置错误	验证 fence 设备连通性,检查认证
VIP 无法访问	ARP 未刷新或网络配置错误	手动 arping,检查网络接口

复盘问题

上次故障切换的 RTO(恢复时间)是多少?是否满足 SLA?
最近一次故障演练是什么时候?发现了哪些问题?
集群的 STONITH 配置是否经过实际测试?
资源失败计数是否被定期清理?是否有持续失败的趋势?
仲裁设备是否正常工作?延迟是否在可接受范围内?
CIB 备份策略是否有效?是否验证过恢复流程?

Linux 高可用集群方案

Linux 高可用集群方案

简介

特点

实现

1. 集群架构概述

1.1 Corosync + Pacemaker 架构

1.2 环境准备

2. 安装与配置集群

2.1 安装集群软件

2.2 创建集群

2.3 集群全局配置

3. STONITH / Fencing 配置

4. 资源管理

4.1 基础资源配置

4.2 约束配置

4.3 资源操作与维护

5. 集群化服务配置

5.1 NFS 高可用

5.2 MySQL 高可用

5.3 PostgreSQL 高可用

5.4 Nginx 负载均衡高可用

6. 仲裁机制

7. 故障检测与恢复

8. 集群监控

9. 生产部署清单

优点

缺点

性能注意事项

总结

关键知识点

常见误区

进阶路线

适用场景

落地建议

排错清单

复盘问题

延伸阅读