数据库监控与告警体系

SunnyFan大约 14 分钟约 4079 字

数据库监控与告警体系

简介

数据库监控与告警体系是保障数据库稳定运行的最后一道防线。一次未被发现的主从延迟可能导致用户读到过期数据，一个未被捕获的慢查询可能在高峰期拖垮整个系统。完善的监控体系能够在问题影响用户之前提前发现并预警。

数据库监控的核心目标是回答四个问题：系统健康吗？（连接数、QPS、TPS）、够快吗？（慢查询、响应时间）、安全吗？（复制状态、备份完整）、够用吗？（磁盘空间、连接池、容量规划）。

现代数据库监控体系通常由 Prometheus（指标采集存储）+ Grafana（可视化看板）+ AlertManager（告警管理）三大组件构成，配合 mysqld_exporter、node_exporter 等专用采集器，实现从操作系统到数据库引擎的全方位监控。

特点

1.多层次监控 — OS 层、数据库层、应用层全覆盖
2.实时告警 — 基于阈值的主动告警，而非被动排查
3.可视化看板 — Grafana 直观展示关键指标趋势
4.自动化响应 — 告警触发自动执行预设操作
5.容量预测 — 基于历史数据预测未来资源需求

关键监控指标

MySQL 核心指标

-- 1. QPS (Queries Per Second) — 每秒查询数
SHOW GLOBAL STATUS LIKE 'Queries';
-- 计算: delta(Queries) / delta(time)

-- 2. TPS (Transactions Per Second) — 每秒事务数
SHOW GLOBAL STATUS LIKE 'Com_commit';
SHOW GLOBAL STATUS LIKE 'Com_rollback';
-- TPS = (delta(Com_commit) + delta(Com_rollback)) / delta(time)

-- 3. 连接数监控
SHOW GLOBAL STATUS LIKE 'Threads_connected';     -- 当前连接数
SHOW GLOBAL STATUS LIKE 'Max_used_connections';   -- 历史最大连接数
SHOW GLOBAL STATUS LIKE 'Threads_running';        -- 活跃执行线程数
SHOW VARIABLES LIKE 'max_connections';            -- 最大连接数配置
-- 告警: Threads_connected > max_connections * 0.8

-- 4. 慢查询统计
SHOW GLOBAL STATUS LIKE 'Slow_queries';           -- 累计慢查询数
SHOW VARIABLES LIKE 'long_query_time';            -- 慢查询阈值（秒）
SHOW VARIABLES LIKE 'slow_query_log';             -- 慢查询日志是否开启

-- 开启慢查询日志
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;  -- 超过 1 秒记录
SET GLOBAL log_queries_not_using_indexes = 'ON';  -- 记录未使用索引的查询

-- 5. InnoDB 关键指标
SHOW GLOBAL STATUS LIKE 'Innodb_row_read';        -- 读取行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_inserted';    -- 插入行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_updated';     -- 更新行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_deleted';     -- 删除行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_waits';  -- 行锁等待次数
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';   -- 行锁等待总时间(ms)
SHOW GLOBAL STATUS LIKE 'Innodb_deadlocks';       -- 死锁次数

-- 6. 缓冲池命中率
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';   -- 逻辑读次数
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';           -- 磁盘读次数
-- 命中率 = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
-- 告警: 命中率 < 0.95 表示缓冲池不足

-- 7. 复制延迟（从库）
SHOW SLAVE STATUS\G
-- 关键字段:
-- Seconds_Behind_Master: 延迟秒数
-- Slave_IO_Running: IO 线程状态
-- Slave_SQL_Running: SQL 线程状态
-- Last_Error: 最后一次错误
-- 告警: Seconds_Behind_Master > 30

-- 8. 磁盘使用
SELECT
    table_schema AS database_name,
    ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb,
    ROUND(SUM(data_length) / 1024 / 1024, 2) AS data_mb,
    ROUND(SUM(index_length) / 1024 / 1024, 2) AS index_mb,
    COUNT(*) AS table_count
FROM information_schema.tables
GROUP BY table_schema
ORDER BY size_mb DESC;

-- 9. 当前活跃事务和锁
SELECT
    t.trx_id,
    t.trx_state,
    t.trx_started,
    TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS duration_sec,
    t.trx_query,
    t.trx_rows_locked,
    t.trx_tables_locked
FROM information_schema.innodb_trx t
ORDER BY t.trx_started ASC;

-- 10. 表级锁等待
SELECT
    object_schema,
    object_name,
    lock_type,
    lock_mode,
    waiting_thread_id,
    waiting_query,
    blocking_thread_id,
    blocking_query
FROM performance_schema.data_lock_waits
JOIN performance_schema.events_statements_current
    ON blocking_thread_id = THREAD_ID;

Prometheus Exporter 配置

mysqld_exporter 部署

# docker-compose.yml — Prometheus + Grafana + Exporters

version: '3.8'

services:
  # MySQL
  mysql:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: root123
    ports:
      - "3306:3306"
    volumes:
      - mysql_data:/var/lib/mysql
      - ./mysql.cnf:/etc/mysql/conf.d/mysql.cnf

  # mysqld_exporter
  mysqld_exporter:
    image: prom/mysqld-exporter:latest
    environment:
      DATA_SOURCE_NAME: "root:root123@(mysql:3306)/"
    ports:
      - "9104:9104"
    depends_on:
      - mysql

  # node_exporter (操作系统指标)
  node_exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  # Grafana
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin123
    volumes:
      - grafana_data:/var/lib/grafana

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  mysql_data:
  prometheus_data:
  grafana_data:

Prometheus 配置

# prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 告警规则文件
rule_files:
  - "alerts/*.yml"

# Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 采集目标
scrape_configs:
  # MySQL 监控
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysqld_exporter:9104']
        labels:
          instance: 'mysql-master'

  # 操作系统监控
  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          instance: 'db-server-01'

  # PostgreSQL 监控（如果使用）
  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres_exporter:9187']
        labels:
          instance: 'pg-master'

  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

告警规则配置

# alerts/database_alerts.yml

groups:
  - name: mysql_alerts
    rules:
      # 连接数过高
      - alert: MySQLTooManyConnections
        expr: mysql_global_status_threads_connected > mysql_global_variables_max_connections * 0.8
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 连接数过高"
          description: "实例 {{ $labels.instance }} 连接数达到上限的 80%+"

      # 慢查询增多
      - alert: MySQLSlowQueriesIncreasing
        expr: rate(mysql_global_status_slow_queries[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 慢查询增多"
          description: "实例 {{ $labels.instance }} 每分钟慢查询超过 10 个"

      # 主从延迟
      - alert: MySQLReplicationLag
        expr: mysql_slave_seconds_behind_master > 30
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 主从复制延迟"
          description: "实例 {{ $labels.instance }} 延迟 {{ $value }} 秒"

      # 复制中断
      - alert: MySQLReplicationStopped
        expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 复制已停止"
          description: "实例 {{ $labels.instance }} 的 IO 或 SQL 线程已停止"

      # InnoDB 缓冲池命中率低
      - alert: MySQLLowBufferPoolHitRate
        expr: rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]) / (rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]) + rate(mysql_global_status_innodb_buffer_pool_reads[5m])) < 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "InnoDB 缓冲池命中率低于 95%"

      # 死锁检测
      - alert: MySQLDeadlocks
        expr: increase(mysql_global_status_innodb_deadlocks[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "检测到 MySQL 死锁"
          description: "实例 {{ $labels.instance }} 在过去 5 分钟发生 {{ $value }} 次死锁"

  - name: system_alerts
    rules:
      # 磁盘使用率
      - alert: DiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "服务器 {{ $labels.instance }} 根分区剩余空间不足 20%"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间严重不足"
          description: "服务器 {{ $labels.instance }} 根分区剩余空间不足 10%"

      # CPU 使用率
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"

      # 内存使用率
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率超过 90%"

      # 磁盘 IO 延迟
      - alert: HighDiskIOLatency
        expr: rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘 IO 延迟过高"

Alertmanager 配置

# alertmanager.yml

global:
  resolve_timeout: 5m

# 告警路由
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'warning'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://alertmanager-webhook:8080/alert'

  - name: 'critical'
    # 钉钉/企业微信/邮件通知
    webhook_configs:
      - url: 'http://alertmanager-webhook:8080/critical'
        send_resolved: true
    email_configs:
      - to: 'dba-team@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'

  - name: 'warning'
    webhook_configs:
      - url: 'http://alertmanager-webhook:8080/warning'

慢查询分析

-- 慢查询分析完整流程

-- 1. 查看慢查询日志位置
SHOW VARIABLES LIKE 'slow_query_log_file';

-- 2. 使用 mysqldumpslow 分析
-- mysqldumpslow -s t -t 10 /var/lib/mysql/slow.log
-- -s t: 按查询时间排序
-- -t 10: 显示前 10 条
-- -s c: 按查询次数排序
-- -s l: 按锁定时间排序
-- -s r: 按返回记录数排序

-- 3. 使用 performance_schema 分析（MySQL 5.7+）
-- 开启性能监控
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%statements%';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE '%statement/%';

-- TOP 10 最耗时的 SQL
SELECT
    DIGEST_TEXT AS sql_template,
    COUNT_STAR AS exec_count,
    ROUND(SUM_TIMER_WAIT / 1000000000000, 3) AS total_time_sec,
    ROUND(AVG_TIMER_WAIT / 1000000000, 3) AS avg_time_ms,
    ROUND(MAX_TIMER_WAIT / 1000000000, 3) AS max_time_ms,
    SUM_ROWS_EXAMINED AS total_rows_examined,
    SUM_ROWS_SENT AS total_rows_sent,
    ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

-- 全表扫描检测
SELECT
    DIGEST_TEXT AS sql_template,
    COUNT_STAR AS exec_count,
    SUM_ROWS_EXAMINED AS rows_examined,
    ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_ROWS_EXAMINED > COUNT_STAR * 1000
ORDER BY avg_rows_examined DESC
LIMIT 20;

-- 4. EXPLAIN 分析单条慢查询
EXPLAIN ANALYZE
SELECT o.*, u.username
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at >= '2024-01-01'
  AND o.status = 1
ORDER BY o.created_at DESC
LIMIT 20;

-- 5. 未使用索引的查询
SELECT
    DIGEST_TEXT,
    COUNT_STAR,
    SUM_NO_INDEX_USED
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_NO_INDEX_USED > 0
ORDER BY SUM_NO_INDEX_USED DESC
LIMIT 20;

死锁监控

-- 死锁监控与分析

-- 1. 查看最近一次死锁信息
SHOW ENGINE INNODB STATUS\G
-- 在 LATEST DETECTED DEADLOCK 部分

-- 2. 开启死锁日志记录
SET GLOBAL innodb_print_all_deadlocks = ON;
-- 死锁信息会记录到错误日志中

-- 3. 查看当前锁等待
SELECT
    r.trx_id AS waiting_trx_id,
    r.trx_mysql_thread_id AS waiting_thread,
    r.trx_query AS waiting_query,
    b.trx_id AS blocking_trx_id,
    b.trx_mysql_thread_id AS blocking_thread,
    b.trx_query AS blocking_query,
    b.trx_started AS blocking_started
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- 4. Kill 阻塞事务
-- KILL <blocking_thread>;

-- 5. 死锁统计（performance_schema）
SELECT
    EVENT_NAME,
    COUNT_STAR AS deadlock_count
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE '%deadlock%';

-- 6. 锁等待超时监控
SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';  -- 默认 50 秒

-- 7. 常见死锁场景和解决方案
/*
场景 1: 交叉更新
事务 A: UPDATE orders SET status=2 WHERE id=1; -> UPDATE orders SET status=2 WHERE id=2;
事务 B: UPDATE orders SET status=2 WHERE id=2; -> UPDATE orders SET status=2 WHERE id=1;
解决: 统一按 id 顺序更新

场景 2: 插入间隙锁
事务 A: INSERT INTO orders (id,...) VALUES (100,...); -- 间隙锁
事务 B: INSERT INTO orders (id,...) VALUES (100,...); -- 等待
解决: 使用 INSERT ... ON DUPLICATE KEY UPDATE 或捕获重试

场景 3: 长事务持有锁
事务 A: BEGIN; UPDATE ... ; -- 长时间不提交
事务 B: UPDATE 同一行; -- 等待超时
解决: 缩短事务，避免在事务中做外部调用
*/

容量规划

-- 数据库容量规划

-- 1. 数据库增长趋势
SELECT
    table_schema,
    ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS size_gb,
    ROUND(SUM(data_length) / 1024 / 1024 / 1024, 2) AS data_gb,
    ROUND(SUM(index_length) / 1024 / 1024 / 1024, 2) AS index_gb,
    SUM(table_rows) AS estimated_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
GROUP BY table_schema
ORDER BY size_gb DESC;

-- 2. 单表增长分析
SELECT
    table_name,
    table_rows,
    ROUND((data_length + index_length) / 1024 / 1024, 2) AS size_mb,
    ROUND(data_length / 1024 / 1024, 2) AS data_mb,
    ROUND(index_length / 1024 / 1024, 2) AS index_mb,
    ROUND(index_length / (data_length + index_length) * 100, 1) AS index_ratio_pct
FROM information_schema.tables
WHERE table_schema = 'your_database'
ORDER BY (data_length + index_length) DESC;

-- 3. 磁盘空间预估
-- 每日增长 = 当前大小 / 运行天数
-- 剩余可用天数 = (总磁盘空间 - 当前大小) / 每日增长

-- 4. InnoDB 缓冲池容量评估
SELECT
    ROUND(@@innodb_buffer_pool_size / 1024 / 1024 / 1024, 2) AS buffer_pool_gb,
    ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS total_data_gb,
    ROUND(@@innodb_buffer_pool_size / (SUM(data_length + index_length)) * 100, 1) AS coverage_pct
FROM information_schema.tables
WHERE engine = 'InnoDB';

-- 5. 连接池容量评估
SELECT
    @@max_connections AS max_connections,
    (SELECT variable_value FROM performance_schema.global_status
     WHERE variable_name = 'Threads_connected') AS current_connections,
    (SELECT variable_value FROM performance_schema.global_status
     WHERE variable_name = 'Max_used_connections') AS peak_connections;

-- 建议 max_connections = peak_connections * 1.5

自动化修复

"""数据库自动化修复脚本"""

import pymysql
import time
import smtplib
from datetime import datetime

class DatabaseAutoRemediation:
    """数据库问题自动修复"""

    def __init__(self, db_config: dict):
        self.config = db_config
        self.actions_log = []

    def get_connection(self):
        return pymysql.connect(**self.config)

    def check_and_kill_long_queries(self, max_duration_sec: int = 300):
        """自动 Kill 超过指定时间的查询"""
        conn = self.get_connection()
        try:
            with conn.cursor() as cursor:
                cursor.execute("""
                    SELECT id, user, host, db, command, time, state, info
                    FROM information_schema.processlist
                    WHERE command = 'Query'
                      AND time > %s
                      AND info NOT LIKE '%%SHOW%%'
                    ORDER BY time DESC
                """, (max_duration_sec,))

                long_queries = cursor.fetchall()

                for query in long_queries:
                    thread_id, user, host, db, cmd, duration, state, info = query
                    self._log_action(
                        f"发现长查询: thread={thread_id}, duration={duration}s, "
                        f"query={info[:100] if info else 'N/A'}"
                    )

                    # Kill 查询
                    if duration > max_duration_sec * 2:
                        cursor.execute(f"KILL {thread_id}")
                        self._log_action(f"已 Kill 线程 {thread_id}")
        finally:
            conn.close()

    def check_replication_health(self):
        """检查复制健康状态"""
        conn = self.get_connection()
        try:
            with conn.cursor() as cursor:
                cursor.execute("SHOW SLAVE STATUS")
                result = cursor.fetchone()

                if not result:
                    return  # 不是从库

                columns = [desc[0] for desc in cursor.description]
                status = dict(zip(columns, result))

                io_running = status.get('Slave_IO_Running') == 'Yes'
                sql_running = status.get('Slave_SQL_Running') == 'Yes'
                lag = int(status.get('Seconds_Behind_Master', 0) or 0)

                if not io_running or not sql_running:
                    self._log_action(
                        f"复制中断! IO={io_running}, SQL={sql_running}, "
                        f"Error={status.get('Last_Error', 'N/A')}"
                    )
                    # 尝试自动恢复
                    if lag < 3600:  # 延迟不超过1小时才尝试恢复
                        cursor.execute("STOP SLAVE")
                        cursor.execute("START SLAVE")
                        self._log_action("已尝试自动恢复复制")

                elif lag > 60:
                    self._log_action(f"复制延迟: {lag} 秒")

        finally:
            conn.close()

    def check_connection_usage(self):
        """检查连接使用率"""
        conn = self.get_connection()
        try:
            with conn.cursor() as cursor:
                cursor.execute("SHOW GLOBAL STATUS LIKE 'Threads_connected'")
                current = int(cursor.fetchone()[1])

                cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
                max_conn = int(cursor.fetchone()[1])

                usage = current / max_conn

                if usage > 0.8:
                    self._log_action(
                        f"连接数告警: {current}/{max_conn} ({usage:.1%})"
                    )
                    # 查找空闲连接
                    cursor.execute("""
                        SELECT user, host, db, time, state
                        FROM information_schema.processlist
                        WHERE command = 'Sleep' AND time > 300
                        ORDER BY time DESC
                    """)
                    idle_conns = cursor.fetchall()
                    if idle_conns:
                        self._log_action(
                            f"发现 {len(idle_conns)} 个空闲超过 5 分钟的连接"
                        )
        finally:
            conn.close()

    def _log_action(self, message: str):
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_entry = f"[{timestamp}] {message}"
        self.actions_log.append(log_entry)
        print(log_entry)

    def run_all_checks(self):
        """执行所有检查"""
        self._log_action("=== 开始数据库健康检查 ===")
        self.check_and_kill_long_queries(max_duration_sec=300)
        self.check_replication_health()
        self.check_connection_usage()
        self._log_action("=== 检查完成 ===")
        return self.actions_log


# 使用示例
config = {
    "host": "localhost",
    "port": 3306,
    "user": "monitor",
    "password": "monitor_pass",
}

remediation = DatabaseAutoRemediation(config)
logs = remediation.run_all_checks()