数据库监控与告警体系
大约 14 分钟约 4079 字
数据库监控与告警体系
简介
数据库监控与告警体系是保障数据库稳定运行的最后一道防线。一次未被发现的主从延迟可能导致用户读到过期数据,一个未被捕获的慢查询可能在高峰期拖垮整个系统。完善的监控体系能够在问题影响用户之前提前发现并预警。
数据库监控的核心目标是回答四个问题:系统健康吗?(连接数、QPS、TPS)、够快吗?(慢查询、响应时间)、安全吗?(复制状态、备份完整)、够用吗?(磁盘空间、连接池、容量规划)。
现代数据库监控体系通常由 Prometheus(指标采集存储)+ Grafana(可视化看板)+ AlertManager(告警管理)三大组件构成,配合 mysqld_exporter、node_exporter 等专用采集器,实现从操作系统到数据库引擎的全方位监控。
特点
关键监控指标
MySQL 核心指标
-- 1. QPS (Queries Per Second) — 每秒查询数
SHOW GLOBAL STATUS LIKE 'Queries';
-- 计算: delta(Queries) / delta(time)
-- 2. TPS (Transactions Per Second) — 每秒事务数
SHOW GLOBAL STATUS LIKE 'Com_commit';
SHOW GLOBAL STATUS LIKE 'Com_rollback';
-- TPS = (delta(Com_commit) + delta(Com_rollback)) / delta(time)
-- 3. 连接数监控
SHOW GLOBAL STATUS LIKE 'Threads_connected'; -- 当前连接数
SHOW GLOBAL STATUS LIKE 'Max_used_connections'; -- 历史最大连接数
SHOW GLOBAL STATUS LIKE 'Threads_running'; -- 活跃执行线程数
SHOW VARIABLES LIKE 'max_connections'; -- 最大连接数配置
-- 告警: Threads_connected > max_connections * 0.8
-- 4. 慢查询统计
SHOW GLOBAL STATUS LIKE 'Slow_queries'; -- 累计慢查询数
SHOW VARIABLES LIKE 'long_query_time'; -- 慢查询阈值(秒)
SHOW VARIABLES LIKE 'slow_query_log'; -- 慢查询日志是否开启
-- 开启慢查询日志
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- 超过 1 秒记录
SET GLOBAL log_queries_not_using_indexes = 'ON'; -- 记录未使用索引的查询
-- 5. InnoDB 关键指标
SHOW GLOBAL STATUS LIKE 'Innodb_row_read'; -- 读取行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_inserted'; -- 插入行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_updated'; -- 更新行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_deleted'; -- 删除行数
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_waits'; -- 行锁等待次数
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time'; -- 行锁等待总时间(ms)
SHOW GLOBAL STATUS LIKE 'Innodb_deadlocks'; -- 死锁次数
-- 6. 缓冲池命中率
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests'; -- 逻辑读次数
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; -- 磁盘读次数
-- 命中率 = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
-- 告警: 命中率 < 0.95 表示缓冲池不足
-- 7. 复制延迟(从库)
SHOW SLAVE STATUS\G
-- 关键字段:
-- Seconds_Behind_Master: 延迟秒数
-- Slave_IO_Running: IO 线程状态
-- Slave_SQL_Running: SQL 线程状态
-- Last_Error: 最后一次错误
-- 告警: Seconds_Behind_Master > 30
-- 8. 磁盘使用
SELECT
table_schema AS database_name,
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb,
ROUND(SUM(data_length) / 1024 / 1024, 2) AS data_mb,
ROUND(SUM(index_length) / 1024 / 1024, 2) AS index_mb,
COUNT(*) AS table_count
FROM information_schema.tables
GROUP BY table_schema
ORDER BY size_mb DESC;
-- 9. 当前活跃事务和锁
SELECT
t.trx_id,
t.trx_state,
t.trx_started,
TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS duration_sec,
t.trx_query,
t.trx_rows_locked,
t.trx_tables_locked
FROM information_schema.innodb_trx t
ORDER BY t.trx_started ASC;
-- 10. 表级锁等待
SELECT
object_schema,
object_name,
lock_type,
lock_mode,
waiting_thread_id,
waiting_query,
blocking_thread_id,
blocking_query
FROM performance_schema.data_lock_waits
JOIN performance_schema.events_statements_current
ON blocking_thread_id = THREAD_ID;Prometheus Exporter 配置
mysqld_exporter 部署
# docker-compose.yml — Prometheus + Grafana + Exporters
version: '3.8'
services:
# MySQL
mysql:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: root123
ports:
- "3306:3306"
volumes:
- mysql_data:/var/lib/mysql
- ./mysql.cnf:/etc/mysql/conf.d/mysql.cnf
# mysqld_exporter
mysqld_exporter:
image: prom/mysqld-exporter:latest
environment:
DATA_SOURCE_NAME: "root:root123@(mysql:3306)/"
ports:
- "9104:9104"
depends_on:
- mysql
# node_exporter (操作系统指标)
node_exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
# Prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
# Grafana
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin123
volumes:
- grafana_data:/var/lib/grafana
# Alertmanager
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
mysql_data:
prometheus_data:
grafana_data:Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# 告警规则文件
rule_files:
- "alerts/*.yml"
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# 采集目标
scrape_configs:
# MySQL 监控
- job_name: 'mysql'
static_configs:
- targets: ['mysqld_exporter:9104']
labels:
instance: 'mysql-master'
# 操作系统监控
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
labels:
instance: 'db-server-01'
# PostgreSQL 监控(如果使用)
- job_name: 'postgresql'
static_configs:
- targets: ['postgres_exporter:9187']
labels:
instance: 'pg-master'
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']告警规则配置
# alerts/database_alerts.yml
groups:
- name: mysql_alerts
rules:
# 连接数过高
- alert: MySQLTooManyConnections
expr: mysql_global_status_threads_connected > mysql_global_variables_max_connections * 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL 连接数过高"
description: "实例 {{ $labels.instance }} 连接数达到上限的 80%+"
# 慢查询增多
- alert: MySQLSlowQueriesIncreasing
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 慢查询增多"
description: "实例 {{ $labels.instance }} 每分钟慢查询超过 10 个"
# 主从延迟
- alert: MySQLReplicationLag
expr: mysql_slave_seconds_behind_master > 30
for: 3m
labels:
severity: critical
annotations:
summary: "MySQL 主从复制延迟"
description: "实例 {{ $labels.instance }} 延迟 {{ $value }} 秒"
# 复制中断
- alert: MySQLReplicationStopped
expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL 复制已停止"
description: "实例 {{ $labels.instance }} 的 IO 或 SQL 线程已停止"
# InnoDB 缓冲池命中率低
- alert: MySQLLowBufferPoolHitRate
expr: rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]) / (rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]) + rate(mysql_global_status_innodb_buffer_pool_reads[5m])) < 0.95
for: 10m
labels:
severity: warning
annotations:
summary: "InnoDB 缓冲池命中率低于 95%"
# 死锁检测
- alert: MySQLDeadlocks
expr: increase(mysql_global_status_innodb_deadlocks[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "检测到 MySQL 死锁"
description: "实例 {{ $labels.instance }} 在过去 5 分钟发生 {{ $value }} 次死锁"
- name: system_alerts
rules:
# 磁盘使用率
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "服务器 {{ $labels.instance }} 根分区剩余空间不足 20%"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "磁盘空间严重不足"
description: "服务器 {{ $labels.instance }} 根分区剩余空间不足 10%"
# CPU 使用率
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
# 内存使用率
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率超过 90%"
# 磁盘 IO 延迟
- alert: HighDiskIOLatency
expr: rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘 IO 延迟过高"Alertmanager 配置
# alertmanager.yml
global:
resolve_timeout: 5m
# 告警路由
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alertmanager-webhook:8080/alert'
- name: 'critical'
# 钉钉/企业微信/邮件通知
webhook_configs:
- url: 'http://alertmanager-webhook:8080/critical'
send_resolved: true
email_configs:
- to: 'dba-team@company.com'
from: 'alerts@company.com'
smarthost: 'smtp.company.com:587'
- name: 'warning'
webhook_configs:
- url: 'http://alertmanager-webhook:8080/warning'慢查询分析
-- 慢查询分析完整流程
-- 1. 查看慢查询日志位置
SHOW VARIABLES LIKE 'slow_query_log_file';
-- 2. 使用 mysqldumpslow 分析
-- mysqldumpslow -s t -t 10 /var/lib/mysql/slow.log
-- -s t: 按查询时间排序
-- -t 10: 显示前 10 条
-- -s c: 按查询次数排序
-- -s l: 按锁定时间排序
-- -s r: 按返回记录数排序
-- 3. 使用 performance_schema 分析(MySQL 5.7+)
-- 开启性能监控
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%statements%';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE '%statement/%';
-- TOP 10 最耗时的 SQL
SELECT
DIGEST_TEXT AS sql_template,
COUNT_STAR AS exec_count,
ROUND(SUM_TIMER_WAIT / 1000000000000, 3) AS total_time_sec,
ROUND(AVG_TIMER_WAIT / 1000000000, 3) AS avg_time_ms,
ROUND(MAX_TIMER_WAIT / 1000000000, 3) AS max_time_ms,
SUM_ROWS_EXAMINED AS total_rows_examined,
SUM_ROWS_SENT AS total_rows_sent,
ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
-- 全表扫描检测
SELECT
DIGEST_TEXT AS sql_template,
COUNT_STAR AS exec_count,
SUM_ROWS_EXAMINED AS rows_examined,
ROUND(SUM_ROWS_EXAMINED / COUNT_STAR, 0) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_ROWS_EXAMINED > COUNT_STAR * 1000
ORDER BY avg_rows_examined DESC
LIMIT 20;
-- 4. EXPLAIN 分析单条慢查询
EXPLAIN ANALYZE
SELECT o.*, u.username
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at >= '2024-01-01'
AND o.status = 1
ORDER BY o.created_at DESC
LIMIT 20;
-- 5. 未使用索引的查询
SELECT
DIGEST_TEXT,
COUNT_STAR,
SUM_NO_INDEX_USED
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_NO_INDEX_USED > 0
ORDER BY SUM_NO_INDEX_USED DESC
LIMIT 20;死锁监控
-- 死锁监控与分析
-- 1. 查看最近一次死锁信息
SHOW ENGINE INNODB STATUS\G
-- 在 LATEST DETECTED DEADLOCK 部分
-- 2. 开启死锁日志记录
SET GLOBAL innodb_print_all_deadlocks = ON;
-- 死锁信息会记录到错误日志中
-- 3. 查看当前锁等待
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
b.trx_started AS blocking_started
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- 4. Kill 阻塞事务
-- KILL <blocking_thread>;
-- 5. 死锁统计(performance_schema)
SELECT
EVENT_NAME,
COUNT_STAR AS deadlock_count
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE '%deadlock%';
-- 6. 锁等待超时监控
SHOW VARIABLES LIKE 'innodb_lock_wait_timeout'; -- 默认 50 秒
-- 7. 常见死锁场景和解决方案
/*
场景 1: 交叉更新
事务 A: UPDATE orders SET status=2 WHERE id=1; -> UPDATE orders SET status=2 WHERE id=2;
事务 B: UPDATE orders SET status=2 WHERE id=2; -> UPDATE orders SET status=2 WHERE id=1;
解决: 统一按 id 顺序更新
场景 2: 插入间隙锁
事务 A: INSERT INTO orders (id,...) VALUES (100,...); -- 间隙锁
事务 B: INSERT INTO orders (id,...) VALUES (100,...); -- 等待
解决: 使用 INSERT ... ON DUPLICATE KEY UPDATE 或捕获重试
场景 3: 长事务持有锁
事务 A: BEGIN; UPDATE ... ; -- 长时间不提交
事务 B: UPDATE 同一行; -- 等待超时
解决: 缩短事务,避免在事务中做外部调用
*/容量规划
-- 数据库容量规划
-- 1. 数据库增长趋势
SELECT
table_schema,
ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS size_gb,
ROUND(SUM(data_length) / 1024 / 1024 / 1024, 2) AS data_gb,
ROUND(SUM(index_length) / 1024 / 1024 / 1024, 2) AS index_gb,
SUM(table_rows) AS estimated_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
GROUP BY table_schema
ORDER BY size_gb DESC;
-- 2. 单表增长分析
SELECT
table_name,
table_rows,
ROUND((data_length + index_length) / 1024 / 1024, 2) AS size_mb,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb,
ROUND(index_length / (data_length + index_length) * 100, 1) AS index_ratio_pct
FROM information_schema.tables
WHERE table_schema = 'your_database'
ORDER BY (data_length + index_length) DESC;
-- 3. 磁盘空间预估
-- 每日增长 = 当前大小 / 运行天数
-- 剩余可用天数 = (总磁盘空间 - 当前大小) / 每日增长
-- 4. InnoDB 缓冲池容量评估
SELECT
ROUND(@@innodb_buffer_pool_size / 1024 / 1024 / 1024, 2) AS buffer_pool_gb,
ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS total_data_gb,
ROUND(@@innodb_buffer_pool_size / (SUM(data_length + index_length)) * 100, 1) AS coverage_pct
FROM information_schema.tables
WHERE engine = 'InnoDB';
-- 5. 连接池容量评估
SELECT
@@max_connections AS max_connections,
(SELECT variable_value FROM performance_schema.global_status
WHERE variable_name = 'Threads_connected') AS current_connections,
(SELECT variable_value FROM performance_schema.global_status
WHERE variable_name = 'Max_used_connections') AS peak_connections;
-- 建议 max_connections = peak_connections * 1.5自动化修复
"""数据库自动化修复脚本"""
import pymysql
import time
import smtplib
from datetime import datetime
class DatabaseAutoRemediation:
"""数据库问题自动修复"""
def __init__(self, db_config: dict):
self.config = db_config
self.actions_log = []
def get_connection(self):
return pymysql.connect(**self.config)
def check_and_kill_long_queries(self, max_duration_sec: int = 300):
"""自动 Kill 超过指定时间的查询"""
conn = self.get_connection()
try:
with conn.cursor() as cursor:
cursor.execute("""
SELECT id, user, host, db, command, time, state, info
FROM information_schema.processlist
WHERE command = 'Query'
AND time > %s
AND info NOT LIKE '%%SHOW%%'
ORDER BY time DESC
""", (max_duration_sec,))
long_queries = cursor.fetchall()
for query in long_queries:
thread_id, user, host, db, cmd, duration, state, info = query
self._log_action(
f"发现长查询: thread={thread_id}, duration={duration}s, "
f"query={info[:100] if info else 'N/A'}"
)
# Kill 查询
if duration > max_duration_sec * 2:
cursor.execute(f"KILL {thread_id}")
self._log_action(f"已 Kill 线程 {thread_id}")
finally:
conn.close()
def check_replication_health(self):
"""检查复制健康状态"""
conn = self.get_connection()
try:
with conn.cursor() as cursor:
cursor.execute("SHOW SLAVE STATUS")
result = cursor.fetchone()
if not result:
return # 不是从库
columns = [desc[0] for desc in cursor.description]
status = dict(zip(columns, result))
io_running = status.get('Slave_IO_Running') == 'Yes'
sql_running = status.get('Slave_SQL_Running') == 'Yes'
lag = int(status.get('Seconds_Behind_Master', 0) or 0)
if not io_running or not sql_running:
self._log_action(
f"复制中断! IO={io_running}, SQL={sql_running}, "
f"Error={status.get('Last_Error', 'N/A')}"
)
# 尝试自动恢复
if lag < 3600: # 延迟不超过1小时才尝试恢复
cursor.execute("STOP SLAVE")
cursor.execute("START SLAVE")
self._log_action("已尝试自动恢复复制")
elif lag > 60:
self._log_action(f"复制延迟: {lag} 秒")
finally:
conn.close()
def check_connection_usage(self):
"""检查连接使用率"""
conn = self.get_connection()
try:
with conn.cursor() as cursor:
cursor.execute("SHOW GLOBAL STATUS LIKE 'Threads_connected'")
current = int(cursor.fetchone()[1])
cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
max_conn = int(cursor.fetchone()[1])
usage = current / max_conn
if usage > 0.8:
self._log_action(
f"连接数告警: {current}/{max_conn} ({usage:.1%})"
)
# 查找空闲连接
cursor.execute("""
SELECT user, host, db, time, state
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 300
ORDER BY time DESC
""")
idle_conns = cursor.fetchall()
if idle_conns:
self._log_action(
f"发现 {len(idle_conns)} 个空闲超过 5 分钟的连接"
)
finally:
conn.close()
def _log_action(self, message: str):
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
log_entry = f"[{timestamp}] {message}"
self.actions_log.append(log_entry)
print(log_entry)
def run_all_checks(self):
"""执行所有检查"""
self._log_action("=== 开始数据库健康检查 ===")
self.check_and_kill_long_queries(max_duration_sec=300)
self.check_replication_health()
self.check_connection_usage()
self._log_action("=== 检查完成 ===")
return self.actions_log
# 使用示例
config = {
"host": "localhost",
"port": 3306,
"user": "monitor",
"password": "monitor_pass",
}
remediation = DatabaseAutoRemediation(config)
logs = remediation.run_all_checks()优点
缺点
性能注意事项
- 采集频率:MySQL 指标采集间隔建议 15-30 秒,不要低于 10 秒
- exporter 权限:为监控创建专用只读用户,避免使用 root
- Prometheus 存储:保留 30 天数据约需 1-2GB/目标,长期存储考虑 Thanos
- Grafana 面板:单个面板展示的指标不宜超过 20 条线
- 告警抑制:使用 group_by 和 inhibit_rules 避免告警风暴
- 历史数据:慢查询日志和性能数据定期归档,避免磁盘占满
总结
完善的数据库监控体系是生产环境稳定运行的基础。核心是选择正确的指标、设置合理的告警阈值、建立清晰的响应流程。Prometheus + Grafana + AlertManager 组合提供了从数据采集到可视化到告警的完整方案,配合自动化修复脚本可以大幅减少 DBA 的日常巡检工作。
关键知识点
- MySQL 核心指标 — QPS/TPS/连接数/慢查询/缓冲池命中率/复制延迟
- Prometheus 架构 — Pull 模式采集、TSDB 存储、PromQL 查询
- 告警分级 — Critical(立即处理)/ Warning(关注)/ Info(记录)
- 慢查询分析 — mysqldumpslow、performance_schema、EXPLAIN ANALYZE
- 死锁排查 — SHOW ENGINE INNODB STATUS、innodb_print_all_deadlocks
- 容量规划 — 基于增长趋势预测磁盘、连接数、缓冲池需求
- 自动化修复 — Kill 长查询、恢复复制、清理空闲连接
常见误区
- 只监控数据库:忽视操作系统指标(CPU、磁盘IO、网络),很多数据库问题是 OS 层面的
- 告警太多:所有指标都设置告警,导致告警疲劳,真正重要的告警被忽略
- 不设基线:没有建立性能基线,无法判断当前指标是否正常
- 忽视复制延迟:主从延迟超过几分钟不处理,可能导致数据不一致
- 手动巡检:依赖人工定期检查,而不是自动化监控
- 监控无冗余:监控系统本身单点,监控挂了也无人知道
进阶路线
- 入门:部署 Prometheus + Grafana,配置基础看板和告警
- 进阶:自定义业务指标、慢查询分析、容量规划
- 高级:自动化修复、异常检测(机器学习)、告警智能聚合
- 专家:全局可观测性平台、AIOps、预测性维护
适用场景
- 所有生产数据库环境
- 高并发业务的数据库性能保障
- 多实例数据库的统一监控管理
- 数据库迁移前后的性能对比
- 7x24 小时无人值守的自动化运维
落地建议
- 第一步:部署 Prometheus + Grafana + mysqld_exporter
- 第二步:导入 MySQL 官方 Grafana 看板模板(ID: 7362)
- 第三步:配置核心告警规则,接入通知渠道
- 第四步:建立性能基线,记录正常指标范围
- 第五步:实现自动化修复脚本,减少人工干预
- 持续:定期审查告警有效性,优化阈值和规则
排错清单
复盘问题
- 上个月发生了多少次数据库告警?其中多少是真问题?
- 最频繁的慢查询是什么?是否已经优化?
- 当前数据库容量还能支撑多久?下次扩容时间是什么时候?
- 复制延迟超过 30 秒的情况发生了几次?根因是什么?
- 监控系统本身是否可靠?有无监控覆盖不到的盲区?
- 告警响应时间是否在 SLA 范围内?
