Ansible Playbook 自动化运维
大约 18 分钟约 5513 字
Ansible Playbook 自动化运维
简介
Ansible 是 Red Hat 维护的开源自动化运维工具,以其无 Agent 架构(基于 SSH)、声明式语法(YAML)和丰富的模块生态而广受欢迎。与 Puppet、Chef、SaltStack 等配置管理工具不同,Ansible 不需要在目标主机上安装任何客户端软件,仅通过 SSH 连接即可完成配置管理、应用部署、任务编排等运维工作。
Ansible Playbook 是 Ansible 的核心概念,它使用 YAML 格式描述自动化任务的执行流程。通过 Playbook,运维人员可以将复杂的手动操作转化为可重复、可审计、可版本管理的自动化脚本。本文将从 Inventory 管理、Playbook 编写、常用模块、变量系统、条件循环、角色设计、密钥管理等方面进行全面讲解。
核心特点
| 特点 | 说明 |
|---|---|
| 无 Agent | 仅依赖 SSH,无需在目标主机安装客户端 |
| 幂等性 | 多次执行结果一致,安全重试 |
| 声明式 | YAML 语法描述期望状态,简单易读 |
| 模块丰富 | 内置数千模块,覆盖主流运维场景 |
| 角色(Role) | 模块化组织 Playbook,便于复用和共享 |
| Jinja2 模板 | 动态配置文件生成 |
| Vault 加密 | 敏感信息加密存储 |
| 生态完善 | Ansible Galaxy 共享社区角色 |
环境安装
控制节点安装
# CentOS 7 安装
yum install -y epel-release
yum install -y ansible
# CentOS 8 / Rocky Linux
dnf install -y ansible
# 使用 pip 安装最新版
pip3 install ansible
# 使用 dnf 安装指定版本
dnf install -y ansible-2.9.27
# 验证安装
ansible --version
# 输出示例:
# ansible 2.9.27
# config file = /etc/ansible/ansible.cfg
# configured module search path = ['/root/.ansible/plugins/modules']
# python version = 3.6.8配置文件
# 查看配置文件加载优先级
ansible --version | grep "config file"
# 创建项目目录
mkdir -p /opt/ansible/{inventory,playbooks,roles,group_vars,host_vars,templates,files,scripts}
# 编辑项目级配置文件
cat > /opt/ansible/ansible.cfg << 'EOF'
[defaults]
# 主机清单文件
inventory = ./inventory/hosts
# 模块库路径
library = ./library
# 角色路径
roles_path = ./roles
# 远程用户
remote_user = root
# SSH 私钥
private_key_file = ~/.ssh/id_rsa
# 并发数
forks = 20
# 超时时间
timeout = 30
# 日志
log_path = ./ansible.log
# 主机密钥检查
host_key_checking = False
# 重试文件
retry_files_enabled = True
retry_files_save_path = ./retry
# 取消警告
deprecation_warnings = False
# Facts 缓存
gathering = smart
fact_caching = jsonfile
fact_caching_connection = ./facts_cache
fact_caching_timeout = 86400
[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False
[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no
pipelining = True
control_path = /tmp/ansible-%%h-%%p-%%r
EOFSSH 免密配置
# 生成密钥对
ssh-keygen -t ed25519 -C "ansible-control" -f ~/.ssh/id_ed25519
# 批量分发公钥(使用 sshpass)
yum install -y sshpass
# 密码分发脚本
cat > /opt/ansible/scripts/copy_ssh_key.sh << 'SCRIPT'
#!/bin/bash
PASSWORD="YourPassword"
KEY_FILE="$HOME/.ssh/id_ed25519.pub"
HOSTS=(
"192.168.1.10"
"192.168.1.11"
"192.168.1.12"
)
for host in "${HOSTS[@]}"; do
echo "Copying SSH key to ${host}..."
sshpass -p "${PASSWORD}" ssh-copy-id -i ${KEY_FILE} -o StrictHostKeyChecking=no root@${host}
done
SCRIPT
chmod +x /opt/ansible/scripts/copy_ssh_key.shInventory 主机清单
静态 Inventory(INI 格式)
# /opt/ansible/inventory/hosts
# 按环境分组
[production]
prod-web01 ansible_host=192.168.1.10
prod-web02 ansible_host=192.168.1.11
prod-db01 ansible_host=192.168.1.20
prod-db02 ansible_host=192.168.1.21
[staging]
stg-web01 ansible_host=192.168.2.10
stg-db01 ansible_host=192.168.2.20
# 按角色分组
[webservers]
prod-web01
prod-web02
stg-web01
[dbservers]
prod-db01
prod-db02
stg-db01
# 按功能分组
[monitoring]
prod-web01
prod-db01
# 嵌套组
[prod:children]
webservers
dbservers
# 组变量(直接在 inventory 中定义)
[webservers:vars]
nginx_worker_processes=4
nginx_worker_connections=4096
[dbservers:vars]
mysql_max_connections=500
mysql_innodb_buffer_pool_size=1G
# 全局变量
[all:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_port=22
ntp_server=ntp.aliyun.com静态 Inventory(YAML 格式)
# /opt/ansible/inventory/hosts.yml
all:
vars:
ansible_python_interpreter: /usr/bin/python3
ansible_ssh_port: 22
ntp_server: ntp.aliyun.com
children:
production:
hosts:
prod-web01:
ansible_host: 192.168.1.10
nginx_port: 80
prod-web02:
ansible_host: 192.168.1.11
nginx_port: 80
prod-db01:
ansible_host: 192.168.1.20
mysql_role: master
prod-db02:
ansible_host: 192.168.1.21
mysql_role: slave
staging:
hosts:
stg-web01:
ansible_host: 192.168.2.10
stg-db01:
ansible_host: 192.168.2.20
webservers:
hosts:
prod-web01:
prod-web02:
stg-web01:
dbservers:
hosts:
prod-db01:
prod-db02:
stg-db01:动态 Inventory
# 从云平台动态获取主机(以阿里云为例)
# 安装动态 Inventory 脚本
pip3 install aliyun-python-sdk-core aliyun-python-sdk-ecs
# 使用动态 Inventory
ansible all -i aliyun_ecs.yml --list-hosts
# 也可以混合使用
ansible all -i inventory/hosts -i aliyun_ecs.yml --list-hostsInventory 操作命令
# 列出所有主机
ansible all --list-hosts
# 列出指定组的主机
ansible webservers --list-hosts
# 查看 Inventory 图形结构
ansible-inventory --graph
# 查看主机变量
ansible-inventory --host prod-web01
# 导出 Inventory 为 JSON
ansible-inventory --list
# 测试连通性
ansible all -m ping
ansible webservers -m ping
ansible prod-web01 -m ping常用模块详解
file 模块(文件/目录管理)
- name: 文件和目录管理示例
hosts: webservers
tasks:
# 创建目录
- name: 创建应用目录
file:
path: /opt/myapp/logs
state: directory
mode: '0755'
owner: app
group: app
recurse: yes
# 创建文件
- name: 创建空文件
file:
path: /opt/myapp/.initialized
state: touch
mode: '0644'
# 创建符号链接
- name: 创建符号链接
file:
src: /opt/myapp/current
dest: /opt/myapp/latest
state: link
# 删除文件或目录
- name: 删除临时文件
file:
path: /tmp/old_config
state: absentcopy 模块(文件复制)
- name: 文件复制示例
hosts: webservers
tasks:
# 复制文件
- name: 复制配置文件
copy:
src: files/nginx.conf
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: '0644'
backup: yes
notify: reload nginx
# 直接写入内容
- name: 写入 welcome 页面
copy:
content: |
Welcome to {{ ansible_hostname }}
IP: {{ ansible_default_ipv4.address }}
Environment: {{ env }}
dest: /usr/share/nginx/html/index.html
mode: '0644'
# 解压文件
- name: 解压应用包
unarchive:
src: files/myapp-1.0.tar.gz
dest: /opt/myapp/
owner: app
group: app
remote_src: notemplate 模块(Jinja2 模板)
- name: Jinja2 模板示例
hosts: webservers
tasks:
- name: 生成 Nginx 配置文件
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: '0644'
backup: yes
validate: '/usr/sbin/nginx -t -c %s'
notify: reload nginx
- name: 生成应用配置
template:
src: templates/app.conf.j2
dest: /opt/myapp/config/app.conf
notify: restart myappJinja2 模板文件示例:
# templates/nginx.conf.j2
user nginx;
worker_processes {{ ansible_processor_vcpus }};
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections {{ nginx_worker_connections | default(1024) }};
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent"';
sendfile on;
tcp_nopush on;
keepalive_timeout {{ nginx_keepalive_timeout | default(65) }};
{% if nginx_gzip | default(true) %}
gzip on;
gzip_types text/plain text/css application/json;
{% endif %}
# 上游服务器
upstream backend {
{% for host in groups['app_servers'] %}
server {{ hostvars[host]['ansible_host'] }}:{{ app_port | default(8080) }};
{% endfor %}
}
server {
listen {{ nginx_port | default(80) }};
server_name {{ nginx_server_name | default('localhost') }};
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
{# 健康检查端点 #}
location /health {
access_log off;
return 200 'OK';
}
}
}yum/dnf 模块(软件包管理)
- name: 软件包管理示例
hosts: all
tasks:
# 安装单个包
- name: 安装 vim
yum:
name: vim-enhanced
state: present
# 安装多个包
- name: 安装基础工具
yum:
name:
- wget
- curl
- git
- htop
- net-tools
- bind-utils
- telnet
- tcpdump
state: present
# 安装指定版本
- name: 安装特定版本的 Docker
yum:
name: docker-ce-20.10.21-3.el7
state: present
allow_downgrade: yes
# 更新所有包
- name: 更新系统(谨慎操作)
yum:
name: '*'
state: latest
exclude: kernel*
when: system_update | default(false)
# 删除包
- name: 卸载不需要的包
yum:
name:
- postfix
- avahi
state: absentsystemd 模块(服务管理)
- name: 服务管理示例
hosts: webservers
tasks:
# 启动并启用服务
- name: 启动 Nginx 并设置开机自启
systemd:
name: nginx
state: started
enabled: yes
daemon_reload: yes
# 重启服务
- name: 重启应用服务
systemd:
name: myapp
state: restarted
# 重载配置
- name: 重载 Nginx 配置
systemd:
name: nginx
state: reloaded
# 停止服务
- name: 停止维护模式
systemd:
name: maintenance
state: stopped
# 管理自定义 systemd 服务
- name: 部署 systemd 服务文件
template:
src: templates/myapp.service.j2
dest: /etc/systemd/system/myapp.service
notify:
- daemon reload
- restart myappshell 和 command 模块
- name: Shell 命令示例
hosts: all
tasks:
# command 模块(不经过 shell 解析,更安全)
- name: 检查磁盘使用率
command: df -h
register: disk_usage
changed_when: false
- name: 显示磁盘信息
debug:
msg: "{{ disk_usage.stdout_lines }}"
# shell 模块(支持管道、重定向等 shell 特性)
- name: 获取 CPU 使用率最高的进程
shell: ps aux --sort=-%cpu | head -11
register: top_cpu_procs
changed_when: false
- name: 检查服务是否运行
shell: systemctl is-active {{ item }}
register: service_status
failed_when: false
changed_when: false
loop:
- nginx
- docker
- redis
# 使用 args 选项增强安全性
- name: 安全执行脚本
shell: /opt/scripts/cleanup.sh
args:
chdir: /opt/scripts
creates: /tmp/cleanup_done # 文件存在则跳过
executable: /bin/bash变量系统
变量定义位置
# 1. group_vars 目录(按组定义变量)
# /opt/ansible/group_vars/webservers.yml
nginx_worker_processes: 4
nginx_worker_connections: 4096
nginx_keepalive_timeout: 65
nginx_gzip: true
# /opt/ansible/group_vars/dbservers.yml
mysql_max_connections: 500
mysql_innodb_buffer_pool_size: "1G"
mysql_slow_query_log: true
# /opt/ansible/group_vars/all.yml(全局变量)
ntp_server: ntp.aliyun.com
timezone: Asia/Shanghai
dns_servers:
- 223.5.5.5
- 8.8.8.8# 2. host_vars 目录(按主机定义变量)
# /opt/ansible/host_vars/prod-web01.yml
nginx_port: 80
nginx_server_name: www.example.com
app_port: 8080
# /opt/ansible/host_vars/prod-db01.yml
mysql_role: master
mysql_server_id: 1
mysql_binlog_format: ROW# 3. Playbook 中定义变量
- name: 使用 Playbook 变量
hosts: webservers
vars:
app_name: myapp
app_version: "2.1.0"
app_port: 8080
max_memory: "512m"
tasks:
- name: 部署应用
debug:
msg: "Deploying {{ app_name }} v{{ app_version }} on port {{ app_port }}"# 4. 引入外部变量文件
- name: 使用变量文件
hosts: webservers
vars_files:
- vars/common.yml
- vars/{{ env }}.yml
tasks:
- name: 显示环境信息
debug:
msg: "Environment: {{ env }}, DB Host: {{ db_host }}"# 5. 命令行传入变量
ansible-playbook deploy.yml -e "env=production"
ansible-playbook deploy.yml -e "@vars/production.yml"
ansible-playbook deploy.yml -e '{"app_name":"myapp","app_version":"2.0"}'Facts(系统信息采集)
- name: 使用 Facts
hosts: all
tasks:
- name: 显示主机信息
debug:
msg: |
主机名: {{ ansible_hostname }}
操作系统: {{ ansible_distribution }} {{ ansible_distribution_version }}
CPU 核数: {{ ansible_processor_vcpus }}
总内存: {{ ansible_memtotal_mb }} MB
IP 地址: {{ ansible_default_ipv4.address }}
架构: {{ ansible_architecture }}
内核版本: {{ ansible_kernel }}
- name: 根据系统版本安装软件
debug:
msg: "{{ ansible_distribution }} 使用 {{ 'yum' if ansible_distribution == 'CentOS' else 'apt' }}"
- name: 根据内存大小调整配置
set_fact:
jvm_heap_size: "{{ (ansible_memtotal_mb * 0.6 / 1024) | round(0, 'ceil') | int }}g"条件判断
- name: 条件判断示例
hosts: all
tasks:
# 基础条件判断
- name: CentOS 系统执行
yum:
name: nginx
state: present
when: ansible_distribution == "CentOS"
- name: Ubuntu 系统执行
apt:
name: nginx
state: present
when: ansible_distribution == "Ubuntu"
# 多条件组合
- name: 生产环境 Web 服务器配置
debug:
msg: "Configuring production web server"
when:
- env == "production"
- "'webservers' in group_names"
- ansible_memtotal_mb | int > 4096
# 条件或
- name: 安装监控代理
yum:
name: node_exporter
state: present
when: install_monitoring | default(true) or env == "production"
# 变量定义检查
- name: 使用自定义端口(如果定义了)
debug:
msg: "Custom port: {{ custom_port }}"
when: custom_port is defined
- name: 使用默认端口
debug:
msg: "Default port: 8080"
when: custom_port is not defined
# 注册变量判断
- name: 检查服务状态
command: systemctl is-active nginx
register: nginx_status
failed_when: false
changed_when: false
- name: Nginx 正在运行
debug:
msg: "Nginx is running"
when: nginx_status.rc == 0
- name: Nginx 未运行
debug:
msg: "Nginx is not running, starting..."
when: nginx_status.rc != 0
# 列表包含判断
- name: 是否为数据库服务器
debug:
msg: "This is a database server"
when: "'dbservers' in group_names"循环
- name: 循环示例
hosts: webservers
tasks:
# 标准循环
- name: 创建多个目录
file:
path: "{{ item }}"
state: directory
mode: '0755'
loop:
- /opt/app/bin
- /opt/app/conf
- /opt/app/logs
- /opt/app/data
- /opt/app/tmp
# 循环安装软件
- name: 安装依赖包
yum:
name: "{{ item }}"
state: present
loop:
- gcc
- make
- openssl-devel
- libffi-devel
# 字典循环
- name: 创建多个用户
user:
name: "{{ item.name }}"
shell: "{{ item.shell }}"
groups: "{{ item.groups }}"
append: yes
loop:
- { name: 'deploy', shell: '/bin/bash', groups: 'docker' }
- { name: 'app', shell: '/sbin/nologin', groups: 'app' }
- { name: 'monitor', shell: '/bin/bash', groups: 'monitor' }
# 嵌套变量循环
- name: 配置多个虚拟主机
template:
src: templates/vhost.conf.j2
dest: "/etc/nginx/conf.d/{{ item.name }}.conf"
loop: "{{ virtual_hosts }}"
notify: reload nginx
# 使用 with_dict
- name: 部署多个应用
debug:
msg: "Deploying {{ item.key }} version {{ item.value.version }} on port {{ item.value.port }}"
loop: "{{ applications | dict2items }}"
when: applications is defined
# 使用 until 重试
- name: 等待服务启动
uri:
url: "http://localhost:{{ app_port }}/health"
status_code: 200
register: result
until: result.status == 200
retries: 10
delay: 5Handlers
- name: Handlers 示例
hosts: webservers
tasks:
- name: 部署 Nginx 配置
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: '/usr/sbin/nginx -t -c %s'
notify:
- reload nginx
- restart firewall
- name: 部署虚拟主机配置
template:
src: templates/vhost.conf.j2
dest: /etc/nginx/conf.d/default.conf
notify: reload nginx
- name: 更新防火墙规则
firewalld:
port: "{{ nginx_port }}/tcp"
permanent: yes
state: enabled
notify: restart firewall
handlers:
- name: reload nginx
systemd:
name: nginx
state: reloaded
listen: "reload nginx"
- name: restart firewall
systemd:
name: firewalld
state: restarted
# 强制执行 handler(无论是否触发)
- name: always flush handlers
meta: flush_handlersRoles 角色设计
角色目录结构
roles/
└── nginx/
├── defaults/ # 默认变量(优先级最低)
│ └── main.yml
├── vars/ # 角色变量(优先级高于 defaults)
│ └── main.yml
├── tasks/ # 任务列表
│ ├── main.yml
│ ├── install.yml
│ ├── configure.yml
│ └── service.yml
├── handlers/ # 处理器
│ └── main.yml
├── templates/ # Jinja2 模板文件
│ ├── nginx.conf.j2
│ └── vhost.conf.j2
├── files/ # 静态文件
│ └── nginx.repo
├── meta/ # 角色元数据(依赖等)
│ └── main.yml
└── tests/ # 测试
├── inventory
└── test.yml角色 tasks/main.yml
# roles/nginx/tasks/main.yml
---
- name: Include install tasks
include_tasks: install.yml
tags: install
- name: Include configure tasks
include_tasks: configure.yml
tags: configure
- name: Include service tasks
include_tasks: service.yml
tags: service# roles/nginx/tasks/install.yml
---
- name: 添加 Nginx YUM 源
copy:
src: nginx.repo
dest: /etc/yum.repos.d/nginx.repo
mode: '0644'
- name: 安装 Nginx
yum:
name: nginx
state: present
notify: start nginx
- name: 创建必要目录
file:
path: "{{ item }}"
state: directory
owner: nginx
group: nginx
mode: '0755'
loop:
- /etc/nginx/conf.d
- /etc/nginx/ssl
- /var/log/nginx# roles/nginx/tasks/configure.yml
---
- name: 生成主配置文件
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: '0644'
backup: yes
validate: '/usr/sbin/nginx -t -c %s'
notify: reload nginx
- name: 生成虚拟主机配置
template:
src: vhost.conf.j2
dest: "/etc/nginx/conf.d/{{ item.name }}.conf"
loop: "{{ nginx_vhosts }}"
notify: reload nginx
- name: 部署 SSL 证书
copy:
src: "ssl/{{ item }}"
dest: "/etc/nginx/ssl/{{ item }}"
mode: '0600'
loop:
- server.crt
- server.key
when: nginx_ssl_enabled | default(false)
notify: reload nginx# roles/nginx/tasks/service.yml
---
- name: 启动并启用 Nginx
systemd:
name: nginx
state: started
enabled: yes
daemon_reload: yes
- name: 验证 Nginx 状态
uri:
url: "http://localhost:{{ nginx_port | default(80) }}/health"
status_code: 200
register: health_check
until: health_check.status == 200
retries: 5
delay: 3角色默认变量
# roles/nginx/defaults/main.yml
nginx_port: 80
nginx_ssl_port: 443
nginx_ssl_enabled: false
nginx_worker_processes: "{{ ansible_processor_vcpus }}"
nginx_worker_connections: 1024
nginx_keepalive_timeout: 65
nginx_client_max_body_size: "50m"
nginx_gzip: true
nginx_server_name: localhost
nginx_vhosts: []角色处理器
# roles/nginx/handlers/main.yml
---
- name: start nginx
systemd:
name: nginx
state: started
enabled: yes
- name: reload nginx
systemd:
name: nginx
state: reloaded
- name: restart nginx
systemd:
name: nginx
state: restarted角色元数据
# roles/nginx/meta/main.yml
galaxy_info:
author: ops-team
description: Nginx web server installation and configuration
company: MyCompany
license: MIT
min_ansible_version: 2.9
platforms:
- name: EL
versions:
- 7
- 8
galaxy_tags:
- nginx
- web
- proxy
dependencies:
- role: common
vars:
timezone: Asia/Shanghai使用角色的 Playbook
# /opt/ansible/playbooks/deploy_nginx.yml
---
- name: 部署 Nginx Web 服务器
hosts: webservers
become: yes
roles:
- role: common
tags: common
- role: nginx
tags: nginx
vars:
nginx_worker_connections: 4096
nginx_ssl_enabled: true
nginx_vhosts:
- name: default
server_name: www.example.com
port: 80
ssl: true
upstream_port: 8080Ansible Vault 密钥管理
创建加密文件
# 创建加密变量文件
ansible-vault create vars/secrets.yml
# 提示输入密码后进入编辑器
# 使用密码文件(避免每次输入)
echo "MyVaultPassword2024" > .vault_password
chmod 600 .vault_password
ansible-vault create --vault-password-file=.vault_password vars/secrets.yml加密现有文件
# 加密文件
ansible-vault encrypt vars/secrets.yml
# 加密多个文件
ansible-vault encrypt vars/db_password.yml vars/api_keys.yml
# 查看加密文件内容
ansible-vault view vars/secrets.yml
# 编辑加密文件
ansible-vault edit vars/secrets.yml
# 解密文件
ansible-vault decrypt vars/secrets.yml加密变量文件内容
# vars/secrets.yml(加密存储)
db_root_password: "SuperSecret123!"
db_app_password: "AppPassword456"
api_key: "sk-xxxxxxxxxxxxxxxxx"
jwt_secret: "my-jwt-secret-key-2024"
redis_password: "RedisPassword789"使用加密变量
# 执行时指定密码文件
ansible-playbook deploy.yml --vault-password-file=.vault_password
# 执行时交互输入密码
ansible-playbook deploy.yml --ask-vault-pass
# 使用多个加密文件
ansible-vault encrypt_string 'mys3cr3t' --name 'db_password'字符串加密
# 加密单个字符串(可嵌入 Playbook)
ansible-vault encrypt_string 'MySecretPassword' --name 'db_password'
# 输出:
# db_password: !vault |
# $ANSIBLE_VAULT;1.1;AES256
# 3638366135353...
# 将输出直接粘贴到 Playbook 中即可异步任务
- name: 异步任务示例
hosts: all
tasks:
# 异步执行长时间任务
- name: 大文件下载(异步执行)
get_url:
url: "https://releases.example.com/app-{{ app_version }}.tar.gz"
dest: "/opt/app-{{ app_version }}.tar.gz"
async: 3600 # 超时时间(秒)
poll: 0 # 不等待,立即返回
register: download_result
# 继续执行其他任务
- name: 准备目录
file:
path: /opt/app
state: directory
# 等待异步任务完成
- name: 等待下载完成
async_status:
jid: "{{ download_result.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 60
delay: 10
# 批量并行执行
- name: 批量更新系统
yum:
name: '*'
state: latest
async: 3600
poll: 0
register: update_results
- name: 等待所有更新完成
async_status:
jid: "{{ item.ansible_job_id }}"
register: update_status
until: update_status.finished
retries: 120
delay: 30
loop: "{{ update_results.results }}"错误处理
- name: 错误处理示例
hosts: webservers
tasks:
# ignore_errors 忽略错误继续执行
- name: 检查旧服务状态
command: systemctl is-active old-service
register: old_service_status
failed_when: false
changed_when: false
- name: 停止旧服务(如果存在)
systemd:
name: old-service
state: stopped
when: old_service_status.rc == 0
# block/rescue/always 结构化错误处理
- block:
- name: 备份数据库
command: /opt/scripts/backup_db.sh
register: backup_result
- name: 执行数据库迁移
command: /opt/scripts/migrate_db.sh {{ app_version }}
register: migrate_result
rescue:
- name: 迁移失败,回滚数据库
command: /opt/scripts/rollback_db.sh
register: rollback_result
- name: 发送告警通知
mail:
to: ops@example.com
subject: "数据库迁移失败 - {{ ansible_hostname }}"
body: "迁移失败,已执行回滚操作。"
delegate_to: localhost
- name: 标记任务失败
fail:
msg: "数据库迁移失败,已执行回滚。请检查日志。"
always:
- name: 清理临时文件
file:
path: /tmp/migration_tmp
state: absent
- name: 记录操作日志
lineinfile:
path: /var/log/migration.log
line: "{{ ansible_date_time.iso8601 }} - Migration {{ 'succeeded' if migrate_result is succeeded else 'failed' }}"
create: yes
# 自定义失败条件
- name: 检查磁盘空间
command: df -BG /opt
register: disk_space
changed_when: false
failed_when: >
(disk_space.stdout.split()[4] | replace('%','') | int) > 90
# 使用 assert 进行前置检查
- name: 前置条件检查
assert:
that:
- ansible_memtotal_mb | int > 2048
- ansible_processor_vcpus | int >= 2
- app_version is defined
fail_msg: "前置条件不满足:内存至少 2GB,CPU 至少 2 核,需指定 app_version"
success_msg: "前置条件检查通过"完整 Playbook 示例
# /opt/ansible/playbooks/deploy_app.yml
---
- name: 完整应用部署
hosts: webservers
become: yes
vars_files:
- vars/common.yml
- vars/{{ env }}.yml
- vars/secrets.yml
pre_tasks:
- name: 检查环境变量
assert:
that:
- env is defined
- app_version is defined
fail_msg: "必须指定 env 和 app_version 变量"
- name: 记录部署开始时间
set_fact:
deploy_start_time: "{{ ansible_date_time.iso8601 }}"
- name: 通知部署开始
debug:
msg: "开始部署 {{ app_name }} v{{ app_version }} 到 {{ env }} 环境"
roles:
- role: common
tags: [common, always]
- role: nginx
tags: [nginx, web]
tasks:
- name: 创建应用目录
file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'
loop:
- "{{ app_base_dir }}/bin"
- "{{ app_base_dir }}/conf"
- "{{ app_base_dir }}/logs"
- "{{ app_base_dir }}/data"
tags: prepare
- name: 下载应用包
get_url:
url: "{{ artifact_repo }}/{{ app_name }}/{{ app_version }}/{{ app_name }}.tar.gz"
dest: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
checksum: "sha256:{{ artifact_checksum }}"
register: download_result
until: download_result is succeeded
retries: 3
delay: 10
tags: download
- name: 解压应用包
unarchive:
src: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
dest: "{{ app_base_dir }}"
owner: "{{ app_user }}"
group: "{{ app_user }}"
remote_src: yes
notify: restart myapp
tags: deploy
- name: 生成应用配置
template:
src: templates/app.conf.j2
dest: "{{ app_base_dir }}/conf/app.conf"
owner: "{{ app_user }}"
mode: '0640'
notify: restart myapp
tags: configure
- name: 部署 systemd 服务
template:
src: templates/myapp.service.j2
dest: /etc/systemd/system/myapp.service
notify:
- daemon reload
- restart myapp
tags: service
post_tasks:
- name: 等待应用启动
uri:
url: "http://localhost:{{ app_port }}/health"
status_code: 200
register: health_result
until: health_result.status == 200
retries: 15
delay: 5
tags: verify
- name: 部署成功通知
debug:
msg: |
部署完成!
应用: {{ app_name }}
版本: {{ app_version }}
环境: {{ env }}
耗时: {{ deploy_start_time }} -> {{ ansible_date_time.iso8601 }}
tags: always
handlers:
- name: daemon reload
systemd:
daemon_reload: yes
- name: restart myapp
systemd:
name: myapp
state: restartedCI/CD 集成
Jenkins Pipeline 集成
// Jenkinsfile
pipeline {
agent any
environment {
ANSIBLE_DIR = '/opt/ansible'
VAULT_PASSWORD = credentials('ansible-vault-password')
}
stages {
stage('Syntax Check') {
steps {
sh """
cd ${ANSIBLE_DIR}
ansible-playbook playbooks/deploy_app.yml --syntax-check
"""
}
}
stage('Dry Run') {
steps {
sh """
cd ${ANSIBLE_DIR}
ansible-playbook playbooks/deploy_app.yml \
-e "env=${ENV}" \
-e "app_version=${VERSION}" \
--vault-password-file=${VAULT_PASSWORD} \
--check \
--diff
"""
}
}
stage('Deploy to Staging') {
when { branch 'develop' }
steps {
sh """
cd ${ANSIBLE_DIR}
ansible-playbook playbooks/deploy_app.yml \
-l staging \
-e "env=staging" \
-e "app_version=${VERSION}" \
--vault-password-file=${VAULT_PASSWORD}
"""
}
}
stage('Deploy to Production') {
when { branch 'main' }
steps {
input '确认部署到生产环境?'
sh """
cd ${ANSIBLE_DIR}
ansible-playbook playbooks/deploy_app.yml \
-l production \
-e "env=production" \
-e "app_version=${VERSION}" \
--vault-password-file=${VAULT_PASSWORD}
"""
}
}
}
}GitLab CI 集成
# .gitlab-ci.yml
stages:
- validate
- test
- deploy
variables:
ANSIBLE_DIR: "/opt/ansible"
validate_playbook:
stage: validate
script:
- ansible-playbook playbooks/deploy_app.yml --syntax-check
- ansible-lint playbooks/deploy_app.yml
deploy_staging:
stage: deploy
script:
- ansible-playbook playbooks/deploy_app.yml
-l staging
-e "env=staging app_version=$CI_COMMIT_TAG"
--vault-password-file=$VAULT_PASS_FILE
environment:
name: staging
only:
- tags
deploy_production:
stage: deploy
script:
- ansible-playbook playbooks/deploy_app.yml
-l production
-e "env=production app_version=$CI_COMMIT_TAG"
--vault-password-file=$VAULT_PASS_FILE
environment:
name: production
when: manual
only:
- tags最佳实践
代码规范
- 始终使用名称:每个 task 必须有 name 字段
- 使用 FQCN:使用完整模块名如
ansible.builtin.yum - 使用 YAML 格式:避免使用 key=value 简写
- 合理使用 Tags:便于选择性执行
- 使用 Role 组织代码:超过 5 个 task 就应该考虑抽取 Role
- 幂等性:确保重复执行不出错
- 使用 assert 做前置检查:提前发现问题
- 敏感信息加密:使用 Vault 保护密码和密钥
项目结构
/opt/ansible/
├── ansible.cfg # 项目配置
├── inventory/ # 主机清单
│ ├── hosts
│ ├── group_vars/
│ │ ├── all.yml
│ │ ├── webservers.yml
│ │ └── dbservers.yml
│ └── host_vars/
│ ├── prod-web01.yml
│ └── prod-db01.yml
├── playbooks/ # 剧本
│ ├── deploy_app.yml
│ ├── update_system.yml
│ └── backup_db.yml
├── roles/ # 角色
│ ├── common/
│ ├── nginx/
│ └── mysql/
├── vars/ # 共享变量
│ ├── common.yml
│ ├── production.yml
│ ├── staging.yml
│ └── secrets.yml # Vault 加密
├── templates/ # 全局模板
├── files/ # 静态文件
├── scripts/ # 辅助脚本
└── .vault_password # Vault 密码文件优点
- 零依赖:目标主机仅需 SSH,无需安装 Agent
- 学习成本低:YAML 语法,运维人员可快速上手
- 幂等执行:多次执行安全可靠
- 模块丰富:内置数千模块,覆盖主流场景
- 社区活跃:Ansible Galaxy 有大量现成 Role 可用
- 可审计:所有操作可通过 Playbook 版本管理追溯
缺点
- 性能瓶颈:SSH 连接开销大,大规模节点执行速度慢
- 调试困难:复杂 Playbook 的错误信息不够直观
- 状态管理弱:不如 Puppet/SaltStack 的状态管理精细
- Windows 支持有限:主要面向 Linux,Windows 支持不够完善
- 大规模场景受限:单控制节点管理数千节点时性能下降
总结
Ansible 以其简洁的语法、无 Agent 架构和强大的模块生态,成为当前最流行的自动化运维工具之一。通过 Playbook 和 Role 的组合,可以构建出可复用、可维护、可审计的自动化运维体系。在 CI/CD 集成和多云管理场景下,Ansible 更是展现出其灵活性和适应性。掌握 Ansible 是现代运维工程师的必备技能。
关键知识点
- Inventory 静态和动态主机管理
- Playbook 的 YAML 语法和任务编排
- 内置模块的正确使用(file、copy、template、yum、systemd、shell)
- 变量优先级和 Facts 采集
- 条件判断、循环和 Handler 触发机制
- Role 的标准化目录结构和复用设计
- Jinja2 模板的过滤器和控制语句
- Vault 加密保护敏感信息
常见误区
- 过度使用 shell/command 模块:应该优先使用幂等的专用模块
- 忽略幂等性:shell 模块默认不幂等,需要正确设置 changed_when
- 变量命名混乱:应建立统一的变量命名规范
- Playbook 过于庞大:应该拆分为 Role,每个 Role 职责单一
- 敏感信息明文存储:密码和密钥必须使用 Vault 加密
- 不使用 --check 模式测试:生产执行前应该先 dry run
进阶路线
- 自定义模块:学习使用 Python 开发自定义 Ansible 模块
- 自定义插件:了解回调插件、过滤器插件的开发
- Ansible Tower/AWX:学习企业级 Ansible 管理平台
- Molecule 测试:使用 Molecule 对 Role 进行自动化测试
- Ansible Navigator:学习新一代 Ansible 执行环境(Execution Environment)
- 网络自动化:使用 Ansible 管理网络设备(交换机、路由器)
适用场景
- 批量服务器初始化和配置管理
- 应用程序的持续部署
- 操作系统补丁和更新管理
- 数据库备份和恢复操作
- 安全基线检查和加固
- 网络设备配置管理
- 云平台资源编排
- 开发/测试环境快速搭建
落地建议
- 从小做起:先从简单的批量命令执行开始,逐步过渡到 Playbook
- 版本管理:所有 Playbook 和 Role 使用 Git 管理
- 代码审查:Playbook 变更需要经过 Code Review
- 分环境管理:使用 inventory 和变量文件隔离不同环境
- 文档先行:每个 Role 需要有 README 说明用途和使用方法
- 测试覆盖:使用 Molecule 或 ansible-playbook --check 进行测试
- 监控集成:将 Ansible 执行结果接入监控和告警系统
排错清单
| 现象 | 可能原因 | 排查方法 |
|---|---|---|
| SSH 连接失败 | 密钥/网络问题 | ansible all -m ping -vvv |
| 模块找不到 | Python 路径错误 | 检查 ansible_python_interpreter |
| 变量未定义 | 变量文件路径错误 | ansible-inventory --host HOST |
| 权限不足 | sudo 配置问题 | 检查 become 和 become_user |
| 模板渲染错误 | Jinja2 语法错误 | 使用 ansible-playbook --syntax-check |
| Handler 未触发 | notify 名称不匹配 | 确认 notify 和 handlers 名称一致 |
| 执行超时 | 目标主机响应慢 | 调整 timeout 和 async 参数 |
| 幂等性问题 | shell 模块未标记 | 添加 changed_when: false |
复盘问题
- Ansible 的变量优先级从高到低是如何排列的?在什么场景下需要注意优先级冲突?
- 如何设计一个支持多环境(开发/测试/生产)的 Ansible 项目结构?
- 如何保证 Ansible Playbook 的幂等性?哪些模块天然幂等,哪些需要额外处理?
- Ansible Vault 在 CI/CD 环境中如何安全地管理密码?
- Role 的 defaults 和 vars 目录有什么区别?各自适用于什么场景?
- 如何优化 Ansible 在大规模(1000+ 节点)场景下的执行性能?
