Terraform Module 设计
大约 12 分钟约 3529 字
Terraform Module 设计
简介
Terraform Module 是基础设施代码复用的核心单元,它的作用类似应用开发里的函数或类:把一组有边界的资源、输入变量、输出结果和约束关系封装起来。好的 Module 不只是"把资源拆出去",而是能让团队在多环境、多项目、多云场景下保持一致的资源规范与交付方式。
在基础设施即代码(IaC)实践中,Module 是从"写 Terraform 配置"到"设计基础设施 API"的关键转变。一个设计良好的 Module 就像一个小型 API:调用者只需要知道输入什么、输出什么,不需要了解内部实现细节。这使得基础设施的复用、测试和演进都变得更加系统化。
特点
Module 设计原则
单一职责原则
# Module 应该像函数一样有单一职责
# 好的 Module 拆分:
# - vpc 模块:管理 VPC、子网、路由表、NAT
# - ecs_service 模块:管理 ECS 服务、任务定义
# - rds 模块:管理 RDS 实例、参数组
# - s3_bucket 模块:管理 S3 存储桶、策略
# 不好的 Module 拆分:
# - infra 模块:包含所有基础设施(VPC + RDS + ECS + S3...)
# 原因:违反单一职责,调用复杂,难以复用
# 也不好的 Module 拆分:
# - subnet_a 模块、subnet_b 模块、route_table_a 模块
# 原因:过度拆分,增加了调用复杂度接口设计原则
# Module 接口设计原则
# 1. 最少必要变量
# 只暴露调用者必须提供的变量
# 合理使用默认值减少必填项
# 2. 语义清晰的命名
# 变量名要能表达意图
# 避免缩写和模糊命名
# 3. 防御性校验
# 使用 validation 对输入进行校验
# 在入口处拦截非法输入
# 4. 稳定的输出
# 输出应该是稳定的、有用的
# 避免输出可能变化的内部实现细节
# 5. 向后兼容
# 新增变量必须有默认值
# 不删除已有的输出实现
Module 基础结构与变量输出
# 推荐的项目目录结构
terraform/
├── envs/ # 环境配置(根模块)
│ ├── dev/
│ │ ├── main.tf # 主配置
│ │ ├── variables.tf # 环境变量定义
│ │ ├── terraform.tfvars # 变量值
│ │ ├── outputs.tf # 输出
│ │ ├── backend.tf # 状态存储配置
│ │ └── providers.tf # Provider 配置
│ ├── staging/
│ │ └── ...
│ └── prod/
│ └── ...
├── modules/ # 可复用模块
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── README.md
│ ├── ecs_service/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── README.md
│ ├── rds/
│ │ └── ...
│ └── s3_bucket/
│ └── ...
├── policies/ # 策略文件
│ ├── s3-bucket-policy.json
│ └── iam-policy.json
└── scripts/ # 辅助脚本
├── plan.sh
└── validate.sh# modules/ecs_service/variables.tf
# Module 变量定义 — 这是模块的"API"
variable "service_name" {
type = string
description = "ECS service name"
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,63}$", var.service_name))
error_message = "Service name must be 3-64 characters, lowercase alphanumeric with hyphens."
}
}
variable "cluster_arn" {
type = string
description = "Target ECS cluster ARN"
}
variable "task_execution_role_arn" {
type = string
description = "ECS task execution role ARN"
}
variable "subnet_ids" {
type = list(string)
description = "Private subnet IDs for the service"
validation {
condition = length(var.subnet_ids) >= 2
error_message = "At least 2 subnets are required for high availability."
}
}
variable "security_group_ids" {
type = list(string)
description = "Additional security group IDs"
default = []
}
variable "desired_count" {
type = number
description = "Desired number of tasks"
default = 2
validation {
condition = var.desired_count >= 1 && var.desired_count <= 100
error_message = "Desired count must be between 1 and 100."
}
}
variable "cpu" {
type = number
description = "Task CPU units (256, 512, 1024, 2048, 4096)"
default = 256
}
variable "memory" {
type = number
description = "Task memory (MB)"
default = 512
}
variable "container_image" {
type = string
description = "Container image URL"
}
variable "container_port" {
type = number
description = "Container exposed port"
default = 8080
}
variable "environment" {
type = string
description = "Environment name (dev/staging/prod)"
default = "dev"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging or prod."
}
}
variable "health_check_path" {
type = string
description = "Container health check path"
default = "/health"
}
variable "enable_alb" {
type = bool
description = "Create Application Load Balancer for the service"
default = true
}
variable "environment_variables" {
type = map(string)
description = "Environment variables for the container"
default = {}
}
variable "secrets" {
type = map(string)
description = "Secrets from AWS Secrets Manager (key = env var name, value = secret ARN)"
default = {}
}
variable "tags" {
type = map(string)
description = "Additional tags for all resources"
default = {}
}# modules/ecs_service/main.tf
# Module 资源定义 — 这是模块的"实现"
data "aws_region" "current" {}
locals {
common_tags = merge(var.tags, {
Name = var.service_name
Environment = var.environment
ManagedBy = "terraform"
Service = var.service_name
Region = data.aws_region.current.name
})
# 合并用户环境变量和系统环境变量
all_environment_variables = merge(
{
APP_ENV = var.environment
APP_NAME = var.service_name
REGION = data.aws_region.current.name
},
var.environment_variables
)
}
# 安全组
resource "aws_security_group" "this" {
name = "${var.service_name}-sg"
description = "Security group for ${var.service_name}"
vpc_id = var.vpc_id
tags = local.common_tags
}
resource "aws_security_group_rule" "ingress_alb" {
count = var.enable_alb ? 1 : 0
type = "ingress"
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
security_group_id = aws_security_group.this.id
source_security_group_id = var.alb_security_group_id
}
resource "aws_security_group_rule" "egress" {
type = "egress"
from_port = 0
to_port = 0
protocol = "-1"
security_group_id = aws_security_group.this.id
cidr_blocks = ["0.0.0.0/0"]
}
# 日志组
resource "aws_cloudwatch_log_group" "this" {
name = "/ecs/${var.service_name}/${var.environment}"
retention_in_days = var.environment == "prod" ? 90 : 30
tags = local.common_tags
}
# ECS 任务定义
resource "aws_ecs_task_definition" "this" {
family = var.service_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = var.task_execution_role_arn
task_role_arn = var.task_role_arn
container_definitions = jsonencode([
{
name = var.service_name
image = var.container_image
essential = true
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
environment = [
for k, v in local.all_environment_variables : {
name = k
value = v
}
]
secrets = [
for k, v in var.secrets : {
name = k
valueFrom = v
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.this.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = var.service_name
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}${var.health_check_path} || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
tags = local.common_tags
}
# ECS 服务
resource "aws_ecs_service" "this" {
name = var.service_name
cluster = var.cluster_arn
desired_count = var.desired_count
task_definition = aws_ecs_task_definition.this.arn
launch_type = "FARGATE"
network_configuration {
assign_public_ip = false
subnets = var.subnet_ids
security_groups = concat([aws_security_group.this.id], var.security_group_ids)
}
# 部署配置
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
# 滚动更新策略
force_new_deployment = true
# ALB 集成
dynamic "load_balancer" {
for_each = var.enable_alb ? [1] : []
content {
target_group_arn = var.target_group_arn
container_name = var.service_name
container_port = var.container_port
}
}
tags = local.common_tags
# 依赖关系
depends_on = [
aws_security_group_rule.ingress_alb,
aws_security_group_rule.egress
]
}
# CloudWatch 告警
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.service_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "3"
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = "300"
statistic = "Average"
threshold = "80"
alarm_description = "ECS service ${var.service_name} CPU utilization over 80%"
dimensions = {
ServiceName = aws_ecs_service.this.name
ClusterName = var.cluster_arn
}
tags = local.common_tags
}# modules/ecs_service/outputs.tf
# Module 输出 — 这是模块的"返回值"
output "service_name" {
description = "ECS service name"
value = aws_ecs_service.this.name
}
output "service_arn" {
description = "ECS service ARN"
value = aws_ecs_service.this.arn
}
output "task_definition_arn" {
description = "ECS task definition ARN"
value = aws_ecs_task_definition.this.arn
}
output "security_group_id" {
description = "Security group ID"
value = aws_security_group.this.id
}
output "log_group_name" {
description = "CloudWatch log group name"
value = aws_cloudwatch_log_group.this.name
}
output "cluster_name" {
description = "ECS cluster name (extracted from ARN)"
value = aws_ecs_service.this.cluster
}根模块调用与多环境变量隔离
# envs/dev/main.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
# 远程状态
backend "s3" {
bucket = "company-terraform-state"
key = "mall/dev/terraform.tfstate"
region = "ap-southeast-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.region
default_tags {
tags = {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
}
}
}
# 调用 VPC 模块
module "vpc" {
source = "../../modules/vpc"
project = var.project
environment = var.environment
vpc_cidr = "10.0.0.0/16"
public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24"]
private_subnet_cidrs = ["10.0.10.0/24", "10.0.11.0/24"]
availability_zones = ["ap-southeast-1a", "ap-southeast-1b"]
}
# 调用 ECS 服务模块
module "order_service" {
source = "../../modules/ecs_service"
service_name = "order-service"
cluster_arn = module.ecs_cluster.cluster_arn
task_execution_role_arn = module.iam.ecs_task_execution_role_arn
subnet_ids = module.vpc.private_subnet_ids
vpc_id = module.vpc.vpc_id
desired_count = 2
cpu = 256
memory = 512
container_image = "${var.ecr_registry}/order-service:${var.order_service_version}"
container_port = 8080
environment = var.environment
enable_alb = true
alb_security_group_id = module.alb.alb_security_group_id
target_group_arn = module.alb.target_group_arn
environment_variables = {
DB_HOST = module.rds.endpoint
DB_PORT = module.rds.port
REDIS_HOST = module.redis.endpoint
}
secrets = {
DB_PASSWORD = aws_secretsmanager_secret.db_password.arn
}
}
# 调用 RDS 模块
module "rds" {
source = "../../modules/rds"
project = var.project
environment = var.environment
subnet_ids = module.vpc.private_subnet_ids
vpc_id = module.vpc.vpc_id
instance_class = "db.t3.medium"
engine_version = "15.4"
allocated_storage = 20
multi_az = false
}# envs/dev/terraform.tfvars
region = "ap-southeast-1"
project = "mall"
environment = "dev"
ecr_registry = "123456789.dkr.ecr.ap-southeast-1.amazonaws.com"
order_service_version = "latest"# envs/prod/terraform.tfvars
region = "ap-southeast-1"
project = "mall"
environment = "prod"
ecr_registry = "123456789.dkr.ecr.ap-southeast-1.amazonaws.com"
order_service_version = "v2.3.1"# envs/prod/main.tf
module "order_service" {
source = "../../modules/ecs_service"
service_name = "order-service"
cluster_arn = module.ecs_cluster.cluster_arn
task_execution_role_arn = module.iam.ecs_task_execution_role_arn
subnet_ids = module.vpc.private_subnet_ids
vpc_id = module.vpc.vpc_id
desired_count = 6 # 生产环境更多副本
cpu = 512 # 生产环境更多 CPU
memory = 1024 # 生产环境更多内存
container_image = "${var.ecr_registry}/order-service:${var.order_service_version}"
container_port = 8080
environment = var.environment
enable_alb = true
alb_security_group_id = module.alb.alb_security_group_id
target_group_arn = module.alb.target_group_arn
environment_variables = {
DB_HOST = module.rds.endpoint
DB_PORT = module.rds.port
REDIS_HOST = module.redis.endpoint
}
secrets = {
DB_PASSWORD = aws_secretsmanager_secret.db_password.arn
}
}# 执行 Terraform
# 初始化
terraform -chdir=envs/dev init
# 查看计划
terraform -chdir=envs/dev plan
terraform -chdir=envs/dev plan -out=tfplan
# 查看计划详情
terraform -chdir=envs/dev show tfplan
# 应用变更
terraform -chdir=envs/dev apply
terraform -chdir=envs/dev apply tfplan
# 销毁资源(谨慎!)
terraform -chdir=envs/dev destroy
# 格式化代码
terraform -chdir=envs/dev fmt -recursive
# 验证配置
terraform -chdir=envs/dev validate版本管理与远程 Registry
# modules/ecs_service/versions.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
# 模块描述(用于 Registry 发布)
description = "ECS Fargate service module with ALB integration"
}
# 使用特定版本的模块
# envs/dev/main.tf
module "order_service" {
source = "git::https://github.com/example/terraform-modules.git//ecs_service?ref=v1.2.0"
# ...
}
# 使用 Terraform Registry 的模块
module "vpc" {
source = "hashicorp/vpc/aws"
version = "~> 5.0"
# ...
}
# 使用私有 Registry 的模块
module "ecs_service" {
source = "app.terraform.io/company/ecs-service/aws"
version = "~> 1.0"
# ...
}# 使用 validation 与 locals 提升模块可用性
variable "environment" {
type = string
description = "Deployment environment"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be one of: dev, staging, prod."
}
}
variable "instance_type" {
type = string
description = "EC2 instance type"
default = "t3.medium"
validation {
condition = can(regex("^t3\\.", var.instance_type))
error_message = "Instance type must be a t3 family instance."
}
}
variable "allowed_cidrs" {
type = list(string)
description = "Allowed CIDR blocks for ingress"
validation {
condition = alltrue([for cidr in var.allowed_cidrs : can(regex("^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}/\\d{1,2}$", cidr))])
error_message = "All CIDR blocks must be valid IPv4 CIDR notation."
}
}
locals {
# 标签合并
common_tags = merge(var.tags, {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
CreatedAt = timestamp()
})
# 名称标准化
resource_name_prefix = "${var.project}-${var.environment}"
}模块测试
# modules/ecs_service/tests/main.tf
# Terraform Test(Terraform 1.6+)
module "test" {
source = "../"
service_name = "test-service"
cluster_arn = "arn:aws:ecs:us-east-1:123456789:cluster/test"
subnet_ids = ["subnet-1", "subnet-2"]
vpc_id = "vpc-12345"
container_image = "nginx:latest"
environment = "dev"
}
# 测试用例
tests {
run "service_name_is_correct" {
assert {
condition = module.test.service_name == "test-service"
error_message = "Service name should be test-service"
}
}
run "security_group_exists" {
assert {
condition = module.test.security_group_id != ""
error_message = "Security group ID should not be empty"
}
}
}# 运行 Terraform Test
terraform -chdir=modules/ecs_service test
# 使用 Terratest(Go)进行更复杂的测试
# test/ecs_service_test.go
# go test -v -run TestEcsServiceCI/CD 集成
# GitHub Actions — Terraform CI/CD
# .github/workflows/terraform.yml
name: Terraform CI/CD
on:
push:
paths:
- 'terraform/**'
pull_request:
paths:
- 'terraform/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.0
- name: Terraform fmt
run: terraform fmt -check -recursive
- name: Terraform init
run: terraform -chdir=terraform/envs/dev init
- name: Terraform validate
run: terraform -chdir=terraform/envs/dev validate
plan:
needs: validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform plan
run: |
terraform -chdir=terraform/envs/dev init
terraform -chdir=terraform/envs/dev plan -out=tfplan# 代码质量检查
# 格式化
terraform fmt -recursive
# 验证
terraform validate
# 生成计划
terraform plan -out=tfplan
# 查看计划
terraform show tfplan
# 安全扫描
# 使用 tfsec 或 checkov 进行安全扫描
tfsec terraform/
checkov -d terraform/
# 成本估算
# 使用 Infracost 估算变更成本
infracost breakdown --path=terraform/envs/dev/远程状态与状态管理
# envs/prod/backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "mall/prod/terraform.tfstate"
region = "ap-southeast-1"
dynamodb_table = "terraform-locks"
encrypt = true
# KMS 加密
# kms_key_id = "alias/terraform-state-key"
}
}# 状态管理命令
# 查看当前状态
terraform -chdir=envs/prod state list
terraform -chdir=envs/prod state show aws_ecs_service.this
# 移动资源到新模块
terraform -chdir=envs/prod state mv \
aws_ecs_service.old_name \
module.new_service.aws_ecs_service.this
# 从状态中删除资源(不销毁)
terraform -chdir=envs/prod state rm aws_ecs_service.this
# 导入已有资源
terraform -chdir=envs/prod import \
module.order_service.aws_ecs_service.this \
arn:aws:ecs:ap-southeast-1:123456789:service/mall/order-service
# 状态锁定排查
terraform -chdir=envs/prod force-unlock <lock-id>优点
缺点
总结
Terraform Module 的重点不是把资源"拆文件",而是设计出稳定的基础设施契约。一个好模块应当有清晰职责、合理默认值、明确输出、最少必要变量,以及可验证的版本与状态管理策略。
关键知识点
- Module 要以职责边界拆分,而不是按文件数量拆分
- 变量不是越多越灵活,越少且语义清晰越易维护
- 远程状态和锁机制是团队协作的基础,不是可选项
- 模块升级要有版本策略和变更说明,避免破坏现有环境
- validation 可以在变量入口处拦截非法输入
- locals 可以合并标签、标准化命名、减少重复
- outputs 应该只暴露调用者真正需要的信息
项目落地视角
- 网络层模块:VPC、子网、路由、安全组
- 计算层模块:ECS、EC2、ASG、K8s Node Group
- 数据层模块:RDS、Redis、对象存储、消息队列
- 平台层根模块:组合 network + platform + app service 模块完成整套环境
- 通用模块:日志、监控、告警、IAM 角色
常见误区
- 一个模块里既建网络又建数据库又建应用,职责失控
- 所有参数都暴露成变量,导致调用复杂且易误配
- 不做远程 state 与加锁,团队多人执行时相互覆盖
- 直接修改公共模块主分支,不做版本管理和回滚策略
- 模块输出过多内部实现细节,导致调用者与实现耦合
- 忽略变量校验,导致错误输入传播到资源创建阶段
进阶路线
- 为模块建立 SemVer 版本管理和 changelog
- 引入 Terragrunt 做多环境编排
- 在 CI 中集成 fmt、validate、plan 审核流程
- 构建企业内部 Terraform Registry 统一分发模块
- 使用 Terraform Test 建立模块自动化测试体系
- 学习 CDKTF(Cloud Development Kit for Terraform)
适用场景
- 多环境基础设施重复建设
- 团队统一管理云资源规范
- 平台工程或 DevOps 团队沉淀基础设施能力
- 需要跨项目复用网络、计算、数据库等资源模板
- 需要标准化基础设施交付流程的组织
落地建议
- 从最稳定、最常复用的资源类型开始模块化
- 先设计输入/输出,再写资源定义,避免中途抽象失控
- 所有模块都加版本约束、变量校验和标签规范
- 把 state、锁表、plan 审核、apply 权限纳入交付流程
- 建立 CI 流水线:fmt -> validate -> plan -> security scan -> review -> apply
- 定期更新 Provider 版本,关注安全补丁
排错清单
- 检查变量是否缺失、类型是否匹配、默认值是否合理
- 检查根模块传入值与子模块输出是否对得上
- 检查 backend/state/lock 是否配置正确
- 检查 provider 版本、模块 source 和 plan 差异是否符合预期
- 检查 state 文件是否损坏或锁定
- 检查 IAM 权限是否足够执行计划中的操作
复盘问题
- 这个模块的职责边界是否足够单一?
- 调用者是否只需要理解少量必要输入就能正确使用?
- 模块升级后,历史环境如何平滑迁移?
- 当前模块是否已经沉淀出团队级基础设施标准?
- 模块的变量数量是否在可控范围内?是否有可以内聚的变量?
- 模块是否有自动化测试覆盖?
