AI 图像生成

SunnyFan大约 14 分钟约 4220 字

AI 图像生成

简介

AI 图像生成使用深度学习模型根据文本描述或参考图像创建新图像。从 DALL-E、Stable Diffusion 到 Midjourney，AI 图像生成已广泛应用于设计、营销、游戏和内容创作等领域。本篇介绍主流图像生成技术和实践。

AI 图像生成的发展可以追溯到生成对抗网络（GAN，2014），它首次展示了深度学习生成逼真图像的能力。此后，VAE（变分自编码器）、Flow-based 模型和扩散模型（Diffusion Model）相继出现。2021 年，Stable Diffusion 的发布标志着扩散模型成为图像生成的主流范式。2022 年 DALL-E 2 和 2023 年 DALL-E 3 的推出进一步提升了生成质量和文本理解能力。2024 年，Stable Diffusion 3 和 FLUX 等新一代模型继续推动着图像生成的边界。

从技术角度看，现代图像生成模型的核心是去噪扩散过程：从纯噪声开始，逐步去除噪声，最终生成清晰的图像。这个过程的每一步都由文本条件（通过 CLIP 等视觉-语言模型）来引导，使得生成的图像与文本描述一致。

特点

1.文本生图 — 根据文字描述生成图像
2.图生图 — 基于参考图风格转换
3.图像编辑 — 局部修改和修复
4.模型微调 — 自定义风格和对象
5.可控生成 — 通过 ControlNet 等实现精确控制

扩散模型的核心原理

def explain_diffusion_process():
    """扩散模型的核心原理

    前向过程（加噪）：
    - 逐步向图像添加高斯噪声
    - 经过 T 步后，图像变为纯噪声
    - 这个过程是固定的，不需要学习

    反向过程（去噪）：
    - 从噪声开始，逐步预测并去除噪声
    - 每一步使用 U-Net 预测噪声
    - 通过文本条件引导生成方向

    训练目标：
    - 给定加噪图像 x_t 和时间步 t
    - 预测添加的噪声 epsilon
    - 损失函数: L = ||epsilon - epsilon_pred||^2

    采样过程：
    - 从随机噪声 x_T 开始
    - 逐步使用去噪网络预测 x_{t-1}
    - 经过 T 步后得到清晰图像

    关键超参数：
    - num_inference_steps: 去噪步数（越多越精细，但越慢）
    - guidance_scale (CFG): 文本引导强度（越高越符合 prompt，但多样性降低）
    - scheduler: 噪声调度策略（Euler, DDIM, DPM++ 等）
    """
    print("扩散模型要点:")
    print("  前向: 图像 -> 逐步加噪 -> 纯噪声")
    print("  反向: 纯噪声 -> 逐步去噪 -> 清晰图像")
    print("  条件: CLIP 文本编码引导生成方向")
    print("  步数越多越精细: 20步(快速) vs 50步(高质量)")

explain_diffusion_process()

OpenAI DALL-E

文本生成图像

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# 生成图像
response = client.images.generate(
    model="dall-e-3",
    prompt="一只戴着墨镜的橘猫坐在电脑前写代码，卡通风格",
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response.data[0].url
print(f"图像 URL: {image_url}")

# 下载图像
import requests
img_data = requests.get(image_url).content
with open("generated_cat.png", "wb") as f:
    f.write(img_data)

图像编辑

# DALL-E 图像编辑
response = client.images.edit(
    model="dall-e-2",
    image=open("original.png", "rb"),
    mask=open("mask.png", "rb"),  # 白色区域为需要编辑的部分
    prompt="将背景替换为海滩日落",
    n=1,
    size="1024x1024"
)

edited_url = response.data[0].url

Prompt 工程技巧

def prompt_engineering_for_image_generation():
    """图像生成 Prompt 工程技巧

    1. 结构化 Prompt:
       [主体描述], [风格], [视角], [光照], [细节], [质量修饰词]

    2. 质量修饰词:
       "highly detailed", "4K", "professional", "masterpiece",
       "sharp focus", "cinematic lighting", "volumetric light"

    3. 风格关键词:
       "oil painting", "watercolor", "digital art", "photorealistic",
       "anime style", "concept art", "pixel art", "3D render"

    4. 负面 Prompt (Negative Prompt):
       "blurry, low quality, distorted, ugly, bad anatomy,
        watermark, text, extra fingers, deformed"

    5. 组合技巧:
       - 使用权重: "(red dress:1.2), blue sky:0.8"
       - 使用分隔符: "cat | sitting | desk | coding"
       - 使用参考: "in the style of [artist]"
    """
    print("Prompt 模板:")
    print("  主体 + 风格 + 光照 + 质量")
    print("  例: 'a cat sitting on a desk, digital art, ")
    print("       cinematic lighting, highly detailed, 4K'")
    print("\n负面 Prompt 模板:")
    print("  'blurry, low quality, distorted, watermark, text'")

prompt_engineering_for_image_generation()

Stable Diffusion

本地部署

# pip install diffusers transformers accelerate
import torch
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline

# 加载模型
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    safety_checker=None
)
pipe = pipe.to("cuda")

# 文本生成图像
prompt = "a futuristic city skyline at sunset, digital art, highly detailed"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("city_sunset.png")

图生图

from PIL import Image

# 图生图 — 风格转换
img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("photo.jpg").resize((512, 512))

result = img2img_pipe(
    prompt="oil painting style, impressionist, vibrant colors",
    image=init_image,
    strength=0.7,          # 修改强度 0-1
    guidance_scale=7.5,
    num_inference_steps=30
)

result.images[0].save("oil_painting_style.png")

关键参数详解

def explain_generation_parameters():
    """图像生成关键参数详解

    1. num_inference_steps (去噪步数):
       - 步数越多，生成质量越高，但推理时间线性增加
       - SD 1.5: 推荐 20-50 步
       - SDXL: 推荐 25-50 步
       - 快速预览: 10-15 步
       - 最终输出: 30-50 步

    2. guidance_scale (CFG, 分类器自由引导):
       - 控制文本对生成结果的影响程度
       - 范围: 1.0 - 20.0
       - 7.5: SD 默认值，平衡质量和多样性
       - > 10: 更严格遵循 prompt，但可能过饱和
       - < 5: 更自由，但可能偏离 prompt

    3. strength (图生图修改强度):
       - 0.0: 完全保持原图
       - 1.0: 完全忽略原图
       - 0.6-0.8: 推荐范围

    4. seed (随机种子):
       - 固定种子保证可复现
       - 同一 prompt + 同一 seed = 同一图像

    5. scheduler (噪声调度):
       - Euler: 最简单，速度快
       - Euler a: 带 ancestor sampling，更稳定
       - DPM++ 2M: 质量和速度的最佳平衡
       - DDIM: 确定性采样
    """
    print("参数推荐配置:")
    print("  快速预览: steps=15, CFG=7.5, Euler")
    print("  标准质量: steps=30, CFG=7.5, Euler a")
    print("  高质量:   steps=50, CFG=7.5, DPM++ 2M")
    print("  精确控制: steps=30, CFG=1.0 + ControlNet")

explain_generation_parameters()

LoRA 微调

自定义风格训练

# LoRA 微调（使用 diffusers + peft）
from diffusers import StableDiffusionPipeline
import torch

# 加载基础模型
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# 加载 LoRA 权重
# pipe.load_lora_weights("./lora_output", weight_name="pytorch_lora_weights.safetensors")

# 使用自定义风格生成
prompt = "a cat sitting on a desk, <lora_style:0.8>"
image = pipe(prompt, num_inference_steps=30).images[0]
image.save("custom_style_cat.png")

LoRA 原理简介

def explain_lora():
    """LoRA (Low-Rank Adaptation) 原理

    LoRA 的核心思想：
    在不修改原始模型权重的情况下，通过添加低秩分解矩阵来实现微调。

    原始权重: W (d × d)
    LoRA 更新: ΔW = A × B
    其中 A: (d × r), B: (r × d), r << d (r 通常为 4-64)

    参数量对比：
    - 原始微调: d × d = 4096 × 4096 = 16M
    - LoRA (r=16): 2 × 4096 × 16 = 131K (减少 99.2%)

    优势：
    - 训练快：只需训练少量参数
    - 存储小：LoRA 权重通常只有几十 MB
    - 可切换：多个 LoRA 可以共享同一个基础模型
    - 不影响原始模型

    训练 LoRA 的关键：
    1. 准备 10-50 张高质量风格图片
    2. 添加文本标签（描述风格和内容）
    3. 训练 1000-5000 步
    4. 调整 LoRA 权重适配风格强度
    """
    print("LoRA 要点:")
    print("  参数量减少 99%+ (rank=16)")
    print("  训练数据: 10-50 张高质量图片")
    print("  训练步数: 1000-5000 步")
    print("  存储大小: 通常 10-100 MB")

explain_lora()

ControlNet

条件控制生成

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image

# 加载 ControlNet（边缘检测控制）
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# 边缘检测条件图
import cv2
import numpy as np

original = np.array(load_image("room.jpg"))
edges = cv2.Canny(original, 100, 200)
control_image = Image.fromarray(edges)

# 条件生成
result = pipe(
    prompt="a modern living room with blue walls, photorealistic",
    image=control_image,
    num_inference_steps=20
).images[0]

result.save("controlled_room.png")

ControlNet 模型类型

def explain_controlnet_models():
    """ControlNet 的不同模型类型

    1. Canny: 边缘检测
       - 提取图像的轮廓边缘
       - 适合: 精确控制物体轮廓和构图

    2. Depth: 深度图
       - 控制图像的深度/层次
       - 适合: 场景深度和空间布局

    3. Pose: 人体姿态
       - 控制人物的动作姿态
       - 适合: 人物姿势和动作生成

    4. Normal: 法线图
       - 控制表面方向和光照
       - 适合: 产品设计和 3D 效果

    5. Segmentation: 语义分割
       - 按区域指定生成内容
       - 适合: 精确的区域控制

    6. Tile: 分块控制
       - 将图像分块分别控制
       - 适合: 大分辨率图像生成
    """
    print("ControlNet 模型选择:")
    print("  Canny: 控制轮廓和构图")
    print("  Depth: 控制空间深度")
    print("  Pose:  控制人物姿态")
    print("  Segmentation: 按区域指定内容")

explain_controlnet_models()

图像生成技术对比

Inpainting 局部编辑

# Inpainting — 图像局部编辑和修复
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

# 加载原始图像和掩码
# 掩码：白色区域为需要重新生成的部分
original_image = Image.open("photo.jpg").resize((512, 512))
mask_image = Image.open("mask.png").resize((512, 512))

result = pipe(
    prompt="a red sports car",
    image=original_image,
    mask_image=mask_image,
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

result.save("inpainting_result.png")

# 编程方式创建掩码
import numpy as np

def create_mask(height, width, x, y, w, h):
    """创建矩形区域的掩码"""
    mask = np.zeros((height, width), dtype=np.uint8)
    mask[y:y+h, x:x+w] = 255
    return Image.fromarray(mask)

# 只修改图像中心区域
mask = create_mask(512, 512, 128, 128, 256, 256)

IP-Adapter 风格参考

# IP-Adapter — 使用参考图像控制风格
# 无需文本描述，直接用图像作为条件
from diffusers import StableDiffusionPipeline
# 需要 ip-adapter 扩展

# IP-Adapter 允许你：
# 1. 输入一张参考图像
# 2. 生成与参考图像风格一致的新图像
# 3. 结合文本 prompt 进行更精细的控制

# 应用场景：
# - 人物肖像风格迁移
# - 产品设计一致性
# - 品牌视觉风格统一

# 使用方式（需安装 ip-adapter 库）
# from ip_adapter import IPAdapterFull
# ip_model = IPAdapterFull(pipe, "ip-adapter_sd15.bin", device="cuda")
# images = ip_model.generate(prompt="a cat", pil_image=ref_image, num_samples=4)

SDXL 高分辨率生成

# SDXL — Stable Diffusion XL 更高质量的生成
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
).to("cuda")

# SDXL 支持 1024x1024 原生分辨率
image = pipe(
    prompt="a majestic mountain landscape at golden hour, "
           "dramatic clouds, crystal clear lake reflection, "
           "professional photography, 8K, ultra detailed",
    negative_prompt="blurry, low quality, distorted, watermark",
    num_inference_steps=30,
    guidance_scale=7.5,
    width=1024,
    height=1024,
).images[0]

image.save("sdxl_landscape.png")

# SDXL + Refiner 两阶段生成
from diffusers import StableDiffusionXLImg2ImgPipeline

# 第一阶段：base 模型生成初始图像
# 第二阶段：refiner 模型增强细节
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# 先用 base 生成，再用 refiner 增强
base_image = pipe(prompt, num_inference_steps=30, denoising_end=0.8, output_type="latent").images
refined = refiner(prompt, image=base_image, num_inference_steps=20, denoising_start=0.8).images[0]

Upscale 图像放大

# 图像超分辨率放大
# 方法1：使用 Real-ESRGAN
# pip install realesrgan
from PIL import Image

def upscale_realesrgan(input_path, output_path, scale=4):
    """使用 Real-ESRGAN 放大图像"""
    import subprocess
    # 命令行调用
    subprocess.run([
        "realesrgan-ncnn-vulkan",
        "-i", input_path,
        "-o", output_path,
        "-s", str(scale),
        "-n", "realesrgan-x4plus"
    ])

# 方法2：使用 Stable Diffusion 的 img2img 放大
from diffusers import StableDiffusionUpscalePipeline

upscale_pipe = StableDiffusionUpscalePipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler",
    torch_dtype=torch.float16
).to("cuda")

low_res = Image.open("low_res.png").resize((256, 256))
upscaled = upscale_pipe(
    prompt="highly detailed, sharp focus",
    image=low_res,
    num_inference_steps=30,
).images[0]

# 512x512 -> 2048x2048
upscaled.save("upscaled.png")

# 方法3：高清修复（Highres Fix）— 先生成低分辨率再放大
# 在 Stable Diffusion WebUI 中常用
# 步骤：512x512 生成 -> 放大到 1024x1024 -> img2img 微调细节

批量生成与种子控制

# 批量生成和可复现性控制
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# 批量生成 — 一次生成多张，选最好的
prompt = "a beautiful sunset over the ocean, digital art"

# 方法1：单次生成多张
results = pipe(
    prompt=prompt,
    num_images_per_prompt=4,
    num_inference_steps=30,
    generator=torch.Generator("cuda").manual_seed(42),
).images

for i, img in enumerate(results):
    img.save(f"sunset_{i}.png")

# 方法2：固定种子生成可复现图像
def reproducible_generate(pipe, prompt, seed=42, **kwargs):
    """使用固定种子生成可复现图像"""
    generator = torch.Generator("cuda").manual_seed(seed)
    return pipe(prompt=prompt, generator=generator, **kwargs).images[0]

# 同一 prompt + 同一 seed = 完全相同的图像
img1 = reproducible_generate(pipe, prompt, seed=12345)
img2 = reproducible_generate(pipe, prompt, seed=12345)
# img1 和 img2 完全相同

# 方法3：参数扫描 — 测试不同参数组合
import itertools

prompts = ["a cat", "a dog", "a bird"]
scales = [5.0, 7.5, 10.0]
steps = [20, 30, 50]

for p, cfg, step in itertools.product(prompts, scales, steps):
    img = pipe(prompt=p, guidance_scale=cfg, num_inference_steps=step).images[0]
    img.save(f"param_sweep_{p.replace(' ', '_')}_cfg{cfg}_step{step}.png")

推理性能优化

# 图像生成性能优化策略

# 1. xFormers 内存优化（减少显存占用）
pipe.enable_xformers_memory_efficient_attention()

# 2. VAE 切片（大幅减少显存，略微增加时间）
pipe.enable_vae_slicing()
# 可以在 6GB 显存上生成 1024x1024 图像

# 3. 分页注意力（Tomesd — 降低计算量）
# 通过合并相似的 token 减少注意力计算量
# pipe.enable_attention_slicing()  # 较老的方法

# 4. 使用 TensorRT 加速（NVIDIA GPU）
from diffusers import StableDiffusionPipeline
# python -m pip install optimum[onnxruntime-gpu]
# 预编译模型，推理速度提升 2-4 倍

# 5. 编译模型（torch.compile）
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# 第一次推理慢（编译），后续推理快 30-50%

# 6. 使用 SDXL Turbo / LCM 加速推理
# LCM（Latent Consistency Model）可以在 4-8 步内生成高质量图像
from diffusers import LCMScheduler, StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

# 只需 4-8 步即可生成高质量图像（原来需要 30-50 步）
image = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]

生成内容安全与过滤

# 安全过滤器 — 防止生成不当内容
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# 启用安全检查器（默认启用）
# pipe.safety_checker 会检测并过滤不当内容

# 自定义安全过滤器
class CustomSafetyChecker:
    def __init__(self):
        self.blocked_keywords = ["violence", "nsfw", "drug"]

    def check_prompt(self, prompt):
        """检查 prompt 是否包含敏感内容"""
        lower_prompt = prompt.lower()
        for keyword in self.blocked_keywords:
            if keyword in lower_prompt:
                return False, f"Prompt 包含敏感词: {keyword}"
        return True, "通过"

    def check_image(self, image):
        """检查生成的图像是否安全"""
        # 可以使用 NSFW 检测模型
        # 或使用 CLIP 分类器判断内容
        return True

safety = CustomSafetyChecker()
is_safe, message = safety.check_prompt("a beautiful landscape")
print(f"安全检查: {message}")

图像生成技术对比

工具/模型	特点	部署方式	适用场景
DALL-E 3	高质量、理解力强	API 调用	通用图像生成
Stable Diffusion	开源、可微调	本地/云端	自定义风格
Midjourney	艺术风格突出	Discord	艺术创作
ControlNet	条件精确控制	本地	精确控制构图
LoRA	轻量微调	本地	风格迁移
Inpainting	局部编辑	本地/API	图像修复

优点

1.创作效率 — 快速生成大量图像
2.风格多样 — 写实/卡通/油画等风格
3.定制能力 — LoRA 微调自定义风格
4.开源生态 — Stable Diffusion 社区活跃

缺点

1.版权争议 — 训练数据版权问题
2.质量问题 — 手部/文字等细节不准确
3.硬件要求 — 本地运行需要高性能 GPU
4.可控性 — 精确控制结果仍需多次尝试

总结

AI 图像生成核心工具：DALL-E（API 调用高质量）、Stable Diffusion（开源可本地部署）、ControlNet（精确控制生成）、LoRA（轻量微调自定义风格）。DALL-E 适合快速生成，Stable Diffusion 适合自定义需求。图生图用 img2img 管道控制修改强度。ControlNet 通过边缘/深度/姿态等条件图精确控制构图。实际项目建议用 DALL-E API 快速原型，用 Stable Diffusion + LoRA 做风格定制。