数据类型与数据结构

SunnyFan大约 13 分钟约 3771 字

数据类型与数据结构

简介

Python 提供了丰富的内置数据类型和数据结构，包括列表、元组、字典、集合等。理解这些数据结构的特点和适用场景是编写高效 Python 代码的基础。

Python 的数据类型系统是"强类型但动态类型"的——变量没有固定类型，但值一旦创建就有确定的类型。这意味着 1 + "2" 会抛出 TypeError（强类型），而 x = 1; x = "hello" 是合法的（动态类型）。理解这一点有助于避免类型相关的运行时错误。

从底层实现看，Python 的内置数据结构都是基于 C 实现的"泛型容器"——list 可以存储任意类型的对象，dict 的键可以是任意可哈希对象。这种灵活性带来了便利，但也意味着比静态语言的专用容器有更高的内存开销。例如，一个 Python int 对象实际占用 28 字节（C 语言的 long 只占 8 字节），一个 Python list 中每个元素都是一个指针（8 字节），加上每个指针对象自身的开销。

特点

1.List — 有序可变序列
2.Tuple — 有序不可变序列
3.Dict — 键值对映射
4.Set — 无序不重复集合
5.collections — 高级数据结构扩展
6.推导式 — 简洁的数据构建语法

列表（List）

基本操作

# 创建列表
numbers = [1, 2, 3, 4, 5]
mixed = [1, "hello", 3.14, True]
nested = [[1, 2], [3, 4], [5, 6]]

# 增删改查
numbers.append(6)           # 末尾添加 [1,2,3,4,5,6]
numbers.insert(0, 0)        # 指定位置插入 [0,1,2,3,4,5,6]
numbers.extend([7, 8])      # 扩展列表
numbers.remove(3)           # 删除第一个匹配
last = numbers.pop()        # 弹出最后一个
numbers.sort()              # 原地排序
numbers.reverse()           # 原地反转

# 切片
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(data[2:5])      # [2, 3, 4]
print(data[:3])        # [0, 1, 2]
print(data[-3:])       # [7, 8, 9]
print(data[::2])       # [0, 2, 4, 6, 8]  步长2
print(data[::-1])      # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]  反转

# 列表推导式
squares = [x**2 for x in range(10)]
matrix = [[i*3+j for j in range(3)] for i in range(3)]
# [[0,1,2], [3,4,5], [6,7,8]]

# 展平嵌套列表
flat = [item for row in matrix for item in row]
# [0,1,2,3,4,5,6,7,8]

深入列表推导式

# 带条件的列表推导式
numbers = range(1, 21)
even_squares = [x ** 2 for x in numbers if x % 2 == 0]
print(even_squares)  # [4, 16, 36, 64, 100, 144, 196, 256, 324, 400]

# 带三元表达式的推导式
labels = ["偶数" if x % 2 == 0 else "奇数" for x in range(5)]
print(labels)  # ['偶数', '奇数', '偶数', '奇数', '偶数']

# 嵌套推导式展平
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [x for row in matrix for x in row]
print(flat)  # [1, 2, 3, 4, 5, 6, 7, 8, 9]

# 字典推导式
words = ["apple", "banana", "cherry"]
word_lengths = {word: len(word) for word in words}
print(word_lengths)  # {'apple': 5, 'banana': 6, 'cherry': 6}

# 集合推导式
sentence = "hello world hello python world"
unique_words = {word for word in sentence.split()}
print(unique_words)  # {'hello', 'world', 'python'}

# 生成器表达式（不创建列表，惰性求值）
total = sum(x ** 2 for x in range(1000000))  # 内存高效

常用操作

# 排序（自定义 key）
users = [
    {"name": "张三", "age": 30},
    {"name": "李四", "age": 25},
    {"name": "王五", "age": 35}
]

users.sort(key=lambda u: u["age"])
sorted_users = sorted(users, key=lambda u: u["name"], reverse=True)

# 多级排序
items = [
    {"category": "A", "priority": 2},
    {"category": "B", "priority": 1},
    {"category": "A", "priority": 1},
]
items.sort(key=lambda x: (x["category"], x["priority"]))

# 查找和统计
numbers = [1, 2, 3, 2, 4, 2, 5]
print(numbers.count(2))     # 3
print(numbers.index(3))     # 2
print(sum(numbers))          # 19
print(max(numbers))          # 5
print(min(numbers))          # 1

# zip 和 enumerate
names = ["Alice", "Bob", "Charlie"]
scores = [90, 85, 95]

for name, score in zip(names, scores):
    print(f"{name}: {score}")

result = dict(zip(names, scores))  # {'Alice': 90, 'Bob': 85, 'Charlie': 95}

# enumerate 同时获取索引和值
for idx, name in enumerate(names, start=1):
    print(f"{idx}. {name}")

# any 和 all
data = [True, False, True]
print(any(data))   # True（至少一个为 True）
print(all(data))   # False（不是全部为 True）

列表的内存与性能

# 列表的底层是动态数组（类似 C++ 的 vector）
# 添加元素的平均时间复杂度 O(1)（摊销）
# 中间插入/删除 O(n)（需要移动后续元素）

import sys

# 空列表的内存开销
empty = []
print(f"空列表: {sys.getsizeof(empty)} bytes")  # 56 bytes

# 随元素增长的内存
for n in [0, 5, 10, 50, 100]:
    lst = list(range(n))
    print(f"长度 {n}: {sys.getsizeof(lst)} bytes")

# 列表 vs 元组内存对比
lst = [1, 2, 3, 4, 5]
tup = (1, 2, 3, 4, 5)
print(f"列表: {sys.getsizeof(lst)} bytes")  # 104 bytes
print(f"元组: {sys.getsizeof(tup)} bytes")  # 80 bytes

# 预分配列表（避免频繁扩容）
size = 10000
pre_allocated = [None] * size  # 一次性分配

# 列表拼接的性能陷阱
# 慢：循环中用 + 拼接
result = []
for i in range(1000):
    result = result + [i]  # O(n) 每次创建新列表

# 快：使用 append
result = []
for i in range(1000):
    result.append(i)  # O(1) 摊销

# 更快：使用列表推导式
result = [i for i in range(1000)]

元组（Tuple）

不可变序列

# 创建元组
point = (3, 4)
rgb = (255, 128, 0)
single = (42,)           # 单元素元组需要逗号

# 解包
x, y = point
r, g, b = rgb

# 扩展解包
first, *rest = [1, 2, 3, 4, 5]
# first=1, rest=[2,3,4,5]

head, *middle, tail = [1, 2, 3, 4, 5]
# head=1, middle=[2,3,4], tail=5

# 元组作为字典键（列表不行）
locations = {
    (35.68, 139.69): "东京",
    (39.90, 116.40): "北京",
}

# namedtuple
from collections import namedtuple
Point = namedtuple("Point", ["x", "y"])
p = Point(3, 4)
print(p.x, p.y)  # 3 4

深入元组与解包

# 嵌套解包
matrix = ((1, 2), (3, 4), (5, 6))
for (a, b) in matrix:
    print(f"row: {a}, {b}")

# 交换变量（利用元组解包）
a, b = 10, 20
a, b = b, a  # a=20, b=10

# 函数返回多个值（本质是返回元组）
def min_max(data: list) -> tuple[int, int]:
    return min(data), max(data)

result = min_max([3, 1, 4, 1, 5, 9])
lo, hi = result  # 解包

# 命名元组 vs dataclass
from collections import namedtuple
from dataclasses import dataclass

# namedtuple（轻量级，不可变）
UserNT = namedtuple("User", ["name", "age", "email"])
u1 = UserNT("Alice", 30, "alice@example.com")

# dataclass（功能更丰富）
@dataclass
class UserDC:
    name: str
    age: int
    email: str

u2 = UserDC("Bob", 25, "bob@example.com")

# 元组的哈希性（不可变才能做字典键或集合元素）
config = {
    ("db", "host"): "localhost",
    ("db", "port"): 5432,
    ("cache", "host"): "localhost",
    ("cache", "port"): 6379,
}

# 使用下划线忽略不需要的值
first, _, third = (1, 2, 3)
first, *_, last = range(10)  # first=0, last=9

字典（Dict）

键值对操作

# 创建字典
user = {
    "name": "张三",
    "age": 30,
    "email": "zhang@example.com"
}

# 增删改查
user["role"] = "admin"              # 添加/修改
age = user.get("age", 0)            # 安全获取
email = user.pop("email")           # 删除并返回
user.setdefault("status", "active") # 不存在则设置

# 遍历
for key in user:
    print(key, user[key])

for key, value in user.items():
    print(f"{key}: {value}")

for value in user.values():
    print(value)

# 字典合并（Python 3.9+）
defaults = {"theme": "light", "lang": "zh", "notify": True}
custom = {"theme": "dark", "font": 14}
config = defaults | custom  # 合并，后者覆盖

# 字典推导式
prices = {"apple": 5.5, "banana": 3.0, "cherry": 8.0}
expensive = {k: v for k, v in prices.items() if v > 4.0}
# {'apple': 5.5, 'cherry': 8.0}

高级用法

from collections import defaultdict, Counter

# defaultdict — 自动初始化
groups = defaultdict(list)
for name, dept in [("张三", "技术"), ("李四", "产品"), ("王五", "技术")]:
    groups[dept].append(name)
# {'技术': ['张三', '王五'], '产品': ['李四']}

# defaultdict 嵌套
nested_dict = defaultdict(lambda: defaultdict(int))
nested_dict["2024"]["01"] = 100
nested_dict["2024"]["02"] = 150
print(nested_dict["2024"])  # defaultdict(<class 'int'>, {'01': 100, '02': 150})

# Counter — 计数器
words = ["python", "java", "python", "c#", "python", "java"]
counter = Counter(words)
print(counter.most_common(2))  # [('python', 3), ('java', 2)]

# Counter 的数学运算
c1 = Counter("aabbc")
c2 = Counter("bcd")
print(c1 + c2)  # Counter({'b': 3, 'c': 2, 'a': 2, 'd': 1})
print(c1 - c2)  # Counter({'a': 2})
print(c1 & c2)  # Counter({'b': 1, 'c': 1})

# 字符统计
text = "hello world"
char_count = Counter(text.replace(" ", ""))
print(char_count.most_common(3))  # [('l', 3), ('o', 2), ('h', 1)]

# setdefault 模式（在遍历中构建分组字典）
data = [
    {"name": "Alice", "dept": "tech"},
    {"name": "Bob", "dept": "product"},
    {"name": "Charlie", "dept": "tech"},
]
by_dept = {}
for item in data:
    by_dept.setdefault(item["dept"], []).append(item["name"])
print(by_dept)  # {'tech': ['Alice', 'Charlie'], 'product': ['Bob']}

# OrderedDict（Python 3.7+ dict 默认有序）
from collections import OrderedDict
ordered = OrderedDict([("b", 2), ("a", 1), ("c", 3)])
ordered.move_to_end("a")  # 移到末尾

集合（Set）

集合操作

# 创建集合
a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}

# 集合运算
print(a | b)    # 并集 {1,2,3,4,5,6,7,8}
print(a & b)    # 交集 {4,5}
print(a - b)    # 差集 {1,2,3}
print(a ^ b)    # 对称差集 {1,2,3,6,7,8}

# 常用操作
a.add(6)          # 添加元素
a.discard(1)      # 安全删除（不存在不报错）
a.update([9, 10]) # 批量添加

# 去重
numbers = [1, 2, 2, 3, 3, 3, 4]
unique = list(set(numbers))  # [1, 2, 3, 4]

# 子集判断
print({1, 2} <= {1, 2, 3})   # True  子集
print({1, 2, 3} >= {1, 2})   # True  超集

# frozenset — 不可变集合（可做字典键）
fs = frozenset([1, 2, 3])

集合的实用模式

# 1. 去重但保持顺序
def unique_preserve_order(items: list) -> list:
    """去重并保持原始顺序"""
    seen = set()
    result = []
    for item in items:
        if item not in seen:
            seen.add(item)
            result.append(item)
    return result

data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3]
print(unique_preserve_order(data))  # [3, 1, 4, 5, 9, 2, 6]

# 2. 集合运算实现标签过滤
required_tags = {"python", "backend"}
candidate_tags = {"python", "javascript", "backend", "frontend"}

has_all = required_tags.issubset(candidate_tags)  # True
has_any = required_tags.intersection(candidate_tags)  # {'python', 'backend'}
missing = required_tags - candidate_tags  # set()

# 3. 集合实现权限检查
admin_permissions = {"read", "write", "delete", "manage"}
user_permissions = {"read", "write"}

print(user_permissions.issubset(admin_permissions))  # True
print(admin_permissions - user_permissions)  # {'delete', 'manage'}

# 4. 使用集合加速成员检查
# 列表检查: O(n)
# 集合检查: O(1)
large_list = list(range(100000))
large_set = set(range(100000))

import time
start = time.time()
for _ in range(1000):
    _ = 99999 in large_list
print(f"列表查找: {time.time() - start:.4f}s")

start = time.time()
for _ in range(1000):
    _ = 99999 in large_set
print(f"集合查找: {time.time() - start:.4f}s")

collections 模块进阶

from collections import deque, ChainMap, OrderedDict, Counter

# 1. deque —— 双端队列
dq = deque([1, 2, 3], maxlen=5)  # 固定长度，超出自动丢弃另一端
dq.append(4)       # 右端添加
dq.appendleft(0)   # 左端添加
dq.pop()           # 右端弹出
dq.popleft()       # 左端弹出

# 用途：滑动窗口
from collections import deque

def sliding_window(data: list, window_size: int):
    """滑动窗口迭代器"""
    window = deque(maxlen=window_size)
    for item in data:
        window.append(item)
        if len(window) == window_size:
            yield list(window)

for w in sliding_window(range(10), 3):
    print(w)

# 2. ChainMap —— 合并多个字典（不创建副本）
defaults = {"color": "red", "size": "medium"}
user_config = {"color": "blue"}
combined = ChainMap(user_config, defaults)
print(combined["color"])  # blue（user_config 优先）
print(combined["size"])   # medium（从 defaults 获取）

# 修改只影响第一个字典
combined["shape"] = "round"
print(user_config)  # {'color': 'blue', 'shape': 'round'}
print(defaults)     # {'color': 'red', 'size': 'medium'}  未变

# 3. 有序字典的实用场景
# LRU 缓存模拟
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key: str):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)  # 移到末尾（最近使用）
        return self.cache[key]

    def put(self, key: str, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # 移除最久未使用

cache = LRUCache(3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
cache.put("d", 4)  # 淘汰 "a"
print(list(cache.cache.keys()))  # ['b', 'c', 'd']

字符串高级操作

# f-string 格式化（Python 3.6+）
name = "世界"
age = 25
print(f"你好, {name}! 年龄: {age}")

# 表达式
print(f"2 + 3 = {2 + 3}")
print(f"{'hello':^20}")  # 居中对齐，宽度 20

# 格式说明符
pi = 3.14159265
print(f"{pi:.2f}")       # 3.14
print(f"{pi:.4e}")       # 3.1416e+00
print(f"{1000000:,}")    # 1,000,000
print(f"{42:08d}")       # 00000042

# 日期格式化
from datetime import datetime
now = datetime.now()
print(f"{now:%Y-%m-%d %H:%M:%S}")

# 多行字符串
html = f"""
<div class="card">
    <h2>{name}</h2>
    <p>年龄: {age}</p>
</div>
"""

# 常用字符串方法
text = "  Hello, World!  "
print(text.strip())        # "Hello, World!"
print(text.split(", "))    # ["  Hello", "World!  "]
print(",".join(["a", "b", "c"]))  # "a,b,c"
print(text.replace("World", "Python"))
print(text.upper())        # "  HELLO, WORLD!  "
print(text.lower())        # "  hello, world!  "
print(text.startswith("  Hello"))  # True
print("hello123".isalnum())  # True
print("12345".isdigit())    # True

数据结构选型

数据结构	有序	可变	查找复杂度	适用场景
list	是	是	O(n)	有序集合、栈
tuple	是	否	O(n)	不可变数据
dict	是	是	O(1)	键值映射
set	否	是	O(1)	去重、集合运算
deque	是	是	O(1)两端	队列/双端队列
heapq	是	是	O(log n)	优先队列

优点

1.内置丰富 — list/dict/set 覆盖大部分场景
2.语法简洁 — 推导式一行搞定
3.高效操作 — dict/set 查找 O(1)
4.collections — defaultdict/Counter 等扩展类型

缺点

1.List 查找慢 — O(n) 不适合大数据量查找
2.内存占用 — 比静态语言占用更多内存
3.线程安全 — 内置类型非线程安全
4.类型混用 — 列表可以混合类型，容易出错

总结

Python 数据结构核心：list（有序可变，切片/推导式）、tuple（有序不可变，解包）、dict（键值对，O(1)查找）、set（去重，集合运算）。列表推导式是 Python 的特色语法。dict/set 查找高效 O(1)。collections 模块提供 defaultdict（自动初始化）、Counter（计数器）、deque（双端队列）等实用类型。选择原则：有序用 list，映射用 dict，去重用 set，不可变用 tuple。