Tutorial 2: 智能体（Agent）

什么是智能体？

智能体（Agent） 是能够感知环境并在环境中采取行动的系统。

┌─────────────────────────────────────────────────────────────┐
│                        环境 (Environment)                    │
│                                                              │
│    ┌─────────────┐                    ┌─────────────┐       │
│    │   感知器     │◄────── 感知 ───-───│             │       │
│    │  (Sensors)  │                    │    环境     │       │
│    └──────┬──────┘                    │   状态      │       │
│           │                           │             │       │
│           ▼                           │             │       │
│    ┌─────────────┐                    │             │       │
│    │   智能体     │                    │             │       │
│    │  (Agent)    │                    │             │       │
│    └──────┬──────┘                    │             │       │
│           │                           │             │       │
│           ▼                           │             │       │
│    ┌─────────────┐                    │             │       │
│    │   执行器     │─────── 行动 ──────►│             │       │
│    │ (Actuators) │                    └─────────────┘       │
│    └─────────────┘                                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

用生活中的例子来理解：

自动驾驶汽车: 感知（摄像头、雷达）→ 决策（AI）→ 行动（转向、加速）
智能客服: 感知（用户输入）→ 决策（NLP）→ 行动（回复）
扫地机器人: 感知（传感器）→ 决策（路径规划）→ 行动（移动、清扫）

智能体的组成

class Agent:
    """智能体的基本结构"""

    def __init__(self):
        self.state = None  # 内部状态

    def perceive(self, environment):
        """感知环境"""
        return environment.get_percept()

    def think(self, percept):
        """决策/思考"""
        # 根据感知决定行动
        action = self.decide(percept)
        return action

    def act(self, action, environment):
        """执行行动"""
        environment.execute(action)

    def run(self, environment):
        """智能体主循环"""
        while True:
            percept = self.perceive(environment)
            action = self.think(percept)
            self.act(action, environment)

理性智能体

理性智能体 是指在给定感知序列的情况下，能够采取使预期效用最大化的行动的智能体。

关键概念：

性能度量: 评估智能体行为好坏的标准
先验知识: 智能体事先知道的关于环境的信息
感知序列: 智能体到目前为止接收到的所有感知
可执行动作: 智能体可以采取的行动集合

class RationalAgent(Agent):
    """理性智能体"""

    def __init__(self, performance_measure):
        super().__init__()
        self.performance_measure = performance_measure
        self.percept_history = []

    def think(self, percept):
        # 记录感知历史
        self.percept_history.append(percept)

        # 选择能最大化预期效用的行动
        best_action = None
        best_utility = float('-inf')

        for action in self.get_possible_actions():
            expected_utility = self.estimate_utility(action, percept)
            if expected_utility > best_utility:
                best_utility = expected_utility
                best_action = action

        return best_action

智能体的类型

1. 简单反射型智能体

特点: 只根据当前感知做决策，不考虑历史

class SimpleReflexAgent(Agent):
    """简单反射型智能体"""

    def __init__(self, rules):
        super().__init__()
        self.rules = rules  # 条件-动作规则

    def think(self, percept):
        """直接根据规则做出反应"""
        for condition, action in self.rules:
            if condition(percept):
                return action
        return None

# 示例：简单的恒温器
thermostat_rules = [
    (lambda p: p['temperature'] < 20, 'turn_on_heater'),
    (lambda p: p['temperature'] > 25, 'turn_off_heater'),
    (lambda p: True, 'do_nothing')
]

thermostat = SimpleReflexAgent(thermostat_rules)
action = thermostat.think({'temperature': 18})
print(f"行动: {action}")  # turn_on_heater

优点: 简单、快速

缺点: 无法处理需要记忆的情况

2. 基于模型的反射型智能体

特点: 维护环境的内部模型，考虑历史状态

class ModelBasedAgent(Agent):
    """基于模型的反射型智能体"""

    def __init__(self, transition_model, sensor_model, rules):
        super().__init__()
        self.transition_model = transition_model  # 状态转移模型
        self.sensor_model = sensor_model          # 感知模型
        self.rules = rules
        self.state = None
        self.last_action = None

    def think(self, percept):
        # 更新内部状态
        self.state = self.update_state(
            self.state,
            self.last_action,
            percept
        )

        # 根据状态选择行动
        for condition, action in self.rules:
            if condition(self.state):
                self.last_action = action
                return action

        return None

    def update_state(self, state, action, percept):
        """根据转移模型和感知更新状态"""
        if state is None:
            return self.sensor_model(percept)

        predicted_state = self.transition_model(state, action)
        return self.sensor_model(percept, predicted_state)

3. 基于目标的智能体

特点: 有明确的目标，选择能达成目标的行动

class GoalBasedAgent(Agent):
    """基于目标的智能体"""

    def __init__(self, goal):
        super().__init__()
        self.goal = goal
        self.state = None

    def think(self, percept):
        self.state = self.update_state(percept)

        # 搜索达成目标的行动序列
        plan = self.search_for_goal(self.state, self.goal)

        if plan:
            return plan[0]  # 返回第一个行动
        return None

    def search_for_goal(self, state, goal):
        """搜索达成目标的路径"""
        # 使用搜索算法（下一章详细介绍）
        pass

# 示例：迷宫寻路智能体
class MazeAgent(GoalBasedAgent):
    def __init__(self, goal_position):
        super().__init__(goal_position)

    def search_for_goal(self, current_pos, goal_pos):
        # 简化版：贪心选择最近的方向
        dx = goal_pos[0] - current_pos[0]
        dy = goal_pos[1] - current_pos[1]

        actions = []
        if dx > 0: actions.append('right')
        if dx < 0: actions.append('left')
        if dy > 0: actions.append('down')
        if dy < 0: actions.append('up')

        return actions if actions else ['stay']

4. 基于效用的智能体

特点: 不仅考虑是否达成目标，还考虑”有多好”

class UtilityBasedAgent(Agent):
    """基于效用的智能体"""

    def __init__(self, utility_function):
        super().__init__()
        self.utility = utility_function
        self.state = None

    def think(self, percept):
        self.state = self.update_state(percept)

        # 评估每个行动的预期效用
        best_action = None
        best_expected_utility = float('-inf')

        for action in self.get_possible_actions():
            # 计算行动的预期效用
            expected_utility = self.expected_utility(action)

            if expected_utility > best_expected_utility:
                best_expected_utility = expected_utility
                best_action = action

        return best_action

    def expected_utility(self, action):
        """计算行动的预期效用"""
        total = 0
        for outcome, probability in self.get_outcomes(action):
            total += probability * self.utility(outcome)
        return total

5. 学习型智能体

特点: 能够从经验中学习，不断改进

class LearningAgent(Agent):
    """学习型智能体"""

    def __init__(self):
        super().__init__()
        self.performance_element = None  # 决策组件
        self.learning_element = None     # 学习组件
        self.critic = None               # 评价组件
        self.problem_generator = None    # 探索组件

    def think(self, percept):
        # 1. 根据当前知识选择行动
        action = self.performance_element.select_action(percept)

        # 2. 评价行动效果
        feedback = self.critic.evaluate(percept, action)

        # 3. 学习改进
        self.learning_element.learn(feedback)

        # 4. 探索新可能
        if self.should_explore():
            action = self.problem_generator.suggest_exploration()

        return action

实战：用 PyTorch 实现学习型智能体

让我们实现一个简单的学习型智能体，学习在网格世界中寻找宝藏：

import torch
import torch.nn as nn
import random

class GridWorld:
    """简单的网格世界环境"""

    def __init__(self, size=5):
        self.size = size
        self.agent_pos = [0, 0]
        self.goal_pos = [size-1, size-1]

    def reset(self):
        self.agent_pos = [0, 0]
        return self.get_state()

    def get_state(self):
        """返回状态向量"""
        return torch.tensor([
            self.agent_pos[0] / self.size,
            self.agent_pos[1] / self.size,
            self.goal_pos[0] / self.size,
            self.goal_pos[1] / self.size
        ], dtype=torch.float32)

    def step(self, action):
        """执行行动，返回新状态和奖励"""
        # 行动: 0=上, 1=下, 2=左, 3=右
        moves = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        dx, dy = moves[action]

        # 更新位置（边界检查）
        new_x = max(0, min(self.size-1, self.agent_pos[0] + dx))
        new_y = max(0, min(self.size-1, self.agent_pos[1] + dy))
        self.agent_pos = [new_x, new_y]

        # 计算奖励
        if self.agent_pos == self.goal_pos:
            reward = 10.0  # 到达目标
            done = True
        else:
            reward = -0.1  # 每步小惩罚，鼓励快速到达
            done = False

        return self.get_state(), reward, done

class SimpleQNetwork(nn.Module):
    """简单的 Q 网络"""

    def __init__(self, state_size=4, action_size=4):
        super().__init__()
        self.fc1 = nn.Linear(state_size, 32)
        self.fc2 = nn.Linear(32, 32)
        self.fc3 = nn.Linear(32, action_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)

class QLearningAgent:
    """Q-Learning 智能体"""

    def __init__(self, state_size=4, action_size=4):
        self.action_size = action_size
        self.q_network = SimpleQNetwork(state_size, action_size)
        self.optimizer = torch.optim.Adam(self.q_network.parameters(), lr=0.01)
        self.epsilon = 1.0  # 探索率
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.gamma = 0.99  # 折扣因子

    def select_action(self, state):
        """选择行动（ε-贪婪策略）"""
        if random.random() < self.epsilon:
            return random.randint(0, self.action_size - 1)

        with torch.no_grad():
            q_values = self.q_network(state)
            return q_values.argmax().item()

    def learn(self, state, action, reward, next_state, done):
        """学习更新"""
        # 计算目标 Q 值
        with torch.no_grad():
            if done:
                target = reward
            else:
                next_q = self.q_network(next_state).max()
                target = reward + self.gamma * next_q

        # 计算当前 Q 值
        current_q = self.q_network(state)[action]

        # 计算损失并更新
        loss = nn.MSELoss()(current_q, torch.tensor(target))

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # 衰减探索率
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

        return loss.item()

# 训练智能体
def train_agent(episodes=500):
    env = GridWorld(size=5)
    agent = QLearningAgent()

    rewards_history = []

    for episode in range(episodes):
        state = env.reset()
        total_reward = 0

        for step in range(100):  # 最多100步
            action = agent.select_action(state)
            next_state, reward, done = env.step(action)

            agent.learn(state, action, reward, next_state, done)

            state = next_state
            total_reward += reward

            if done:
                break

        rewards_history.append(total_reward)

        if (episode + 1) % 100 == 0:
            avg_reward = sum(rewards_history[-100:]) / 100
            print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}")

    return agent, rewards_history

# 运行训练
if __name__ == "__main__":
    print("开始训练智能体...")
    agent, rewards = train_agent(500)

    print("\n测试训练好的智能体:")
    env = GridWorld(size=5)
    state = env.reset()

    print(f"起点: {env.agent_pos}, 目标: {env.goal_pos}")

    for step in range(20):
        action = agent.select_action(state)
        action_names = ['上', '下', '左', '右']
        state, reward, done = env.step(action)
        print(f"步骤 {step+1}: {action_names[action]} -> 位置 {env.agent_pos}")

        if done:
            print("到达目标！")
            break

环境的分类

维度	类型	说明	例子
可观察性	完全可观察	智能体能感知完整环境状态	国际象棋
	部分可观察	只能感知部分状态	扑克牌
确定性	确定性	行动结果完全确定	国际象棋
	随机性	行动结果有不确定性	掷骰子游戏
连续性	离散	有限的状态和行动	棋类游戏
	连续	无限的状态或行动	自动驾驶
智能体数	单智能体	只有一个智能体	数独
	多智能体	多个智能体交互	围棋

关键概念总结

概念	解释
智能体	能感知环境并采取行动的系统
感知	智能体获取环境信息的过程
行动	智能体对环境产生影响的过程
理性	选择能最大化预期效用的行动
效用	衡量状态或结果好坏的数值
学习	从经验中改进行为的能力

下一步

在下一个教程中，我们将学习搜索算法，这是智能体解决问题的核心方法。

Tutorial 3: 搜索算法