深度强化学习（7）多智能体强化学习IPPO、MADDPG

大家好，欢迎来到IT知识分享网。

集中式强化学习
由一个全局学习单元承担学习任务，以整个多智能体系统的整体状态为输入，输出各个智能体的动作。
独立强化学习
每个智能体都是独立的学习主体，只考虑自身的观测环境和策略利益
社会强化学习
独立强化学习与社会/经济模型的结合，模拟人类社会中人类个体的交互过程，用社会学和管理学的方法调节智能体之间的关系
群体强化学习
集中训练-分布执行CTDE范式，融合集中学习和独立学习的优势。在训练阶段，智能体利用全局信息集中学习；在执行阶段，智能体仅使用自身观测状态和局部信息选择动作

7.1 IPPO算法

对于N个智能体，为每个智能体初始化各自的策略以及价值函数
$\text{for}$ 训练轮数 $k=0,1,2,\cdots\text{do}$
- 所有智能体在环境中交互分别获得各自的的一条轨迹数据
- 对每个智能体，基于当前的价值函数用GAE计算优质函数的估计
- 对每个智能体，通过最大化其PPO-截断的目标来更新其策略
- 对每个智能体，通过均方误差损失函数优化其价值函数

Combat环境

代码实现

导入Combat环境

git clone https://github.com/boyu-ai/ma-gym.git

import torch import torch.nn.functional as F import numpy as np from tqdm import tqdm import matplotlib.pyplot as plt import sys sys.path.append("../ma-gym") from ma_gym.envs.combat.combat import Combat

PPO算法

# PPO算法 class PolicyNet(torch.nn.Module): def __init__(self, state_dim, hidden_dim, action_dim): super(PolicyNet, self).__init__() self.fc1 = torch.nn.Linear(state_dim, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim) self.fc3 = torch.nn.Linear(hidden_dim, action_dim) def forward(self, x): x = F.relu(self.fc2(F.relu(self.fc1(x)))) return F.softmax(self.fc3(x), dim=1) class ValueNet(torch.nn.Module): def __init__(self, state_dim, hidden_dim): super(ValueNet, self).__init__() self.fc1 = torch.nn.Linear(state_dim, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim) self.fc3 = torch.nn.Linear(hidden_dim, 1) def forward(self, x): x = F.relu(self.fc2(F.relu(self.fc1(x)))) return self.fc3(x) def compute_advantage(gamma, lmbda, td_delta): td_delta = td_delta.detach().numpy() advantage_list = [] advantage = 0.0 for delta in td_delta[::-1]: advantage = gamma * lmbda * advantage + delta advantage_list.append(advantage) advantage_list.reverse() return torch.tensor(advantage_list, dtype=torch.float) # PPO，采用截断方式 class PPO: def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda, eps, gamma, device): self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device) self.critic = ValueNet(state_dim, hidden_dim).to(device) self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr) self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr) self.gamma = gamma self.lmbda = lmbda self.eps = eps # PPO中截断范围的参数 self.device = device def take_action(self, state): state = torch.tensor([state], dtype=torch.float).to(self.device) probs = self.actor(state) action_dict = torch.distributions.Categorical(probs) action = action_dict.sample() return action.item() def update(self, transition_dict): states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device) actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device) rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device) next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device) dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device) td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones) td_delta = td_target - self.critic(states) advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device) old_log_probs = torch.log(self.actor(states).gather(1, actions)).detach() log_probs = torch.log(self.actor(states).gather(1, actions)) ratio = torch.exp(log_probs - old_log_probs) surr1 = ratio * advantage surr2 = torch.clamp(ratio, 1 - self.eps, 1 + self.eps) * advantage # 截断 action_loss = torch.mean(-torch.min(surr1, surr2)) # PPO损失函数 critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target.detach())) self.actor_optimizer.zero_grad() self.critic_optimizer.zero_grad() action_loss.backward() critic_loss.backward() self.actor_optimizer.step() self.critic_optimizer.step()

参数和环境设置

actor_lr = 3e-4 critic_lr = 1e-3 epochs = 10 episode_per_epoch = 1000 hidden_dim = 64 gamma = 0.99 lmbda = 0.97 eps = 0.2 team_size = 2 # 每个team里agent的数量 grid_size = (15, 15) # 二维空间的大小 device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # 创建环境 env = Combat(grid_shape=grid_size, n_agents=team_size, n_opponents=team_size) state_dim = env.observation_space[0].shape[0] action_dim = env.action_space[0].n

参数共享（parameter sharing）即对于所有智能体使用同一套策略参数，前提是这些智能体是同质（homogeneous）的，即它们的状态空间和动作空间是完全一致的，并且它们的优化目标也完全一致。

智能体不共享策略

# 创建智能体（不参数共享） agent1 = PPO( state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda, eps, gamma, device ) agent2 = PPO( state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda, eps, gamma, device )

智能体共享同一策略

# 创建智能体（参数共享） agent = PPO( state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda, eps, gamma, device )

Training

win_list = [] for e in range(epochs): with tqdm(total=episode_per_epoch, desc='Epoch %d' % e) as pbar: for episode in range(episode_per_epoch): # Replay buffer for agent1 buffer_agent1 = { 
     'states': [], 'actions': [], 'next_states': [], 'rewards': [], 'dones': [] } # Replay buffer for agent2 buffer_agent2 = { 
     'states': [], 'actions': [], 'next_states': [], 'rewards': [], 'dones': [] } # 重置环境 s = env.reset() terminal = False while not terminal: # 采取动作（不进行参数共享） a1 = agent1.take_action(s[0]) a2 = agent2.take_action(s[1]) # 采取动作（进行参数共享） # a1 = agent.take_action(s[0]) # a2 = agent.take_action(s[1]) next_s, r, done, info = env.step([a1, a2]) buffer_agent1['states'].append(s[0]) buffer_agent1['actions'].append(a1) buffer_agent1['next_states'].append(next_s[0]) # 如果获胜，获得100的奖励，否则获得0.1惩罚 buffer_agent1['rewards'].append( r[0] + 100 if info['win'] else r[0] - 0.1) buffer_agent1['dones'].append(False) buffer_agent2['states'].append(s[1]) buffer_agent2['actions'].append(a2) buffer_agent2['next_states'].append(next_s[1]) buffer_agent2['rewards'].append( r[1] + 100 if info['win'] else r[1] - 0.1) buffer_agent2['dones'].append(False) s = next_s # 转移到下一个状态 terminal = all(done) # 更新策略（不进行参数共享） agent1.update(buffer_agent1) agent2.update(buffer_agent2) # 更新策略（进行参数共享） # agent.update(buffer_agent1) # agent.update(buffer_agent2) win_list.append(1 if info['win'] else 0) if (episode + 1) % 100 == 0: pbar.set_postfix({ 
     'episode': '%d' % (episode_per_epoch * e + episode + 1), 'winner prob': '%.3f' % np.mean(win_list[-100]) }) pbar.update(1)

win_array = np.array(win_list) # 每100条轨迹取一次平均 win_array = np.mean(win_array.reshape(-1, 100), axis=1) episode_list = np.array(win_array.shape[0]) * 100 plt.plot(episode_list, win_array) plt.xlabel('Episodes') plt.ylabel('win rate') plt.title('IPPO on Combat') plt.show()

7.2 MADDPG算法

$\text{for}\space e=1\to M\space\text{do}$ ：
- 初始化随机过程 $\mathcal N$ ，用于动作探索
- 获取所有智能体的初始观测 $\mathbf x$
- $\text{for}\space t=1\to T\space\text{do}$ ：
  - 对于每个智能体 $i$ ，用当前策略选择一个动作 $a_i=\mu_{\theta_i}(o_i)+\mathcal N_t$
  - 执行动作 $a=(a_1,\cdots,a_N)$ 获得奖励 $r$ 和新的观测 $\mathbf x^\prime$
  - 将 $(\mathbf x,a,r,\mathbf x^\prime)$ 存入经验回放池 $\mathcal D$ 中
  - $\mathbf x\larr\mathbf x^\prime$
  - $\text{for}\space i=1\to N\space\text{do}$ ：
    - 从 $\mathcal D$ 中随机采样 $(\mathbf x^j,a^j,r^j,\mathbf x^{\prime j})$
    - 中心化训练Critic网络
    - 训练自身的Actor网络
    - 更新目标 Actor 网络和目标 Critic 网络

MPE环境

多智能体粒子环境（Multi-Agent Particles Environment，MPE）由 1 个红色的对抗智能体（adversary）， $N$ 个蓝色的正常智能体，以及 $N$ 个地点（一般 $N = 2$ ），这 $N$ 个地点中有一个是目标地点（绿色）。正常智能体知道哪一个是目标地点，但对抗智能体不知道。正常智能体是合作关系，它们其中任意一个距离目标地点足够近，则每个正常智能体都能获得相同的奖励。对抗智能体如果距离目标地点足够近，也能获得奖励，但它需要猜哪一个才是目标地点。因此，正常智能体需要进行合作，分散到不同的坐标点，以此欺骗对抗智能体。

Gumbel-Softmax近似采样

由于MPE 环境中的每个智能体的动作空间是离散的，而DDPG算法本身需要使智能体的动作对于其策略参数可导，因此引入Gumbel-Softmax的方法来得到离散分布的近似采样。

假设随机变量 $Z$ 服从某个离散分布 $\mathcal K=(a_1,\cdots,a_k)$ 。其中， $a_i\in[0,1]$ ，表示 $P (Z = i)$ ，并且满足 $\sum^k_{i=1}a_i=1$ 。引入重参数因子 $g_i$ ，它是一个采样自Gumbel(0, 1)的噪声，表示为：
$g_i=-\log(-\log u),u\sim\mathrm{Uniform}(0,1)$
于是Gumbel-Softmax采样可以写为：
$y_i={e^{\log a_i+g_i\over \tau}\over \sum^k_{j=1}e^{\log a_j+g_i\over\tau}},\forall i=1,\cdots,k$
通过 $z=\arg\max_iy_i$ 计算离散值，该离散值近似等价于离散采样 $z\sim\mathcal K$ 的值。采样到结果 $y$ 自然地引入了对于 $a$ 的梯度。温度参数 $\tau>0$ ：控制Gumbel-Softmax分布与离散分布的近似程度， $\tau$ 越小，生成的分布越趋向于 $\text{onehot}(\arg\max_i(\log a_i+g_i))$ ； $\tau$ 越大，生成的分布越趋向于均匀分布。

代码实现

导入MPE环境

git clone https://github.com/boyu-ai/multiagent-particle-envs.git --quiet pip install -e multiagent-particle-envs # 由于multiagent-pariticle-env的一些版本问题,gym需要改为可用的版本 pip install --upgrade gym==0.10.5 -q

import os import time import torch import torch.nn.functional as F import numpy as np import matplotlib.pyplot as plt import random import collections import gym import sys sys.path.append("..\multiagent-particle-envs") # 刚git下来的包所存放的路径 from multiagent.environment import MultiAgentEnv import multiagent.scenarios as scenarios

创建环境

def make_env(name): scenario = scenarios.load(f'{ 
      name}.py').Scenario() world = scenario.make_world() return MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation) env_id = "simple_adversary" env = make_env(env_id) state_dims = [state_space.shape[0] for state_space in env.observation_space] action_dims = [action_space.n for action_space in env.action_space] critic_input_dim = sum(state_dims) + sum(action_dims)

定义工具函数，包括让 DDPG 可以适用于离散动作空间的 Gumbel Softmax 采样的相关函数

# 生成最优动作的one-hot形式 def onehot_from_logits(logits, eps=0.01): argmax_acs = (logits == logits.max(1, keepdim=True)[0]).float() # 生成随机动作,转换成独热形式 rand_acs = torch.autograd.Variable( torch.eye(logits.shape[1])[[ np.random.choice(range(logits.shape[1]), size=logits.shape[0]) ]], requires_grad=False ).to(logits.device) # 通过epsilon-贪婪算法来选择用哪个动作 return torch.stack([ argmax_acs[i] if r > eps else rand_acs[i] for i, r in enumerate(torch.rand(logits.shape[0])) ]) # Gumbel(0,1)分布中噪声采样 def sample_gumbel(shape, eps=1e-20, tens_type=torch.FloatTensor): U = torch.autograd.Variable(tens_type(*shape).uniform_(), requires_grad=False) return -torch.log(-torch.log(U + eps) + eps) # 从Gumbel-Softmax分布中采样 def gumbel_softmax_sample(logits, temperature): y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data)).to(logits.device) return F.softmax(y / temperature, dim=1) # 从Gumbel-Softmax分布中采样，并进行离散化 def gumbel_softmax(logits, temperature=1.0): y = gumbel_softmax_sample(logits, temperature) y_hard = onehot_from_logits(y) y = (y_hard.to(logits.device) - y).detach() + y return y

实现单智能体的DDPG，包含 Actor 网络与 Critic 网络，以及计算动作的函数

class ThreeLayerFC(torch.nn.Module): def __init__(self, num_in, num_out, hidden_dim): super().__init__() self.fc1 = torch.nn.Linear(num_in, hidden_dim) self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim) self.fc3 = torch.nn.Linear(hidden_dim, num_out) def forward(self, x): x = F.relu(self.fc2(F.relu(self.fc1(x)))) return self.fc3(x) class DDPG: def __init__(self, state_dim, action_dim, critic_input_dim, hidden_dim, actor_lr, critic_lr, device): self.actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device) self.target_actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device) self.critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device) self.target_critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device) self.target_critic.load_state_dict(self.critic.state_dict()) self.target_actor.load_state_dict(self.actor.state_dict()) self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr) self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr) def take_action(self, state, explore=False): action = self.actor(state) if explore: action = gumbel_softmax(action) else: action = onehot_from_logits(action) return action.detach().cpu().numpy()[0] def soft_update(self, net, target_net, tau): for param_target, param in zip(target_net.parameters(), net.parameters()): param_target.data.copy_(param_target.data * (1.0 - tau) + param.data * tau)

MADDPG算法

class MADDPG: def __init__(self, env, device, actor_lr, critic_lr, hidden_dim, state_dims, action_dims, critic_input_dim, gamma, tau): self.agents = [DDPG( state_dims[i], action_dims[i], critic_input_dim, hidden_dim, actor_lr, critic_lr, device ) for i in range(len(env.agents))] self.gamma = gamma self.tau = tau self.critic_criterion = torch.nn.MSELoss() self.device = device @property def policies(self): return [agt.actor for agt in self.agents] @property def target_policies(self): return [agt.target_actor for agt in self.agents] def take_action(self, states, explore): # 将各个状态分给各个智能体，让它们在各自状态下执行动作 states = [ torch.tensor([states[i]], dtype=torch.float, device=self.device) for i in range(len(env.agents)) ] return [ agent.take_action(state, explore) for agent, state in zip(self.agents, states) ] def update(self, sample, agent_id): current_agent = self.agents[agent_id] obs, acts, rewards, next_obs, done = sample '''更新critic网络''' current_agent.critic_optim.zero_grad() # 计算Q-target all_target_act = [ onehot_from_logits(pi(next_obs_)) for pi, next_obs_ in zip(self.target_policies, next_obs) ] # 拼接神经网络target_critic的输入 target_critic_input = torch.cat((*next_obs, *all_target_act), dim=1) target_critic_value = rewards[agent_id].view(-1, 1)\ + self.gamma * (1 - done[agent_id].view(-1, 1)) * current_agent.target_critic(target_critic_input) # 计算Q-eval critic_input = torch.cat((*obs, *acts), dim=1) critic_value = current_agent.critic(critic_input) # 利用MSE更新critic网络 critic_loss = self.critic_criterion(critic_value, target_critic_value.detach()) critic_loss.backward() current_agent.critic_optim.step() '''更新actor网络''' current_agent.actor_optim.zero_grad() logits = current_agent.actor(obs[agent_id]) act = gumbel_softmax(logits) all_actor_acts = [] for i, (pi, obs_) in enumerate(zip(self.policies, obs)): if i == agent_id: all_actor_acts.append(act) else: all_actor_acts.append(onehot_from_logits(pi(obs_))) vf_input = torch.cat((*obs, *all_actor_acts), dim=1) actor_loss = -current_agent.critic(vf_input).mean() actor_loss += (logits  2).mean() * 1e-3 actor_loss.backward() current_agent.actor_optim.step() # 对target网络进行软更新 def update_all_target(self): for agt in self.agents: agt.soft_update(agt.actor, agt.target_actor, self.tau) agt.soft_update(agt.critic, agt.target_critic, self.tau)

定义评估策略的方法

def evaluate(env_id, maddpg, n_episode=10, episode_length=25): env = make_env(env_id) returns = np.zeros(len(env.agents)) for _ in range(n_episode): obs = env.reset() for t_i in range(episode_length): actions = maddpg.take_action(obs, explore=False) obs, rew, done, info = env.step(actions) rew = np.array(rew) returns += rew / n_episode return returns.tolist()

Training

num_episodes = 5000 episode_length = 25 buffer_size =  hidden_dim = 128 actor_lr = 1e-3 critic_lr = 1e-3 gamma = 0.99 tau = 0.005 batch_size = 256 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") update_interval = 50 minimal_size = 4000 epsilon = 0.3 maddpg = MADDPG(env, device, actor_lr, critic_lr, hidden_dim, state_dims, action_dims, critic_input_dim, gamma, tau) replay_buffer = rl_utils.ReplayBuffer(buffer_size) return_list = [] total_step = 0 for episode in range(num_episodes): state = env.reset() for step in range(episode_length): actions = maddpg.take_action(state, explore=True) next_state, reward, done, _ = env.step(actions) replay_buffer.add(state, actions, reward, next_state, done) state = next_state total_step += 1 # 如果replay buffer存满了，以及达到更新间隔update_interval，对buffer进行更新 if replay_buffer.size() >= minimal_size and total_step % update_interval == 0: sample = replay_buffer.sample(batch_size) # 处理样本数据 def stack_array(x): rearranged = [[sub_x[i] for sub_x in x] for i in range(len(x[0]))] return [ torch.FloatTensor(np.vstack(ra)).to(device) for ra in rearranged ] sample = [stack_array(x) for x in sample] # 更新每一个agent的critic和actor网络 for agent_id in range(len(env.agents)): maddpg.update(sample, agent_id) # 更新target网络 maddpg.update_all_target() if (episode + 1) % 100 == 0: ep_returns = evaluate(env_id, maddpg, n_episode=100) return_list.append(ep_returns) print(f'Episode: { 
      episode + 1}, { 
      ep_returns}') return_array = np.array(return_list) for i, agent_name in enumerate(["adversary", "agent0", "agent1"]): plt.figure() plt.plot( np.arange(return_array.shape[0]) * 100, rl_utils.moving_average(return_array[:, i], 9) ) plt.xlabel("Episode") plt.ylabel("Returns") plt.title(agent_name)

Episode: 100, [-41.304, -6.82515, -6.82515] Episode: 200, [-35.2446, -2.8429, -2.8429] Episode: 300, [-27.023, 4.085, 4.085] Episode: 400, [-17.635, -12.409, -12.409] Episode: 500, [-15.068, -6.9104, -6.9104] Episode: 600, [-16.269, -3.02317, -3.02317] Episode: 700, [-11.7778, -5.9993, -5.9993] Episode: 800, [-13.0006, 4.8817, 4.8817] Episode: 900, [-11.3697, 3.548, 3.548] Episode: 1000, [-11.0582, 3.97206, 3.97206] Episode: 1100, [-12.112, 6.1136, 6.1136] Episode: 1200, [-10.8363, 4.40725, 4.40725] Episode: 1300, [-12.8032, 7.06, 7.06] Episode: 1400, [-11.8538, 7.4386, 7.4386] Episode: 1500, [-10.0543, 6.8339, 6.8339] Episode: 1600, [-9.2806, 7.2865, 7.2865] Episode: 1700, [-10.0836, 7.2733, 7.2733] Episode: 1800, [-10.4314, 7.5415, 7.5415] Episode: 1900, [-11.0025, 7.58345, 7.58345] Episode: 2000, [-9.2294, 6.0727, 6.0727] Episode: 2100, [-9.4188, 6.2431, 6.2431] Episode: 2200, [-8.2239, 6.0182, 6.0182] Episode: 2300, [-9.8365, 6.7572, 6.7572] Episode: 2400, [-10.6255, 6.4565, 6.4565] Episode: 2500, [-7.5542, 5.8697, 5.8697] Episode: 2600, [-8.7832, 6.7296, 6.7296] Episode: 2700, [-8.0892, 6.5939, 6.5939] Episode: 2800, [-7.6937, 5.2278, 5.2278] Episode: 2900, [-8.4698, 6.6716, 6.6716] Episode: 3000, [-8.2417, 5.4646, 5.4646] Episode: 3100, [-8.0954, 6.7612, 6.7612] Episode: 3200, [-8.7608, 5.17524, 5.17524] Episode: 3300, [-6.0495, 4.1814, 4.1814] Episode: 3400, [-9.0465, 5.8535, 5.8535] Episode: 3500, [-9.3274, 5.1028, 5.1028] Episode: 3600, [-8.9446, 6.11715, 6.11715] Episode: 3700, [-9.0769, 5.7206, 5.7206] Episode: 3800, [-8.6009, 5.1042, 5.1042] Episode: 3900, [-9.6136, 5.2459, 5.2459] Episode: 4000, [-9.0453, 5.5292, 5.5292] Episode: 4100, [-9.785, 5.8946, 5.8946] Episode: 4200, [-9.2312, 5.58105, 5.58105] Episode: 4300, [-8.4968, 5.2804, 5.2804] Episode: 4400, [-8.9002, 5.0755, 5.0755] Episode: 4500, [-10.9779, 6.7362, 6.7362] Episode: 4600, [-8.6367, 5.6162, 5.6162] Episode: 4700, [-9.9247, 5.18125, 5.18125] Episode: 4800, [-8.9647, 5.47295, 5.47295] Episode: 4900, [-9.0698, 5.803, 5.803] Episode: 5000, [-10.1705, 6.03605, 6.03605]

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://haidsoft.com/133706.html

深度强化学习（7）多智能体强化学习IPPO、MADDPG

目录

7.1 IPPO算法

代码实现

7.2 MADDPG算法

Gumbel-Softmax近似采样

代码实现

发表回复

深度强化学习（7）多智能体强化学习IPPO、MADDPG

目录

7.1 IPPO算法

代码实现

7.2 MADDPG算法

Gumbel-Softmax近似采样

代码实现

相关推荐

发表回复