网站空间可以通过什么获取,网站的seo,wordpress站点不被收录,深圳网站建设服在本教程中#xff0c;我将通过实施Advantage Actor-Critic(演员-评论家#xff0c;A2C)代理来解决经典的CartPole-v0环境#xff0c;通过深度强化学习#xff08;DRL#xff09;展示即将推出的TensorFlow2.0特性。虽然我们的目标是展示TensorFlow2.0#xff0c;但我将尽…在本教程中我将通过实施Advantage Actor-Critic(演员-评论家A2C)代理来解决经典的CartPole-v0环境通过深度强化学习DRL展示即将推出的TensorFlow2.0特性。虽然我们的目标是展示TensorFlow2.0但我将尽最大努力让DRL的讲解更加平易近人包括对该领域的简要概述。
事实上由于2.0版本的焦点是让开发人员的生活变得更轻松所以我认为现在是使用TensorFlow进入DRL的好时机本文用到的例子的源代码不到150行代码可以在这里或者这里获取。
建立
由于TensorFlow2.0仍处于试验阶段我建议将其安装在独立的虚拟环境中。我个人比较喜欢Anaconda所以我将用它来演示安装过程
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#f8f8f2/span conda create span stylecolor:#f8f8f2-/spann tf2 pythonspan stylecolor:#f8f8f2/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff3.6/span/span
span stylecolor:#f8f8f2/span source activate tf2
span stylecolor:#f8f8f2/span pip install tfspan stylecolor:#f8f8f2-/spannightlyspan stylecolor:#ae81ffspan stylecolor:#ae81ff-2.0/span/spanspan stylecolor:#f8f8f2-/spanpreview span stylecolor:slategrayspan stylecolor:#75715e# tf-nightly-gpu-2.0-preview for GPU version/span/span/code/span
让我们快速验证一切是否按能够正常工作
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#f8f8f2span stylecolor:#75715e/span/spanspan stylecolor:#f8f8f2span stylecolor:#75715e/span/span span stylecolor:#66d9efspan stylecolor:#f92672import/span/span tensorflow span stylecolor:#66d9efspan stylecolor:#f92672as/span/span tf
span stylecolor:#f8f8f2span stylecolor:#75715e/span/spanspan stylecolor:#f8f8f2span stylecolor:#75715e/span/span span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spantfspan stylecolor:#f8f8f2./span__version__span stylecolor:#f8f8f2)/span
span stylecolor:#ae81ffspan stylecolor:#ae81ff1.13/span/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff./span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0/span/spanspan stylecolor:#f8f8f2-/spandev20190117
span stylecolor:#f8f8f2span stylecolor:#75715e/span/spanspan stylecolor:#f8f8f2span stylecolor:#75715e/span/span span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spantfspan stylecolor:#f8f8f2./spanexecuting_eagerlyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/span
span stylecolor:#ae81ffspan stylecolor:#f92672True/span/span/code/span
不要担心1.13.x版本这只是意味着它是早期预览。这里要注意的是我们默认处于eager模式
span stylecolor:#f8f8f2code classlanguage-none print(tf.reduce_sum([1, 2, 3, 4, 5]))
tf.Tensor(15, shape(), dtypeint32)/code/span
如果你还不熟悉eager模式那么实质上意味着计算是在运行时被执行的而不是通过预编译的图曲线图来执行。你可以在TensorFlow文档中找到一个很好的概述。
深度强化学习
一般而言强化学习是解决连续决策问题的高级框架。RL通过基于某些agent进行导航观察环境并且获得奖励。大多数RL算法通过最大化代理在一轮游戏期间收集的奖励总和来工作。
基于RL的算法的输出通常是policy策略-将状态映射到函数有效的策略中有效的策略可以像硬编码的无操作动作一样简单。在某些状态下随机策略表示为行动的条件概率分布。 演员评论家方法Actor-Critic Methods
RL算法通常基于它们优化的目标函数进行分组。Value-based诸如DQN之类的方法通过减少预期的状态-动作值的误差来工作。
策略梯度Policy Gradients方法通过调整其参数直接优化策略本身通常通过梯度下降完成的。完全计算梯度通常是难以处理的因此通常要通过蒙特卡罗方法估算它们。
最流行的方法是两者的混合actor-critic方法其中代理策略通过策略梯度进行优化而基于值的方法用作预期值估计的引导。
深度演员-批评方法
虽然很多基础的RL理论是在表格案例中开发的但现代RL几乎完全是用函数逼近器完成的例如人工神经网络。具体而言如果策略和值函数用深度神经网络近似则RL算法被认为是“深度”。 异步优势演员-评论家actor-critical
多年来为了提高学习过程的样本效率和稳定性技术发明者已经进行了一些改进。
首先梯度加权回报折现的未来奖励这在一定程度上缓解了信用分配问题并以无限的时间步长解决了理论问题。
其次使用优势函数代替原始回报。优势在收益与某些基线之间的差异之间形成并且可以被视为衡量给定值与某些平均值相比有多好的指标。
第三在目标函数中使用额外的熵最大化项以确保代理充分探索各种策略。本质上熵以均匀分布最大化来测量概率分布的随机性。
最后并行使用多个工人加速样品采集同时在训练期间帮助它们去相关。
将所有这些变化与深度神经网络相结合我们得出了两种最流行的现代算法:异步优势演员评论家actor-critical算法简称A3C或者A2C。两者之间的区别在于技术性而非理论性顾名思义它归结为并行工人如何估计其梯度并将其传播到模型中。 有了这个我将结束我们的DRL方法之旅因为博客文章的重点更多是关于TensorFlow2.0的功能。如果你仍然不了解该主题请不要担心代码示例应该更清楚。如果你想了解更多那么一个好的资源就可以开始在Deep RL中进行Spinning Up了。
使用TensorFlow 2.0的优势演员-评论家
让我们看看实现现代DRL算法的基础是什么演员评论家代理actor-critic agent。如前一节所述为简单起见我们不会实现并行工作程序尽管大多数代码都会支持它感兴趣的读者可以将其用作锻炼机会。
作为测试平台我们将使用CartPole-v0环境。虽然它有点简单但它仍然是一个很好的选择开始。在实现RL算法时我总是依赖它作为一种健全性检查。
通过Keras Model API实现的策略和价值
首先让我们在单个模型类下创建策略和价值估计NN
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672import/span/span numpy span stylecolor:#66d9efspan stylecolor:#f92672as/span/span np
span stylecolor:#66d9efspan stylecolor:#f92672import/span/span tensorflow span stylecolor:#66d9efspan stylecolor:#f92672as/span/span tf
span stylecolor:#66d9efspan stylecolor:#f92672import/span/span tensorflowspan stylecolor:#f8f8f2./spankerasspan stylecolor:#f8f8f2./spanlayers span stylecolor:#66d9efspan stylecolor:#f92672as/span/span klspan stylecolor:#66d9efspan stylecolor:#f92672class/span/span span stylecolor:#f8f8f2ProbabilityDistribution/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2tf/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2./span/spanspan stylecolor:#f8f8f2keras/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2./span/spanspan stylecolor:#f8f8f2Model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22ecall/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 logits/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# sample a random categorical action from given logits/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span tfspan stylecolor:#f8f8f2./spansqueezespan stylecolor:#f8f8f2(/spantfspan stylecolor:#f8f8f2./spanrandomspan stylecolor:#f8f8f2./spancategoricalspan stylecolor:#f8f8f2(/spanlogitsspan stylecolor:#f8f8f2,/span span stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672class/span/span span stylecolor:#f8f8f2Model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2tf/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2./span/spanspan stylecolor:#f8f8f2keras/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2./span/spanspan stylecolor:#f8f8f2Model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e__init__/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 num_actions/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spansuperspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2./span__init__span stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74mlp_policy/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# no tf.get_variable(), just simple Keras API/span/spanselfspan stylecolor:#f8f8f2./spanhidden1 span stylecolor:#f8f8f2/span klspan stylecolor:#f8f8f2./spanDensespan stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff128/span/spanspan stylecolor:#f8f8f2,/span activationspan stylecolor:#f8f8f2/spanspan stylecolor:#a6e22espan stylecolor:#e6db74relu/span/spanspan stylecolor:#f8f8f2)/spanselfspan stylecolor:#f8f8f2./spanhidden2 span stylecolor:#f8f8f2/span klspan stylecolor:#f8f8f2./spanDensespan stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff128/span/spanspan stylecolor:#f8f8f2,/span activationspan stylecolor:#f8f8f2/spanspan stylecolor:#a6e22espan stylecolor:#e6db74relu/span/spanspan stylecolor:#f8f8f2)/spanselfspan stylecolor:#f8f8f2./spanvalue span stylecolor:#f8f8f2/span klspan stylecolor:#f8f8f2./spanDensespan stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2,/span namespan stylecolor:#f8f8f2/spanspan stylecolor:#a6e22espan stylecolor:#e6db74value/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# logits are unnormalized log probabilities/span/spanselfspan stylecolor:#f8f8f2./spanlogits span stylecolor:#f8f8f2/span klspan stylecolor:#f8f8f2./spanDensespan stylecolor:#f8f8f2(/spannum_actionsspan stylecolor:#f8f8f2,/span namespan stylecolor:#f8f8f2/spanspan stylecolor:#a6e22espan stylecolor:#e6db74policy_logits/span/spanspan stylecolor:#f8f8f2)/spanselfspan stylecolor:#f8f8f2./spandist span stylecolor:#f8f8f2/span ProbabilityDistributionspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22ecall/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 inputs/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# inputs is a numpy array, convert to Tensor/span/spanx span stylecolor:#f8f8f2/span tfspan stylecolor:#f8f8f2./spanconvert_to_tensorspan stylecolor:#f8f8f2(/spaninputsspan stylecolor:#f8f8f2,/span dtypespan stylecolor:#f8f8f2/spantfspan stylecolor:#f8f8f2./spanfloat32span stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# separate hidden layers from the same input tensor/span/spanhidden_logs span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanhidden1span stylecolor:#f8f8f2(/spanxspan stylecolor:#f8f8f2)/spanhidden_vals span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanhidden2span stylecolor:#f8f8f2(/spanxspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span selfspan stylecolor:#f8f8f2./spanlogitsspan stylecolor:#f8f8f2(/spanhidden_logsspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span selfspan stylecolor:#f8f8f2./spanvaluespan stylecolor:#f8f8f2(/spanhidden_valsspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22eaction_value/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 obs/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# executes call() under the hood/span/spanlogitsspan stylecolor:#f8f8f2,/span value span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanpredictspan stylecolor:#f8f8f2(/spanobsspan stylecolor:#f8f8f2)/spanaction span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spandistspan stylecolor:#f8f8f2./spanpredictspan stylecolor:#f8f8f2(/spanlogitsspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# a simpler option, will become clear later why we dont use it/span/spanspan stylecolor:slategrayspan stylecolor:#75715e# action tf.random.categorical(logits, 1)/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span npspan stylecolor:#f8f8f2./spansqueezespan stylecolor:#f8f8f2(/spanactionspan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span npspan stylecolor:#f8f8f2./spansqueezespan stylecolor:#f8f8f2(/spanvaluespan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/span/code/span
验证我们验证模型是否按预期工作
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672import/span/span gym
env span stylecolor:#f8f8f2/span gymspan stylecolor:#f8f8f2./spanmakespan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74CartPole-v0/span/spanspan stylecolor:#f8f8f2)/span
model span stylecolor:#f8f8f2/span Modelspan stylecolor:#f8f8f2(/spannum_actionsspan stylecolor:#f8f8f2/spanenvspan stylecolor:#f8f8f2./spanaction_spacespan stylecolor:#f8f8f2./spannspan stylecolor:#f8f8f2)/span
obs span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanresetspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/span
span stylecolor:slategrayspan stylecolor:#75715e# no feed_dict or tf.Session() needed at all/span/span
actionspan stylecolor:#f8f8f2,/span value span stylecolor:#f8f8f2/span modelspan stylecolor:#f8f8f2./spanaction_valuespan stylecolor:#f8f8f2(/spanobsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f92672None/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanactionspan stylecolor:#f8f8f2,/span valuespan stylecolor:#f8f8f2)/span span stylecolor:slategrayspan stylecolor:#75715e# [1] [-0.00145713]/span/span/code/span
这里要注意的事项
模型层和执行路径是分开定义的没有“输入”图层模型将接受原始numpy数组可以通过函数API在一个模型中定义两个计算路径模型可以包含一些辅助方法例如动作采样在eager的模式下一切都可以从原始的numpy数组中运行
随机代理
现在我们可以继续学习一些有趣的东西A2CAgent类。首先让我们添加一个贯穿整集的test方法并返回奖励总和。
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672class/span/span span stylecolor:#f8f8f2A2CAgent/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e__init__/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanselfspan stylecolor:#f8f8f2./spanmodel span stylecolor:#f8f8f2/span modelspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22etest/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 env/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 render/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2/span/spanspan stylecolor:#ae81ffspan stylecolor:#f8f8f2True/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanobsspan stylecolor:#f8f8f2,/span donespan stylecolor:#f8f8f2,/span ep_reward span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanresetspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span span stylecolor:#ae81ffspan stylecolor:#f92672False/span/spanspan stylecolor:#f8f8f2,/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672while/span/span span stylecolor:#f8f8f2span stylecolor:#f92672not/span/span donespan stylecolor:#f8f8f2:/spanactionspan stylecolor:#f8f8f2,/span _ span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanmodelspan stylecolor:#f8f8f2./spanaction_valuespan stylecolor:#f8f8f2(/spanobsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f92672None/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanobsspan stylecolor:#f8f8f2,/span rewardspan stylecolor:#f8f8f2,/span donespan stylecolor:#f8f8f2,/span _ span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanstepspan stylecolor:#f8f8f2(/spanactionspan stylecolor:#f8f8f2)/spanep_reward span stylecolor:#f8f8f2/span rewardspan stylecolor:#66d9efspan stylecolor:#f92672if/span/span renderspan stylecolor:#f8f8f2:/spanenvspan stylecolor:#f8f8f2./spanrenderspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span ep_reward/code/span
让我们看看我们的模型在随机初始化权重下得分多少
span stylecolor:#f8f8f2code classlanguage-pythonagent span stylecolor:#f8f8f2/span A2CAgentspan stylecolor:#f8f8f2(/spanmodelspan stylecolor:#f8f8f2)/span
rewards_sum span stylecolor:#f8f8f2/span agentspan stylecolor:#f8f8f2./spantestspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2)/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74%d out of 200/span/span span stylecolor:#f8f8f2%/span rewards_sumspan stylecolor:#f8f8f2)/span span stylecolor:slategrayspan stylecolor:#75715e# 18 out of 200/span/span/code/span
离最佳转台还有很远接下来是训练部分
损失/目标函数
正如我在DRL概述部分所描述的那样代理通过基于某些损失目标函数的梯度下降来改进其策略。在演员评论家中我们训练了三个目标用优势加权梯度加上熵最大化来改进策略并最小化价值估计误差。
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672import/span/span tensorflowspan stylecolor:#f8f8f2./spankerasspan stylecolor:#f8f8f2./spanlosses span stylecolor:#66d9efspan stylecolor:#f92672as/span/span kls
span stylecolor:#66d9efspan stylecolor:#f92672import/span/span tensorflowspan stylecolor:#f8f8f2./spankerasspan stylecolor:#f8f8f2./spanoptimizers span stylecolor:#66d9efspan stylecolor:#f92672as/span/span ko
span stylecolor:#66d9efspan stylecolor:#f92672class/span/span span stylecolor:#f8f8f2A2CAgent/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e__init__/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# hyperparameters for loss terms/span/spanselfspan stylecolor:#f8f8f2./spanparams span stylecolor:#f8f8f2/span span stylecolor:#f8f8f2{/spanspan stylecolor:#a6e22espan stylecolor:#e6db74value/span/spanspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0.5/span/spanspan stylecolor:#f8f8f2,/span span stylecolor:#a6e22espan stylecolor:#e6db74entropy/span/spanspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0.0001/span/spanspan stylecolor:#f8f8f2}/spanselfspan stylecolor:#f8f8f2./spanmodel span stylecolor:#f8f8f2/span modelselfspan stylecolor:#f8f8f2./spanmodelspan stylecolor:#f8f8f2./spancompilespan stylecolor:#f8f8f2(/spanoptimizerspan stylecolor:#f8f8f2/spankospan stylecolor:#f8f8f2./spanRMSpropspan stylecolor:#f8f8f2(/spanlrspan stylecolor:#f8f8f2/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0.0007/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/spanspan stylecolor:slategrayspan stylecolor:#75715e# define separate losses for policy logits and value estimate/span/spanlossspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2[/spanselfspan stylecolor:#f8f8f2./span_logits_lossspan stylecolor:#f8f8f2,/span selfspan stylecolor:#f8f8f2./span_value_lossspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22etest/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 env/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 render/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2/span/spanspan stylecolor:#ae81ffspan stylecolor:#f8f8f2True/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# unchanged from previous section/span/spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e_value_loss/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 returns/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 value/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# value loss is typically MSE between value estimates and returns/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span selfspan stylecolor:#f8f8f2./spanparamsspan stylecolor:#f8f8f2[/spanspan stylecolor:#a6e22espan stylecolor:#e6db74value/span/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2*/spanklsspan stylecolor:#f8f8f2./spanmean_squared_errorspan stylecolor:#f8f8f2(/spanreturnsspan stylecolor:#f8f8f2,/span valuespan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e_logits_loss/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 acts_and_advs/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 logits/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# a trick to input actions and advantages through same API/span/spanactionsspan stylecolor:#f8f8f2,/span advantages span stylecolor:#f8f8f2/span tfspan stylecolor:#f8f8f2./spansplitspan stylecolor:#f8f8f2(/spanacts_and_advsspan stylecolor:#f8f8f2,/span span stylecolor:#ae81ffspan stylecolor:#ae81ff2/span/spanspan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# polymorphic CE loss function that supports sparse and weighted options/span/spanspan stylecolor:slategrayspan stylecolor:#75715e# from_logits argument ensures transformation into normalized probabilities/span/spancross_entropy span stylecolor:#f8f8f2/span klsspan stylecolor:#f8f8f2./spanCategoricalCrossentropyspan stylecolor:#f8f8f2(/spanfrom_logitsspan stylecolor:#f8f8f2/spanspan stylecolor:#ae81ffspan stylecolor:#f92672True/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# policy loss is defined by policy gradients, weighted by advantages/span/spanspan stylecolor:slategrayspan stylecolor:#75715e# note: we only calculate the loss on the actions weve actually taken/span/spanspan stylecolor:slategrayspan stylecolor:#75715e# thus under the hood a sparse version of CE loss will be executed/span/spanactions span stylecolor:#f8f8f2/span tfspan stylecolor:#f8f8f2./spancastspan stylecolor:#f8f8f2(/spanactionsspan stylecolor:#f8f8f2,/span tfspan stylecolor:#f8f8f2./spanint32span stylecolor:#f8f8f2)/spanpolicy_loss span stylecolor:#f8f8f2/span cross_entropyspan stylecolor:#f8f8f2(/spanactionsspan stylecolor:#f8f8f2,/span logitsspan stylecolor:#f8f8f2,/span sample_weightspan stylecolor:#f8f8f2/spanadvantagesspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# entropy loss can be calculated via CE over itself/span/spanentropy_loss span stylecolor:#f8f8f2/span cross_entropyspan stylecolor:#f8f8f2(/spanlogitsspan stylecolor:#f8f8f2,/span logitsspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# here signs are flipped because optimizer minimizes/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span policy_loss span stylecolor:#f8f8f2-/span selfspan stylecolor:#f8f8f2./spanparamsspan stylecolor:#f8f8f2[/spanspan stylecolor:#a6e22espan stylecolor:#e6db74entropy/span/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2*/spanentropy_loss/code/span
我们完成了目标函数请注意代码的紧凑程度注释行几乎比代码本身多。
代理训练循环
最后还有训练回路本身它相对较长但相当简单收集样本计算回报和优势并在其上训练模型。
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672class/span/span span stylecolor:#f8f8f2A2CAgent/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e__init__/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 model/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# hyperparameters for loss terms/span/spanselfspan stylecolor:#f8f8f2./spanparams span stylecolor:#f8f8f2/span span stylecolor:#f8f8f2{/spanspan stylecolor:#a6e22espan stylecolor:#e6db74value/span/spanspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0.5/span/spanspan stylecolor:#f8f8f2,/span span stylecolor:#a6e22espan stylecolor:#e6db74entropy/span/spanspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0.0001/span/spanspan stylecolor:#f8f8f2,/span span stylecolor:#a6e22espan stylecolor:#e6db74gamma/span/spanspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff0.99/span/spanspan stylecolor:#f8f8f2}/spanspan stylecolor:slategrayspan stylecolor:#75715e# unchanged from previous section/span/spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22etrain/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 env/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 batch_sz/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2/span/spanspan stylecolor:#ae81ffspan stylecolor:#f8f8f2span stylecolor:#ae81ff32/span/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 updates/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2/span/spanspan stylecolor:#ae81ffspan stylecolor:#f8f8f2span stylecolor:#ae81ff1000/span/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# storage helpers for a single batch of data/span/spanactions span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanemptyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2(/spanbatch_szspan stylecolor:#f8f8f2,/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span dtypespan stylecolor:#f8f8f2/spannpspan stylecolor:#f8f8f2./spanint32span stylecolor:#f8f8f2)/spanrewardsspan stylecolor:#f8f8f2,/span donesspan stylecolor:#f8f8f2,/span values span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanemptyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff3/span/spanspan stylecolor:#f8f8f2,/span batch_szspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/spanobservations span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanemptyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2(/spanbatch_szspan stylecolor:#f8f8f2,/spanspan stylecolor:#f8f8f2)/span span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanobservation_spacespan stylecolor:#f8f8f2./spanshapespan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# training loop: collect samples, send to optimizer, repeat updates times/span/spanep_rews span stylecolor:#f8f8f2/span span stylecolor:#f8f8f2[/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0.0/span/spanspan stylecolor:#f8f8f2]/spannext_obs span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanresetspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672for/span/span update span stylecolor:#66d9efspan stylecolor:#f92672in/span/span rangespan stylecolor:#f8f8f2(/spanupdatesspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efspan stylecolor:#f92672for/span/span step span stylecolor:#66d9efspan stylecolor:#f92672in/span/span rangespan stylecolor:#f8f8f2(/spanbatch_szspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2:/spanobservationsspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2/span next_obsspan stylecolor:#f8f8f2./spancopyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanactionsspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span valuesspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanmodelspan stylecolor:#f8f8f2./spanaction_valuespan stylecolor:#f8f8f2(/spannext_obsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f92672None/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spannext_obsspan stylecolor:#f8f8f2,/span rewardsspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span donesspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span _ span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanstepspan stylecolor:#f8f8f2(/spanactionsspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanep_rewsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2/span rewardsspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#66d9efspan stylecolor:#f92672if/span/span donesspan stylecolor:#f8f8f2[/spanstepspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2:/spanep_rewsspan stylecolor:#f8f8f2./spanappendspan stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0.0/span/spanspan stylecolor:#f8f8f2)/spannext_obs span stylecolor:#f8f8f2/span envspan stylecolor:#f8f8f2./spanresetspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/span_span stylecolor:#f8f8f2,/span next_value span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanmodelspan stylecolor:#f8f8f2./spanaction_valuespan stylecolor:#f8f8f2(/spannext_obsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f92672None/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanreturnsspan stylecolor:#f8f8f2,/span advs span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./span_returns_advantagesspan stylecolor:#f8f8f2(/spanrewardsspan stylecolor:#f8f8f2,/span donesspan stylecolor:#f8f8f2,/span valuesspan stylecolor:#f8f8f2,/span next_valuespan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# a trick to input actions and advantages through same API/span/spanacts_and_advs span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanconcatenatespan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2[/spanactionsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f92672None/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span advsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f92672None/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# performs a full training step on the collected batch/span/spanspan stylecolor:slategrayspan stylecolor:#75715e# note: no need to mess around with gradients, Keras API handles it/span/spanlosses span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanmodelspan stylecolor:#f8f8f2./spantrain_on_batchspan stylecolor:#f8f8f2(/spanobservationsspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2[/spanacts_and_advsspan stylecolor:#f8f8f2,/span returnsspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span ep_rewsspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e_returns_advantages/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 rewards/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 dones/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 values/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 next_value/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# next_value is the bootstrap value estimate of a future state (the critic)/span/spanreturns span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanappendspan stylecolor:#f8f8f2(/spannpspan stylecolor:#f8f8f2./spanzeros_likespan stylecolor:#f8f8f2(/spanrewardsspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2,/span next_valuespan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:slategrayspan stylecolor:#75715e# returns are calculated as discounted sum of future rewards/span/spanspan stylecolor:#66d9efspan stylecolor:#f92672for/span/span t span stylecolor:#66d9efspan stylecolor:#f92672in/span/span reversedspan stylecolor:#f8f8f2(/spanrangespan stylecolor:#f8f8f2(/spanrewardsspan stylecolor:#f8f8f2./spanshapespan stylecolor:#f8f8f2[/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0/span/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2:/spanreturnsspan stylecolor:#f8f8f2[/spantspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2/span rewardsspan stylecolor:#f8f8f2[/spantspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2/span selfspan stylecolor:#f8f8f2./spanparamsspan stylecolor:#f8f8f2[/spanspan stylecolor:#a6e22espan stylecolor:#e6db74gamma/span/spanspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2*/span returnsspan stylecolor:#f8f8f2[/spantspan stylecolor:#f8f8f2/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2]/span span stylecolor:#f8f8f2*/span span stylecolor:#f8f8f2(/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2-/spandonesspan stylecolor:#f8f8f2[/spantspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2)/spanreturns span stylecolor:#f8f8f2/span returnsspan stylecolor:#f8f8f2[/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2span stylecolor:#ae81ff-/span/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff1/span/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:slategrayspan stylecolor:#75715e# advantages are returns - baseline, value estimates in our case/span/spanadvantages span stylecolor:#f8f8f2/span returns span stylecolor:#f8f8f2-/span valuesspan stylecolor:#66d9efspan stylecolor:#f92672return/span/span returnsspan stylecolor:#f8f8f2,/span advantagesspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22etest/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 env/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 render/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2/span/spanspan stylecolor:#ae81ffspan stylecolor:#f8f8f2True/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# unchanged from previous section/span/spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e_value_loss/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 returns/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 value/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# unchanged from previous section/span/spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#66d9efspan stylecolor:#f92672def/span/span span stylecolor:#e6db74span stylecolor:#a6e22e_logits_loss/span/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2(/span/spanspan stylecolor:#f8f8f2self/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 acts_and_advs/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2,/span/spanspan stylecolor:#f8f8f2 logits/spanspan stylecolor:#f8f8f2span stylecolor:#f8f8f2)/span/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:slategrayspan stylecolor:#75715e# unchanged from previous section/span/spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./spanspan stylecolor:#f8f8f2./span/code/span
训练和结果
我们现在已经准备好在CartPole-v0上训练我们的单工A2C代理了训练过程不应超过几分钟训练完成后你应该看到代理成功达到200分中的目标。
span stylecolor:#f8f8f2code classlanguage-pythonrewards_history span stylecolor:#f8f8f2/span agentspan stylecolor:#f8f8f2./spantrainspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2)/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74Finished training, testing.../span/spanspan stylecolor:#f8f8f2)/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74%d out of 200/span/span span stylecolor:#f8f8f2%/span agentspan stylecolor:#f8f8f2./spantestspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/span span stylecolor:slategrayspan stylecolor:#75715e# 200 out of 200/span/span/code/span 在源代码中我包含了一些额外的帮助程序可以打印出运行的奖励和损失以及rewards_history的基本绘图仪。 静态计算图
有了所有这种渴望模式的成功的喜悦你可能想知道静态图形执行是否可以。当然此外我们还需要多一行代码来启用它
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#66d9efspan stylecolor:#f92672with/span/span tfspan stylecolor:#f8f8f2./spanGraphspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2./spanas_defaultspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2:/spanspan stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spantfspan stylecolor:#f8f8f2./spanexecuting_eagerlyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/span span stylecolor:slategrayspan stylecolor:#75715e# False/span/spanmodel span stylecolor:#f8f8f2/span Modelspan stylecolor:#f8f8f2(/spannum_actionsspan stylecolor:#f8f8f2/spanenvspan stylecolor:#f8f8f2./spanaction_spacespan stylecolor:#f8f8f2./spannspan stylecolor:#f8f8f2)/spanagent span stylecolor:#f8f8f2/span A2CAgentspan stylecolor:#f8f8f2(/spanmodelspan stylecolor:#f8f8f2)/spanrewards_history span stylecolor:#f8f8f2/span agentspan stylecolor:#f8f8f2./spantrainspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74Finished training, testing.../span/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74%d out of 200/span/span span stylecolor:#f8f8f2%/span agentspan stylecolor:#f8f8f2./spantestspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/span span stylecolor:slategrayspan stylecolor:#75715e# 200 out of 200/span/span/code/span
有一点需要注意在静态图形执行期间我们不能只有Tensors这就是为什么我们在模型定义期间需要使用CategoricalDistribution的技巧。事实上当我在寻找一种在静态模式下执行的方法时我发现了一个关于通过Keras API构建的模型的一个有趣的低级细节。
还有一件事…
还记得我说过TensorFlow默认是运行在eager模式下吧甚至用代码片段证明它吗好吧我错了。
如果你使用Keras API来构建和管理模型那么它将尝试将它们编译为静态图形。所以你最终得到的是静态计算图的性能具有渴望执行的灵活性。
你可以通过model.run_eagerly标志检查模型的状态你也可以通过设置此标志来强制执行eager模式变成True尽管大多数情况下你可能不需要这样做。但如果Keras检测到没有办法绕过eager模式它将自动退出。
为了说明它确实是作为静态图运行这里是一个简单的基准测试
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:slategrayspan stylecolor:#75715e# create a 100000 samples batch/span/span
env span stylecolor:#f8f8f2/span gymspan stylecolor:#f8f8f2./spanmakespan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74CartPole-v0/span/spanspan stylecolor:#f8f8f2)/span
obs span stylecolor:#f8f8f2/span npspan stylecolor:#f8f8f2./spanrepeatspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2./spanresetspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2[/spanspan stylecolor:#f92672None/spanspan stylecolor:#f8f8f2,/span span stylecolor:#f8f8f2:/spanspan stylecolor:#f8f8f2]/spanspan stylecolor:#f8f8f2,/span span stylecolor:#ae81ffspan stylecolor:#ae81ff100000/span/spanspan stylecolor:#f8f8f2,/span axisspan stylecolor:#f8f8f2/spanspan stylecolor:#ae81ffspan stylecolor:#ae81ff0/span/spanspan stylecolor:#f8f8f2)/span/code/span
Eager基准
span stylecolor:#f8f8f2code classlanguage-pythonspan stylecolor:#f8f8f2%/spanspan stylecolor:#f8f8f2%/spantime
model span stylecolor:#f8f8f2/span Modelspan stylecolor:#f8f8f2(/spanenvspan stylecolor:#f8f8f2./spanaction_spacespan stylecolor:#f8f8f2./spannspan stylecolor:#f8f8f2)/span
modelspan stylecolor:#f8f8f2./spanrun_eagerly span stylecolor:#f8f8f2/span span stylecolor:#ae81ffspan stylecolor:#f92672True/span/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74Eager Execution: /span/spanspan stylecolor:#f8f8f2,/span tfspan stylecolor:#f8f8f2./spanexecuting_eagerlyspan stylecolor:#f8f8f2(/spanspan stylecolor:#f8f8f2)/spanspan stylecolor:#f8f8f2)/span
span stylecolor:#66d9efprint/spanspan stylecolor:#f8f8f2(/spanspan stylecolor:#a6e22espan stylecolor:#e6db74Eager Keras Model:/span/spanspan stylecolor:#f8f8f2,/span modelspan stylecolor:#f8f8f2./spanrun_eagerlyspan stylecolor:#f8f8f2)/span
_ span stylecolor:#f8f8f2/span modelspan stylecolor:#f8f8f2(/spanobsspan stylecolor:#f8f8f2)/span
span stylecolor:slategrayspan stylecolor:#75715e######## Results #######/span/span
Eager Executionspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#f92672True/span/span
Eager Keras Modelspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#f92672True/span/span
CPU timesspan stylecolor:#f8f8f2:/span user span stylecolor:#ae81ffspan stylecolor:#ae81ff639/span/span msspan stylecolor:#f8f8f2,/span sysspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff736/span/span msspan stylecolor:#f8f8f2,/span totalspan stylecolor:#f8f8f2:/span span stylecolor:#ae81ffspan stylecolor:#ae81ff1.38/span/span s/code/span
静态基准
span stylecolor:#f8f8f2code classlanguage-none%%time
with tf.Graph().as_default():model Model(env.action_space.n)print(Eager Execution: , tf.executing_eagerly())print(Eager Keras Model:, model.run_eagerly)_ model.predict(obs)
######## Results #######
Eager Execution: False
Eager Keras Model: False
CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms/code/span
默认基准
span stylecolor:#333333span stylecolor:#f8f8f2code classlanguage-none%%time
model Model(env.action_space.n)
print(Eager Execution: , tf.executing_eagerly())
print(Eager Keras Model:, model.run_eagerly)
_ model.predict(obs)
######## Results #######
Eager Execution: True
Eager Keras Model: False
CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s/code/span/span
正如你所看到的eager模式是静态模式的背后默认情况下我们的模型确实是静态执行的。
结论
希望本文能够帮助你理解DRL和TensorFlow2.0。请注意TensorFlow2.0仍然只是预览版本甚至不是候选版本一切都可能发生变化。如果TensorFlow有什么东西你特别不喜欢让它的开发者知道
人们可能会有一个挥之不去的问题TensorFlow比PyTorch好吗也许也许不是。它们两个都是伟大的库所以很难说这样谁好谁不好。如果你熟悉PyTorch你可能已经注意到TensorFlow 2.0不仅赶上了它而且还避免了一些PyTorch API的缺陷。
在任何一种情况下对于开发者来说这场竞争都已经为双方带来了积极的结果我很期待看到未来的框架将会变成什么样。 原文链接 本文为云栖社区原创内容未经允许不得转载。