DRL(Deep Reinforced Learning) PPO算法(Proximal Policy Optimization)

PPO(Proximal Policy Optimization)

最好先看一下策略梯度优化,再看这篇文章,不然公式推不明白

PPO是Openai默认的强化学习策略

  • On-policy:学习的agent和与环境交互的agent是同一个

∇ R ˉ θ = E τ ∼ p θ ( τ ) [ R ( τ ) ∇ log ⁡ p θ ( τ ) ] \nabla \bar{R}_\theta = E_{\tau \sim p_\theta(\tau)}[R(\tau) \nabla \log p_\theta(\tau)] Rˉθ=Eτpθ(τ)[R(τ)logpθ(τ)]

  • Off-policy:学习的agent和与环境交互的agent不同

这里的agent可以理解为上文的 actor

∇ R ˉ θ = E τ ∼ p θ ′ ( τ ) [ p θ ( τ ) p θ ′ ( τ ) R ( τ ) ∇ log ⁡ p θ ( τ ) ] \nabla \bar{R}_\theta = E_{\tau \sim p_{\theta'}(\tau)}\left[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)} R(\tau) \nabla \log p_\theta(\tau)\right] Rˉθ=Eτpθ(τ)[pθ(τ)pθ(τ)R(τ)logpθ(τ)]

为什么要off-policy ?

因为在on-policy时,你每一次采样后的 τ \tau τ 只能用一次,在更新 θ \theta θ 参数后就得重新采样

变成off-policy的好处就是一次采样多次使用,不在原始分布 p ( x ) p(x) p(x) 上采样,而是在伪分布 q ( x ) q(x) q(x) 上采样,然后来训练 p ( x ) p(x) p(x) 分布的参数 θ \theta θ

在做 off-policy 时,我们补充下述方法和公式推导

Importance Sampling

E x ∼ p [ f ( x ) ] ≈ 1 N ∑ i = 1 N f ( x i ) = ∫ f ( x ) p ( x )   d x = ∫ f ( x ) p ( x ) q ( x ) q ( x )   d x = E x ∼ q [ f ( x ) p ( x ) q ( x ) ] E_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x^i)= \int f(x) p(x) \, dx = \int f(x) \frac{p(x)}{q(x)} q(x) \, dx = E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] Exp[f(x)]N1i=1Nf(xi)=f(x)p(x)dx=f(x)q(x)p(x)q(x)dx=Exq[f(x)q(x)p(x)]

  • 原来 x i x_i xi 是从 p ( x ) p(x) p(x) 分布中去抓取,但是通过上述推导,我们可以从 q ( x ) q(x) q(x) 分布中抓取 x i x_i xi
  • p ( x ) p(x) p(x) 分布是实际分布, q ( x ) q(x) q(x) 分布是模仿实际分布的伪分布

在有伪分布 q ( x ) q(x) q(x) 后,我们就可以从中抽取 θ ′ \theta^{'} θ 来训练 θ \theta θ ,由于 θ ′ \theta^{'} θ 是固定的,所以可以重复使用,不用再像之前的 θ \theta θ 一样每次都得更新。

但importance sampling 也有一些问题:方差会不一样

方差公式: V a r [ X ] = E [ X 2 ] − ( E [ X ] ) 2 Var[X] = E[X^2]-(E[X])^2 Var[X]=E[X2](E[X])2

Var x ∼ p [ f ( x ) ] = E x ∼ p [ f ( x ) 2 ] − ( E x ∼ p [ f ( x ) ] ) 2 \text{Var}_{x \sim p}[f(x)] = E_{x \sim p}[f(x)^2] - (E_{x \sim p}[f(x)])^2 Varxp[f(x)]=Exp[f(x)2](Exp[f(x)])2

Var x ∼ q [ f ( x ) p ( x ) q ( x ) ] = E x ∼ q [ ( f ( x ) p ( x ) q ( x ) ) 2 ] − ( E x ∼ q [ f ( x ) p ( x ) q ( x ) ] ) 2 = ∫ q ( x ) f ( x ) 2 p ( x ) 2 q ( x ) 2 d x − ∫ q ( x ) 2 f ( x ) 2 p ( x ) 2 q ( x ) 2 d x = E x ∼ p [ f ( x ) 2 p ( x ) q ( x ) ] − ( E x ∼ p [ f ( x ) ] ) 2 \text{Var}_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] = E_{x \sim q}\left[\left(f(x) \frac{p(x)}{q(x)}\right)^2\right] - \left(E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\right)^2 \\ = \int q(x)f(x)^2\frac{p(x)^2}{q(x)^2}dx - \int q(x)^2f(x)^2\frac{p(x)^2}{q(x)^2}dx \\= E_{x \sim p}[f(x)^2\frac{p(x)}{q(x)}] - \left(E_{x \sim p}[f(x)]\right)^2 Varxq[f(x)q(x)p(x)]=Exq[(f(x)q(x)p(x))2](Exq[f(x)q(x)p(x)])2=q(x)f(x)2q(x)2p(x)2dxq(x)2f(x)2q(x)2p(x)2dx=Exp[f(x)2q(x)p(x)](Exp[f(x)])2

观察上述两个方差等式我们可以看到,如果 p ( x ) p(x) p(x) 分布和 q ( x ) q(x) q(x) 分布一样,需要满足下述等式
E x ∼ p [ f ( x ) ] = E x ∼ q [ f ( x ) p ( x ) q ( x ) ] E_{x \sim p}[f(x)] = E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] Exp[f(x)]=Exq[f(x)q(x)p(x)]
如果 p ( x ) q ( x ) \frac{p(x)}{q(x)} q(x)p(x) 差距很大,即实际分布和伪分布相差很多,那么实际方差与估计方差就会有很大的差距

理想状态下,只要有足够多的采样,上述问题就不是问题。但是实际情况是我们无法做到采样足够多的样本,所以 E x ∼ p [ f ( x ) ] = E x ∼ q [ f ( x ) p ( x ) q ( x ) ] E_{x \sim p}[f(x)] = E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] Exp[f(x)]=Exq[f(x)q(x)p(x)] 的左右两侧会产生很大的差别

梯度更新

gradient for update = E ( s t , a t ) ∼ π θ [ A θ ( s t , a t ) ∇ log ⁡ p θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ p θ ( s t , a t ) p θ ′ ( s t , a t ) A θ ( s t , a t ) ∇ log ⁡ p θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ p θ ( a t ∣ s t ) p θ ( s t ) p θ ′ ( a t ∣ s t ) p θ ′ ( s t ) A θ ( s t , a t ) ∇ log ⁡ p θ ( a t ∣ s t ) ] \text{gradient for update} = E_{(s_t,a_t) \sim \pi_\theta} \left[ A^\theta(s_t, a_t) \nabla \log p_\theta(a_t | s_t) \right] \\= E_{(s_t,a_t) \sim \pi_{\theta'}} \left[ \frac{p_\theta(s_t, a_t)}{p_{\theta'}(s_t, a_t)} A^\theta(s_t, a_t) \nabla \log p_\theta(a_t | s_t) \right] \\ = E_{(s_t,a_t) \sim \pi_{\theta'}} \left[ \frac{p_\theta(a_t | s_t) p_\theta(s_t)}{p_{\theta'}(a_t | s_t) p_{\theta'}(s_t)} A^\theta(s_t, a_t) \nabla \log p_\theta(a_t | s_t) \right] gradient for update=E(st,at)πθ[Aθ(st,at)logpθ(atst)]=E(st,at)πθ[pθ(st,at)pθ(st,at)Aθ(st,at)logpθ(atst)]=E(st,at)πθ[pθ(atst)pθ(st)pθ(atst)pθ(st)Aθ(st,at)logpθ(atst)]

其中 p θ ( s t ) p θ ′ ( s t ) \frac{p_\theta(s_t)}{p_{\theta'}(s_t)} pθ(st)pθ(st) 默认相等然后约掉(因为不好算,所以认为他们近似等价并约分)

有推导公式 ∇ f ( x ) = f ( x ) ∇ log ⁡ f ( x ) \nabla f(x) = f(x) \nabla \log f(x) f(x)=f(x)logf(x) ,观察上述梯度等式可以发现,
f ( x ) = p θ ( a t ∣ s t ) ∇ log ⁡ f ( x ) = ∇ log ⁡ p θ ( a t ∣ s t ) f(x) =p_\theta(a_t | s_t)\\ \nabla \log f(x)=\nabla \log p_\theta(a_t | s_t) f(x)=pθ(atst)logf(x)=logpθ(atst)
所以
∇ f ( x ) = f ( x ) ∇ log ⁡ f ( x ) = p θ ( a t ∣ s t ) ∇ log ⁡ p θ ( a t ∣ s t ) = ∇ p θ ( a t ∣ s t ) \nabla f(x)=f(x) \nabla \log f(x) =p_\theta(a_t | s_t)\nabla \log p_\theta(a_t | s_t)=\nabla p_\theta(a_t | s_t) f(x)=f(x)logf(x)=pθ(atst)logpθ(atst)=pθ(atst)
即梯度为
∇ = E ( s t , a t ) ∼ π θ ′ [ ∇ p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) A θ ′ ( s t , a t ) ] \nabla = E_{(s_t,a_t) \sim \pi_{\theta'}} \left[ \frac{\nabla p_\theta(a_t | s_t)}{p_{\theta'}(a_t | s_t) } A^{\theta'}(s_t, a_t) \right] =E(st,at)πθ[pθ(atst)pθ(atst)Aθ(st,at)]
由梯度可以推出关于 θ ′ \theta' θ 的奖励目标函数
J θ ′ ( θ ) = E ( s t , a t ) ∼ π θ ′ [ p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) A θ ′ ( s t , a t ) ] J^{\theta'}(\theta) = E_{(s_t, a_t) \sim \pi_{\theta'}} \left[ \frac{p_\theta(a_t | s_t)}{p_{\theta'}(a_t | s_t)} A^{\theta'}(s_t, a_t) \right] Jθ(θ)=E(st,at)πθ[pθ(atst)pθ(atst)Aθ(st,at)]

PPO算法

J P P O k ( θ ) = J θ k ( θ ) − β K L ( θ , θ k ) J θ k ( θ ) ≈ ∑ ( s t , a t ) p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) A θ k ( s t , a t ) If  K L ( θ , θ k ) ≥ K L max ,  increase  β If  K L ( θ , θ k ) ≤ K L min ,  decrease  β J^{k}_{PPO}(\theta) = J^{\theta^k}(\theta) - \beta KL(\theta, \theta^k)\\ J^{\theta^k}(\theta) \approx \sum_{(s_t, a_t)} \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} A^{\theta^k}(s_t, a_t)\\ \text{If } KL(\theta, \theta^k) \geq KL_{\text{max}}, \text{ increase } \beta \\ \text{If } KL(\theta, \theta^k) \leq KL_{\text{min}}, \text{ decrease } \beta \\ JPPOk(θ)=Jθk(θ)βKL(θ,θk)Jθk(θ)(st,at)pθk(atst)pθ(atst)Aθk(st,at)If KL(θ,θk)KLmax, increase βIf KL(θ,θk)KLmin, decrease β

在后面加上了KL散度

  • 初始化策略参数 θ 0 \theta^0 θ0
  • 在每次迭代中:
    • 使用 θ k \theta^k θk 与环境交互,收集 { s t , a t } \{s_t, a_t\} {st,at},并计算优势 A θ k ( s t , a t ) A^{\theta^k}(s_t, a_t) Aθk(st,at)
    • 找到优化 J P P O ( θ ) J_{PPO}(\theta) JPPO(θ) θ \theta θ

PPO的第二种方式

J P P O 2 θ k ( θ ) ≈ ∑ ( s t , a t ) min ⁡ ( p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) A θ k ( s t , a t ) , clip ( p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) , 1 − ξ , 1 + ξ ) A θ k ( s t , a t ) ) J_{PPO2}^{\theta^k}(\theta) \approx \sum_{(s_t, a_t)} \min \left( \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} A^{\theta^k}(s_t, a_t), \text{clip}\left(\frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)}, 1 - \xi, 1 + \xi\right) A^{\theta^k}(s_t, a_t) \right) JPPO2θk(θ)(st,at)min(pθk(atst)pθ(atst)Aθk(st,at),clip(pθk(atst)pθ(atst),1ξ,1+ξ)Aθk(st,at))

clip是一个选择函数

  • p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) < 1 − ξ \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} < 1 -\xi pθk(atst)pθ(atst)<1ξ 时选 1 − ξ 1 -\xi 1ξ

  • p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) > 1 + ξ \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} > 1 +\xi pθk(atst)pθ(atst)>1+ξ 时选 1 + ξ 1 +\xi 1+ξ

  • 1 − ϵ < p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) < 1 + ξ 1 -\epsilon < \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} < 1 +\xi 1ϵ<pθk(atst)pθ(atst)<1+ξ 时选 p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} pθk(atst)pθ(atst)

在这里插入图片描述

绿线是 p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) A θ k ( s t , a t ) \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} A^{\theta^k}(s_t, a_t) pθk(atst)pθ(atst)Aθk(st,at)的范围

蓝线是clip选择后的范围

x轴是 p θ ( a t ∣ s t ) p θ k ( a t ∣ s t ) \frac{p_\theta(a_t | s_t)}{p_{\theta^k}(a_t | s_t)} pθk(atst)pθ(atst)​ 的比值

我们不希望真实分布 p θ ( x ) p_\theta(x) pθ(x) 和伪分布 p θ ′ ( x ) p_{\theta'}(x) pθ(x) 相差太大,也就是这个比值最好是在1左右

这里尝试对上图进行解释:

  • A > 0 A > 0 A>0 时,说明 s 和 a 获得的奖励是好的,说明此时 p θ ( a t ∣ s t ) p_\theta(a_t | s_t) pθ(atst) 做出的决策是正确的,我们希望它越大越好,此时目标函数(红线)会把概率往上推,但只能推到一定的程度
  • A < 0 A < 0 A<0 时,说明 s 和 a 获得的奖励是好的,说明此时 p θ ( a t ∣ s t ) p_\theta(a_t | s_t) pθ(atst) 做出的决策是错误的,我们希望它越小越好,此时目标函数(红线)就把概率往下推,但也不会推的很小

相关推荐

  1. <span style='color:red;'>算法</span>___

    算法___

    2024-04-30 21:16:04      50 阅读
  2. 计算机算法贪心算法

    2024-04-30 21:16:04       66 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-04-30 21:16:04       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-04-30 21:16:04       106 阅读
  3. 在Django里面运行非项目文件

    2024-04-30 21:16:04       87 阅读
  4. Python语言-面向对象

    2024-04-30 21:16:04       96 阅读

热门阅读

  1. 「笔试刷题」:字符串中找出连续最长的数字串

    2024-04-30 21:16:04       33 阅读
  2. 【unity】(1)场景

    2024-04-30 21:16:04       37 阅读
  3. Docker常用命令 & 镜像库设置

    2024-04-30 21:16:04       38 阅读
  4. 记录不熟悉的函数用法(C++)——assign

    2024-04-30 21:16:04       35 阅读
  5. 【FastGPT 】FastGPT 的知识库逻辑

    2024-04-30 21:16:04       31 阅读
  6. 用Python将Word文件另存为任意支持的格式

    2024-04-30 21:16:04       31 阅读
  7. 语法树简介

    2024-04-30 21:16:04       36 阅读
  8. C++ explicit关键字详解

    2024-04-30 21:16:04       31 阅读