Policy-Based Reinforcement Learning(2)
- 开发
- 10
-
上篇已经介绍过:
![V(s;\theta ) = \sum_{a}^{}\pi (a|s;\theta )Q_\pi (s,a)](https://latex.csdn.net/eq?V%28s%3B%5Ctheta%20%29%20%3D%20%5Csum_%7Ba%7D%5E%7B%7D%5Cpi%20%28a%7Cs%3B%5Ctheta%20%29Q_%5Cpi%20%28s%2Ca%29)
Policy Gradient:
![\frac{\partial V(s;\theta )}{\partial \theta } = \frac{\partial \sum_{a}^{}\pi (a|s;\theta )Q_\pi (s,a)}{\partial \theta } \newline |\quad \quad = \sum_{a}^{}\frac{\partial \pi (a|s;\theta )}{\partial \theta } Q_\pi (s,a) \newline |\quad \quad =\sum_{a}^{} \pi (a|s;\theta ) \frac{\partial log\pi (a|s; \theta )}{\partial \theta } Q_\pi (s,a) (\frac{\partial log \pi (\theta )}{\partial \theta } = \frac{1}{\pi (\theta ) } * \frac{\partial \pi (\theta )}{\partial \theta }) \newline |\quad \quad = E_A[\frac{\partial log\pi (A|s;\theta ) }{\partial \theta } Q_\pi (s,A)]](https://latex.csdn.net/eq?%5Cfrac%7B%5Cpartial%20V%28s%3B%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%3D%20%5Cfrac%7B%5Cpartial%20%5Csum_%7Ba%7D%5E%7B%7D%5Cpi%20%28a%7Cs%3B%5Ctheta%20%29Q_%5Cpi%20%28s%2Ca%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%5Cnewline%20%7C%5Cquad%20%5Cquad%20%3D%20%5Csum_%7Ba%7D%5E%7B%7D%5Cfrac%7B%5Cpartial%20%5Cpi%20%28a%7Cs%3B%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20Q_%5Cpi%20%28s%2Ca%29%20%5Cnewline%20%7C%5Cquad%20%5Cquad%20%3D%5Csum_%7Ba%7D%5E%7B%7D%20%5Cpi%20%28a%7Cs%3B%5Ctheta%20%29%20%5Cfrac%7B%5Cpartial%20log%5Cpi%20%28a%7Cs%3B%20%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20Q_%5Cpi%20%28s%2Ca%29%20%28%5Cfrac%7B%5Cpartial%20log%20%5Cpi%20%28%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%3D%20%5Cfrac%7B1%7D%7B%5Cpi%20%28%5Ctheta%20%29%20%7D%20*%20%5Cfrac%7B%5Cpartial%20%5Cpi%20%28%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%29%20%5Cnewline%20%7C%5Cquad%20%5Cquad%20%3D%20E_A%5B%5Cfrac%7B%5Cpartial%20log%5Cpi%20%28A%7Cs%3B%5Ctheta%20%29%20%7D%7B%5Cpartial%20%5Ctheta%20%7D%20Q_%5Cpi%20%28s%2CA%29%5D)
这样就得到2种Policy Gradient 公式:
![\frac{\partial V(s;\theta )}{\partial \theta } = \frac{\partial \sum_{a}^{}\pi (a|s;\theta )Q_\pi (s,a)}{\partial \theta } \quad (1)](https://latex.csdn.net/eq?%5Cfrac%7B%5Cpartial%20V%28s%3B%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%3D%20%5Cfrac%7B%5Cpartial%20%5Csum_%7Ba%7D%5E%7B%7D%5Cpi%20%28a%7Cs%3B%5Ctheta%20%29Q_%5Cpi%20%28s%2Ca%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%5Cquad%20%281%29)
![\frac{\partial V(s;\theta )}{\partial \theta } =E_A[\frac{\partial log\pi (A|s;\theta ) }{\partial \theta } Q_\pi (s,A)] \quad (2)](https://latex.csdn.net/eq?%5Cfrac%7B%5Cpartial%20V%28s%3B%5Ctheta%20%29%7D%7B%5Cpartial%20%5Ctheta%20%7D%20%3DE_A%5B%5Cfrac%7B%5Cpartial%20log%5Cpi%20%28A%7Cs%3B%5Ctheta%20%29%20%7D%7B%5Cpartial%20%5Ctheta%20%7D%20Q_%5Cpi%20%28s%2CA%29%5D%20%5Cquad%20%282%29)
公式(1)用于离散的情形, (2)用于连续的情形
原文地址:https://blog.csdn.net/zhangsj1007/article/details/139581657
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。
本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:https://www.suanlizi.com/kf/1800160502547091456.html
如若内容造成侵权/违法违规/事实不符,请联系《酸梨子》网邮箱:1419361763@qq.com进行投诉反馈,一经查实,立即删除!