机器学习(十) — 强化学习

Reinforcement learning

1 key concepts

  1. states
  2. actions
  3. rewards
  4. discount factor γ \gamma γ
  5. return
  6. policy π \pi π

2 return

  1. definition: the sum of the rewards that the system gets, weighted by the discount factor
  2. compute:
    • R i R_i Ri : reward of state i
    • γ \gamma γ : discount factor(usually close to 1), making the reinforcement learning impatient

r e t u r n = R 1 + γ R 2 + ⋯ + γ n − 1 R n return = R_1 + \gamma R_2 + \cdots + \gamma^{n-1} R_n return=R1+γR2++γn1Rn

3 policy

policy π \pi π maps state s s s to some action a a a

π ( s ) = a \pi(s) = a π(s)=a

the goal of reinforcement learning is to find a policy π \pi π to map every state s s s to action a a a to maximize the return

在这里插入图片描述

4 state action value function

1. definition

$Q(s, a) = $return if

  • start in state s s s
  • take action a a a once
  • behave optimally after that

2. usage

  1. the best possible return from state s s s is m a x max max Q ( s , a ) Q(s, a) Q(s,a)
  2. the best possible action in state s s s is the action a a a that gives m a x max max Q ( s , a ) Q(s, a) Q(s,a)

5 bellman equation

s s s : current state

a a a : current action

s ′ s^{'} s : state you get to after taking action a a a

a ′ a^{'} a : action that you take in state s ′ s^{'} s

Q ( s , a ) = R ( s ) + γ m a x Q ( s ′ , a ′ ) Q(s, a) = R(s) + \gamma max Q(s^{'}, a^{'}) Q(s,a)=R(s)+γmaxQ(s,a)

6 Deep Q-Network

1. definition

use neural network to learn Q ( s , a ) Q(s, a) Q(s,a)

x = ( s , a ) y = R ( s ) + γ m a x Q ( s ′ , a ′ ) f w , b ( x ) ≈ y x = (s, a)\\ y = R(s) + \gamma max Q(s^{'}, a^{'}) \\ f_{w, b}(x) \approx y x=(s,a)y=R(s)+γmaxQ(s,a)fw,b(x)y

在这里插入图片描述

2. step

  1. initialize neural network randomly as guess of Q ( s , a ) Q(s, a) Q(s,a)
  2. repeat:
    • take actions, get ( s , a , R ( s ) , s ′ ) (s, a, R(s), s^{'}) (s,a,R(s),s)
    • store N most recent ( s , a , R ( s ) , s ′ ) (s, a, R(s), s^{'}) (s,a,R(s),s) tuples
  3. train neural network:
    • create training set of N examples using x = ( s , a ) x = (s, a) x=(s,a) and y = R ( s ) + γ m a x Q ( s ′ , a ′ ) y = R(s) + \gamma max Q(s^{'}, a^{'}) y=R(s)+γmaxQ(s,a)
    • train Q n e w Q_{new} Qnew such that Q n e w ≈ y Q_{new} \approx y Qnewy
    • set Q = Q n e w Q = Q_{new} Q=Qnew

3. optimazation

在这里插入图片描述

4. ϵ \epsilon ϵ-greedy policy

  1. with probability 1 − ϵ 1 - \epsilon 1ϵ, pick the action a a a that maximize Q ( s , a ) Q(s, a) Q(s,a)
  2. with probability ϵ \epsilon ϵ, pick the action a a a randomly

5. mini-batch

use a subset of the dataset on each gradient decent

6. soft update

instead Q = Q n e w Q = Q_{new} Q=Qnew

w = α w n e w + w b = α b n e w + b w = \alpha w_{new} + w\\ b = \alpha b_{new} + b w=αwnew+wb=αbnew+b

相关推荐

  1. 机器学习强化学习算法比较

    2024-01-18 09:42:01       23 阅读

最近更新

  1. TCP协议是安全的吗?

    2024-01-18 09:42:01       18 阅读
  2. 阿里云服务器执行yum,一直下载docker-ce-stable失败

    2024-01-18 09:42:01       19 阅读
  3. 【Python教程】压缩PDF文件大小

    2024-01-18 09:42:01       18 阅读
  4. 通过文章id递归查询所有评论(xml)

    2024-01-18 09:42:01       20 阅读

热门阅读

  1. 01 数据结构前言

    2024-01-18 09:42:01       37 阅读
  2. STM32 单片机重启(查看上次重启原因)

    2024-01-18 09:42:01       36 阅读
  3. ssh: connect to host github.com port 22: Connection timed out

    2024-01-18 09:42:01       34 阅读
  4. npm install:深入理解与应用

    2024-01-18 09:42:01       31 阅读
  5. Hive之set参数大全-8

    2024-01-18 09:42:01       28 阅读
  6. Git中config配置

    2024-01-18 09:42:01       28 阅读
  7. postgresql安装脚本

    2024-01-18 09:42:01       40 阅读