Model-based value iteration and policy iteration pseudocode

Note that the symbols used in the pseudocode below have the following meanings:

  • MDP: Markov Decision Process;
  • V(s): Value function, the avg reture of one state;
  • π(s): Policy, in the sense that for a given state s, π(s)represents the action that the agent will take in that state according to the policy,  usually can be divided into a random manner or a deterministic manner;
  • R(s,a): Immediate reward when taking action a in state s;
  • P(s'|s,a): Transition probability from state s to state s' under an action a;
  • γ: Discount factor for future reward.

Value iteration:

function ValueIteration(MDP):
    // MDP is a Markov Decision Process
    V(s) = 0 for all states s  // Initialization

    repeat until convergence:
        delta = 0
        for each state s:
            v = V(s)
            V(s) = max over all actions a of [ R(s, a) + γ * Σ P(s' | s, a) * V(s') ]
            delta = max(delta, |v - V(s)|)

    return V  // Optimal value function

function ExtractOptimalPolicy(MDP, V):
    // MDP is a Markov Decision Process, V is the optimal value function
    for each state s:
        π(s) = argmax over all actions a of [ R(s, a) + γ * Σ P(s' | s, a) * V(s') ]

    return π  // Optimal policy

Policy iteration:

function PolicyIteration(MDP):
    // MDP is a Markov Decision Process
    Initialize a policy π arbitrarily

    repeat until policy converges:
        // Policy Evaluation
        V = EvaluatePolicy(MDP, π)

        // Policy Improvement
        π' = GreedyPolicyImprovement(MDP, V)

        if π' = π:
            break  // Policy has converged

        π = π'

    return π  // Optimal policy

function EvaluatePolicy(MDP, π):
    // MDP is a Markov Decision Process, π is a policy
    V(s) = 0 for all states s  // Initialization

    repeat until convergence:
        delta = 0
        for each state s:
            v = V(s)
            V(s) = Σ P(s' | s, π(s)) * [ R(s, π(s)) + γ * V(s') ]
            delta = max(delta, |v - V(s)|)

    return V  // Value function under the given policy

function GreedyPolicyImprovement(MDP, V):
    // MDP is a Markov Decision Process, V is a value function
    for each state s:
        π(s) = argmax over all actions a of [ R(s, a) + γ * Σ P(s' | s, a) * V(s') ]

    return π  // Improved policy

given the shiyu Zhao's course [1] ppt :

References:

[1] https://www.bilibili.com/video/BV1sd4y167NS

[2] https://chat.openai.com/

相关推荐

  1. v-model

    2023-12-19 23:22:02       27 阅读
  2. Actor-Model和Reward-Model

    2023-12-19 23:22:02       45 阅读
  3. self.predictor.setup_model(model=self.model, verbose=is_cli)

    2023-12-19 23:22:02       32 阅读
  4. v-model和:model的区别

    2023-12-19 23:22:02       54 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2023-12-19 23:22:02       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2023-12-19 23:22:02       106 阅读
  3. 在Django里面运行非项目文件

    2023-12-19 23:22:02       87 阅读
  4. Python语言-面向对象

    2023-12-19 23:22:02       96 阅读

热门阅读

  1. 网络 / day03 作业

    2023-12-19 23:22:02       61 阅读
  2. petalinux2021.1 手动打包BOOT.BIN

    2023-12-19 23:22:02       53 阅读
  3. Python设计模式

    2023-12-19 23:22:02       79 阅读
  4. C51--小车——串口/蓝牙控制及点动

    2023-12-19 23:22:02       62 阅读
  5. C语言之温度转换

    2023-12-19 23:22:02       46 阅读
  6. linux常用高级命令

    2023-12-19 23:22:02       63 阅读