前情
前一章讲的概念,这一章本来想继续学习一些算法,比如Q-learning,PPO等,但是发现学不下去,太数学了,于是就直接先用码点代码,直观点。
注意本代码是在open ai的gym环境下运行的,需要预先下载
代码
导入库
import gym #导入gym
from gym import Env
from gym.spaces import Discrete, Box, Dict, Tuple, MultiBinary, MultiDiscrete
import numpy as np
import random
import os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack #堆叠操作,提高训练效率
from stable_baselines3.common.evaluation import evaluate_policy
定义自定义的环境
class Shower_Env(Env):# 这里一定要写参数Env,继承gym中Env类
def __init__(self):
self.action_space = Discrete(3)
self.observation_space = Box(low=np.array([0]), high=np.array([100]))
self.state = 38 + random.randint(-3,3)
self.shower_length=60
def step(self,action):
self.state +=action - 1
self.shower_length -=1
if self.state >=37 and self.state<=39:
reward = 1
else:
reward = -1
if self.shower_length <= 0:
done = True
else:
done = False
info = {}
return self.state, reward, done, info
def render(self):
pass
def reset(self):
self.state = np.array([38 + random.randint(-3,3)]).astype(float)
self.shower_length=60
return self.state
这里要对gym库里的一些数据类型进行说明
Discrete
Discrete(3) #三个离散值,0,1,2
Box
Box(0,1,(3,),float) #前两个参数表示范围从0-1,(3,)表示有3个参数,float表示数据类型
Box(low=np.array([-1, -2]), high=np.array([2.0, 4.0]),(2,) dtype=np.float64).sample()
# 这是对两个参数分别设定阈值,第一个是(-1,2),第二个是(-2,4)
Tuple(和python中的tuple一样)
Tuple((Discrete(3),Box(0,1,(4,4),float))).sample()
output:
(1,array([[0.82108611, 0.95457337, 0.25524844, 0.25637766],
[0.71330428, 0.75133684, 0.06802405, 0.60739341],
[0.1211024 , 0.30688017, 0.04670419, 0.85701807],
[0.81068156, 0.46448658, 0.17561785, 0.58225321]]))
测试环境
env = Shower_Env()
episodes = 5
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
while not done:
env.render()
action = env.action_space.sample()
n_state, reward, done, info= env.step(action)
score+=reward
print('Episode:{} Score:{}'.format(episode, score))
这里action取随机值,看看60个timestep中获得的reward
Episode:1 Score:-48
Episode:2 Score:8
Episode:3 Score:-60
Episode:4 Score:-30
Episode:5 Score:-58
可以看到很差O.O
模型训练
log_path = os.path.join('Training', 'logs')
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=100000)
------------------------------------------
| rollout/ | |
| ep_len_mean | 60 |
| ep_rew_mean | 51.1 |
| time/ | |
| fps | 355 |
| iterations | 49 |
| time_elapsed | 282 |
| total_timesteps | 100352 |
| train/ | |
| approx_kl | 0.0016310867 |
| clip_fraction | 0.041 |
| clip_range | 0.2 |
| entropy_loss | -0.211 |
| explained_variance | 0.00269 |
| learning_rate | 0.0003 |
| loss | 48.5 |
| n_updates | 1026 |
| policy_gradient_loss | 0.00708 |
| value_loss | 105 |
------------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x1068b702fa0>
最后一条如图所示,比较重要的就是最上面两个参数
ep_len_mean越小越好,ep_rew_mean越大越好。但是在本项目中,这个数值是不变的,因为在环境设置中已经定义了每个回合是60次。
ep_len_mean指的是通常平均的回合长度(Episode Length Mean)。一个“回合”(episode)是指从环境的初始状态开始,智能体与环境进行交互,直到达到某个终止条件为止的一系列连续动作和状态转换。
ep_rew_mean代表着平均的回合奖励(Episode Reward Mean)。
模型保存与评估
model.save('PPO')
evaluate_policy(model, env, n_eval_episodes=10, render=True)
OUT:(24.0, 54.99090833947008)
看上去还行
模型再次测试
episodes = 5
for episode in range(1, episodes+1):
obs = env.reset()
done = False
score = 0
while not done:
env.render()
action, _ = model.predict(obs)
obs, reward, done, info= env.step(action)
score+=reward
print('Episode:{} Score:{}'.format(episode, score))
Episode:1 Score:48
Episode:2 Score:56
Episode:3 Score:60
Episode:4 Score:16
Episode:5 Score:54
明显看到分数比随机的高多了且稳定多了,说明找到了比较好的解法。
当然训练次数越多,应该是分数越高且越稳定,但是时间会比较长,我这里就不演示了。