FQL智能体

本节依托新的 Finance 环境,对简单的 DQL 智能体进行了改进,以提升其在金融市场环 境下的性能。FQLAgent 类能够处理多个特征和灵活长度的滞后项。它还可以将学习环境 (learn_env)与验证环境(valid_env)区分开来,从而使我们能够在训练期间获得智能体 的样本外性能的更真实表现。DQLAgent 类与 FQLAgent 类的类和 RL/QL 学习方法拥有相同 的基本结构。强化学习 | 223 In [69]: class FQLAgent: def __init__(self, hidden_units, learning_rate, learn_env, valid_env): self.learn_env = learn_env self.valid_env = valid_env self.epsilon = 1.0 self.epsilon_min = 0.1 self.epsilon_decay = 0.98 self.learning_rate = learning_rate self.gamma = 0.95 self.batch_size = 128 self.max_treward = 0 self.trewards = list() self.averages = list() self.performances = list() self.aperformances = list() self.vperformances = list() self.memory = deque(maxlen=2000) self.model = self._build_model(hidden_units, learning_rate) def _build_model(self, hu, lr): model = Sequential() model.add(Dense(hu, input_shape=( self.learn_env.lags, self.learn_env.n_features), activation='relu')) model.add(Dropout(0.3, seed=100)) model.add(Dense(hu, activation='relu')) model.add(Dropout(0.3, seed=100)) model.add(Dense(2, activation='linear')) model.compile( loss='mse', optimizer=RMSprop(lr=lr) ) return model def act(self, state): if random.random() self.epsilon_min: self.epsilon *= self.epsilon_decay224 | 第 9 章 def learn(self, episodes): for e in range(1, episodes + 1): state = self.learn_env.reset() state = np.reshape(state, [1, self.learn_env.lags, self.learn_env.n_features]) for _ in range(10000): action = self.act(state) next_state, reward, done, info = \ self.learn_env.step(action) next_state = np.reshape(next_state, [1, self.learn_env.lags, self.learn_env.n_features]) self.memory.append([state, action, reward, next_state, done]) state = next_state if done: treward = _ + 1 self.trewards.append(treward) av = sum(self.trewards[-25:]) / 25 perf = self.learn_env.performance self.averages.append(av) self.performances.append(perf) self.aperformances.append( sum(self.performances[-25:]) / 25) self.max_treward = max(self.max_treward, treward) templ = 'episode: {:2d}/{} | treward: {:4d} | ' templ += 'perf: {:5.3f} | av: {:5.1f} | max: {:4d}' print(templ.format(e, episodes, treward, perf, av, self.max_treward), end='\r') break self.validate(e, episodes) if len(self.memory) > self.batch_size: self.replay() def validate(self, e, episodes): state = self.valid_env.reset() state = np.reshape(state, [1, self.valid_env.lags, self.valid_env.n_features]) for _ in range(10000): action = np.argmax(self.model.predict(state)[0, 0]) next_state, reward, done, info = self.valid_env.step(action) state = np.reshape(next_state, [1, self.valid_env.lags, self.valid_env.n_features]) if done: treward = _ + 1 perf = self.valid_env.performance self.vperformances.append(perf) if e % 20 == 0: templ = 71 * '=' templ += '\nepisode: {:2d}/{} | VALIDATION | ' templ += 'treward: {:4d} | perf: {:5.3f} | ' templ += 'eps: {:.2f}\n' templ += 71 * '=' print(templ.format(e, episodes, treward, perf, self.epsilon)) break强化学习 | 225 以下 Python 代码表明,FQLAgent 的性能明显优于解决 CartPole 问题的简单的 DQLAgent, 这个交易机器人似乎会通过与金融市场环境的互动来相当有效地学习交易(参见图 9-4)。 In [70]: symbol = 'EUR=' features = [symbol, 'r', 's', 'm', 'v'] In [71]: a = 0 b = 2000 c = 500 In [72]: learn_env = Finance(symbol, features, window=10, lags=6, leverage=1, min_performance=0.85, start=a, end=a + b, mu=None, std=None) In [73]: learn_env.data.info() DatetimeIndex: 2000 entries, 2010-01-19 to 2017-12-26 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EUR= 2000 non-null float64 1 r 2000 non-null float64 2 s 2000 non-null float64 3 m 2000 non-null float64 4 v 2000 non-null float64 dtypes: float64(5) memory usage: 93.8 KB In [74]: valid_env = Finance(symbol, features, window=learn_env.window, lags=learn_env.lags, leverage=learn_env.leverage, min_performance=learn_env.min_performance, start=a + b, end=a + b + c, mu=learn_env.mu, std=learn_env.std) In [75]: valid_env.data.info() DatetimeIndex: 500 entries, 2017-12-27 to 2019-12-20 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EUR= 500 non-null float64 1 r 500 non-null float64 2 s 500 non-null float64 3 m 500 non-null float64 4 v 500 non-null float64 dtypes: float64(5) memory usage: 23.4 KB In [76]: set_seeds(100) agent = FQLAgent(24, 0.0001, learn_env, valid_env) In [77]: episodes = 61 In [78]: agent.learn(episodes)226 | 第 9 章 ======================================================================= episode: 20/61 | VALIDATION | treward: 494 | perf: 1.169 | eps: 0.68 ======================================================================= ======================================================================= episode: 40/61 | VALIDATION | treward: 494 | perf: 1.111 | eps: 0.45 ======================================================================= ======================================================================= episode: 60/61 | VALIDATION | treward: 494 | perf: 1.089 | eps: 0.30 ======================================================================= episode: 61/61 | treward: 1994 | perf: 1.268 | av: 1615.1 | max: 1994 In [79]: agent.epsilon Out[79]: 0.291602079838278 In [80]: plt.figure(figsize=(10, 6)) x = range(1, len(agent.averages) + 1) y = np.polyval(np.polyfit(x, agent.averages, deg=3), x) plt.plot(agent.averages, label='moving average') plt.plot(x, y, 'r--', label='regression') plt.xlabel('episodes') plt.ylabel('total reward') plt.legend(); 移动平均线 回归线 回合 总 奖 励 图 9-4:运行于 Finance 环境的 FQLAgent 的平均总奖励 训练集和验证集的性能也出现了一个有趣的现象,如图 9-5 所示。训练集性能显示出了很 大的方差,这是由于除了利用当前最佳策略之外,还要利用正在进行的探索。相比之下, 验证集性能的方差要小得多,因为它仅依赖于对当前最优策略的利用。 In [81]: plt.figure(figsize=(10, 6)) x = range(1, len(agent.performances) + 1) y = np.polyval(np.polyfit(x, agent.performances, deg=3), x) y_ = np.polyval(np.polyfit(x, agent.vperformances, deg=3), x) 图灵社区会员 cxc_3612(17665373813) 专享 尊重版权强化学习 | 227 plt.plot(agent.performances[:], label='training') plt.plot(agent.vperformances[:], label='validation') plt.plot(x, y, 'r--', label='regression (train)') plt.plot(x, y_, 'r-.', label='regression (valid)') plt.xlabel('episodes') plt.ylabel('gross performance') plt.legend(); 回合 总 收 益 训练数据集 验证数据集 回归线(训练集) 回归线(验证集) 图 9-5:FQLAgent 的每回合的训练集性能和验证集性能

  • 9
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值