Combining policy gradient and Q-learning

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Cite as: arXiv:1611.01626 [cs.LG]
  (or arXiv:1611.01626v3 [cs.LG] for this version)

Submission history

From: Brendan O'Donoghue [ view email
[v1] Sat, 5 Nov 2016 10:49:37 GMT (1094kb,D)
[v2] Mon, 6 Mar 2017 12:38:42 GMT (892kb,D)
[v3] Fri, 7 Apr 2017 15:20:05 GMT (893kb,D)
Privacy-preserving machine learning is becoming increasingly important in today's world where data privacy is a major concern. Federated learning and secure aggregation are two techniques that can be used to achieve privacy-preserving machine learning. Federated learning is a technique where the machine learning model is trained on data that is distributed across multiple devices or servers. In this technique, the model is sent to the devices or servers, and the devices or servers perform the training locally on their own data. The trained model updates are then sent back to a central server, where they are aggregated to create a new version of the model. The key advantage of federated learning is that the data remains on the devices or servers, which helps to protect the privacy of the data. Secure aggregation is a technique that can be used to protect the privacy of the model updates that are sent to the central server. In this technique, the updates are encrypted before they are sent to the central server. The central server then performs the aggregation operation on the encrypted updates, and the result is sent back to the devices or servers. The devices or servers can then decrypt the result to obtain the updated model. By combining federated learning and secure aggregation, it is possible to achieve privacy-preserving machine learning. This approach allows for the training of machine learning models on sensitive data while protecting the privacy of the data and the model updates.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值