一共三位评审给出了意见,心里在滴血,感觉亲生孩子被上了十大酷刑。
第一位给出的意见,很狠,但是真心服气(宝宝委屈)
第二位,正常
第三位,感觉没接触这个方向
我的回复:
回复3:
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
The use of deep cyclical phased actor-critic is due to the large scale state and action space of continuous control tasks. The target critic network and the replay buffer are the same mechanisms in the paper [16] and equation 12 is also from it. The target critic is used in the calculation of critic loss, in equation 12 and 13. Since PACEE is off-policy, replay buffers are used to store experiences and to break sample correlation to some extent by random sampling. The computation of critic loss is same to DQN, and we use deterministic policy gradients to compute the loss of actor. Compared to PACEE, the actors of C-PACEE work cyclically which is their only difference. So we only show PACEE algorithm. The time complexity of the algorithm is O(n^2), and it will consume some memory resources due to the use of replay buffer and deep neural network. The average time for an episode is about 0.8053 in testing and each Mujoco environment requires an average of approximately 8 hours of training. The CPU of our machine is Inter® Core™ i7-7770. TRPO, PPO, DDPG are the methods in reference [23], [24], [16] respectively and Ant is a continues control task in Mujoco.
回复2:
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below:
- We think that \xi is in (0,1) . And we found that if \xi is greater than 0.5, the generated experiences are dominated by experience network which is not conducive to actors learning. So we turn the parameter down so that the actors can dominate the generated experiences and finally found that 1e-5 is a good value.
- Yes
回复1:
Dear reviewer:
I am very grateful to your comments for the manuscript. According to your advice, we recognize the shortcomings in the manuscript and will try our best to improve the paper. Some of your questions were answered below: - For environments like HalfCheetah, HumanoidStandup, Ant, and Swimmer, since each episode is 1000 time steps, they are all 1000 runs. Like Hopper, Walker2d has an episode length of less than 1000, so they all exceed 1000runs, but all trained one million time steps.
- Yes, From the second experiment, we can see that both Reacher and InvertedPendulum converged in one million time steps.
- Table 1 combines with the experimental result images to illustrate the advantages of our approaches.