value iteration入门博客
https://blog.csdn.net/qq_40206371/article/details/120857850
https://zhuanlan.zhihu.com/p/33229439
https://artint.info/2e/html/ArtInt2e.Ch9.S5.SS2.html
贝尔曼函数的理解:分别动作函数,价值动作函数,它们可以相互转换
https://blog.csdn.net/WSRY_GJP/article/details/123524282
Monte-Carlo policy gradient(PG)
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
https://www.jianshu.com/p/af668c5d783d
https://www.zhihu.com/column/p/110881517?utm_medium=social&utm_source=weibo
https://cloud.tencent.com/developer/article/1711596
https://blog.csdn.net/qq_30615903/article/details/80747380
https://blog.csdn.net/suai9292/article/details/79910525
入门视频
https://www.bilibili.com/video/BV1yP4y1X7xF?p=2&vd_source=1579e7f4f3a932e731abeb9d99294b0c