总目录 大模型安全相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328
Safe RLHF: Safe Reinforcement Learning from Human Feedback
安全 RLHF:通过人类反馈进行安全强化学习
https://arxiv.org/pdf/2310.12773
https://github.com/PKU-Alignment/safe-rlhf
https://www.doubao.com/chat/3556303170287106
国内首个可复现的RLHF基准,北大团队开源 PKU-Beaver
速览
- 研究动机: