The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Download the paper!Read on ArXiv!Run the code!Video available!

What you need to know:

Citation