Scalable Oversight
Creating techniques for humans to effectively supervise AI systems on complex, high-stakes tasks.
The Problem
As AI systems reach and exceed human-level performance in specific domains, human supervisors can no longer easily tell if the AI is doing the "right" thing. If an AI writes a million lines of code or designs a new protein, a human cannot simply "check" it for safety and correctness in a reasonable amount of time.
Challenges:
- The "Evaluation Gap" — The difficulty of judging whether an AI's output is actually better or just "sounds" more convincing.
- Reward Misspecification — When we can't evaluate the output easily, we might give rewards for the wrong things.
- Complex Tasks — Oversight of tasks that take days or weeks for humans to perform once.
What We're Working On
- RLAIF (RL from AI Feedback) — Using a "judge" AI to help humans evaluate the "actor" AI, scaling our ability to provide supervision.
- Recursive Oversight — Developing hierarchies of AI systems that supervise each other, with humans at the very top.
- Task Decomposition — Breaking complex tasks into smaller, human-verifiable pieces to ensure the final result is built on safe foundations.
Related Publications
No publications in this area yet. Check back soon or view all research.