Scalable Oversight

Creating techniques for humans to effectively supervise AI systems on complex, high-stakes tasks.

The Problem

As AI systems reach and exceed human-level performance in specific domains, human supervisors can no longer easily tell if the AI is doing the "right" thing. If an AI writes a million lines of code or designs a new protein, a human cannot simply "check" it for safety and correctness in a reasonable amount of time.

Challenges:

  • The "Evaluation Gap" — The difficulty of judging whether an AI's output is actually better or just "sounds" more convincing.
  • Reward Misspecification — When we can't evaluate the output easily, we might give rewards for the wrong things.
  • Complex Tasks — Oversight of tasks that take days or weeks for humans to perform once.

What We're Working On

  • RLAIF (RL from AI Feedback) — Using a "judge" AI to help humans evaluate the "actor" AI, scaling our ability to provide supervision.
  • Recursive Oversight — Developing hierarchies of AI systems that supervise each other, with humans at the very top.
  • Task Decomposition — Breaking complex tasks into smaller, human-verifiable pieces to ensure the final result is built on safe foundations.

Related Publications

No publications in this area yet. Check back soon or view all research.