Shiba AI Lab | AI Safety & Alignment Research

The Problem

As AI systems reach and exceed human-level performance in specific domains, human supervisors can no longer easily tell if the AI is doing the "right" thing. If an AI writes a million lines of code or designs a new protein, a human cannot simply "check" it for safety and correctness in a reasonable amount of time.

Challenges:

The "Evaluation Gap" — The difficulty of judging whether an AI's output is actually better or just "sounds" more convincing.
Reward Misspecification — When we can't evaluate the output easily, we might give rewards for the wrong things.
Complex Tasks — Oversight of tasks that take days or weeks for humans to perform once.

What We're Working On

RLAIF (RL from AI Feedback) — Using a "judge" AI to help humans evaluate the "actor" AI, scaling our ability to provide supervision.
Recursive Oversight — Developing hierarchies of AI systems that supervise each other, with humans at the very top.
Task Decomposition — Breaking complex tasks into smaller, human-verifiable pieces to ensure the final result is built on safe foundations.

Scalable Oversight

The Problem

What We're Working On

Related Publications