Shiba AI | AI Safety & Alignment Research

The Problem

Traditional fine-tuning (like RLHF) relies on human labels for millions of examples. This is slow, expensive, and often inconsistent. Moreover, the "principles" the model learns are implicit—buried in the reward model rather than clearly defined. This makes it hard to audits or "patch" specific safety behaviors.

Key problems:

Opaque Values — Not knowing exactly what ethical rules the model is following.
Difficulty of Updating — To change a model's behavior, you might need to collect thousands of new human labels.
Inconsistency — Human labelers often disagree, leading to muddled safety signals.

What We're Working On

Explicit Principles — Training models to follow a "Constitution": a small set of human-readable rules (e.g., "be helpful," "never encourage violence").
Self-Critique and Revision — Developing models that can use their own constitution to critique their drafts and rewrite them for safety.
Auditability — By having explicit rules, we can trace a model's safety failures back to specific constitutional principles, making them easier to fix.

Constitutional AI

The Problem

What We're Working On

Related Publications

2026

Evolving Interpretable Constitutions for Multi-Agent Coordination