Constitutional AI
Training AI systems to follow a set of explicit, human-defined principles and safety rules.
The Problem
Traditional fine-tuning (like RLHF) relies on human labels for millions of examples. This is slow, expensive, and often inconsistent. Moreover, the "principles" the model learns are implicit—buried in the reward model rather than clearly defined. This makes it hard to audits or "patch" specific safety behaviors.
Key problems:
- Opaque Values — Not knowing exactly what ethical rules the model is following.
- Difficulty of Updating — To change a model's behavior, you might need to collect thousands of new human labels.
- Inconsistency — Human labelers often disagree, leading to muddled safety signals.
What We're Working On
- Explicit Principles — Training models to follow a "Constitution": a small set of human-readable rules (e.g., "be helpful," "never encourage violence").
- Self-Critique and Revision — Developing models that can use their own constitution to critique their drafts and rewrite them for safety.
- Auditability — By having explicit rules, we can trace a model's safety failures back to specific constitutional principles, making them easier to fix.
Related Publications
1 paper in Constitutional AI