Shiba AI | AI Safety & Alignment Research

The Problem

Most AI models are extremely brittle. A small, intentional change to an input—called an "adversarial perturbation"—can cause a state-of-the-art model to fail spectacularly. For safety-critical systems like autonomous vehicles or medical diagnostics, this brittleness is unacceptable.

Key issues:

Adversarial Attacks — Small, invisible changes to images or text that trick models into misclassification.
Distribution Shift — Performance dropping sharply when the model encounters data that is even slightly different from its training set.
Out-of-Distribution Detection — The difficulty models have in "knowing what they don't know."

What We're Working On

Formal Verification — Mathematically proving that a model's output will remain within a "safe" range for any input within a certain bound.
Adversarial Training — Training models against an "adversary" that actively finds their weaknesses, forcing the model to learn more robust features.
Uncertainty Estimation — Developing methods for models to signal high uncertainty when they encounter novel data, allowing for safe human intervention.

Robustness

The Problem

What We're Working On

Related Publications