Shiba AI | AI Safety & Alignment Research

The Problem

Neural networks are often "black boxes." We know what goes in and what comes out, but the internal logic that transforms input to output is opaque. This lack of transparency is a major safety risk: if we don't know why a model made a decision, we can't be sure it will make safe decisions in novel or high-stakes scenarios.

Challenges include:

Feature Attribution — Identifying which parts of the input were most important for a specific output.
Mechanistic Understanding — Reverse-engineering the "circuits" or algorithms learned by the model internally.
Scalability — As models grow to billions of parameters, interpretability methods must scale to find meaningful patterns without becoming buried in noise.

What We're Working On

Mechanistic Interpretability — Studying small-to-medium models to find universal "circuits" that can be generalized to larger systems.
Feature Visualization — Developing tools to see what individual neurons or layers are "looking for" in visual and textual data.
Verification — Using interpretability to verify that a model is using the correct reasoning, rather than exploiting biases in the dataset.

The Problem

What We're Working On

Related Publications