📊 Real Detection Data

Real LLM Drift Detection Results: Exact Outputs, Real Scores

Published March 13, 2026 · 6 min read · DriftWatch Research

Data source: These are real measurements from running the DriftWatch detection algorithm on 5 production-style prompts via Claude API — two consecutive runs, same model checkpoint. Exact baseline and check outputs are shown below. No extrapolation. Your results will vary by model, provider, and prompt set.

Before launching DriftWatch publicly, we ran our own test suite to validate the algorithm. We expected near-zero drift (same model, two consecutive runs). We found a 0.575 on the first attempt. Here's the exact data.

How the Drift Score Works

0.0 = functionally identical to baseline
0.1–0.29 = minor variation — monitor but don't page
0.3–0.49 = significant behavioral change — investigate
0.5+ = breaking change — something downstream will likely fail

The composite score combines: word_similarity (edit distance between outputs), validator_drift (pass/fail on format validators), and length_drift (normalized token count change).

The Results (Exact Inputs and Outputs)

Prompt	Category	Score	Baseline Output	Check Output
inst-01	Instruction following	0.575 ⚠️	`"Neutral."`	`"Neutral"`
json-01	JSON extraction	0.316	`{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}` (spaced)	`{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}` (compact, period stripped)
json-02	JSON array	0.000	`["built", "tested", "deployed"]`	Identical
json-03	Nested JSON	0.000	Full nested object	Identical
inst-02	Format compliance	0.173	Short numbered list	More verbose version

Summary: avg drift 0.213, max 0.575, false positive rate 0%.

The 0.575: A Trailing Period Regression

Prompt: "Classify the sentiment of this review as exactly one word — positive, negative, or neutral. Reply with only that single word, nothing else."

Baseline output: "Neutral." (with trailing period)

Check output: "Neutral" (no period)

Both outputs look correct to a human. Both pass the validators — "Neutral" and "Neutral." both contain one word in the accepted set. But:

# Written against baseline behavior — breaks on check output
if response.strip() == "Neutral.":
    sentiment = "neutral"
    
# Written against check output — breaks on baseline
if response.strip() == "Neutral":
    sentiment = "neutral"

The composite drift score: validator_drift=0.5, length_drift=0.125, word_similarity=0.0, overall=0.575.

This is the core problem with LLM behavioral drift: it doesn't trigger exceptions. A trailing period change produces no error, no log entry, no alert — just silent wrong behavior in any parser that was written against the old output format.

The 0.316: A JSON Format Shift

Prompt: "Return a JSON object for this contact record. Keys: name, email, company."

Baseline output:

{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}

Check output:

{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}

Two changes: whitespace removed from key-value pairs, trailing period stripped from the company field value. json.loads() on both returns valid Python dicts. But:

Exact string comparison (baseline == current) fails
Any regex written against the spaced format breaks
The field value "Acme Corp." became "Acme Corp" — data fidelity changed

Why the Stable Prompts Matter

Three of five prompts scored 0.000 or near it. This is important: most prompts, most of the time, are stable. The risk is concentrated in format-sensitive prompts where small output changes break downstream string processing.

Highest risk: single-token outputs, JSON formatting, code-only outputs
Lower risk: multi-sentence summaries, classifications with broad acceptable ranges

What to Do About It

Monitor production prompts, not toy examples — the prompts that break are the ones already in prod
Set behavioral baselines, not just error rate alerts — a 0.575 score produces no exception
Separate fast and slow degradation — a 0.575 spike is obvious; a prompt creeping from 0.05 to 0.28 over weeks is not

⚡ Run This on Your Production Prompts

Add your prompts and see your actual drift scores. Free tier: 3 prompts, no card, ~5 min setup.

Start Monitoring Free →