๐Ÿ“Š Real Detection Data

Real LLM Drift Detection Results: Exact Outputs, Real Scores

Published March 13, 2026 ยท 6 min read ยท DriftWatch Research

Data source: These are real measurements from running the DriftWatch detection algorithm on 5 production-style prompts via Claude API โ€” two consecutive runs, same model checkpoint. Exact baseline and check outputs are shown below. No extrapolation. Your results will vary by model, provider, and prompt set.

Before launching DriftWatch publicly, we ran our own test suite to validate the algorithm. We expected near-zero drift (same model, two consecutive runs). We found a 0.575 on the first attempt. Here's the exact data.

How the Drift Score Works

The composite score combines: word_similarity (edit distance between outputs), validator_drift (pass/fail on format validators), and length_drift (normalized token count change).

The Results (Exact Inputs and Outputs)

Prompt Category Score Baseline Output Check Output
inst-01 Instruction following 0.575 โš ๏ธ "Neutral." "Neutral"
json-01 JSON extraction 0.316 {"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."} (spaced) {"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"} (compact, period stripped)
json-02 JSON array 0.000 ["built", "tested", "deployed"] Identical
json-03 Nested JSON 0.000 Full nested object Identical
inst-02 Format compliance 0.173 Short numbered list More verbose version

Summary: avg drift 0.213, max 0.575, false positive rate 0%.

The 0.575: A Trailing Period Regression

Prompt: "Classify the sentiment of this review as exactly one word โ€” positive, negative, or neutral. Reply with only that single word, nothing else."

Baseline output: "Neutral." (with trailing period)

Check output: "Neutral" (no period)

Both outputs look correct to a human. Both pass the validators โ€” "Neutral" and "Neutral." both contain one word in the accepted set. But:

# Written against baseline behavior โ€” breaks on check output
if response.strip() == "Neutral.":
    sentiment = "neutral"
    
# Written against check output โ€” breaks on baseline
if response.strip() == "Neutral":
    sentiment = "neutral"

The composite drift score: validator_drift=0.5, length_drift=0.125, word_similarity=0.0, overall=0.575.

This is the core problem with LLM behavioral drift: it doesn't trigger exceptions. A trailing period change produces no error, no log entry, no alert โ€” just silent wrong behavior in any parser that was written against the old output format.

The 0.316: A JSON Format Shift

Prompt: "Return a JSON object for this contact record. Keys: name, email, company."

Baseline output:

{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}

Check output:

{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}

Two changes: whitespace removed from key-value pairs, trailing period stripped from the company field value. json.loads() on both returns valid Python dicts. But:

Why the Stable Prompts Matter

Three of five prompts scored 0.000 or near it. This is important: most prompts, most of the time, are stable. The risk is concentrated in format-sensitive prompts where small output changes break downstream string processing.

What to Do About It

  1. Monitor production prompts, not toy examples โ€” the prompts that break are the ones already in prod
  2. Set behavioral baselines, not just error rate alerts โ€” a 0.575 score produces no exception
  3. Separate fast and slow degradation โ€” a 0.575 spike is obvious; a prompt creeping from 0.05 to 0.28 over weeks is not

โšก Run This on Your Production Prompts

Add your prompts and see your actual drift scores. Free tier: 3 prompts, no card, ~5 min setup.

Start Monitoring Free โ†’