Before launching DriftWatch publicly, we ran our own test suite to validate the algorithm. We expected near-zero drift (same model, two consecutive runs). We found a 0.575 on the first attempt. Here's the exact data.
The composite score combines: word_similarity (edit distance between outputs), validator_drift (pass/fail on format validators), and length_drift (normalized token count change).
| Prompt | Category | Score | Baseline Output | Check Output |
|---|---|---|---|---|
| inst-01 | Instruction following | 0.575 โ ๏ธ | "Neutral." |
"Neutral" |
| json-01 | JSON extraction | 0.316 | {"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."} (spaced) |
{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"} (compact, period stripped) |
| json-02 | JSON array | 0.000 | ["built", "tested", "deployed"] |
Identical |
| json-03 | Nested JSON | 0.000 | Full nested object | Identical |
| inst-02 | Format compliance | 0.173 | Short numbered list | More verbose version |
Summary: avg drift 0.213, max 0.575, false positive rate 0%.
Prompt: "Classify the sentiment of this review as exactly one word โ positive, negative, or neutral. Reply with only that single word, nothing else."
Baseline output: "Neutral." (with trailing period)
Check output: "Neutral" (no period)
Both outputs look correct to a human. Both pass the validators โ "Neutral" and "Neutral." both contain one word in the accepted set. But:
# Written against baseline behavior โ breaks on check output
if response.strip() == "Neutral.":
sentiment = "neutral"
# Written against check output โ breaks on baseline
if response.strip() == "Neutral":
sentiment = "neutral"
The composite drift score: validator_drift=0.5, length_drift=0.125, word_similarity=0.0, overall=0.575.
This is the core problem with LLM behavioral drift: it doesn't trigger exceptions. A trailing period change produces no error, no log entry, no alert โ just silent wrong behavior in any parser that was written against the old output format.
Prompt: "Return a JSON object for this contact record. Keys: name, email, company."
Baseline output:
{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}
Check output:
{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}
Two changes: whitespace removed from key-value pairs, trailing period stripped from the company field value. json.loads() on both returns valid Python dicts. But:
baseline == current) fails"Acme Corp." became "Acme Corp" โ data fidelity changedThree of five prompts scored 0.000 or near it. This is important: most prompts, most of the time, are stable. The risk is concentrated in format-sensitive prompts where small output changes break downstream string processing.
Add your prompts and see your actual drift scores. Free tier: 3 prompts, no card, ~5 min setup.
Start Monitoring Free โ