May 29, 2026 · TrueLLMs
Detecting Frontier Model Substitution: What Actually Works in 2026
Old methods fail because self-consistency is not identity, mock fingerprints are fiction, and behavior probes can be forged by system prompts. Here is what actually works.
The first wave of LLM fingerprinting relied on self-identification, stylometry, and behavior probes. Those methods looked plausible in blog posts and broke the moment a proxy operator added a five-line system prompt. The second wave added logprobs and ITT rhythm, which are stronger but still incomplete when gateways strip logprobs or buffer streams. In mid-2026 the detection surface has consolidated around three signals that are hard to forge at scale: deterministic tokenizer-family slope fingerprints, capability floors, and differential-mode distribution tests against a trusted reference endpoint.
This post is about why the old signals failed, how the three real signals work, and where the honest boundaries still are.
Why the old tests do not hold up
Three categories of false confidence:
- Self-consistency is not identity. A model can be consistent with itself and still be a downgraded fine-tune or a distilled variant. Consistency only tells you the endpoint is stable, not that it is the flagship you paid for.
- Mock fingerprint libraries are fiction. Any "reference fingerprint" built from public blog posts or estimated token boundaries is not a measured ground truth. Comparing against fiction produces fiction.
- Behavior probes are forgeable. System prompts, output rewriting layers, and gateway post-processing can change refusal templates, Markdown density, and even "deterministic puzzle" answers without changing the underlying model. A proxy that knows the probe set can pass all of them and still serve a cheaper model for real traffic.
The three real signals
TrueLLMs currently scores on five dimensions, but the three that carry the most weight and are hardest to fake are:
1. Deterministic tokenizer-family fingerprint (slope method)
The slope method regresses reported prompt_tokens against repeated probe-unit count. The slope is the exact token count of one probe unit under the server's tokenizer. Because o200k_base is the current OpenAI tokenizer from GPT-4o through GPT-5.x, a match against exact js-tiktoken counts is deterministic evidence that the server is billing through the OpenAI family — which, for the flagship tier, means GPT-5.x.
For Claude and Gemini, whose tokenizers are closed-source, absolute mode cannot compute the exact count locally. Differential mode solves this: the trusted reference endpoint's prompt_tokens slope is the exact ground-truth tokenizer count for the real model. If the audited endpoint's slope differs from the reference slope for the same units, the audited endpoint is using a different tokenizer — deterministic evidence of substitution, with no local tokenizer required.
2. Capability floor
Capability floor uses a small set of hard questions with ground-truth grading. The test is not about style or preference; it is about whether the model can clear a minimum behavioral bar expected from the claimed tier. In differential mode, the same items are run against both the audited endpoint and the trusted reference. A large delta in pass rate, especially with specific regressed items, is concrete evidence of a downgrade. It is weaker than the tokenizer signal because fine-tuning or safety layers can also cause misses, but it is independent of tokenizer tricks.
3. Differential mode (MMD + capability delta)
Differential mode requires the user to supply their own trusted official API key as a reference. TrueLLMs runs identical prompts against both endpoints and compares the results. MMD tests whether the response distributions differ; capability deltas measure whether the audited endpoint fails items the reference passes. Neither fabricates a baseline.
This is the strongest "is this the real thing?" test available in the current framework, but it is only as strong as the reference endpoint. If your reference key is itself compromised, the comparison is compromised.
Honest boundaries
There are four hard limits that the current framework cannot remove:
- Closed-source tokenizers need differential mode. Without a trusted reference, Claude and Gemini tokenizers cannot be verified locally.
- Image models are not covered. gpt-image-2 and similar models have no chat prompt_tokens tokenizer slope, no text capability probes, and MMD is a text-distribution test. Image-model substitution detection requires image statistics and latency fingerprinting, which are currently unsupported.
- A lone tokenizer mismatch is only likely evidence. Gateways can normalize billing through a different tokenizer family while still serving the real model. A mismatch needs corroboration from another scored dimension before it can be treated as confirmed.
- MMD needs temperature > 0 and sufficient samples. At temperature 0 or with small sample sizes, the test is unavailable by design. TrueLLMs does not invent a baseline.
References
- Gao et al. Model Equality Testing: Which Model is this API Serving? ICLR 2025. arXiv:2410.20247.
- Pasquini et al. LLMmap: Fingerprinting Large Language Models. USENIX Security 2025. arXiv:2407.15847.
- Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs. arXiv:2504.04715, 2025.
- Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test. arXiv:2506.06975, 2025.
- IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation. arXiv:2602.22700, 2026.
- Log Probability Tracking of LLM APIs. arXiv:2512.03816, 2025.
Run an audit against your own proxy. It takes about a minute and stays in your browser.