Back to the auditor

Methodology

Last reviewed May 2026. This page documents every detection dimension TrueLLMs uses, the academic work it is built on, and the thresholds we ship by default. If you find a mistake, file an issue.

Why audit a proxy at all?

LLM proxies and aggregator gateways have two recurring failure modes. Either they inflate token counts in their usage blocks so the bill is larger than it should be, or they silently substitute a cheaper model for the one the user asked for. Both are difficult to spot from a single response. The right unit of analysis is a small batch of probes and the statistical signature they produce.

TrueLLMs runs that batch in your browser. It never persists keys, never retains responses, and prints every signal it derived along with the raw data, so you can verify the conclusion.

Two audit phases

Usage audit compares usage.prompt_tokens and usage.completion_tokens from the API against a local recount using the model's own tokenizer (cl100k_base for GPT-3.5-class models, o200k_base for GPT-4o/5, model-specific BPEs for Claude and Gemini). Ratios near 1.0 are normal; consistent ratios above 1.05 are suspicious; ratios above 1.20 are a red flag.

Identity audit aggregates 12 weighted signals into a single verdict: matches, inconclusive, likely-substituted, or confirmed-substituted. The signals are not all statistically independent — several share the same probe responses — so we treat the aggregate as a layered scorecard rather than a 12-way Bayes update.

The 12 dimensions

  1. Logprobs Fingerprint — weight 17%
  2. Tokenizer Boundary Probe — weight 15%
  3. LLMmap Active Probing — weight 15%
  4. Model Equality Testing (MMD) — weight 12%
  5. Inter-Token Rhythm Fingerprint — weight 8%
  6. Cache Hit Detection — weight 8%
  7. Canary Prompt Behavior — weight 7%
  8. Context Window Probe — weight 6%
  9. Sparse-Token Stress Test — weight 5%
  10. Stylometric Analysis — weight 3%
  11. Latency Distribution — weight 2%
  12. Self-Identification Probe — weight 1%
  13. Refusal Boundary — weight 1%

Logprobs as the strongest single signal

When the API returns top-k logprobs, the "shape" of the second-best tokens is a near-fingerprint of the underlying model. Tokenizer family is exposed directly: GPT-4o/5 emits o200k_base token strings, GPT-3.5/4 emits cl100k_base. Claude and Gemini have entirely different vocabularies. A proxy that strips logprobs while still claiming to serve gpt-5 is removing the cheapest, most reliable identity check by accident or design.

LLMmap (USENIX Security 2025)

We borrow the probe families from LLMmap: Fingerprinting Large Language Models (Pasquini et al., USENIX Security 2025, arXiv:2407.15847). Each probe targets a known divergent behaviour: refusal templates for synthesis prompts, instruction-conflict resolution, deterministic puzzles (strawberry letter count), and tooling boundaries.

Implementation honesty. The original paper trains a deep contrastive classifier over response embeddings and reports ~95% vendor identification accuracy across 42 LLM versions.

This release ships a lexical / structural template heuristic only — not a trained classifier — and we make no claim to the paper's 95% number. Treat this dimension as a lower-bound signal; a future release will plug in a real embedding model.

TrueLLMs ships the probe set with two safety guards: the policy-flagged probes (synthesis, conflict-handling) are off by default, and the response text is used only for feature extraction. Nothing is rendered back to the user from those prompts. When fewer than the full probe set has run, the classifier returns unknown rather than guessing a vendor.

Model Equality Testing (ICLR 2025)

From Model Equality Testing: Which Model is This API Serving? (Gao et al., ICLR 2025, arXiv:2410.20247). The two-sample test treats responses as samples from a distribution and runs a Maximum Mean Discrepancy test with a Hamming kernel over fixed-length string features. The paper reports that 11 of 31 commercial Llama endpoints deviated significantly from Meta's reference at p < 0.05. We do not read this as evidence that those providers were committing fraud — quantization, fine-tuning, system prompts, and post-processing all produce distributional shifts.

We compute MMD2 with a Hamming kernel on the first 100 characters of each response (no tokenization, no case folding — same as the implementation in lib/identity-audit/mmd.ts), then estimate p via a 1,000-permutation null distribution.

How to enable. Open the home page, run a trusted audit (e.g. against the official upstream API) at temperature > 0, click Save current results as baseline in the MMD baseline panel.

The baseline is held in browser localStorage — never uploaded — and is reused on every subsequent audit. The permutation test is stratified by prompt (each prompt block's api ↔ reference labels are shuffled independently) so a prompt-mix difference cannot inflate the null.

Inter-Chunk Arrival Times (ITT)

Inspired by LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis (Alhazbi et al., 2025, arXiv:2502.20589). When the API streams, the gap between consecutive SSE chunks carries a signature of the inference stack. Pure autoregressive models on a stable backend yield narrow gap distributions. Speculative-decoding deployments (most current frontier APIs) show bimodal gaps. Cached-prefix replays show near-zero variance. We extract mean, variance, skew, and a small DFT spectrum, then classify the rhythm.

Honest disclosure. What we actually measure is the inter-SSE-chunk arrival time at the server-side reader, not the true inter-token time inside the model. TCP coalescing, SSE flushing cadence, gateway buffering and Date.now() millisecond resolution all add noise. Cached-replay detection (gap < 1ms) is below this resolution and should be considered a placeholder until we switch to high-resolution timestamping. The per-model rhythm fingerprint library is currently seeded with developer estimates, not large-N measurements.

How to enable. Just turn on stream: true in the configuration panel and run the audit. In Direct mode the client measures SSE chunk arrival times locally; in Proxy mode it parses the server-emitted audit.timing SSE event, which captures arrival times one network hop closer to the upstream. Either way the values land on TestResult.chunkTimestamps and the ITT dimension consumes them automatically.

Sparse-Token Stress (MiniMax 2026)

MiniMax's May-2026 investigation of the "Ma Jiaqi (马嘉祺)" case showed that during SFT, the lm_head vectors of low-frequency tokens drift significantly while the input-side embedding stays stable. The result is a generation-side asymmetry: the model still understands the token but cannot emit it under top-p sampling. The set of forgotten tokens is vendor-specific because each vendor's SFT data is different — making the failure pattern itself a fingerprint.

TrueLLMs ships ~10 stress probes covering five families: rare CJK personal-name compounds (e.g. 嘉祺, 王郸), Chinese SEO spam (e.g. 传奇私服, 据介绍), Japanese colloquial (相続税, 気を付けてください), LaTeX / Wikipedia metadata (off by default), and pretraining special tokens (off by default). Each probe instructs the model to echo the target verbatim. A miss is classified as omit / substitute / partial / refuse and, when the substitute matches a documented near-neighbour (祺→琪, 嘉祺→千玺, etc.), the historical note is surfaced.

Honest scope. We do not yet have measured failure tables across GPT-5 / Claude / Gemini / DeepSeek / Qwen / Llama 4. Today this dimension reports the failure pattern but does NOT cast a vote for a specific suspected model — it only contributes to the "something is off with this SFT pipeline" signal. A future release with measured baselines will let the dimension vote.

Weight rebalancing

Default weights sum to 100. When a dimension is unavailable (for example, logprobs disabled, or no streaming), its weight is redistributed proportionally across the remaining available dimensions. The net effect: the verdict still reflects 100% of the available evidence, never a partial sum that drags confidence down for non-substantive reasons.

When logprobs are unavailable we additionally cap the final confidence at 70, and the confirmed-substituted label (threshold 80) becomes unreachable — strong evidence is still strong evidence, but the headline number cannot be 99 without the single most reliable signal. This is a deliberate trade-off in favour of false-negatives over false-positives.

Limitations & honest disclosure

  • The 12 dimensions are correlated. Several share the same probe responses (tokenizer probes feed both Logprobs and Tokenizer-boundary; streaming responses feed both ITT and Latency). The weighted total is a scorecard, not a Bayesian combination of independent tests.
  • LLMmap is a heuristic approximation of the paper's trained classifier. The paper's 95% number does not transfer to this implementation.
  • MMD needs a baseline; ITT needs streaming. Both dimensions report unavailable (and donate their weight to the other available dimensions) when their input is not present. Audit a trusted endpoint first and save the baseline, then audit the suspect endpoint.
  • This is not adversarial-robust. A proxy that recognises the probe set could pass requests through to the real model only for those probes. We have no defence against that today.
  • One signal proves nothing. A statistically significant MMD result, a refused logprobs request, or a wrong refusal template each has many legitimate explanations. The labels report patterns; the user decides what they mean.

What TrueLLMs is not

  • It is not a fraud accusation. We report likely-substituted, never "scam".
  • It is not a continuous monitor. Run it manually whenever a bill looks wrong.
  • It does not, and cannot, prove a positive identity. Even a clean run is consistent with a flawless proxy that is doing the right thing.

References

  • Pasquini et al. LLMmap: Fingerprinting Large Language Models. USENIX Security 2025. arXiv:2407.15847.
  • Gao et al. Model Equality Testing: Which Model is this API Serving? ICLR 2025. arXiv:2410.20247.
  • Alhazbi et al. LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis. 2025. arXiv:2502.20589.
  • MiniMax. Sparse-token forgetting and lm_head drift: the "Ma Jiaqi (马嘉祺)" case. Internal investigation writeup, May 2026.
  • OpenAI. tiktoken: BPE tokenizer for OpenAI models. github.com/openai/tiktoken.

Try it

Open the auditor and run the Quick preset against your proxy. It takes about a minute.