When the API returns top-k logprobs, the alternative tokens at each position fingerprint the underlying model's tokenizer family. GPT-4o/5 emits o200k_base byte sequences; GPT-3.5/4 emits cl100k_base; Claude and Gemini have entirely separate vocabularies.
Algorithm
We request logprobs with top_logprobs=5 on a stable prompt, collect the alternative-token strings, and classify the BPE family by string-level features: presence of leading-space tokens, byte-level fallback patterns, and known multi-byte sequences for Chinese, Japanese, and emoji. A confidence score is produced as the fraction of unambiguous matches.
Thresholds
Condition
Verdict contribution
Family confidence ≥ 80%
Treat as fingerprint match
50% ≤ confidence < 80%
Inconclusive, record sample
confidence < 50%
Treat as mismatch
Limitations
Some proxies legitimately omit logprobs to save tokens. Logprobs unavailability is a flag, not a verdict. Confidence is capped at 70 in that case.
Anysingle signal cannot provemalicious behavior. Proxies may show anomalies for legitimate reasons (regional routing, A/B testing, degradation strategies, cache optimization).
Token ratio deviation may result from ChatML wrapping, system prompt injection, or tokenizer version differences — not necessarily intentional inflation.
Model identity judgment is based on statistical fingerprint matching, not cryptographic proof. Quantization, fine-tuning, and post-processing can all alter fingerprints.
MMD distribution tests are sensitive to temperature, sampling parameters, and system prompts. Significant p-values mean distributional difference, not proof of substitution.
Logprobs unavailability is increasingly common (many providers disable it by default in 2025-2026) and does not by itself indicate deception.
ITT rhythm fingerprinting is an early-stage technique. Network jitter, TCP coalescing, and gateway buffering can produce false signals.
This tool generates reference-grade evidence chains, not legal conclusions. Do not make definitive accusations based solely on this report.
The wording in the report refers to statistical "deviations" or "signal inconsistencies". Please do not use this to make fraud or deception claims against any service provider.