Model Equality Testing (Gao et al., ICLR 2025) uses a two-sample Maximum Mean Discrepancy test to check whether two sets of responses come from the same distribution. First record a baseline from a trusted endpoint, then run audits to compare. Baseline is stored in localStorage, never uploaded.
MMD testing requires temperature > 0 to capture distribution differences. Current temperature = 0.
Save current results as baseline (0)
Dashboard A · Token Usage Audit
Avg Prompt Ratio
1.000
normal
Avg Completion Ratio
1.000
normal
Fixed Offset Estimate
0.0
tokens / request
High Risk Samples
0
/ 0 samples
Overall Risk Level
Normal
Pattern: No significant deviation detected
Linear Regression Evidence
样本数不足,无法可靠识别模式(至少需要 3 个有效样本)
Dashboard B · Model Identity Audit
Overall Verdict
Inconclusive
Based on 12 weighted signals. Unavailable dimensions' weights are proportionally transferred to other dimensions.
Anomaly Confidence
0/100
Higher means more deviation from claimed model
Model Comparison
Claimed:gpt-4o
Suspected:unknown
12-Dimension Signal Lights
Tokenizer Boundary
0% 20%
LLMmap Fingerprint
0% 18%
MMD Distribution Equivalence Test
0% 15%
ITT Rhythm Fingerprint
0% 10%
Response Latency & Speed
0% 2%
Self-Identification Probe
0% 1%
Canary Prompts
0% 8%
Refusal Boundary Probe
0% 1%
Context Window
0% 7%
Cache Hit Detection
0% 10%
Sparse-Token 压力测试
0% 5%
Stylometric Analysis
0% 3%
Evidence Chain
Expand each dimension to see the full reasoning process and raw evidence
Anysingle signal cannot provemalicious behavior. Proxies may show anomalies for legitimate reasons (regional routing, A/B testing, degradation strategies, cache optimization).
Token ratio deviation may result from ChatML wrapping, system prompt injection, or tokenizer version differences — not necessarily intentional inflation.
Model identity judgment is based on statistical fingerprint matching, not cryptographic proof. Quantization, fine-tuning, and post-processing can all alter fingerprints.
MMD distribution tests are sensitive to temperature, sampling parameters, and system prompts. Significant p-values mean distributional difference, not proof of substitution.
Logprobs unavailability is increasingly common (many providers disable it by default in 2025-2026) and does not by itself indicate deception.
ITT rhythm fingerprinting is an early-stage technique. Network jitter, TCP coalescing, and gateway buffering can produce false signals.
This tool generates reference-grade evidence chains, not legal conclusions. Do not make definitive accusations based solely on this report.
The wording in the report refers to statistical "deviations" or "signal inconsistencies". Please do not use this to make fraud or deception claims against any service provider.