Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.
Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.
Built by an independent researcher. Open source. Not affiliated with any model vendor.
π TAF Agent β User Manual
What does it do?
Predicts practical viability of any transformer LLM
before you spend GPU/$. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF β Thermodynamic Attention Framework).
How to use β 7 modes
π Profile: paste model id β all recipes at once = TAF Card. Best starting point.
π Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.
π Inspect config: paste raw config.json β tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.
π¬ Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.
π Recipe + form: manual selection, full parameter control. Best when you want exact control.
π Phase diagram: scatter plot of 23 panel models on (log ΞΈ, Ξ³) plane. Hagedorn line Ξ³=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.
The 8 recipes available
X-1 Custom training vs API β compares cost of training your own model vs paying for API access.
Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.
X-2 Long Context Viability β predicts if a model serves a target context length reliably.
X-3 Budget pre-flight β given $ budget, what model is feasible to train?
Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
X-5 Hardware selection β which GPU should I use to serve at target throughput?
Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.
X-19 KV Compression decision β should I use soft decay, hard cutoff, or literature methods?
Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
β v0.4 (sesiΓ³n 29 findings) β
What's new in v0.4 (sesiΓ³n 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).
X-21 Imprint Purity Diagnostic β predicts Ξ³ on RANDOM tokens via Ξ½=β1/(2Ο); how clean is the model's RoPE prediction?
Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted Ξ³_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
Learned-imprint slope Ξ½ = β1/(2Ο): RoPE rotation period 2Ο drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. Ξ½ is DERIVED β not fitted (empirical err 0.3%).
X-22 Compute-Context Invariant β does Ξ³ Γ log(NΒ²Β·D) lie in panel band 51.2 Β± 16.8? Detects scaling/training anomalies.
Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = Ξ³Β·log(NΒ²Β·D), z-score, IN-BAND or OUTLIER.
Chinchilla-attention invariant K: Ξ³ Γ log(NΒ²Β·D) β 51.2 Β± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.
X-23 IH-Phase Detector β pre- or post-induction-head? Cheap probe via sign(Ξ³_text β Ξ³_random).
ΞΞ³ as IH probe: sign(Ξ³_text β Ξ³_random) > 0 βΊ post-induction-head. Cheaper than running an in-context-learning benchmark.
Ξ³-cluster on famous constants (intriguing, n=4): CodeLlama-13b Ξ³=0.382 β 1β1/Ο (golden conjugate, err 0.0003); pythia-1.4b Ξ³=0.705 β 1/β2; Llama-2-7b Ξ³=0.287 β 1β1/β2; Mistral-Nemo Ξ³=0.428 β log_10(e). Caveat: could be coincidence.
π v0.4 β New diagnostics (sesion 31)
Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + SΓ³cratic interrogation. Available in taf_browser.py Β§33.
v0.6 (2026-05-06): three new diagnostics live in the TAF Card under π¬ Diagnostics. All run in your browser; Ξ³_observed comes from the Diagnose CLI on real weights.
TAF Card layout (new in v0.6)
After clicking π Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict β /β /β, Ξ³ headline, π§² Anti-Ising if Phase A) and four expandable sections: π Recipes (open by default β verdict per dimension), π¬ Diagnostics (key numbers, Ξ³ predicted vs observed, what-if explorer), β Verification (Sage+Lean algebraic consistency, falsification F1-F23), π Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline β tooltip.
Preset list: 11 popular models curated. Just select from dropdown.
HF Hub fetch: paste any model id (e.g. Qwen/Qwen2.5-32B-Instruct),
click π₯ Fetch. Browser downloads config.json directly from HuggingFace, fills the form. Works for any public model.
Manual: fill the form fields directly with values from the model card.
π v0.7 β Anti-bullshit pack (4 new modes)
v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference β pure metadata + math.
πͺ Context Unmasker
Detects when max_position_embeddings is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id β 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. Use case: before paying GPU for 32k context, verify the model actually attends that far.
π Chat-template Sniffer
Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting --apply_chat_template silently halves multi-turn accuracy. Use case: before reporting a benchmark score, confirm you applied the template correctly.
π― Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from its public leaderboard β a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) β Bradley-Terry MLE + 200-iteration bootstrap β ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. Use case: before declaring "model A beats model B", verify their CIs don't overlap.
π§ͺ Contamination Prior
Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date β tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSRβ¦) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. Use case: decide which scores to trust when comparing two models.
βοΈ Quant-regime Classifier
Predicts Ξ³-shift and ΞPPL for any (model Γ quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, β¦). Architecture-aware: small d_head + aggressive GQA β more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. Use case: before quantizing, predict whether your specific architecture Γ scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.
π Cross-framework Drift Bound
Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it β real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. Use case: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.
MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict β saturated / near-saturated / discriminative β plus a recommended replacement (e.g. MMLU β MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. Use case: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.
π JSON CoT-aware Linter
Constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in the order your schema declares them. If you write { answer, reasoning } the model commits to answer first and CoT collapses into post-hoc justification. Paste any schema (or example response) β the linter classifies each field as reasoning, answer, or other, flags the ordering, and emits a reordered fix you can copy back. Use case: 'My CoT prompt works in plaintext but degrades under JSON mode' β run linter, find the inverted order, fix.
π§ PEFT Anti-Pattern Checker
PEFT's get_peft_model(base, config) creates a FRESH adapter β it does not load saved weights from a path. Users who paste tutorial code and try to resume from a checkpoint silently throw away their training. peft #2115 has the canonical bug report. This linter scans your training script for the pattern + 3 related issues (QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio) and reports findings with line numbers and suggested fixes. Use case: before you launch a 10-hour LoRA fine-tune, paste your script β catch the silent bugs in 200ms.
π Prompt-Cache Diff Predictor
Provider prompt caches each have different rules: Anthropic's cache_control breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes β₯1024 tokens; Gemini context caches require β₯32K tokens. A misplaced edit silently 10x's your bill β the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. Use case: 'I tweaked the system prompt and the bill jumped β what broke?' β paste both prompts, see exactly which provider stopped caching.
π¬ Speculative-Decode Compatibility
Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected β you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full tokenβid map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical Ξ±=0.5/0.7/0.85 acceptance rates. Use case: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.
π Multilingual Tokenizer Tax
Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Use case: 'My multilingual support added 30% to the bill β which language costs the most?' β paste real production text, see exact per-tokenizer breakdown.
π― LongScore
Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose LongScore: LC_l = (S_l β Base) / Base with Base = mean(S_short), then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. Use case: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization β how much accuracy do I actually lose?' β paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).
π§ Solutions Hub
tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability Β· diagnostics Β· setup Β· training Β· retrieval Β· multimodal Β· observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. Use case: 'I have problem X β does tafagent solve it, and if not, who does?'
The audit chain
Every result shows the full Computation Chain β each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (Β§26.1, Β§19.1, etc.) refer
to the underlying paper for derivation.
The plain-English answer
After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are always correct (deterministic Python);
the synthesis is LLM-generated β verify against the chain if in doubt.
Common parameters explained
ΞΈ (rope_theta): RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).
T_train: max context the model was trained on. From max_position_embeddings.
T_eval: your target inference context length. The key knob.
n_kv_heads < n_attention_heads: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes Ξ³ toward Hagedorn.
has_SWA: model uses Sliding Window Attention (Mistral, gemma-2).
n_params: total parameter count. Threshold ~400M for induction-head emergence.
What to look for in verdicts
YES / GO β proceed with confidence; numbers support the choice.
DEGRADED / TINY-MODEL β works but with caveats; read the action.
NO / MEMORY-LIMITED β don't proceed as-is; mitigation provided.
Privacy
Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.
Cardy ΞH — entropy shift between observed and nominal context
Falsification dashboard — checks 23 specific predictions (F1βF23)
Algebraic consistency — 8 mathematical identities the model must satisfy
β Formally verified math
37 theorems machine-proven in Lean 4 + Mathlib4
Click any badge β opens the source line on GitHub
Verify yourself: lake build (β5 s after cache fetch)
π€ Export & share
JSON Β· Markdown Β· LaTeX (paper-ready)
Reproducible share link (state encoded in URL)
Submit to community registry on GitHub
π v0.7 anti-bullshit pack
πͺ Unmask β config.json claims 32k? See if it actually attends that far
π Chat-template β exact CLI flag so lm-eval doesn't silently halve your accuracy
π― Arena CI β recover the confidence intervals Chatbot Arena hides
π§ͺ Contamination β rate 20+ benchmarks for contamination probability
βοΈ Quant β predict Ξ³ shift + ΞPPL for any (model Γ quant scheme) combo
π Drift β bug or noise? Predict max admissible gap between two evals
π NIAHβReason β does your "128k context" actually reason there, or just retrieve?
π Saturation β is your benchmark still useful, or are all frontier models tied at the top?
π JSON CoT β lints structured-output schemas for the answer-before-reasoning anti-pattern that silently breaks Chain-of-Thought.
π§ PEFT Lint β catches the silent get_peft_model base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.
π Cache Diff β predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.
π¬ Spec-Decode β verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).
π Token Tax β real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).
π― LongScore β peer-reviewed degradation metric (100-LongBench, ACL 2025). Lookup any model in RULER + HELMET KBs (n=93). See how much your model actually drops past short context.
π§ Solutions Hub β every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent β find.
Architectures supported (click to expand)
β RoPE-MHA Multi-Head Attention: each token position attends through several parallel heads at once.β RoPE-GQA Grouped Query Attention: queries share fewer keys/values than heads (saves memory but pushes Ξ³ toward Hagedorn).β ALiBi Attention with Linear Biases: position info is a learned slope added to attention scores, no rotation.β AbsPE Absolute Position Embeddings: each position has a fixed learned vector added to the token embedding.β SWA Sliding Window Attention: each token only attends within a fixed local window (Mistral, gemma-2 use this).β SSM (Mamba) State Space Model: a sequence layer that maintains internal state instead of attention (Mamba, Jamba use this).β Any HuggingFace public model
β³ Loading Python runtime...
π― What do you want to do?
Pick a task. Each one opens the right tool below. Or scroll down for the full list of 22 modes.
π¬ Diagnose a modelStart here when you have a specific model id and want a full diagnostic: Profile runs all 5 recipes at once. Unmask checks if max_position_embeddings is honest. NIAHβReason predicts retrieval-vs-reasoning gap. LongScore looks up published RULER + HELMET data for the model and shows real degradation past short context (peer-reviewed metric). Quant predicts whether quantizing will break it. Inspect lets you paste raw config.json for private/in-dev models.
Will this specific model work for my use case?
β Trust a benchmark scoreWhen you see a score and want to know if it's real. Contamination rates 20+ benchmarks for likelihood the model saw them during training. Drift tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). Arena CI reconstructs the confidence intervals Chatbot Arena hides β many top-Elo "wins" are statistically tied.
Should I believe this number? Bug or noise?
βοΈ Set up an eval correctlyBefore you run lm-eval-harness or vLLM serve, get the right CLI flag. Chat-template Sniffer detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact --apply_chat_template / --chat-template invocation. Solves issue #1841 in lm-eval-harness (silent Γ·2 accuracy). Diagnose CLI generates the Python command to measure Ξ³_obs on your local GPU.
Get the exact CLI flag for lm-eval / vLLM / transformers.
Side-by-side, or browse the empirical model landscape.
π Manual / free-formRecipe: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. Ask: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.
Pick a specific recipe by hand, or ask in plain English.
Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B),
click Profile. See all 5 recipes scored in seconds.
π‘ Quick start: pick any preset β click Generate. Or paste a model id from HF Hub trending β π₯ Fetch β Generate.
π Profile a modelOne-click full diagnosis. Paste any HF model id (or pick preset).
Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget,
hardware) and produces a single TAF Card showing verdict per
dimension + key numbers + architecture classification.
Use case: "I'm evaluating Qwen2.5-32B for production β
what's its full viability profile?" β paste id β Profile β done.
For technicians: when you need a complete viability snapshot
of a candidate model. Outputs match paper Β§sec:gamma_decomposition format.
π‘ Use case: you have a private model not on HF Hub, or a config you're designing. Paste the raw JSON below and get a full TAF profile.
π Architecture InspectorPaste any config.json directly. Tool parses it and runs the full Profile.
Useful for: private models, in-development configs, models not yet on HuggingFace,
or comparing what your custom architecture would do.
Paste the raw config.json contents. The tool extracts the architectural
parameters and runs the full 5-recipe Profile.
π‘ Try: paste 3 popular 7-8B models (Meta-Llama-3-8B, Mistral-7B-v0.1, Qwen/Qwen2.5-7B), pick recipe X-2, T_eval=16000. See which best handles long context.
π Compare models side-by-sideSame recipe, multiple models. Pick 2-3 candidate models and
one recipe. See verdicts in a single comparison table.
Use case: "I need long-context retrieval at 16K β which is
best: Llama-3-8B, Mistral-7B, or Qwen-7B?" β pick 3 + X-2 + 16K β see winner.
For technicians: when choosing between 2-3 candidate models for
a specific deployment scenario. Compare their verdicts on the same recipe.
For X-2 / X-19 only. The context length all compared models will be
evaluated at. Other recipes use their own params.
Output: Ξ³_obs, RΒ², phase, KV cache budget D_90, KL anomaly,
full thermodynamic profile (Z, U, S, F, C_V, Ο). Saved as JSON.
Pick options below and copy-paste the generated command on your local
machine (Python + transformers + numpy). Total wall time β 5 min in
--fast mode on CPU; full mode 20β60 min on GPU.
Generated command:
Next steps:
(1) git clone https://github.com/karlesmarin/tafagent
(2) cd tafagent && pip install torch transformers numpy
(3) Run the command above.
(4) Result JSON lands in ./diagnose_results/ β upload it
to the π Pick recipe mode (or paste in π Inspect config) for full TAF analysis.
π Phase diagram (Ξ³ Γ ΞΈ)
Each dot is one model from the paper's empirical panel
(data/master_gamma_results.json). The x-axis is RoPE base ΞΈ
on log scale; y-axis is measured Ξ³.
The Hagedorn line Ξ³=1 separates Phase A (Ξ³<1, global) from
Phase B (Ξ³>1, local-collapsed).
Hover dots for details; click to populate the recipe form.
πͺ Context Unmasker
Paste a HuggingFace model id (or raw config.json). The tool checks for
sliding-window attention, RoPE scaling (YaRN/linear/dynamic NTK), and
GQA β anything that makes max_position_embeddings larger
than the practical effective context. Mistral-7B-v0.1 is the canonical
example: declared 32k, attends within ~4-8k.
Are you about to spend money on a model that won't actually attend that far? Paste an id and find out in 1 second. No GPU, no inference β just config.json arithmetic.
Or paste raw config.json (private / in-dev models)
π Chat-template Sniffer
Paste an HF model id (or raw tokenizer_config.json). Detects the
chat-template family (Llama-3, ChatML, Mistral, Gemma, Phi-3,
Alpaca, DeepSeek, custom) and gives you the exact framework command
to use it correctly. lm-eval-harness silently halves accuracy if you
forget to apply it (issue #1841).
Did you forget --apply_chat_template? Most multi-turn evals fail by ~50% because the chat template wasn't applied. Paste a model id, get the exact CLI flag for your stack.
Or paste raw tokenizer_config.json (private models)
π― Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from the public leaderboard.
A 5-Elo gap can be statistically meaningless. Paste raw vote data
(model_a, model_b, winner) β the tool computes Bradley-Terry MLE +
bootstrap CIs and lists statistical ties (CI overlap).
Is GPT-4 actually better than Claude β or are they tied? Paste pairwise vote CSV (or click Load sample). Bradley-Terry MLE + 200-iteration bootstrap β ranked Elos with 95% CIs and statistical-tie detection. All in browser.
π§ͺ Contamination Prior
Computes a Bayesian-ish prior on whether a benchmark score is contaminated, based on (model training cutoff date) Γ (benchmark release date) Γ (known corpus inclusion + leak history). Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated.
Should you trust your model's MMLU score? Enter the model's training cutoff date β the tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQAβ¦) and tells you which scores are likely contaminated.
βοΈ Quant-regime Classifier
Predicts Ξ³-shift (and downstream ΞPPL) for a given (model Γ quant scheme).
Generic claims like "AWQ ~95% retention" are too vague β TAF uses
d_head, GQA ratio, SWA flag, and model size to give an architecture-specific
verdict. Solves: HF community widely reports unpredictable quant cliffs
(NF4 -2 PPL on Phi-3 but fine on Llama-3-8B).
Will quantizing your model break it? Paste an HF model id, pick a quant scheme β get predicted Ξ³-shift, expected ΞPPL band, and a recommended alternative if it's a cliff. Browser-only, no GPU, no calibration set required.
π Cross-framework Drift Bound
Same model, different scores on different setups. Is the gap noise or
a real bug? Enter two scores with their (framework, dtype, batch,
chat-template) β tool predicts the maximum allowable drift from
numerical noise alone. If observed gap exceeds it β real bug, usually
chat-template mismatch (lm-eval issue #1841) or KV-cache layout.
Your model gives 67.2 on lm-eval-hf and 65.1 on vLLM-served. Bug or noise? Enter both scores with (framework, dtype, batch, chat-template applied?). Tool predicts the noise band and flags real bugs. arxiv 2506.09501 documents this as a major eval reproducibility problem.
π NIAH β Reasoning Gap
NIAH (Needle in a Haystack) tests retrieval: "find this fact in long text". Multi-hop reasoning tests inference: "combine facts X+Y at the start with fact Z at the end". RULER paper (NVIDIA 2024) shows long-context models often pass NIAH but fail reasoning at the same context. This tool predicts both pass rates from architecture alone.
Your model claims 128k context. Will it actually reason at 64k, or just retrieve? Paste an HF model id and a target eval context β tool predicts NIAH and multi-hop reasoning pass rates, the gap, and a "safe context" where reasoning stays β₯65%.
π Benchmark Saturation Detector
MMLU is saturated (88-94% all frontier models). Reporting "92% on MMLU" is now meaningless. This tool tells you which benchmarks still discriminate frontier models, which are saturated, and what to use instead. Data: DemandSphere AI Frontier Tracker (CC BY-NC 4.0) refreshed 2026-05.
Is your benchmark still useful? Pick a benchmark to see top-3 frontier scores, spread, and a verdict (saturated / near-saturated / discriminative) plus recommended replacements.
Data: DemandSphere AI Frontier Model Tracker (CC BY-NC 4.0) Β· HF Open LLM Leaderboard v3 (open-weight historical) Β· last fetch 2026-05-05.
π JSON CoT-aware LinterWhy this matters: constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in schema order. If your schema places answer before reasoning, the model commits to a final answer first and only then writes the rationale to justify it β defeating Chain-of-Thought entirely. Paste a JSON Schema (or example object) and the linter flags the ordering.
Reasoning before answer, always. Paste a JSON Schema or example response object β the linter reports whether reasoning fields come before answer fields and suggests a fix.
π§ PEFT Anti-Pattern CheckerWhy this matters: get_peft_model(base, config) creates a FRESH adapter β it does NOT load saved weights. Users who want to resume from a checkpoint must call PeftModel.from_pretrained(base, path). peft #2115 documents the silent base-model bug. This linter scans your training script for that pattern (and 3 others: QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio).
Don't burn 10 hours of training on a base model. Paste your PEFT setup code β the linter flags silent base-model loads, QLoRA ordering bugs, target_modules/arch mismatches, and lora_alpha conventions.
π Prompt-Cache Diff PredictorWhy this matters: Anthropic's `cache_control` cache breaks at the first token diff in the marked prefix. OpenAI auto-caches prefixes β₯1024 tokens but invalidates on any change. Gemini context cache requires β₯32K tokens. A misplaced edit silently 10x's your bill β and the API never warns you. Paste old + new prompt, see per-provider hit ratio + cost delta.
Don't 10x your bill on a one-character edit. Paste your previous and current prompt β the predictor finds the longest common prefix, estimates tokens, and shows per-provider cache hit ratio + $ delta vs no-cache.
π¬ Speculative-Decode CompatibilityWhy this matters: speculative decoding (vLLM, SGLang, llama.cpp, transformers) requires the draft and target model to share an EXACT vocabulary. Any token-id disagreement means the target rejects every draft token β you pay BOTH compute costs and get WORSE throughput than baseline. The system reports nominal output (just slower), so the bug is invisible in unit tests. This tool fetches `tokenizer.json` from HF Hub for both ids and compares.
π‘ Gated models (Llama, Mistral, Gemma) require HF login + license acceptance β this tool can't auth, so they return 401. Use open-weight pairs (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) for demos.
π Multilingual Tokenizer TaxWhy this matters: tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers β no estimation, real BPE encoding via transformers.js in your browser.
Don't 3Γ your bill on Chinese support. Paste any text β real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude β see the cost asymmetry vs your baseline.
π‘ First-time load: the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local β your text never leaves the browser.
π― LongScoreWhy this matters: every model claims a 128K context window, but accuracy degrades long before that. LongScore (peer-reviewed metric from 100-LongBench, ACL 2025) measures relative degradation past short context. Disentangles base ability from true long-ctx capability β so you compare degradation, not raw scores. Lookup against RULER + HELMET KBs (n=93 models).
How much does your model degrade past short context? Paste an HF model id β see LongScore (relative degradation) + per-length breakdown + HELMET 7-task scores when available. No GPU. No inference. Pure lookup against published benchmarks.
π‘ LongScore = mean over l β {16K, 32K, 64K, 128K} of (S_l β Base) / Base, where Base = mean(S_4K, S_8K). Source: 100-LongBench, ACL 2025. Data: NVIDIA RULER (per-length, n=33) + HELMET (aggregate at 128K, n=60). 0 = no degradation; -0.30 = severe.
π§ Solutions Hub
Map of every documented LLM-eval pain we know about: which tafagent mode addresses it (if any), and the best-of-breed external tools the community already trusts. Goal: full coverage. If a canonical tool exists elsewhere, we link rather than rebuild.
Don't reinvent β find. 30+ pains mapped to tafagent modes + curated external tools. Browse by category, search by keyword, or see the gaps where new modes would help most.
π Recipe
π― Inputs
π Verdict
π Computation Chain
Every number below is deterministic Python. Click a step to expand.
π¬ Plain-English Answer
π TAF Card β full model profile
π Comparison Table
π Import a shared TAF result
Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally.
Same view as if you'd run it yourself.
π Recent community submissions
Live feed from the public registry. Click any submission to view full analysis.
Browse all β
Loading...
π¬ Paper predictions β falsification status
The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested.
Here's the live status of every prediction in the paper.