πŸ”¬ TAF Agent

Diagnose any transformer LLM in 30 seconds. Free. No GPU. No signup.

Predicts whether a model will work for your use case before you spend money or time. Everything runs in your browser — your inputs never leave this tab.

Built by an independent researcher. Open source. Not affiliated with any model vendor.

⏳ Loading Python runtime...

🎯 What do you want to do?

Pick a task. Each one opens the right tool below. Or scroll down for the full list of 22 modes.

πŸ”¬ Diagnose a model Start here when you have a specific model id and want a full diagnostic: Profile runs all 5 recipes at once. Unmask checks if max_position_embeddings is honest. NIAHβ†’Reason predicts retrieval-vs-reasoning gap. LongScore looks up published RULER + HELMET data for the model and shows real degradation past short context (peer-reviewed metric). Quant predicts whether quantizing will break it. Inspect lets you paste raw config.json for private/in-dev models.

Will this specific model work for my use case?

βœ“ Trust a benchmark score When you see a score and want to know if it's real. Contamination rates 20+ benchmarks for likelihood the model saw them during training. Drift tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). Arena CI reconstructs the confidence intervals Chatbot Arena hides β€” many top-Elo "wins" are statistically tied.

Should I believe this number? Bug or noise?

βš™οΈ Set up an eval correctly Before you run lm-eval-harness or vLLM serve, get the right CLI flag. Chat-template Sniffer detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact --apply_chat_template / --chat-template invocation. Solves issue #1841 in lm-eval-harness (silent Γ·2 accuracy). Diagnose CLI generates the Python command to measure Ξ³_obs on your local GPU.

Get the exact CLI flag for lm-eval / vLLM / transformers.

πŸ†š Compare models Compare: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). Phase diagram: scatter of 23 empirical models on the (log ΞΈ, Ξ³) plane, with the PadΓ© curve overlaid. Hover dots for details, click to load that model into the Recipe form.

Side-by-side, or browse the empirical model landscape.

πŸ“‹ Manual / free-form Recipe: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. Ask: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.

Pick a specific recipe by hand, or ask in plain English.

🎯 Mode 7 modes available. Most users want πŸ“‡ Profile (one-click full diagnosis).
πŸ“‡ Profile: paste a model id β†’ 5-recipe TAF Card.
πŸ†š Compare: 2-3 models side-by-side on one recipe.
πŸ” Inspect: paste raw config.json to debug parameters.
πŸ’¬ Ask: free-form question, browser LLM picks the recipe.
πŸ“‹ Recipe: manual selection with full form control.
🩺 Diagnose CLI: generate Python command to measure γ on real weights.
πŸ“Š Phase diagram: explore 23 panel models on (log ΞΈ, Ξ³) plane.

Quickest start: paste any HuggingFace model id (e.g. meta-llama/Meta-Llama-3-8B), click Profile. See all 5 recipes scored in seconds.

πŸ’‘ Quick start: pick any preset β†’ click Generate. Or paste a model id from HF Hub trending β†’ πŸ“₯ Fetch β†’ Generate.

πŸ“‡ Profile a model One-click full diagnosis. Paste any HF model id (or pick preset). Tool runs all 5 recipes (long-context, KV-compression, custom-vs-API, budget, hardware) and produces a single TAF Card showing verdict per dimension + key numbers + architecture classification.

Use case: "I'm evaluating Qwen2.5-32B for production β€” what's its full viability profile?" β†’ paste id β†’ Profile β†’ done.

For technicians: when you need a complete viability snapshot of a candidate model. Outputs match paper Β§sec:gamma_decomposition format.

πŸ“‚ Import a shared TAF result

Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.

🌐 Recent community submissions

Live feed from the public registry. Click any submission to view full analysis. Browse all β†’

Loading...

πŸ”¬ Paper predictions β€” falsification status

The TAF framework rests on falsifiable predictions (F1-F23). Each is empirically tested. Here's the live status of every prediction in the paper.