I just published a new LLM Benchmark tool called llmtester.
Over 50,000 tests over 13 categories!
This is the easiest way to benchmark LLM models and see where they go wrong.
- Interactive CLI - Keyboard-driven benchmark selection and configuration
- Multi-Provider Support - OpenAI, Anthropic, Together.ai, Groq, Fireworks AI, Perplexity, OpenRouter, and any OpenAI-compatible API
- LLM-as-Judge - Optional secondary model evaluation for code, math, SQL, bash, and truthfulness benchmarks
- Progress Tracking - Resume interrupted evaluations from where you left off
- Result Explorer - Built-in TUI to browse past results, filter by pass/fail, and inspect individual responses
- Config Persistence - Saves provider, endpoint, and model settings between runs
- Shuffle & Sampling - Run a percentage of each benchmark with optional shuffling for diverse distribution
Run the latest version without any install with npx llmtester
What makes llmtester so awesome?
50,000+ tests, second LLM judges results on more complex tests, ability to fully explore past tests.
I got tired of the doing everything manually, so built a full test runner.
Includes tests in many domains from Grade school math, advanced math, reasoning, programming, and even sql. Some of these tests are impossible to grade without a judge, llmtester will handle this all for you!
You can find the package on npm and github.