New LLM benchmark: llmtester

I just published a new LLM Benchmark tool called llmtester.

Over 50,000 tests over 13 categories!

This is the easiest way to benchmark LLM models and see where they go wrong.

Interactive CLI - Keyboard-driven benchmark selection and configuration
Multi-Provider Support - OpenAI, Anthropic, Together.ai, Groq, Fireworks AI, Perplexity, OpenRouter, and any OpenAI-compatible API
LLM-as-Judge - Optional secondary model evaluation for code, math, SQL, bash, and truthfulness benchmarks
Progress Tracking - Resume interrupted evaluations from where you left off
Result Explorer - Built-in TUI to browse past results, filter by pass/fail, and inspect individual responses
Config Persistence - Saves provider, endpoint, and model settings between runs
Shuffle & Sampling - Run a percentage of each benchmark with optional shuffling for diverse distribution

Run the latest version without any install with npx llmtester

What makes llmtester so awesome?

50,000+ tests, second LLM judges results on more complex tests, ability to fully explore past tests.

I got tired of the doing everything manually, so built a full test runner.

Includes tests in many domains from Grade school math, advanced math, reasoning, programming, and even sql. Some of these tests are impossible to grade without a judge, llmtester will handle this all for you!

You can find the package on npm and github.