Back to Repositories

petergpt/bullshit-benchmark

BullshitBench is an open-source benchmark by Peter Gostev that tests one thing most leaderboards ignore: will a model call out nonsense, or will it confidently run with a broken premise? The v2 set is 100 nonsensical, ill-posed or logically flawed prompts spread across five domains — software (40), finance (15), legal (15), medical (15) and physics (15). Each model's response is sorted into one of three buckets by a 3-judge panel: clear pushback (it rejects the broken premise outright), partial challenge (it flags the problem but still engages it), or accepted nonsense (it treats the bad input as valid and answers anyway). The repo ships the prompt set, the scoring harness and a public HTML leaderboard viewer. It's written primarily in Python with an HTML results viewer, and it's MIT-licensed.

other
Python

Why It Matters

Most benchmarks measure how smart a model is. BullshitBench measures whether it's honest — and the June 9, 2026 update (which evaluated 164 model variants and went viral via Gostev's X thread) exposed a gap nobody markets. Claude Opus 4.8 pushes back on bad premises about 95% of the time; GPT-5.5 sits near 45%, accepting more than half the nonsense thrown at it. The killer finding: cranking up reasoning effort barely helps. GPT-5.5 moved from ~45% to ~47% at maximum reasoning, and several high-reasoning variants scored a hair below their standard settings. That directly contradicts the assumption that pricier, slower-thinking reasoning models are the reliable ones — when a model starts from your false premise, extra reasoning often just builds a more convincing argument for the wrong answer. If you pick models for tasks where your own premise might be shaky (legal, medical, finance), this is the benchmark to check before you trust the output. Caveat: it's a focused 100-prompt probe of one behavior, not a general capability score — read it as 'does this model challenge me?' not 'is this model good?'

Repository Stats

Stars
0
Forks
0
Last Commit
N/A

Category

Related Resources

Weekly AI Digest