Model Leaderboard
Below is the leaderboard of all models on the site, ordered by ELO.
| Main Stats | Resilience by Request Type | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | Model | Company | Country | ELO | Resilience | # Tests | # Jailbreaks | Violent Crimes | Non-Violent Crimes | Sex Crimes | Child Exploitation | Defamation | Specialized Advice | Privacy | Intellectual Property | Indiscriminate Weapons | Hate | Self-Harm | Sexual Content | Elections | Code Interpreter Abuse | |
| π₯ | gpt-oss-120b | OpenAI | πΊπΈ | 958 | 99% | 5136 | 49 | 99% | 99% | 99% | 99% | 100% | 99% | 99% | 99% | 99% | 99% | 99% | 99% | 98% | 99% | |
| π₯ | gpt-oss-20b | OpenAI | πΊπΈ | 922 | 98% | 5052 | 90 | 99% | 100% | 96% | 98% | 100% | 100% | 98% | 99% | 99% | 98% | 98% | 100% | 90% | 99% | |
| π₯ | kimi-k2.5new! | Moonshot AI | π¨π³ | 891 | 96% | 1809 | 77 | 92% | 96% | 97% | 96% | 99% | 96% | 92% | 98% | 99% | 98% | 95% | 97% | 93% | 96% | |
| 4 | qwen3-235b-a22b-instruct-2507 | Alibaba | π¨π³ | 820 | 94% | 5238 | 310 | 93% | 96% | 93% | 96% | 95% | 99% | 94% | 94% | 96% | 92% | 97% | 96% | 81% | 96% | |
| 5 | qwen3-32b | Alibaba | π¨π³ | 784 | 89% | 5328 | 574 | 91% | 90% | 84% | 91% | 88% | 94% | 83% | 89% | 96% | 90% | 96% | 94% | 71% | 89% | |
| 6 | qwen3-8bnew! | Alibaba | π¨π³ | 778 | 82% | 1728 | 303 | 84% | 90% | 70% | 73% | 81% | 92% | 70% | 86% | 91% | 79% | 91% | 97% | 71% | 76% | |
| 7 | kimi-k2-instruct-0905 | Moonshot AI | π¨π³ | 725 | 75% | 2550 | 631 | 64% | 80% | 65% | 79% | 77% | 84% | 66% | 69% | 84% | 84% | 80% | 84% | 61% | 77% | |
| 8 | mistral-small-3.2-24b-instruct-2506 | Mistral | π«π· | 661 | 76% | 5127 | 1252 | 66% | 83% | 71% | 79% | 72% | 86% | 71% | 76% | 87% | 86% | 84% | 74% | 48% | 77% | |
| 9 | mistral-nemo-instruct-2407 | Mistral / Nvidia | π«π· | 651 | 73% | 5298 | 1423 | 66% | 72% | 70% | 80% | 69% | 77% | 69% | 72% | 77% | 87% | 85% | 69% | 59% | 71% | |
Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.