Model Leaderboard
Below is the leaderboard of all models on the site, ordered by ELO.
| Main Stats | Resilience by Request Type | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | Model | Company | Country | ELO | Resilience | # Tests | # Jailbreaks | Violent Crimes | Non-Violent Crimes | Sex Crimes | Child Exploitation | Defamation | Specialized Advice | Privacy | Intellectual Property | Indiscriminate Weapons | Hate | Self-Harm | Sexual Content | Elections | Code Interpreter Abuse | |
| π₯ | gpt-oss-20b | OpenAI | πΊπΈ | 958 | 98% | 2436 | 51 | 98% | 100% | 94% | 98% | 100% | 100% | 98% | 98% | 99% | 97% | 97% | 99% | 95% | 99% | |
| π₯ | gpt-oss-120b | OpenAI | πΊπΈ | 935 | 98% | 2394 | 41 | 99% | 98% | 98% | 97% | 99% | 99% | 98% | 98% | 98% | 99% | 97% | 99% | 97% | 97% | |
| π₯ | kimi-k2.5new! | Moonshot AI | π¨π³ | 891 | 97% | 1161 | 34 | 94% | 98% | 97% | 99% | 99% | 97% | 96% | 99% | 100% | 100% | 96% | 100% | 90% | 97% | |
| 4 | qwen3-235b-a22b-instruct-2507 | Alibaba | π¨π³ | 849 | 91% | 2487 | 234 | 88% | 93% | 88% | 94% | 91% | 98% | 90% | 89% | 93% | 94% | 95% | 96% | 72% | 92% | |
| 5 | qwen3-32b | Alibaba | π¨π³ | 779 | 83% | 2466 | 429 | 84% | 82% | 74% | 86% | 80% | 91% | 72% | 83% | 93% | 86% | 93% | 90% | 63% | 81% | |
| 6 | qwen3-8bnew! | Alibaba | π¨π³ | 778 | 79% | 1326 | 278 | 83% | 87% | 63% | 68% | 78% | 90% | 60% | 82% | 89% | 80% | 89% | 97% | 68% | 70% | |
| 7 | kimi-k2-instruct-0905 | Moonshot AI | π¨π³ | 725 | 75% | 2550 | 631 | 64% | 80% | 65% | 79% | 77% | 84% | 66% | 69% | 84% | 84% | 80% | 84% | 61% | 77% | |
| 8 | mistral-small-3.2-24b-instruct-2506 | Mistral | π«π· | 671 | 62% | 2367 | 897 | 49% | 69% | 52% | 68% | 56% | 76% | 51% | 63% | 77% | 76% | 71% | 59% | 41% | 64% | |
| 9 | mistral-nemo-instruct-2407 | Mistral / Nvidia | π«π· | 610 | 57% | 2394 | 1026 | 48% | 54% | 51% | 71% | 56% | 56% | 51% | 52% | 53% | 83% | 74% | 46% | 55% | 46% | |
Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.