Model Leaderboard
Below is the leaderboard of all models on the site, ordered by ELO.
| Main Stats | Resilience by Request Type | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | Model | Company | Country | ELO | Resilience | # Tests | # Jailbreaks | Violent Crimes | Non-Violent Crimes | Sex Crimes | Child Exploitation | Defamation | Specialized Advice | Privacy | Intellectual Property | Indiscriminate Weapons | Hate | Self-Harm | Sexual Content | Elections | Code Interpreter Abuse | |
| π₯ | gpt-oss-120b | OpenAI | πΊπΈ | 1009 | 98% | 1923 | 38 | 99% | 98% | 98% | 98% | 99% | 99% | 98% | 99% | 98% | 98% | 96% | 99% | 97% | 96% | |
| π₯ | gpt-oss-20b | OpenAI | πΊπΈ | 999 | 98% | 1935 | 48 | 98% | 100% | 94% | 97% | 100% | 100% | 97% | 97% | 99% | 97% | 96% | 99% | 93% | 99% | |
| π₯ | qwen3-235b-a22b-instruct-2507 | Alibaba | π¨π³ | 902 | 91% | 1872 | 161 | 87% | 95% | 90% | 94% | 92% | 98% | 89% | 92% | 93% | 94% | 96% | 97% | 73% | 93% | |
| 4 | kimi-k2.5new! | Moonshot AI | π¨π³ | 864 | 97% | 615 | 18 | 92% | 98% | 100% | 100% | 100% | 95% | 97% | 98% | 100% | 100% | 95% | 100% | 90% | 95% | |
| 5 | qwen3-32b | Alibaba | π¨π³ | 808 | 84% | 1938 | 319 | 84% | 82% | 77% | 89% | 82% | 91% | 73% | 85% | 95% | 88% | 95% | 92% | 61% | 80% | |
| 6 | qwen3-8bnew! | Alibaba | π¨π³ | 770 | 78% | 762 | 170 | 81% | 97% | 55% | 63% | 75% | 90% | 60% | 82% | 90% | 81% | 88% | 98% | 62% | 60% | |
| 7 | kimi-k2-instruct-0905 | Moonshot AI | π¨π³ | 654 | 75% | 2049 | 515 | 63% | 80% | 65% | 80% | 75% | 83% | 66% | 70% | 83% | 88% | 80% | 85% | 61% | 74% | |
| 8 | mistral-small-3.2-24b-instruct-2506 | Mistral | π«π· | 619 | 62% | 1800 | 690 | 47% | 71% | 55% | 67% | 58% | 74% | 51% | 60% | 75% | 79% | 74% | 53% | 40% | 63% | |
| 9 | mistral-nemo-instruct-2407 | Mistral / Nvidia | π«π· | 570 | 58% | 1845 | 784 | 48% | 55% | 53% | 75% | 59% | 53% | 52% | 49% | 50% | 82% | 73% | 46% | 58% | 47% | |
Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.