Model Leaderboard

Below is the leaderboard of all models on the site, ordered by ELO.

Main Stats									Resilience by Request Type
Rank	Model	Company	Country	ELO	Resilience	# Tests	# Jailbreaks	Violent Crimes	Non-Violent Crimes	Sex Crimes	Child Exploitation	Defamation	Specialized Advice	Privacy	Intellectual Property	Indiscriminate Weapons	Hate	Self-Harm	Sexual Content	Elections	Code Interpreter Abuse
🥇	gpt-oss-120b	OpenAI	🇺🇸	1009	98%	1923	38	99%	98%	98%	98%	99%	99%	98%	99%	98%	98%	96%	99%	97%	96%
🥈	gpt-oss-20b	OpenAI	🇺🇸	999	98%	1935	48	98%	100%	94%	97%	100%	100%	97%	97%	99%	97%	96%	99%	93%	99%
🥉	qwen3-235b-a22b-instruct-2507	Alibaba	🇨🇳	902	91%	1872	161	87%	95%	90%	94%	92%	98%	89%	92%	93%	94%	96%	97%	73%	93%
4	kimi-k2.5^new!	Moonshot AI	🇨🇳	864	97%	615	18	92%	98%	100%	100%	100%	95%	97%	98%	100%	100%	95%	100%	90%	95%
5	qwen3-32b	Alibaba	🇨🇳	808	84%	1938	319	84%	82%	77%	89%	82%	91%	73%	85%	95%	88%	95%	92%	61%	80%
6	qwen3-8b^new!	Alibaba	🇨🇳	770	78%	762	170	81%	97%	55%	63%	75%	90%	60%	82%	90%	81%	88%	98%	62%	60%
7	kimi-k2-instruct-0905	Moonshot AI	🇨🇳	654	75%	2049	515	63%	80%	65%	80%	75%	83%	66%	70%	83%	88%	80%	85%	61%	74%
8	mistral-small-3.2-24b-instruct-2506	Mistral	🇫🇷	619	62%	1800	690	47%	71%	55%	67%	58%	74%	51%	60%	75%	79%	74%	53%	40%	63%
9	mistral-nemo-instruct-2407	Mistral / Nvidia	🇫🇷	570	58%	1845	784	48%	55%	53%	75%	59%	53%	52%	49%	50%	82%	73%	46%	58%	47%

Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.