Guerra LLM Ranking

The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.

MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.

Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.

Best models for solving math problems:

gpt-4o-2024-05-13
gpt-4-Turbo-2024-04-09
gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra 1.0
Gemini Pro 1.5
Gemini Advanced
Claude 3 Opus
Claude 3 Sonnet

Best models for large text:

gpt-4o-2024-05-13
gpt-4-Turbo-2024-04-09
gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
Gemini Ultra 1.0
Gemini Pro 1.5
Gemini Advanced
Claude 3 Opus
Claude 3 Sonnet
Claude 3 Haiku
Claude 2-2.1
Claude Instant 1-1.2

Models with the best cost benefit:

gpt-4o-2024-05-13
Gemini Pro 1.5
gpt-3.5-turbo-0125
gpt-3.5-turbo-0613
Claude 3 Haiku
Meta Llama 3 70B Instruct

Models with fewer hallucinations:

gpt-4o-2024-05-13
gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra 1.0
Gemini Pro 1.5
Claude 2.1
Snowflake Arctic Instruct
Intel Neural Chat 7B

Models with a high level of hallucinations:

Gemma 1-1.1 7B
DBRX Instruct
Microsoft Phi 2
Mistral 7B
Google Palm 2
Mixtral 8x7B Instruct
Yi 34B

Open Models:

Mixtral 8x7B Instruct
Mistral 7B
Phi-3
Yi 34B
Grok 1
DBRX Instruct
Llama 3 8-70B
Gemma 2-7B

Can be trained in online service:

gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-4-0613

Can be trained locally:

Llama 3 8-70B
Mixtral 8x7B Instruct
Yi 34B

Has widely available api service:

gpt-4-0125-preview (turbo) - OpenAI
gpt-4-1106-preview (turbo) - OpenAI
gpt-4-0613 - OpenAI
gpt-4-0314 - OpenAI
gpt-3.5-turbo-1106 - OpenAI
gpt-4-0314 - OpenAI
Gemini Pro 1.0-1.5 - Openrouter with compatibility with OpenAI api, Google api service.
Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
Yi 34B - Deepinfra with compatibility with OpenAI api.

Models with the same level of GPT-4 Turbo:

Claude 3 Opus

Models with the same level of GPT-4 but lower than GPT-4 Turbo:

Gemini Ultra 1.0
Gemini Pro 1.5
Gemini Advanced
Gemini Pro (Bard/Online)
Claude 3 Sonnet

Models with the same level or better than GPT-3.5 but lower than GPT-4:

Claude 3 Haiku
Claude 2-2.1
Claude 1
Claude Instant 1-1.2
Phi-3 Medium
Llama 3 70B Instruct
Gemini-1.5-Flash-API-0514
Command R+

Versions of models already surpassed by fine-tune, new versions or new architectures:

gpt-4-0613
gpt-4-0314
Gemini Pro 1.0
Grok 1
Phi-2
DBRX Instruct
Mistral Medium
Gemma 1.0 7B
Zephyr-ORPO-141b-A35b-v0.1
Yi 1.0 34B
gpt-4-0613
gpt-4-0314
Claude 2-2.1
Claude Instant 1-1.2
Qwen 1.0
Falcon 180B
Llama 1 and Llama 2
Guanaco 65B
Palm 2 Chat Bison
Dolly V2
Alpaca
CodeLlama-34b-Instruct-hf
SOLAR-10.7B-Instruct-v1.0
Mistral-7B-v0.2
Mistral-7B-v0.1
MythoMax-L2
Zephyr 7B Alpha and Beta
Airoboros 70b
OpenChat-3.5-1210
StableLM Tuned Alpha
Stable Beluga 2

Best OpenAI Models:

gpt-4o-2024-05-13
gpt-4-Turbo-2024-04-09
gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-3.5-turbo-0613
gpt-3.5-turbo-0125

API services:

Openrouter
OpenAI
Google Cloud
Anthropic
Azure
Deepinfra
Together
OctoAI
Lepton
Fireworks
Perplexity
Groq
Mistral
NovitaAI
Cohere
DeepSeek