



Llama 3 currently has two model weights with parameters of 8B and 70B. (B stands for billion and represents how complex the model is and how well the model understands the training.) So far we have only provided text-based responses, but these Meta says it's a big improvement from the version. Llama 3 showed more variety in responding to prompts, had fewer false refusals to answer questions, and was able to reason better. Mehta also said that Llama 3 now understands more instructions and writes better code than before.

In the post, Mehta claims that in certain benchmark tests, both Llama 3 sizes outperformed similarly sized models such as Google's Gemma, Gemini, Mistral 7B, and Anthropics' Claude 3. In his MMLU benchmark, which measures general knowledge, Llama 3 8B performed significantly better than both Gemma 7B and Mistral 7B, and Llama 3 70B slightly outperformed Gemini Pro 1.5.

(It is perhaps noteworthy that Metas' 2,700-word post does not mention OpenAI's flagship model, GPT-4.)

Benchmarking your AI model can help you understand how powerful your AI model is, but it's also important to note that it's imperfect. It turns out that the dataset used to benchmark the model is part of the model's training. This means that the model already knows the answers to the questions the evaluator asks.

Benchmark tests show that both sizes of Llama 3 outperform similarly sized language models.Screenshot: Emilia David / The Verge

Meta said human evaluators also rated Llama 3 higher than other models such as OpenAI GPT-3.5. Meta says it has created a new dataset for human evaluators to emulate real-world scenarios in which Llama 3 might be used. This dataset included use cases such as requesting advice, summarizing, and creative writing. The company said the team working on the model did not have access to this new evaluation data and it did not affect the model's performance.

This assessment set includes 1,800 prompts covering 12 key use cases: Seeking Advice, Brainstorming, Classification, Answering Closed Questions, Coding, Creative Writing, Extraction, Character/ We talk about the existence of personas, answering open questions, reasoning, rewriting, summarizing, meta and blog posts.

In human evaluation, Llama 3 performed better than most models, Mehta said.Screenshot: Emilia David / The Verge

Llama 3 is expected to have larger model sizes (able to understand longer instruction strings and data) and enable more multimodal responses, such as image generation and audio file transcription. Mehta said that these larger versions, which have more than 400B parameters and can ideally learn more complex patterns than smaller versions of the model, are currently being trained, but initial performance testing shows that these models are It has been shown that many of the questions posed by can be answered.

However, Meta did not release previews of these larger models or compare them to other larger models such as GPT-4.

