Today, Meta released its newest version of the Llama model family–Llama 4, enabling developers to build more personalized multimodal experiences. Thanks to our close collaboration with Meta, the vLLM team from Red Hat and UC Berkeley have enabled Day 0 model support, meaning you can start inferencing Llama 4 with vLLM today. This is a big day for open source AI, as it shows the true power of vLLM and its robust, collaborative community.
The Llama 4 release brings us two models—Llama 4 Scout and Llama 4 Maverick. Both Scout and Maverick come with BF16 weights. Additionally, Maverick also comes with the FP8-quantized version on Day 0. The FP8 code in vLLM and Hugging Face was supported by Meta using Red Hat’s open source LLM Compressor, a library for quantizing LLMs for faster and more efficient inference with vLLM.
Read on to learn about Llama 4 Scout and Llama 4 Maverick, what’s new in the Llama 4 herd, and how to get started with inferencing in vLLM today.
Meet Llama 4 herd–Scout and Maverick
The Llama 4 release comes with two model variations–Llama 4 Scout and Llama 4 Maverick.
Llama 4 Scout
Llama 4 Scout is a multimodal model with:
- 17 billion active parameters
- 16 experts
- 109 billion total parameters.
Scout delivers industry leading performance for its class and it fits on a single NVIDIA H100 node. Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens. This opens up a world of possibilities, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.
Llama 4 Maverick
Llama 4 Maverick is a general purpose LLM with:
- 17 billion active parameters
- 128 experts
- 400 billion total parameters
Maverick offers higher quality at a lower price compared to Llama 3.3 70B. It brings unparalleled, industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of sophisticated AI applications that bridge language barriers. Maverick is great for precise image understanding and creative writing. For developers, it offers state-of-the-art intelligence with high speed, optimized for best response quality on tone, and refusals.
The official release by Meta includes an FP8-quantized version of Llama 4 Maverick 128E, enabling the 128 expert model to fit on a single NVIDIA 8xH100 node, resulting in more performance with lower costs.
What’s new in Llama 4?
The power of mixture of experts (MoE) architecture
Llama 4 Scout and Llama 4 Maverick are the first of Meta’s models that use a mixture of experts (MoE) architecture. In MoE models, a single token activates only a fraction of the total parameters. MoE architectures are more compute efficient for model training and inference and, given a fixed training FLOPs budget, deliver higher quality models compared to dense architectures.
To break it down, Llama 4 Maverick has 400 billion total parameters in the model, but the inference internally directs to an "expert", where only 17B parameters need to be processed per token. Furthermore, Llama 4 Maverick with 128 experts comes with FP8 weights, enabling the model to fit on a single 8xH100 node, resulting in faster and more efficient inference.
For high accuracy recovery, Maverick leverages channelwise weight quantization and dynamic per token activation quantization applied in a non-uniform manner. The Red Hat team (led by Eliza Wzsola) recently added a CUTLASS based kernel for GroupedGEMM in vLLM. Maverick leverages this kernel code and is a follow on to the existing work we have done leveraging CUTLASS 3.5.1, explained in our blogs vLLM brings FP8 inference to the open source community and Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs.
Adoption of early fusion multimodality
The Llama 4 models are built with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Unlike previous Llama models, Scout and Maverick don’t freeze the text parameters or use separate multimodal parameters while training with images and videos. Early fusion is a major step forward, since it enables the joint pre-training of models with large amounts of unlabeled text, image, and video data.
Protecting developers against severe risks
Meta’s hope for Llama 4 is to develop the most helpful, useful models for developers while protecting against and mitigating the most severe risks. Llama 4 models are built with the best practices outlined in Meta’s Developer Use Guide: AI Protections. This includes integrating mitigations at each layer of model development from pre-training to post training and tunable system-level mitigations that shield developers from adversarial users. In doing so, the Meta team empowers developers to create helpful, safe and adaptable experiences for their Llama supported applications.
Day 0 vLLM support for inferencing Llama 4 models
As the leading commercial contributor to vLLM, Red Hat is excited that Meta has selected vLLM to support the immediate inferencing of Llama 4 models. This is no surprise for us. Originally developed at UC Berkeley, vLLM has become the de facto standard for open source inference serving with 44,000 GitHub stars and approaching one million weekly PyPI installs.
The vLLM community’s close collaboration with Meta during the pre-release process ensures developers can deploy the latest models as soon as they are available.
Get started with inferencing Llama 4 in vLLM now
You can install vLLM seamlessly using pip:
pip install -U vllm
Once installed, you can run a simple command to serve any of the models in the Llama 4 family:
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tensor-parallel-size 8 \
--max-model-len 430000
The model in the above examples is the FP8-quantized version of Llama 4 Maverick. You can experiment with other Llama 4 model variations by pointing to the appropriate model stub on Hugging Face or one of Red Hat's quantized models located here.
Conclusion
The release of the Llama 4 herd marks a pivotal moment in the world of open source AI. With the combination of mixture of experts architecture and early fusion of multimodality, these models are enabling developers to build more personalized multimodal experiences.
By partnering with the vLLM community, Meta is ensuring developers can take advantage of Llama 4 models immediately, with a focus on performance and lower deployment costs.
Red Hat is proud to be a top commercial contributor to vLLM, driving these innovations forward and empowering the community with open, efficient, and scalable AI solutions. For more information and further details on getting started with vLLM, visit the GitHub repository.