Michael Goin
Michael Goin's contributions
Article
How we optimized vLLM for DeepSeek-R1
Michael Goin
+4
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.
Article
vLLM V1: Accelerating multimodal inference for large language models
Michael Goin
+3
Explore how vLLM's new multimodal AI inference capabilities enhance performance, scalability, and flexibility across diverse hardware platforms.
Article
Distributed inference with vLLM
Michael Goin
Explore how distributed inference works within vLLM in this recap of Neural Magic's vLLM Office Hours with Michael Goin and Murali Andoorveedu, a vLLM committer from CentML.
Article
vLLM brings FP8 inference to the open source community
Michael Goin
+5
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.
Article
How Marlin pushes the boundaries of mixed-precision LLM inference
Michael Goin
+1
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.
Article
Sparse fine-tuning for accelerating large language models with DeepSparse
Robert Shaw
+1
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.
Article
SparseGPT: Remove 100 billion parameters for free
Robert Shaw
+1
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.

How we optimized vLLM for DeepSeek-R1
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.

vLLM V1: Accelerating multimodal inference for large language models
Explore how vLLM's new multimodal AI inference capabilities enhance performance, scalability, and flexibility across diverse hardware platforms.

Distributed inference with vLLM
Explore how distributed inference works within vLLM in this recap of Neural Magic's vLLM Office Hours with Michael Goin and Murali Andoorveedu, a vLLM committer from CentML.

vLLM brings FP8 inference to the open source community
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.

How Marlin pushes the boundaries of mixed-precision LLM inference
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.

Sparse fine-tuning for accelerating large language models with DeepSparse
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.

SparseGPT: Remove 100 billion parameters for free
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.