Dan Alistarh
Dan Alistarh's contributions
Article
Cracking the code: How neural networks might actually “think”
Micah Adler
+2
Discover a new combinatorial approach to decoding AI’s hidden logic, exploring how neural networks truly compute and reason."
Article
Deployment-ready reasoning with quantized DeepSeek-R1 models
Eldar Kurtić
+3
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.
Article
2:4 Sparse Llama: Smaller models for efficient GPU inference
Eldar Kurtić
+4
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.
Article
2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Alexandre Marques
+5
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.
Article
We ran over half a million evaluations on quantized LLMs—here's what we found
Eldar Kurtić
+3
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.
Article
How well do quantized models handle long-context tasks?
Eldar Kurtić
+3
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.

Cracking the code: How neural networks might actually “think”
Discover a new combinatorial approach to decoding AI’s hidden logic, exploring how neural networks truly compute and reason."

Deployment-ready reasoning with quantized DeepSeek-R1 models
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.

2:4 Sparse Llama: Smaller models for efficient GPU inference
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.

We ran over half a million evaluations on quantized LLMs—here's what we found
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.

How well do quantized models handle long-context tasks?
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.