📢 Join us on Thursday, Dec. 19th for a special year-end vLLM Office Hours with Simon Mo, vLLM Project Maintainer! 🎯 Reflect on vLLM's 2024 achievements 🔮 Get an exclusive look at the vLLM 2025 roadmap Save your spot: https://lnkd.in/euF8m73q
Neural Magic
Software Development
Somerville, Massachusetts 17,813 followers
We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.
About us
Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.
- Website
-
http://neuralmagic.com/
External link for Neural Magic
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- Somerville, Massachusetts
- Type
- Privately Held
- Founded
- 2018
- Specialties
- machine learning, deep learning, and artificial intelligence
Locations
-
Primary
55 Davis Sq
Floor 3
Somerville, Massachusetts 02144, US
Employees at Neural Magic
Updates
-
Neural Magic reposted this
𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝘀𝗲 𝗙𝗣𝟴 𝗟𝗹𝗮𝗺𝗮𝘀 𝗮𝗻𝗱 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗳𝗼𝗿 𝗡𝗩𝗜𝗗𝗜𝗔 𝗛𝗼𝗽𝗽𝗲𝗿 𝗚𝗣𝗨𝘀! Building on the success of our Sparse INT4 model release at Neural Magic, we’ve now pushed the research one step further. We've combined 2:4 structured sparsity with NVIDIA's latest FP8 quantization technology through vLLM to enable: • 1.7X lower latency and 1.5X higher throughput on Hopper GPUs • Full accuracy recovery with easier quantization • Open-source vLLM integration with CUTLASS Kernels • Open-source base and fine-tuned models Dive into the blog for full details, and let us know how you plan to use it: https://lnkd.in/eyz4Zvkn Ready to efficiently scale your GenAI deployments? Connect with us to learn more: https://lnkd.in/eqabaCdm
-
Neural Magic reposted this
This will be a fun event! vLLM has incredible growth over 2024 and I will be sharing our lessons learned 😃
📢 Join us on Thursday, Dec. 19th for a special year-end vLLM Office Hours with Simon Mo, vLLM Project Maintainer! 🎯 Reflect on vLLM's 2024 achievements 🔮 Get an exclusive look at the vLLM 2025 roadmap Save your spot: https://lnkd.in/euF8m73q
vLLM Project Update: 2024 Retrospective and 2025 Roadmap
www.linkedin.com
-
In our latest vLLM Office Hours, we explored Machete, Neural Magic's cutting-edge mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs. Lucas Wilkinson shared why Machete matters: 🔹 Built on CUTLASS 3.5.1, designed specifically for H100 GPUs and beyond 🔹 Supports w4a16, w8a16, and GPTQ models in vLLM v0.6.2+ 🔹 Enables serving Llama 3.1 70B with 5 requests/sec on a single H100 while maintaining a median TTFT of <250ms and TPOT of <100ms Machete represents a significant leap in mixed-precision inference, delivering superior performance in both compute and memory-bound scenarios. Want to learn more? Watch the session and explore how Machete works: https://lnkd.in/gRzVJvmA Join us for bi-weekly vLLM Office Hours every two weeks to stay ahead in AI inference performance: https://lnkd.in/euF8m73q #vLLMOfficeHours
vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024
https://www.youtube.com/
-
As a leading commercial contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Each session features: ✅ A bi-weekly update on the vLLM project from Michael Goin ✅ Deep dives into special topics on AI efficiency and performance ✅ Live Q&A and feedback loop This week, we’re exploring Machete—the next-generation mixed-input GEMM kernel designed for NVIDIA Hopper GPUs. Learn from Lucas Wilkinson how Machete: 🚀 Delivers up to 42% faster throughput on large models 📊 Optimizes performance with memory-bound techniques 🔄 Speeds up computations through pre-shuffling and upconversion routines Join us to connect with the vLLM community, ask questions, and gain insights to accelerate your AI workloads.
[Bi-Weekly vLLM Office Hours] Deep Dive into Machete
www.linkedin.com
-
We're excited to attend #NeurIPS2024 next week in Vancouver! 🎉 Join us to learn about our SOTA model compression research and how we're helping enterprises succeed with #vLLM. Let's connect and discuss the future of AI! Mark Kurtz, Nir Shavit, Michael Goin, Saša Zelenović, · Jeannie Finks
-
Neural Magic reposted this
𝗙𝗶𝗻𝗮𝗹 𝘃𝗟𝗟𝗠 𝗢𝗳𝗳𝗶𝗰𝗲 𝗛𝗼𝘂𝗿𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟰! The last two vLLM sessions of the year are here, and they're packed with the latest updates so you can deploy your GenAI models efficiently, performantly, and accurately. Register here: https://lnkd.in/e43yCU9J 1️⃣ December 5, 2024 – Machete Deep Dive Learn how this cutting-edge GEMM kernel accelerates large model performance on NVIDIA Hopper GPUs by 42%. 2️⃣ December 19, 2024 – vLLM Roadmap Reveal Get an insider's view of vLLM's 2025 roadmap and reflect on the achievements of 2024. What questions would you like to see us answer in these sessions? Looking to dive in with Neural Magic to learn more about how we can improve your AI deployments? We'd love to connect: https://lnkd.in/eqabaCdm
-
Neural Magic reposted this
This is a great read. We talk a lot about the efficiency and power of smaller models, esp with alignment tuning. But there's also important model optimization techniques such as quantization and sparsification. Check out what the Neural Magic team has done w/ Sparse Llama 3.1 8B. https://lnkd.in/eKxeTFVa
-
Neural Magic reposted this
How far can we push LLM optimizations? Turns out, pretty far! A new study achieves 98% accuracy recovery on key benchmarks while removing 50% of Llama 3.1 8B's parameters using pruning. Pruning strategically to remove unnecessary connections in a neural network to make it smaller and faster. 👀 TL;DR: 🔄 98.4% original accuracy on on Open LLM Leaderboard v1 with 50% less parameters using 2:4 sparsity pattern 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels 📈 Full recovery on fine-tuning tasks (GSM8K, Evol-CodeAlpaca, Ultrachat-200K) ⚡ 1.4-2.1x better multi-query throughput 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s 🔧 Optimized for NVIDIA Ampere GPUs and newer Blog: https://lnkd.in/ewUdQteg Pruning is not a new technique, but it was much harder to achieve good results and maintain performance across tasks compared to quantization. Let's see if Neural Magic can change that.
-
Neural Magic reposted this
We are excited to announce our first foundational Sparse LLM: Sparse-Llama-3.1-8B-2of4 ! At Neural Magic, we have developed a very efficient recipe to produce 2:4 sparse models with minimal accuracy degradation on few-shot benchmarks. In addition to that, we show that these models can be fine-tuned on various downstream tasks (math, coding, chat) just as good as the dense ones. On top of that, we also show how to quantize them to 4-bits with GPTQ while preserving the 2:4 sparsity pattern to achieve compounded gains from both sparsity and quantization in vLLM engine on GPUs. Our sparse-quantized Llama model gets 5x speedup on A5000 GPUs, 4.9x on A6000 GPUs, and 3.7x on A100s in single-stream latency, with 1.8x of the gains attributed to sparsity alone. Throughput scenarios showed a consistent 1.4x improvement, even when quantization alone had minimal impact. Our full model-release blog is available at: https://lnkd.in/dcqNM-r7 The model and its sparse-quantized variants are fully open-sourced at: https://lnkd.in/dXjcYGds