Neural Magic

Software Development

Somerville, Massachusetts 17,813 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

See jobs Follow

View all 50 employees

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website: http://neuralmagic.com/
External link for Neural Magic
Industry: Software Development
Company size: 51-200 employees
Headquarters: Somerville, Massachusetts
Type: Privately Held
Founded: 2018
Specialties: machine learning, deep learning, and artificial intelligence

Locations

Primary

55 Davis Sq

Floor 3

Somerville, Massachusetts 02144, US

Get directions

Employees at Neural Magic

See all employees

Updates

Neural Magic

17,813 followers
1w Edited
Report this post
📢 Join us on Thursday, Dec. 19th for a special year-end vLLM Office Hours with Simon Mo, vLLM Project Maintainer! 🎯 Reflect on vLLM's 2024 achievements 🔮 Get an exclusive look at the vLLM 2025 roadmap Save your spot: https://lnkd.in/euF8m73q

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

www.linkedin.com

Like Comment Share
Neural Magic reposted this
Mark Kurtz

Chief Technology Officer @ Neural Magic | Engineering Leader and ML Researcher
2d
Report this post
𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝘀𝗲 𝗙𝗣𝟴 𝗟𝗹𝗮𝗺𝗮𝘀 𝗮𝗻𝗱 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗳𝗼𝗿 𝗡𝗩𝗜𝗗𝗜𝗔 𝗛𝗼𝗽𝗽𝗲𝗿 𝗚𝗣𝗨𝘀! Building on the success of our Sparse INT4 model release at Neural Magic, we’ve now pushed the research one step further. We've combined 2:4 structured sparsity with NVIDIA's latest FP8 quantization technology through vLLM to enable: • 1.7X lower latency and 1.5X higher throughput on Hopper GPUs • Full accuracy recovery with easier quantization • Open-source vLLM integration with CUTLASS Kernels • Open-source base and fine-tuned models Dive into the blog for full details, and let us know how you plan to use it: https://lnkd.in/eyz4Zvkn Ready to efficiently scale your GenAI deployments? Connect with us to learn more: https://lnkd.in/eqabaCdm
Like Comment Share
Neural Magic reposted this
Simon Mo

Lowering cost of inference, via open source
5d Edited
Report this post
This will be a fun event! vLLM has incredible growth over 2024 and I will be sharing our lessons learned 😃

Neural Magic

17,813 followers
1w Edited

📢 Join us on Thursday, Dec. 19th for a special year-end vLLM Office Hours with Simon Mo, vLLM Project Maintainer! 🎯 Reflect on vLLM's 2024 achievements 🔮 Get an exclusive look at the vLLM 2025 roadmap Save your spot: https://lnkd.in/euF8m73q

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

www.linkedin.com

Like Comment Share
Neural Magic

17,813 followers
1w
Report this post
In our latest vLLM Office Hours, we explored Machete, Neural Magic's cutting-edge mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs. Lucas Wilkinson shared why Machete matters: 🔹 Built on CUTLASS 3.5.1, designed specifically for H100 GPUs and beyond 🔹 Supports w4a16, w8a16, and GPTQ models in vLLM v0.6.2+ 🔹 Enables serving Llama 3.1 70B with 5 requests/sec on a single H100 while maintaining a median TTFT of <250ms and TPOT of <100ms Machete represents a significant leap in mixed-precision inference, delivering superior performance in both compute and memory-bound scenarios. Want to learn more? Watch the session and explore how Machete works: https://lnkd.in/gRzVJvmA Join us for bi-weekly vLLM Office Hours every two weeks to stay ahead in AI inference performance: https://lnkd.in/euF8m73q #vLLMOfficeHours

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

https://www.youtube.com/

Like Comment Share
Neural Magic

17,813 followers
2w
Report this post
As a leading commercial contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Each session features: ✅ A bi-weekly update on the vLLM project from Michael Goin ✅ Deep dives into special topics on AI efficiency and performance ✅ Live Q&A and feedback loop This week, we’re exploring Machete—the next-generation mixed-input GEMM kernel designed for NVIDIA Hopper GPUs. Learn from Lucas Wilkinson how Machete: 🚀 Delivers up to 42% faster throughput on large models 📊 Optimizes performance with memory-bound techniques 🔄 Speeds up computations through pre-shuffling and upconversion routines Join us to connect with the vLLM community, ask questions, and gain insights to accelerate your AI workloads.

[Bi-Weekly vLLM Office Hours] Deep Dive into Machete

www.linkedin.com

Like Comment Share
Neural Magic

17,813 followers
2w
Report this post
We're excited to attend #NeurIPS2024 next week in Vancouver! 🎉 Join us to learn about our SOTA model compression research and how we're helping enterprises succeed with #vLLM. Let's connect and discuss the future of AI! Mark Kurtz, Nir Shavit, Michael Goin, Saša Zelenović, · Jeannie Finks
1 Comment

Like Comment Share
Neural Magic reposted this
Mark Kurtz

Chief Technology Officer @ Neural Magic | Engineering Leader and ML Researcher
2w
Report this post
𝗙𝗶𝗻𝗮𝗹 𝘃𝗟𝗟𝗠 𝗢𝗳𝗳𝗶𝗰𝗲 𝗛𝗼𝘂𝗿𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟰! The last two vLLM sessions of the year are here, and they're packed with the latest updates so you can deploy your GenAI models efficiently, performantly, and accurately. Register here: https://lnkd.in/e43yCU9J 1️⃣ December 5, 2024 – Machete Deep Dive Learn how this cutting-edge GEMM kernel accelerates large model performance on NVIDIA Hopper GPUs by 42%. 2️⃣ December 19, 2024 – vLLM Roadmap Reveal Get an insider's view of vLLM's 2025 roadmap and reflect on the achievements of 2024. What questions would you like to see us answer in these sessions? Looking to dive in with Neural Magic to learn more about how we can improve your AI deployments? We'd love to connect: https://lnkd.in/eqabaCdm
2 Comments

Like Comment Share
Neural Magic reposted this
Chris Wright

Chief Technology Officer and Senior Vice President Global Engineering at Red Hat
3w
Report this post
This is a great read. We talk a lot about the efficiency and power of smaller models, esp with alignment tuning. But there's also important model optimization techniques such as quantization and sparsification. Check out what the Neural Magic team has done w/ Sparse Llama 3.1 8B. https://lnkd.in/eKxeTFVa

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

https://neuralmagic.com

11 Comments

Like Comment Share
Neural Magic reposted this
Philipp Schmid

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️
3w
Report this post
How far can we push LLM optimizations? Turns out, pretty far! A new study achieves 98% accuracy recovery on key benchmarks while removing 50% of Llama 3.1 8B's parameters using pruning. Pruning strategically to remove unnecessary connections in a neural network to make it smaller and faster. 👀 TL;DR: 🔄 98.4% original accuracy on on Open LLM Leaderboard v1 with 50% less parameters using 2:4 sparsity pattern 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels 📈 Full recovery on fine-tuning tasks (GSM8K, Evol-CodeAlpaca, Ultrachat-200K) ⚡ 1.4-2.1x better multi-query throughput 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s 🔧 Optimized for NVIDIA Ampere GPUs and newer Blog: https://lnkd.in/ewUdQteg Pruning is not a new technique, but it was much harder to achieve good results and maintain performance across tasks compared to quantization. Let's see if Neural Magic can change that.
36 Comments

Like Comment Share
Neural Magic reposted this
Eldar Kurtić

Machine Learning
3w Edited
Report this post
We are excited to announce our first foundational Sparse LLM: Sparse-Llama-3.1-8B-2of4 ! At Neural Magic, we have developed a very efficient recipe to produce 2:4 sparse models with minimal accuracy degradation on few-shot benchmarks. In addition to that, we show that these models can be fine-tuned on various downstream tasks (math, coding, chat) just as good as the dense ones. On top of that, we also show how to quantize them to 4-bits with GPTQ while preserving the 2:4 sparsity pattern to achieve compounded gains from both sparsity and quantization in vLLM engine on GPUs. Our sparse-quantized Llama model gets 5x speedup on A5000 GPUs, 4.9x on A6000 GPUs, and 3.7x on A100s in single-stream latency, with 1.8x of the gains attributed to sparsity alone. Throughput scenarios showed a consistent 1.4x improvement, even when quantization alone had minimal impact. Our full model-release blog is available at: https://lnkd.in/dcqNM-r7 The model and its sparse-quantized variants are fully open-sourced at: https://lnkd.in/dXjcYGds
4 Comments

Like Comment Share

Browse jobs

Funding

Neural Magic 3 total rounds

Last Round

Series A Nov 5, 2021

US$ 30.0M

Investors

New Enterprise Associates + 4 Other investors

See more info on crunchbase

Neural Magic

Software Development

Somerville, Massachusetts 17,813 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

About us

DeepSparse

Deep Learning Software

SparseML

Deep Learning Software

SparseZoo

Deep Learning Software

Locations

Employees at Neural Magic

Dimitri Sirota

BigID - Know Your Data | Control Your Data

Jamie Goldstein

Brian Stevens

CEO @Neural Magic. Ex VP Product & CTO Google Cloud, EVP & CTO Red Hat.

Gil Beyda

Founder & Managing Partner at Genacast Ventures

Updates

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

www.linkedin.com

vLLM Project Update: 2024 Retrospective and 2025 Roadmap

www.linkedin.com

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

https://www.youtube.com/

[Bi-Weekly vLLM Office Hours] Deep Dive into Machete

www.linkedin.com

Join now to see what you are missing

Similar pages

Red Hat

Deci AI (Acquired by NVIDIA)

Ultralytics

Cerebras Systems

Roboflow

Nebius

Hugging Face

Weights & Biases

Anyscale

Run:ai

Browse jobs

Operational Specialist jobs

Machine Learning Engineer jobs

Engineer jobs

Funding