Neural Magic

Neural Magic

Software Development

Somerville, Massachusetts 17,811 followers

We are on a mission to bring open-source LLMs and vLLM to every enterprise on the planet. The future of AI is open.

About us

Together with our community, we engineer sparse LLM, CV, and NLP models that are more efficient and performant in production. Why does this matter? Sparse models are more flexible and can achieve unrivaled latency and throughput performance on your private CPU and GPU infrastructure. Check us out on GitHub and join the Neural Magic Slack Community to get started with software-delivered AI.

Website
http://neuralmagic.com/
Industry
Software Development
Company size
51-200 employees
Headquarters
Somerville, Massachusetts
Type
Privately Held
Founded
2018
Specialties
machine learning, deep learning, and artificial intelligence

Locations

  • Primary

    55 Davis Sq

    Floor 3

    Somerville, Massachusetts 02144, US

    Get directions

Employees at Neural Magic

Updates

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    Chief Technology Officer @ Neural Magic | Engineering Leader and ML Researcher

    𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝘀𝗲 𝗙𝗣𝟴 𝗟𝗹𝗮𝗺𝗮𝘀 𝗮𝗻𝗱 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗳𝗼𝗿 𝗡𝗩𝗜𝗗𝗜𝗔 𝗛𝗼𝗽𝗽𝗲𝗿 𝗚𝗣𝗨𝘀! Building on the success of our Sparse INT4 model release at Neural Magic, we’ve now pushed the research one step further. We've combined 2:4 structured sparsity with NVIDIA's latest FP8 quantization technology through vLLM to enable: • 1.7X lower latency and 1.5X higher throughput on Hopper GPUs • Full accuracy recovery with easier quantization • Open-source vLLM integration with CUTLASS Kernels • Open-source base and fine-tuned models Dive into the blog for full details, and let us know how you plan to use it: https://lnkd.in/eyz4Zvkn Ready to efficiently scale your GenAI deployments? Connect with us to learn more: https://lnkd.in/eqabaCdm

    • Inference performance and accuracy results for dense BF16, sparse BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    • Server-based inference performance results for a multi-turn chat use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    • Server-based inference performance results for a code completion use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
  • Neural Magic reposted this

    View profile for Simon Mo, graphic

    Lowering cost of inference, via open source

    This will be a fun event! vLLM has incredible growth over 2024 and I will be sharing our lessons learned 😃

  • In our latest vLLM Office Hours, we explored Machete, Neural Magic's cutting-edge mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs. Lucas Wilkinson shared why Machete matters: 🔹 Built on CUTLASS 3.5.1, designed specifically for H100 GPUs and beyond 🔹 Supports w4a16, w8a16, and GPTQ models in vLLM v0.6.2+ 🔹 Enables serving Llama 3.1 70B with 5 requests/sec on a single H100 while maintaining a median TTFT of <250ms and TPOT of <100ms Machete represents a significant leap in mixed-precision inference, delivering superior performance in both compute and memory-bound scenarios. Want to learn more? Watch the session and explore how Machete works: https://lnkd.in/gRzVJvmA Join us for bi-weekly vLLM Office Hours every two weeks to stay ahead in AI inference performance: https://lnkd.in/euF8m73q #vLLMOfficeHours

  • As a leading commercial contributor to vLLM, Neural Magic partners with vLLM project committers and the vLLM team at UC Berkeley to host bi-weekly office hours. Join us to give feedback, ask questions, and hear about cutting-edge developments to accelerate your inference. Each session features: ✅ A bi-weekly update on the vLLM project from Michael Goin ✅ Deep dives into special topics on AI efficiency and performance ✅ Live Q&A and feedback loop This week, we’re exploring Machete—the next-generation mixed-input GEMM kernel designed for NVIDIA Hopper GPUs. Learn from Lucas Wilkinson how Machete: 🚀 Delivers up to 42% faster throughput on large models 📊 Optimizes performance with memory-bound techniques 🔄 Speeds up computations through pre-shuffling and upconversion routines Join us to connect with the vLLM community, ask questions, and gain insights to accelerate your AI workloads.

    [Bi-Weekly vLLM Office Hours] Deep Dive into Machete

    [Bi-Weekly vLLM Office Hours] Deep Dive into Machete

    www.linkedin.com

  • Neural Magic reposted this

    View profile for Mark Kurtz, graphic

    Chief Technology Officer @ Neural Magic | Engineering Leader and ML Researcher

    𝗙𝗶𝗻𝗮𝗹 𝘃𝗟𝗟𝗠 𝗢𝗳𝗳𝗶𝗰𝗲 𝗛𝗼𝘂𝗿𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟰! The last two vLLM sessions of the year are here, and they're packed with the latest updates so you can deploy your GenAI models efficiently, performantly, and accurately. Register here: https://lnkd.in/e43yCU9J 1️⃣ December 5, 2024 – Machete Deep Dive Learn how this cutting-edge GEMM kernel accelerates large model performance on NVIDIA Hopper GPUs by 42%. 2️⃣ December 19, 2024 – vLLM Roadmap Reveal Get an insider's view of vLLM's 2025 roadmap and reflect on the achievements of 2024. What questions would you like to see us answer in these sessions? Looking to dive in with Neural Magic to learn more about how we can improve your AI deployments? We'd love to connect: https://lnkd.in/eqabaCdm

    • vLLM office hours schedule for end of 2024
  • Neural Magic reposted this

    View profile for Chris Wright, graphic

    Chief Technology Officer and Senior Vice President Global Engineering at Red Hat

    This is a great read. We talk a lot about the efficiency and power of smaller models, esp with alignment tuning. But there's also important model optimization techniques such as quantization and sparsification. Check out what the Neural Magic team has done w/ Sparse Llama 3.1 8B. https://lnkd.in/eKxeTFVa

    2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

    2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

    https://neuralmagic.com

  • Neural Magic reposted this

    View profile for Philipp Schmid, graphic

    Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️

    How far can we push LLM optimizations? Turns out, pretty far! A new study achieves 98% accuracy recovery on key benchmarks while removing 50% of Llama 3.1 8B's parameters using pruning. Pruning strategically to remove unnecessary connections in a neural network to make it smaller and faster. 👀 TL;DR: 🔄 98.4% original accuracy on on Open LLM Leaderboard v1 with 50% less parameters using 2:4 sparsity pattern 🚀 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization 💻 Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels 📈 Full recovery on fine-tuning tasks (GSM8K, Evol-CodeAlpaca, Ultrachat-200K) ⚡ 1.4-2.1x better multi-query throughput 🌱 Pruned using 13B tokens training, 26 hours on 32 H100s 🔧 Optimized for NVIDIA Ampere GPUs and newer Blog: https://lnkd.in/ewUdQteg Pruning is not a new technique, but it was much harder to achieve good results and maintain performance across tasks compared to quantization. Let's see if Neural Magic can change that.

    • No alternative text description for this image
  • Neural Magic reposted this

    View profile for Eldar Kurtić, graphic

    Machine Learning

    We are excited to announce our first foundational Sparse LLM: Sparse-Llama-3.1-8B-2of4 ! At Neural Magic, we have developed a very efficient recipe to produce 2:4 sparse models with minimal accuracy degradation on few-shot benchmarks. In addition to that, we show that these models can be fine-tuned on various downstream tasks (math, coding, chat) just as good as the dense ones. On top of that, we also show how to quantize them to 4-bits with GPTQ while preserving the 2:4 sparsity pattern to achieve compounded gains from both sparsity and quantization in vLLM engine on GPUs. Our sparse-quantized Llama model gets 5x speedup on A5000 GPUs, 4.9x on A6000 GPUs, and 3.7x on A100s in single-stream latency, with 1.8x of the gains attributed to sparsity alone. Throughput scenarios showed a consistent 1.4x improvement, even when quantization alone had minimal impact. Our full model-release blog is available at: https://lnkd.in/dcqNM-r7 The model and its sparse-quantized variants are fully open-sourced at: https://lnkd.in/dXjcYGds

    • No alternative text description for this image

Similar pages

Browse jobs

Funding