It’s day 4 of our 11 Days of Inference Acceleration Techniques. Today, we’re moving on to runtime level optimization best practices. 𝐅𝐨𝐮𝐫𝐭𝐡 𝐭𝐢𝐩: 𝐓𝐚𝐤𝐞 𝐚𝐝𝐯𝐚𝐧𝐭𝐚𝐠𝐞 𝐨𝐟 𝐠𝐫𝐚𝐩𝐡 𝐜𝐨𝐦𝐩𝐢𝐥𝐚𝐭𝐢𝐨𝐧 📈 Graph compilers such as TVM, Tensor-RT, and OpenVino work by getting a computation graph of a specific model and generating an optimized code adjusted for the target hardware. Graph compilation can optimize the graph structure by merging redundant operations, performing kernel auto-tuning, enhancing memory reuse, preventing cache misses, and more. But few things to be aware of: 📝 Not all models compile equally. 📝 The impact of compilation on model performance can vary. Hence, make sure to check the architecture's ability to compile early on in the process to avoid wasting time and resources on training a model that can’t be optimized for fast inference. – What’s the #11DaysofInferenceAccelerationTechniques? The Deci team is posting, for 11 days, a series of inference acceleration techniques for deep learning applications. If you’re looking for practical tips and best practices for improving inference, follow Deci AI so you won’t miss an update. #deeplearning #machinelearning #neuralnetworks #computervision
Deci AI (Acquired by NVIDIA)’s Post
More Relevant Posts
-
𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 𝗔𝗿𝗲 𝗜𝗻: 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆-𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗧𝗿𝗮𝗱𝗲-𝗢𝗳𝗳𝘀 𝗶𝗻 𝗟𝗟𝗠 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 In my recent posts, I've promised a detailed research paper summarizing our work on LLM quantization. Our team at Neural Magic has been hard at work running hundreds of thousands of evaluations and benchmarks, and I'm incredibly excited to share the results with everyone! 📊 Key Insights: - 𝘄𝟴𝗮𝟴 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 formats deliver up to 𝟴𝘅 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝘀𝗽𝗲𝗲𝗱𝘂𝗽𝘀 on high-performance GPUs, making them ideal for larger models or server deployments. - 𝘄𝟰𝗮𝟭𝟲 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 formats provide up to 𝟳𝘅 𝗰𝗼𝘀𝘁 𝗿𝗲𝗱𝘂𝗰𝘁𝗶𝗼𝗻 per request for smaller models and synchronous deployments. - 𝗙𝗣𝟴 (𝘄𝟴𝗮𝟴) is essentially lossless, and both 𝗜𝗡𝗧𝟴 (𝘄𝟴𝗮𝟴) 𝗮𝗻𝗱 𝗜𝗡𝗧𝟰 (𝘄𝟰𝗮𝟭𝟲) maintain very high fidelity. - 𝗔𝗪𝗤 𝗮𝗻𝗱 𝗚𝗣𝗧𝗤 perform similarly in academic benchmarks, but 𝗔𝗪𝗤 𝘀𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝘀 in real-world scenarios. (𝘕𝘰𝘵𝘦: 𝘸# 𝘳𝘦𝘧𝘦𝘳𝘴 𝘵𝘰 𝘵𝘩𝘦 𝘯𝘶𝘮𝘣𝘦𝘳 𝘰𝘧 𝘣𝘪𝘵𝘴 𝘶𝘴𝘦𝘥 𝘧𝘰𝘳 𝘸𝘦𝘪𝘨𝘩𝘵𝘴, 𝘢𝘯𝘥 𝘢# 𝘳𝘦𝘧𝘦𝘳𝘴 𝘵𝘰 𝘵𝘩𝘦 𝘯𝘶𝘮𝘣𝘦𝘳 𝘰𝘧 𝘣𝘪𝘵𝘴 𝘧𝘰𝘳 𝘢𝘤𝘵𝘪𝘷𝘢𝘵𝘪𝘰𝘯𝘴. 16 𝘳𝘦𝘧𝘦𝘳𝘴 𝘵𝘰 𝘧𝘱16 𝘰𝘳 𝘣𝘧16 𝘢𝘴 𝘵𝘩𝘦 𝘣𝘢𝘴𝘦𝘭𝘪𝘯𝘦.) 📄 The full paper is on arxiv as well: https://lnkd.in/eCThxFxt If you want to make your models more 𝗰𝗼𝘀𝘁-𝗲𝗳𝗳𝗲𝗰𝘁𝗶𝘃𝗲, 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝘁, or 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲, reach out—we'd love to help! 🚀 Exciting things ahead! Stay tuned for: - A 𝗻𝗲𝘄 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹 𝗹𝗮𝘂𝗻𝗰𝗵 coming very soon. - Fresh results on 𝗺𝘂𝗹𝘁𝗶-𝗺𝗼𝗱𝗮𝗹 𝗺𝗼𝗱𝗲𝗹𝘀. - More 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝘀, 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀, and 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻𝘀 on the horizon. #llms #quantization #optimization #genai #ai
To view or add a comment, sign in
-
OpenAI’s o3 Stuns the World: Solving Complex Mathematical Problems Instantly In a historic leap for artificial intelligence, OpenAI’s o3 model has surpassed expectations, solving 85% of the most complex mathematical problems that even top mathematicians have struggled with. Among these, it tackled Artin’s primitive root conjecture with incredible precision. This achievement has left experts astonished, as AI now demonstrates abilities not just to assist, but to excel far beyond human capabilities in certain areas. Here’s the computation that amazed the world, simplified for clarity: The Problem For primes , find the density of those satisfying: where is the smallest such that: \[ a^k \equiv 1 \pmod{p^2}. \] Step-by-Step Calculation 1. Generate Primes: Start with primes up to a limit (e.g., 100). Example: 2. Compute : • For each , compute and by testing powers of and modulo until . 3. Compare Orders: • If , add to the set . 4. Calculate Density: • Density . Results for : • Primes Satisfying the Condition: • Density: This achievement by OpenAI’s o3 is rewriting the rules, showcasing how AI is solving problems in seconds that human experts have debated for years. As these systems grow in power, we must rethink the future of collaboration between humans and machines, ensuring ethical and responsible integration. But for now, one thing is certain—the age of superintelligent problem-solving is here. #AIRevolution #OpenAIo3 #MathGenius #AIProblemSolving #FutureOfTech #EthicalAI #AIInMath #SuperintelligentAI #TechBreakthroughs
To view or add a comment, sign in
-
🔍 Zero-Shot Object Detection (ZSOD): Revolutionizing Computer Vision! ZSOD is transforming the field by enabling models to identify unseen objects, but it also has its challenges. In my latest blog, I discuss innovative approaches to overcome these limitations and how we used GPT-4 Vision to enhance object detection. Check it out for a deep dive into this cutting-edge technology! #genai #computervision #ZeroShotObjectDetection
To view or add a comment, sign in
-
A bit of #speculation about how o3 could work Probably it generates and evaluates possible “Chains of Thought” (CoTs) to solve tasks, using an approach similar to Monte Carlo tree search, guided by an evaluator model. This method overcomes limitations of single-generation LLMs by recombining knowledge at test time through program generation and execution. The “o3” mechanism represents a form of deep learning-guided program search, where the space of CoTs is explored using a base LLM as a guiding prior. However, this approach is computationally intensive, potentially requiring tens of millions of tokens and significant cost due to the vast program space and the necessity of backtracking. This approach, as noted, reflects state-of-the-art advancements in ARC-AGI systems. The Arc-agi datasets is a created to find out it an AI is able to understand and solve #new and #never seen problems. #o3 scores about 88% with high compute (really costly) while the previous #o1 model about 32% #OpenAi #o3 #howItWorks
To view or add a comment, sign in
-
The importance of prompt engineering is perfectly summarized by a research manager at OpenAI that I interviewed: “The problem is not with prompt engineering. It’s a real and useful skill to have. The problem is when prompt engineering is the only thing people know.” To build production-ready AI applications, you need more than just prompt engineering. You need statistics, engineering, and classic ML knowledge to do experiment tracking, evaluation, and dataset curation. [Chip Huyen]
To view or add a comment, sign in
-
#TuesdayPaperThoughts Edition 10: Order-Preserving #RAG In this week's edition of #TuesdayPaperThoughts, we highlight a recent paper from NVIDIA AI: "In Defense of RAG in the Era of Long-Context Language Models." While the paper explores the efficiency of Retrieval-Augmented Generation (RAG) vs. Long-Context LLMs, I believe these methods are complementary. Combining strong RAG techniques with Long-Context #LLMs offers the best performance. The key insight for me is that preserving the order of retrieved text can significantly impact performance—a seemingly intuitive yet powerful optimization, which this work objectively demonstrates. Top 3 takeaways: 1️⃣ OP-RAG (Order Preserving RAG) organizes relevant context not by relevance score but by the sequence in the original document, providing better alignment with the source material. 2️⃣ When compared to just Long-Context LLMs (#Llama 3.1 #70B) without RAG, OP-RAG delivers a 30% higher F1-Score on the EN.QA dataset and 20% more accuracy on the EN.MC dataset, while using 7x fewer tokens. 3️⃣ OP-RAG achieves an F1-Score of 44.43 on the EN.QA dataset vs. 38.40 for Vanilla RAG, and 88.65% accuracy on EN.MC compared to Vanilla RAG’s 81.22% (context retrieval size of 192 chunks). This paper introduces an intuitive optimization to RAG that makes it even more effective. Despite the increasing focus on longer context models, RAG is here to stay. Research Credits: Tan Yu Anbang Xu Rama Akkiraju Paper Link: https://lnkd.in/gMmgFWy2 #LLMs #NVIDIA #Genloop #PrivateLLMs #CustomizedLLMs P.S.: To everyone who messaged about last week's missing post—my apologies! There were no thoughts last Tuesday.
To view or add a comment, sign in
-
☕ Coffee Break Series - 𝙄𝙣𝙛𝙚𝙧𝙚𝙣𝙘𝙚 𝙊𝙥𝙩𝙞𝙢𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙏𝙚𝙘𝙝𝙣𝙞𝙦𝙪𝙚𝙨 𝙛𝙤𝙧 𝙇𝙇𝙈 🚀 Open AI Strawberry (o1) is out & we are finally seeing the paradigm of inference-time scaling popularized and deployed in production. The focus has shifted from giving more training data to giving more time to model to think. Inference optimization is crucial to get a better response time that is cost-effective. Join our Coffee Break Series where we’ll dive deep into the world of inference optimization strategies. Perfect for a quick yet insightful read during your coffee breaks! ☕ 𝗦𝗲𝗿𝗶𝗲𝘀 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀: 1. Why inference Optimization matters # paradigm shift 2. Understanding transformer inference & challenges 3. Solutions to optimize inference time. 4. Optimizing model serving - Continuous batching 5. Optimizing model serving - Key value caching 7. Optimizing model serving - Speculative inferencing 9. Scaling up LLM - Pipeline parallelism 10. Scaling up LLM - Tensor parallelism 11. Scaling up LLM - Sequence parallelism 12. Optimizing Model Architecture - Multi-head attention 13. Optimizing Model Architecture - Grouped-query attention 14. Optimizing Model Architecture - Flash attention 15. Optimizing Model Architecture - Efficient management of KV cache 16. Model Compression - Quantization 17. Model Compression - sparsity 18. Model Compression - Distillation Finally, we will explore the tools available in the market and compare them & deep dive into some coding and benchmarking strategies. Follow Mastering LLM (Large Language Model) to get updates on this series and other intresting updates. #llm #inferenceoptimization #kvcache #flashattention #agents #inference #optimization #Quantization #Distillation #vLLM #tensorRT #TGI #NIM #nvidia #deepspeed
To view or add a comment, sign in
-
I am really curious to see and understand what are the concrete tools and components to implement real life computing environments for generative AI model development. This article introduces way to implement parallelism into processes. Of course same techniques can be used other HPC solutions such as scientific calculations, rendering etc. https://lnkd.in/d3Bm56PG
[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch
discuss.pytorch.org
To view or add a comment, sign in
-
Anyone looking for LLM inference optimisation must look into this amazing series. Why inference optimisation matters? https://lnkd.in/duu56cAR Understanding transformer inference - https://lnkd.in/dNyktzjB Follow Mastering LLM (Large Language Model) for upcoming topics.
☕ Coffee Break Series - 𝙄𝙣𝙛𝙚𝙧𝙚𝙣𝙘𝙚 𝙊𝙥𝙩𝙞𝙢𝙞𝙯𝙖𝙩𝙞𝙤𝙣 𝙏𝙚𝙘𝙝𝙣𝙞𝙦𝙪𝙚𝙨 𝙛𝙤𝙧 𝙇𝙇𝙈 🚀 Open AI Strawberry (o1) is out & we are finally seeing the paradigm of inference-time scaling popularized and deployed in production. The focus has shifted from giving more training data to giving more time to model to think. Inference optimization is crucial to get a better response time that is cost-effective. Join our Coffee Break Series where we’ll dive deep into the world of inference optimization strategies. Perfect for a quick yet insightful read during your coffee breaks! ☕ 𝗦𝗲𝗿𝗶𝗲𝘀 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀: 1. Why inference Optimization matters # paradigm shift 2. Understanding transformer inference & challenges 3. Solutions to optimize inference time. 4. Optimizing model serving - Continuous batching 5. Optimizing model serving - Key value caching 7. Optimizing model serving - Speculative inferencing 9. Scaling up LLM - Pipeline parallelism 10. Scaling up LLM - Tensor parallelism 11. Scaling up LLM - Sequence parallelism 12. Optimizing Model Architecture - Multi-head attention 13. Optimizing Model Architecture - Grouped-query attention 14. Optimizing Model Architecture - Flash attention 15. Optimizing Model Architecture - Efficient management of KV cache 16. Model Compression - Quantization 17. Model Compression - sparsity 18. Model Compression - Distillation Finally, we will explore the tools available in the market and compare them & deep dive into some coding and benchmarking strategies. Follow Mastering LLM (Large Language Model) to get updates on this series and other intresting updates. #llm #inferenceoptimization #kvcache #flashattention #agents #inference #optimization #Quantization #Distillation #vLLM #tensorRT #TGI #NIM #nvidia #deepspeed
To view or add a comment, sign in
-
🚀Key Takeaways from "An Image is Worth 1/2 Tokens After Layer 2"🚀 The study reveals inefficiencies in attention computation over visual tokens in Large Vision-Language Models (LVLMs). FastV, a versatile plug-and-play method, optimizes computational efficiency and significantly reduces costs without sacrificing performance across image and video tasks. Fine-tune the trade-off between computational efficiency and performance with FastV. This customizable approach allows for superior performance while compressing models for deployment on edge devices and commercial use. Read more about this innovative approach to AI, machine learning, and computer vision in the full article here: https://lnkd.in/gJ-DQ46M #AI #MachineLearning #ComputerVision #Efficiency #Innovation
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
arxiv.org
To view or add a comment, sign in
13,999 followers
More from this author
-
How to Improve Small Object Detection Accuracy Without Increasing Latency
Deci AI (Acquired by NVIDIA) 9mo -
Just Launched: Deci’s Gen AI Development Platform and Deci-Nano
Deci AI (Acquired by NVIDIA) 9mo -
What makes LLM inference more challenging than traditional NLP?
Deci AI (Acquired by NVIDIA) 10mo