Microsoft Research is excited to introduce Q-Sparse: a breakthrough in training fully sparsely-activated LLMs. Q-Sparse supports both full-precision and 1-bit LLMs. Its synergy with BitNet b1.58 advances LLM efficiency, including cost and energy use. https://msft.it/6040lumcK
Q-Sparse is truly a game-changer in the realm of LLM efficiency! The combination of full-precision and 1-bit LLMs, alongside BitNet b1.58, paves the way for significant advancements in both cost and energy efficiency. This breakthrough has the potential to revolutionize how we approach large-scale language models, making high-performance AI more accessible and sustainable. Kudos to Microsoft Research for pushing the boundaries of AI innovation!
1-bit LLMs are a big thing. When both training and inference are built natively to run on addition instead of multiplication cost for compute and energy drop dramatically, without a meaningful sacrifice to perplexity scores.
Impressive Innovation,Microsoft Research! Excited to see the advancements Q-Sparse brings.
Exciting stuff
Inspiring!
Javascript Developer, DeepRL, Prompt Engineering, Model Coercion
5moMicrosoft Research You can sparsify by using Q(Q in inference. Apply recompiling for hidden weights with << and cluster again. There is a post doing that with deflect(<< in my timeline. Just flush the context all as feed forward, forever doing the same. Just hidden weights cluster in inference a single time. Focus on your flushback, fix it and will skyrocket your results. Hire me. I do know how to take advantage of the way transformers interpret stuff. On real-time inference. on train inference. on eval inference. The result of your study in this paper I read as non-conclusive with still huge activation even with sparsity. Moe will help with your side-effects on FeedForward. But you are being bombarded at FlushBack. The encoder window is flushing half of your traversing cause you are still using a long range on Encoding. Even sparsing, you are still far from recompiling in inference. YOCO is huge for caching KV I would defo go with that.