We'd like to share our latest simple and effective work, ZipAR, which reduces 91% of auto-regressive image generation overhead without any training. 🔮 🔮 Addressing the issue of slow decoding in AR image generation models (such as Emu3, Lumina-mGPT, LlamaGen, and Janus), we propose a training-free parallel decoding framework that achieves up to 91% reduction in image generation overhead while preserving almost all image details and semantic information. This is an early-stage work, and we are continuing to improve it. Feedback and discussions are very welcome! Paper link: https://lnkd.in/gizCB4bK Code link: https://lnkd.in/gjDvmEsh
Bohan Zhuang’s Post
More Relevant Posts
-
Learn how to use ColPali as a re-ranker for highly relevant results using a multimodal index! Ravi Theja Desetty walks you through the technique: 💡 Cohere's multimodal embeddings for initial retrieval of both text and images 💡 We fetch the top 10 most relevant from both the text and image modalities 💡ColPali generates multi-vector representations for both text and images in the same embedding space 💡 We re-rank to the top 5 for each modality before sending to the LLM Check out the full video here: https://lnkd.in/gZWU3tmK
To view or add a comment, sign in
-
Great! I created a simple example using the Gemini 1.5 Flash model. Generating texts from images: https://lnkd.in/dEtuy-j8 Generating text from audio and video https://lnkd.in/dFdRXg9V I'll explore this way as well.
✨ Meet PaliGemma → https://goo.gle/4eOAsRg This powerful open vision-language model is designed for fine-tuning performance on a wide range of vision-language tasks including short video captioning, visual question answering, understanding text in images, object detection, & so much more.
To view or add a comment, sign in
-
In this episode, we discuss VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos by Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal. The paper introduces VideoTree, a novel framework that enhances the efficiency and accuracy of long-video question answering by selectively extracting and hierarchically organizing frames based on their relevance to the query. Unlike traditional methods that rely on dense and often redundant sampling of frames for LLM-based reasoning, VideoTree employs a dynamic, adaptive approach to identify and caption keyframes, forming a tree structure that reflects varying levels of detail where needed. Experiments demonstrate significant performance improvements and reduced inference times on benchmarks like EgoSchema, NExT-QA, and IntentQA.
arxiv preprint - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
podbean.com
To view or add a comment, sign in
-
Today in QEC on the arXiv https://lnkd.in/dtEGWbc7 A deep dive into the details of dynamical codes. Such as figuring out what their distance is (which I know from experience can be tricky). They also show that they obey previous no-go theorems, meaning no non-Cliffords in 2D.
To view or add a comment, sign in
-
Posterior sampling for solving inverse problems? No thanks 😉 We present Posterior-Mean Rectified Flow (PMRF): a novel photo-realistic image restoration algorithm which (provably) outperforms posterior sampling. It also beats current GAN-based and diffusion-based methods on a variety of tasks, including the challenging blind face image restoration problem. Project page: https://pmrf-ml.github.io Arxiv: https://lnkd.in/d2Z5vibp Code: https://lnkd.in/dcAATppy Hugging Face demo: https://lnkd.in/diWhfHd7 #inverseproblems #imageprocessing #computervision #flowmatching #deeplearning #machinelearning
To view or add a comment, sign in
-
Generating high quality images with minimal distortion (and hallucinations) is HARD! BUT not for this guy! Check out Guy Ohayon's latest work presenting a simple, well-justified and extremely effective method to achieve exactly this. #GenAI
Posterior sampling for solving inverse problems? No thanks 😉 We present Posterior-Mean Rectified Flow (PMRF): a novel photo-realistic image restoration algorithm which (provably) outperforms posterior sampling. It also beats current GAN-based and diffusion-based methods on a variety of tasks, including the challenging blind face image restoration problem. Project page: https://pmrf-ml.github.io Arxiv: https://lnkd.in/d2Z5vibp Code: https://lnkd.in/dcAATppy Hugging Face demo: https://lnkd.in/diWhfHd7 #inverseproblems #imageprocessing #computervision #flowmatching #deeplearning #machinelearning
To view or add a comment, sign in
-
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gUFJsfqu .
To view or add a comment, sign in
-
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gT26R4-U .
To view or add a comment, sign in
-
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gNySaupm .
To view or add a comment, sign in
-
DALL·E 3 is much better than previous versions because it uses a 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝗶𝗺𝗮𝗴𝗲 𝗰𝗮𝗽𝘁𝗶𝗼𝗻𝗲𝗿 which generates highly detailed captions, improving upon the limitations of previous models trained with less informative image-text pairs from the internet. By using a blend of 95% synthetic and 5% ground truth captions, DALL·E 3 achieves superior image generation, providing more context and spatial information.
To view or add a comment, sign in