Bohan Zhuang’s Post

Faculty @ Zhejiang University; Adjunct Faculty @ Monash University; Visiting Research Scientist @ DAMO Academy

We'd like to share our latest simple and effective work, ZipAR, which reduces 91% of auto-regressive image generation overhead without any training. 🔮 🔮 Addressing the issue of slow decoding in AR image generation models (such as Emu3, Lumina-mGPT, LlamaGen, and Janus), we propose a training-free parallel decoding framework that achieves up to 91% reduction in image generation overhead while preserving almost all image details and semantic information. This is an early-stage work, and we are continuing to improve it. Feedback and discussions are very welcome! Paper link: https://lnkd.in/gizCB4bK Code link: https://lnkd.in/gjDvmEsh

To view or add a comment, sign in

More Relevant Posts

LlamaIndex

229,548 followers
1mo
Report this post
Learn how to use ColPali as a re-ranker for highly relevant results using a multimodal index! Ravi Theja Desetty walks you through the technique: 💡 Cohere's multimodal embeddings for initial retrieval of both text and images 💡 We fetch the top 10 most relevant from both the text and image modalities 💡ColPali generates multi-vector representations for both text and images in the same embedding space 💡 We re-rank to the top 5 for each modality before sending to the LLM Check out the full video here: https://lnkd.in/gZWU3tmK
3 Comments
Like Comment
To view or add a comment, sign in
Marcos Guedes

Microsoft Certified | Scrum Master | Sr. Consultant, Tech Architect at Avanade
6mo
Report this post
Great! I created a simple example using the Gemini 1.5 Flash model. Generating texts from images: https://lnkd.in/dEtuy-j8 Generating text from audio and video https://lnkd.in/dFdRXg9V I'll explore this way as well.

Google for Developers

1,985,267 followers
6mo

✨ Meet PaliGemma → https://goo.gle/4eOAsRg This powerful open vision-language model is designed for fine-tuning performance on a wide range of vision-language tasks including short video captioning, visual question answering, understanding text in images, object detection, & so much more.
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
7mo
Report this post
In this episode, we discuss VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos by Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal. The paper introduces VideoTree, a novel framework that enhances the efficiency and accuracy of long-video question answering by selectively extracting and hierarchically organizing frames based on their relevance to the query. Unlike traditional methods that rely on dense and often redundant sampling of frames for LLM-based reasoning, VideoTree employs a dynamic, adaptive approach to identify and caption keyframes, forming a tree structure that reflects varying levels of detail where needed. Experiments demonstrate significant performance improvements and reduced inference times on benchmarks like EgoSchema, NExT-QA, and IntentQA.

arxiv preprint - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

podbean.com
Like Comment
To view or add a comment, sign in
James Wootton

Doing creative things with quantum
10mo
Report this post
Today in QEC on the arXiv https://lnkd.in/dtEGWbc7 A deep dive into the details of dynamical codes. Such as figuring out what their distance is (which I know from experience can be tricky). They also show that they obey previous no-go theorems, meaning no non-Cliffords in 2D.
Like Comment
To view or add a comment, sign in
Guy Ohayon

Computer Science PhD candidate @ Technion
3mo Edited
Report this post
Posterior sampling for solving inverse problems? No thanks 😉 We present Posterior-Mean Rectified Flow (PMRF): a novel photo-realistic image restoration algorithm which (provably) outperforms posterior sampling. It also beats current GAN-based and diffusion-based methods on a variety of tasks, including the challenging blind face image restoration problem. Project page: https://pmrf-ml.github.io Arxiv: https://lnkd.in/d2Z5vibp Code: https://lnkd.in/dcAATppy Hugging Face demo: https://lnkd.in/diWhfHd7 #inverseproblems #imageprocessing #computervision #flowmatching #deeplearning #machinelearning
5 Comments
Like Comment
To view or add a comment, sign in
Regev Cohen

Research Scientist
2mo
Report this post
Generating high quality images with minimal distortion (and hallucinations) is HARD! BUT not for this guy! Check out Guy Ohayon's latest work presenting a simple, well-justified and extremely effective method to achieve exactly this. #GenAI
Guy Ohayon

Computer Science PhD candidate @ Technion
3mo Edited

Posterior sampling for solving inverse problems? No thanks 😉 We present Posterior-Mean Rectified Flow (PMRF): a novel photo-realistic image restoration algorithm which (provably) outperforms posterior sampling. It also beats current GAN-based and diffusion-based methods on a variety of tasks, including the challenging blind face image restoration problem. Project page: https://pmrf-ml.github.io Arxiv: https://lnkd.in/d2Z5vibp Code: https://lnkd.in/dcAATppy Hugging Face demo: https://lnkd.in/diWhfHd7 #inverseproblems #imageprocessing #computervision #flowmatching #deeplearning #machinelearning
Like Comment
To view or add a comment, sign in
Jimmy Costello

Commercial Account Executive at Egnyte SaaS Sales ☁️ 🔒 Data Security, Governance, Collaboration & Content Management
5mo
Report this post
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gUFJsfqu .
Like Comment
To view or add a comment, sign in
Bhavana DK

Talent Partner @Egnyte - Connect better. Protect better!! "Trusted by 24,000+ Content-Critical Businesses Worldwide"
5mo
Report this post
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gT26R4-U .
Like Comment
To view or add a comment, sign in
Gabriella Williams

Partner Account Manager at Egnyte
5mo
Report this post
Discover how RIOS leverages Egnyte for advanced computational design and AI-powered image generation—without changing the way they work. Catch the full session on-demand now! https://lnkd.in/gNySaupm .
Like Comment
To view or add a comment, sign in
Tanika Gupta

Director Data Science at Sigmoid| Ex-VP Machine Learning at JPMorgan Chase & Co. | Patent Inventor | MDI, Gurgaon
7mo Edited
Report this post
DALL·E 3 is much better than previous versions because it uses a 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝗶𝗺𝗮𝗴𝗲 𝗰𝗮𝗽𝘁𝗶𝗼𝗻𝗲𝗿 which generates highly detailed captions, improving upon the limitations of previous models trained with less informative image-text pairs from the internet. By using a blend of 95% synthetic and 5% ground truth captions, DALL·E 3 achieves superior image generation, providing more context and spatial information.
Like Comment
To view or add a comment, sign in

1,187 followers

53 Posts

View Profile Connect

Bohan Zhuang’s Post

More Relevant Posts

arxiv preprint - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

podbean.com

Explore topics