Fan Zheyu final year project

On-demand dynamic temporal visual search in videoLLMs

Background and abstract:

VideoLLMs, a fusion of a video encoder and a backbone LLM, have shown remarkable video understanding abilities and made a new step towards AGI. However, being limited by computational costs and context length constraints, current video large language models (videoLLMs) tend to employ static encoding methods when processing videos, which is unlike human beings that perform a crucial step to selectively process what we see. A standard pipeline adopted by current videoLLMs usually include sparsely and uniformly sampling a pre-defined number of frames across the entire video, rather than dynamically producing concerned visual embedding and discarding unwanted information based on prompt cues in a human-like way. These existing methods can not only result in the loss of crucial details due to the sparsity of frame sampling but the involvement of redundant information by encoding the entire video every time as well, both of which can potentially degrade model performance, especially when dealing with long videos or temporal reasoning related tasks. In addition, videoLLMs’ lack of abilities to deal with meta-information related prompts including timestamps are observed, due to the natural absence of the corresponding encoding processes. To address these challenges, this work proposes On-demand Dynamic Temporal Visual Search for videoLLMs, which is a new paradigm for any videoLLM, aiming to dynamically and selectively encode desired visual information of videos based on the prompt cues and discard junk information. By this new paradigm, we seek to answer two fundamental questions: 1) the extent to which this process is necessary; and 2) the extent to which we can perform this well. To this end, we curated a dataset and achieved a plug-and-play implementation of this paradigm at the frame selection level using a GPT-4 based agent. We conducted evaluations during both the training and inference stages. Our findings suggest that this paradigm introduces significant performance gain and shed light on the critical role of dynamic temporal visual search in enhancing the performance of videoLLMs, highlighting the potential for significant advancements in video understanding and processing.

HAVE FUN!

Coming soon!

As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!