A research team led by Prof. CHEN Changwen, Management Committee Member of PAIR, Interim Dean of Faculty of Computer and Mathematical Sciences and Chair Professor of Visual Computing, has developed a novel video-language agent, VideoMind, that enables AI models to perform long video reasoning and question-answering tasks by emulating humans’ way of thinking. The VideoMind framework incorporates an innovative Chain-of-Low-Rank Adaptation (LoRA) strategy to reduce the demand for computational resources and power, advancing the application of generative AI in video analysis. The findings have been submitted to the world-leading AI conferences.
In designing VideoMind, the team made reference to a human-like process of video understanding, and introduced a role-based workflow. The four roles included in the framework are: the Planner, to coordinate all other roles for each query; the Grounder, to localise and retrieve relevant moments; the Verifier, to validate the information accuracy of the retrieved moments and select the most reliable one; and the Answerer, to generate the query-aware answer. This progressive approach to video understanding helps address the challenge of temporal-grounded reasoning that most AI models face.
Another core innovation of the VideoMind framework lies in its adoption of a Chain-of-LoRA strategy. LoRA is a finetuning technique emerged in recent years. It adapts AI models for specific uses without performing full-parameter retraining. The innovative chain-of-LoRA strategy pioneered by the team involves applying four lightweight LoRA adapters in a unified model, each of which is designed for calling a specific role. With this strategy, the model can dynamically activate role-specific LoRA adapters during inference via self-calling to seamlessly switch among these roles, eliminating the need and cost of deploying multiple models while enhancing the efficiency and flexibility of the single model.
VideoMind not only overcomes the performance limitations of AI models in video processing, but also serves as a modular, scalable and interpretable multimodal reasoning framework. Looking forward, the team will expand the application of generative AI to various areas, such as intelligent surveillance, sports and entertainment video analysis, video search engines and more.
Press release: https://www.polyu.edu.hk/en/media/media-releases/2025/0610_polyu-develops-novel-multi-modal-agent-to-facilitate-long-video-understanding-by-ai/
Research Units | PolyU Academy for Interdisciplinary Research | Faculty of Computer and Mathematical Sciences |
---|