Title of the Report: Addressing vision tasks with large foundation models: how far can we go without training?
Presenter: Dr. WANG Yiming
Affiliation: The Italian FBK Research Institute
Date of the Report: December 26th, 2024 (Thursday) at 15:00
Location of the Report: Room 1104, Building A, Feicui Science and Education Building
Abstract:
Recent advancements in Vision and Language Models (VLMs) have significantly impacted computer vision research, particularly thanks to their ability to interpret multimodal information within a unified representation space. Notably, the generalisation capability of VLMs, honed through extensive web-scale data pre-training, has shown remarkable performance in zero-shot recognition.
As direct competition in developing such large models is not a viable option for most public institutes due to their limited resources, we have explored new research opportunities following a training-free methodology by leveraging pre-trained models and existing databases that contain rich world knowledge.
In this talk, I will first present how we exploit VLMs to approach image classification without training, particularly, in the low-resource domains where images or their annotations are scarce, achieving very competitive performance. Then, I will present how VLMs and Large Language Models (LLMs) can be synergised in a training-free manner to advance video understanding, in particular in recognising anomalous patterns in video content.
Biography of the Presenter: 
Yiming Wang is a Researcher in the Deep Visual Learning (DVL) Unit in Fondazione Bruno Kessler (FBK), Italy. She has expertise on vision-based scene understanding that facilitates automation and social good, covering diverse topics on static scene modeling, semantic understanding and video analysis.
She obtained the PhD in 2018 from Queen Mary University of London (QMUL) under the supervision of Prof. Andrea Cavallaro. Previously, she was a post-doc in the Pattern Analysis and Computer Vision (PAVIS) research line at Istituto Italiano di Tecnologia (IIT), working mostly on active 3D vision. Her recent research focuses on training-free methods leveraging foundation models to address vision tasks.
She has served as Reviewer in many top-tier vision/robotics conferences and journals (Outstanding Reviewer BMVC 2021), and as Area Chair for ICRA'24, ECCV'24 and CVPR'25. She is Associate Editor in International Journal of Social Robotics (SoRo). She is currently responsible for a funded innovative project on low-carbon learning algorithms funded by CariVerona. She is an ELLIS member.