Our Research

 

Fields of Interrest

Multimodal Machine Learning

Multimodal machine learning is a subfield of machine learning and artificial intelligence that focuses on building models that can process and relate information from multiple modalities. The term "modality" refers to a particular way in which something exists or is experienced, such as images, text, audio, and sensor data. The core idea behind multimodal machine learning is to create models that can integrate these different types of data to perform some tasks more effectively than models that rely on a single modality.

Video Understanding

Video understanding and recognition is a dynamic and rapidly evolving field within computer vision and machine learning. It focuses on developing algorithms and models that can automatically analyze, interpret, and understand the content of video data. Advances in deep learning, computer vision, and artificial intelligence help us to push the boundaries of what is possible, enabling more accurate and robust video analysis systems that can transform various industries and aspects of daily life.

Computer Vision and Machine Learning

Computer Vision is a field of study that seeks to develop techniques to help computers "see" and understand the world. It involves the automatic extraction, analysis, and understanding of useful information from digital images and videos. Computer Vision and Machine Learning often intersect and complement each other, enabling the development of systems that can perceive, understand, and interpret the visual world. Together, they are transformative fields that enable machines to perceive, learn, and make recommendations based on visual data. 

Current Projects

XEI - eXtreme Efficient Inference for long context range

The project aims to address the challenges associated with using large transformer architectures, which are crucial for modern AI and large-language models, particularly when dealing with long context lengths. These models face significant computational demands when processing extensive inputs, such as text combined with images or videos, leading to a quadratic increase in computational requirements.

The goal is to enable faster processing of long and diverse queries, facilitating new applications and setting new standards in long-context processing.

ELLIOT - An EU-Horizon project for robust general-purpose AI

ELLlOT is a Horizon Europe-funded project aiming to develop next-generation Multimodal Generalist Foundation Models, AI systems capable of learning general knowledge and patterns from massive amounts of data of various types — from videos, images, and text to sensor signals, industrial time series, and satellite feeds —  and efficiently transferring the generic knowledge learned in generalist manner to a wide variety of downstream tasks. ELLIOT’s models will empower new applications in the domains of media, earth modelling, robotic perception, autonomous driving, computer engineering and workflow automation.

ERC Starting Grant GraViLa

GraViLa aims to learn semantic structures from multimodal data to capture long-range concepts and relations in multimodal data via multimodal and self-supervision learning without human annotation which will be represent in form of a graph.

By bridging the gap between those two parallel trends, multimodal supervision and graph-based representations, we combine their strengths of generating and processing topological data, which opens new ways of processing and understanding multimodal data and concepts at a scale.

Link to EU website

BFMF Project STCL - Multi-modal Self-Supervised Learning for Spatial-temporal Concepts

The STCL project proposes using self-supervised multimodal learning to capture and align these concepts across various modalities, such as text, video, and audio. This approach eliminates the need for individual data annotations, allowing semantic concepts to be learned simultaneously from texts and videos.

By leveraging self-supervised learning, the project aims to train models on large datasets, enabling more precise spatio-temporal localization of concepts. This advancement could significantly enhance the field of video analysis, making it possible to understand and interpret longer activities with greater accuracy.

MIT-IBM Sight and Sound

The sight and sound project focuses on the training and recognition of multi-modal data.  We target feature representations and higher level semantic concepts by training neural networks with multi-modal data such as videos, sounds, and texts.