Workshop on Video-Language Models

December 14, 2024

Vancouver Convention Centre
East Meeting Room 13,
Get directions

About the workshop

The growing relevance of video-language models in both academia and industry highlights the necessity for a dedicated workshop to address the unique challenges and opportunities this field presents. This workshop is designed to accelerate the development and practical application of video foundation models, which are crucial for interpreting and utilizing the extensive amounts of video data that make up a significant portion of global data. These models are increasingly vital for a range of applications, from video search and content creation to surveillance and robotics.

Check out the official NeurIPS workshop page for more information.

Our workshop will tackle four primary challenges:

  • First, the scarcity of high-quality, annotated video data is a significant barrier to progress. Unlike text and image data, which are abundant and often come with high-quality annotations, video data typically lacks such detailed annotations, limiting the development of advanced models.
  • Second, the sheer volume of video data demands significant advancements in data processing techniques.. Modern video models must process hundreds to thousands of frames per video, with each frame requiring detailed analysis. Efficient processing methods are needed to handle this scale while maintaining detailed information capture.
  • Third, the multimodal nature of video data requires sophisticated model designs that can integrate audio, visual, temporal, and textual data in a cohesive manner.
  • Last but not least, the community still lacks robust video-language alignment benchmarks, which makes it hard to evaluate and compare the capabilities of video-language models.

Featuring organizers from leading AI institutions

The goal is to push the boundaries of what video-language models can achieve while ensuring they are developed and deployed ethically and responsibly.

This workshop will serve as a platform for sharing knowledge, fostering collaborations, and setting future research directions in this essential and rapidly advancing field.

The goal is to push the boundaries of what video-language models can achieve while ensuring they are developed and deployed ethically and responsibly.
This workshop will serve as a platform for sharing knowledge, fostering collaborations, and setting future research directions in this essential and rapidly advancing field.

What the workshop offers

Resources such as access to embeddings for large-scale public video datasets and benchmarks for video-language alignment (See Video Embeddings and Benchmark Section).

Opportunity to explore the ethical implications of video foundation models, focusing on safety, reliability, and responsibility.

A series of expert talks, panel discussions, and collaborative sessions to discuss current advancements and tackle existing challenges.

Resources such as access to embeddings for large-scale public video datasets and benchmarks for video-language alignment (See Video Embeddings and Benchmark Section).

Opportunity to explore the ethical implications of video foundation models, focusing on safety, reliability, and responsibility.

A series of expert talks, panel discussions, and collaborative sessions to discuss current advancements and tackle existing challenges.

Speakers

Gedas Bertasius

Confirmed
UNC Chapel Hill

Dima Damen

Confirmed
University of Bristol, Google DeepMind

Yong Jae Lee

Confirmed
University of Wisconsin-Madison

Doyup Lee

Confirmed
Runway

Ishan Misra

Confirmed
Meta Generative AI

Jianwei Yang

Confirmed
Microsoft

Accepted Papers

Oral

  • Wolf: Captioning Everything with a World Summarization Framework
  • TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
  • TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
  • VideoPhy: Evaluating Physical Commonsense for Video Generation
  • Taskverse: A Benchmark Generation Engine for Multi-modal Language Model

Posters

  • Read, Watch and Scream! Sound Generation from Text and Video
  • Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
  • MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
  • Too many frames, not all useful: Efficient Strategies for Long-Form Video QA
  • Can Video Large Language Models Comprehend Language in Videos?
  • Generative Timelines for Instructed Visual Assembly
  • GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents
  • VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
  • MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
  • Exploring In-Context Ensemble with Video-Language Models for Low-Level Workflow Understanding
  • Quo Vadis, Video Understanding with Vision-Language Foundation Models?
  • Language Repository for Long Video Understanding
  • VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
  • CinePile: A Long Video Question Answering Dataset and Benchmark
  • LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living
  • Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution
  • RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
  • HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
  • Click & Describe: Multimodal Grounding and Tracking for Aerial Objects
  • Mobile OS Task Procedure Extraction from YouTube
  • IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
  • Matryoshka Multimodal Models

Oral

  • Wolf: Captioning Everything with a World Summarization Framework
  • TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
  • VideoPhy: Evaluating Physical Commonsense for Video Generation
  • TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
  • Taskverse: A Benchmark Generation Engine for Multi-modal Language Model

Posters

  • Read, Watch and Scream! Sound Generation from Text and Video
  • Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
  • MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
  • Too many frames, not all useful: Efficient Strategies for Long-Form Video QA
  • Can Video Large Language Models Comprehend Language in Videos?
  • Generative Timelines for Instructed Visual Assembly
  • GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents
  • VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
  • MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
  • Exploring In-Context Ensemble with Video-Language Models for Low-Level Workflow Understanding
  • Quo Vadis, Video Understanding with Vision-Language Foundation Models?
  • Language Repository for Long Video Understanding
  • VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
  • CinePile: A Long Video Question Answering Dataset and Benchmark
  • LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living
  • Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution
  • RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
  • HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
  • Click & Describe: Multimodal Grounding and Tracking for Aerial Objects
  • Mobile OS Task Procedure Extraction from YouTube
  • IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
  • Matryoshka Multimodal Models

Program schedule

0900
-
0930

Morning break (Coffee & tea available)

Coffee & Tea Available
0920
-
0930

Opening remarks

0930
-
1010

Invited Talk 1: Dima Damen

Online recorded talk
1010
-
1050

Invited Talk 2: Gedas Bertasius

In person
1050
-
1130

Invited Talk 3: Yong Jae Lee

In person
1130
-
1300

Lunch Break

1300
-
1350

Oral Session

13:50
-
14:30

Invited Talk 4: Ishan Misra

Online remote
14:30
-
15:30

Poster Session

15:00
-
15:30

Break

Coffee & Snacks Provided
15:30
-
16:10

Invited Talk 5: Jianwei Yang

In person
16:10
-
16:50

Invited Talk 6: Doyup Lee

In person
16:50
-
17:30

Panel Discussion + Closing

All Speakers, Open Q&A

Organizers

Aiden Lee

Twelve Labs

Sangdoo Yun

NAVER AI Lab

Minjoon Seo

KAIST

Sangho Lee

AI2

Jiasen Lu

Apple

Mohaiminul Islam

UNC Chapel Hill

Yanbei Chen

Amazon AGI

Linjie Li

Microsoft

Powered by trusted partnerships.

Explore the esteemed companies we collaborate with to deliver outstanding events and xperiences
Logo Architect
Architect
Logo Sumer
Saumer
Logo North Company
North Company