Workshop on Video-Language Models

December 14 or 15th, 2024

About the workshop

The growing relevance of video-language models in both academia and industry highlights the necessity for a dedicated workshop to address the unique challenges and opportunities this field presents. This workshop is designed to accelerate the development and practical application of video foundation models, which are crucial for interpreting and utilizing the extensive amounts of video data that make up a significant portion of global data. These models are increasingly vital for a range of applications, from video search and content creation to surveillance and robotics.

Organizers

Our workshop will tackle four primary challenges

  • First, the scarcity of high-quality, annotated video data is a significant barrier to progress. Unlike text and image data, which are abundant and often come with high-quality annotations, video data typically lacks such detailed annotations, limiting the development of advanced models.
  • Second, the sheer volume of video data demands significant advancements in data processing techniques.. Modern video models must process hundreds to thousands of frames per video, with each frame requiring detailed analysis. Efficient processing methods are needed to handle this scale while maintaining detailed information capture.
  • Third, the multimodal nature of video data requires sophisticated model designs that can integrate audio, visual, temporal, and textual data in a cohesive manner.
  • Last but not least, the community still lacks robust video-language alignment benchmarks, which makes it hard to evaluate and compare the capabilities of video-language models.
The goal is to push the boundaries of what video-language models can achieve while ensuring they are developed and deployed ethically and responsibly.

This workshop will serve as a platform for sharing knowledge, fostering collaborations, and setting future research directions in this essential and rapidly advancing field.

The goal is to push the boundaries of what video-language models can achieve while ensuring they are developed and deployed ethically and responsibly.
This workshop will serve as a platform for sharing knowledge, fostering collaborations, and setting future research directions in this essential and rapidly advancing field.

What the Workshop Offers

Resources such as access to embeddings for large-scale public video datasets and benchmarks for video-language alignment (See Video Embeddings and Benchmark Section).

Opportunity to explore the ethical implications of video foundation models, focusing on safety, reliability, and responsibility.

A series of expert talks, panel discussions, and collaborative sessions to discuss current advancements and tackle existing challenges.

Resources such as access to embeddings for large-scale public video datasets and benchmarks for video-language alignment (See Video Embeddings and Benchmark Section).

Opportunity to explore the ethical implications of video foundation models, focusing on safety, reliability, and responsibility.

A series of expert talks, panel discussions, and collaborative sessions to discuss current advancements and tackle existing challenges.

Speakers

Gedas Bertasius

Confirmed
UNC Chapel Hill
Video understanding, human behavior modeling, multi-modal deep learning, and transfer learning

Dima Damen

Confirmed
University of Bristol, Google DeepMind
Egocentric Vision, automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors

Yong Jae Lee

Confirmed
University of Wisconsin-Madison
Video, multimodal perception, embodied AI (vision for robotics, perception for action), and visual recognition.

Doyup Lee

Confirmed
Runway

Ishan Misra

Confirmed
Meta Generative AI
Multimodal, generative, and self-supervised computer vision

Jianwei Yang

Confirmed
Microsoft
Structured visual understanding at different levels and how to further leverage them for intelligent interactions with humans through language and environment through embodiment.

Panelists: TBD

Call for papers

We will invite researchers to submit papers focusing on, but not limited to, the following topics related to video-language models:

  • Video-text alignment and multimodal understanding
  • Text-to-video generation and editing using natural language prompts
  • Video-to-text generation, including video captioning and description
  • Temporal reasoning and event understanding in video language models
  • Cross-modal retrieval between video and text
  • Video question answering and visual dialogue systems
  • Long-form video understanding and summarization
  • Ethical considerations and bias mitigation in video AI
  • Benchmarks and evaluation metrics for video-language tasks
  • Multimodal fusion techniques for video, audio, and text
  • Long, non-archival track: Submission of papers relevant to our workshop, or accepted at other conferences.
  • Video question answering and visual dialogue systems
  • Long-form video understanding and summarization
  • Ethical considerations and bias mitigation in video AI
  • Benchmarks and evaluation metrics for video-language tasks
  • Multimodal fusion techniques for video, audio, and text

Tentative Important Dates

  • Abstract Submission Deadline:
    September 10, 2024
  • Paper Submission Deadline:
    September 13, 2024
  • Review Bidding Period:
    September 13 - September 17, 2024
  • Review Deadline:
    October 11, 2024
  • Acceptance/Rejection Notification Date:
    October 14, 2024
  • Workshop Date:
    December 14 or 15, 2024

Awards

Among exceptional research papers with high review scores, we will select one best paper award and two runner-ups.

We will invite researchers to submit papers focusing on, but not limited to, the following topics related to video-language models:

  • Video-text alignment and multimodal understanding
  • Text-to-video generation and editing using natural language prompts
  • Video-to-text generation, including video captioning and description
  • Temporal reasoning and event understanding in video language models
  • Cross-modal retrieval between video and text
  • Video question answering and visual dialogue systems
  • Long-form video understanding and summarization
  • Ethical considerations and bias mitigation in video AI
  • Benchmarks and evaluation metrics for video-language tasks
  • Multimodal fusion techniques for video, audio, and text

Short, archival track: Submission of abstract that shows early, novel ideas related to the workshop (up to 3 pages)

Long, non-archival track: Submission of papers relevant to our workshop, or accepted at other conferences.

Tentative Important Dates

  • Abstract Submission Deadline:
    September 10, 2024
  • Paper Submission Deadline:
    September 13, 2024
  • Review Bidding Period:
    September 13 - September 17, 2024
  • Review Deadline:
    October 11, 2024
  • Acceptance/Rejection Notification Date:
    October 14, 2024
  • Workshop Date:
    December 14 or 15, 2024

Awards

Among exceptional research papers with high review scores, we will select one best paper award and two runner-ups.

Video Embeddings and Benchmark: TBD

Processing videos and obtaining their embeddings is crucial for creating a powerful video-language model, but it is also prone to heavy computations and privacy issues. We will release video embeddings and industry-grade evaluation benchmarks to facilitate the research

Program Schedule

Tentative

0900
-
0930

Morning break (Coffee & tea available)

0920
-
0930

Opening remarks

0930
-
1010

Invited Talk 1: Dima Damen

1010
-
1050

Invited Talk 2: Gedas Bertasius

1050
-
1130

Invited Talk 3: Yong Jae Lee

1130
-
1300

Lunch Break

1300
-
1350

Oral Session

13:50
-
14:30

Invited Talk 4: Ishan Misra

14:30
-
15:30

Poster Session

15:00
-
15:30

Break

15:30
-
16:10

Invited Talk 5: Jianwei Yang

16:10
-
16:50

Invited Talk 6: Doyup Lee

16:50
-
17:30

Panel Discussion + Closing

Organizers

Aiden Lee

Twelve Labs

Sangdoo Yun

NAVER AI Lab

Minjoon Seo

KAIST

Sangho Lee

AI2

Jiasen Lu

Apple

Mohaiminul Islam

UNC Chapel Hill

Yanbei Chen

Amazon AGI

Linjie Li

Microsoft

Powered by trusted partnerships.

Explore the esteemed companies we collaborate with to deliver outstanding events and xperiences
Logo Architect
Architect
Logo Sumer
Saumer
Logo North Company
North Company