Exploring the Next Generation of Data
in the Era of Foundation Models

News:

Recording of the tutorial and workshop on YouTube will be available to the general public once the 3-month embargo lifts.

[March 2025] We are hosting a CVPR 2025 Workshop NeXD. Join us!

[March 2025] We are hosting a CVPR 2025 Tutorial VAST, please stop by and participate!

Our Mission

Data has become more pivotal than ever, driving advancements from the first generation of deep learning models to the emergence of foundation models. The significant momentum behind foundation models has spurred their continuous integration into various applications, such as autonomous driving, medical diagnostics, and AI-powered chatbots. To ensure reliable and safe model development, the quality of the massive datasets these models depend on has gained increasing interest and attention. Given the sheer scale of raw data, it is essential to develop scalable methods to evaluate and select data based on the data quality and relevance to both general and specific tasks. Furthermore, foundation models themselves are now utilized to understand and uncover more data for further training, creating a feedback loop between models and datasets.

Our ultimate goal is to tackle this enormous challenge surrounding the next generation of data from numerous angles. In our trials, we consider several factors: definitions of quality data, bias-free data, scalability, generating data, ethical data gathering, continuous data gathering, and hallucination-free foundation models for data mining.

Overview

Today, we have started tackling this behemoth challenge through 4 different directions, with more in the works!

More details on the papers and their respective webpages are available below.

Our Works

SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen*, Nadine Chang*, Sifei Liu, Jose M. Alvarez
KDD 2025
Paper and code coming soon!

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, Jose M. Alvarez
CVPR 2025
Paper / Code

Enhancing Autonomous Driving Safety with Collision Scenario Integration
Zi Wang, Shiyi Lan, Xinglong Sun, Nadine Chang, Zhenxin Li, Zhiding Yu, Jose M. Alvarez
IROS 2025
Paper / Project Page

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez
CVPR 2025
Paper / Code