OmniVinci: Joint Visual-Audio Understanding

Abstract

OmniVinci is our systematic research of new model architecture and data curation for omni-modal LLMs, resulting in a model that achieves state-of-the-art performance in joint perception of images, videos, audio, and text. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model (9B) outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6x reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory in the paper.

Demonstrations

1. Detailed Video Content Analysis

Prompt: [Video] + "What are they talking about in detail?"

OmniVinci Output:

The video features Jensen Huang with gray hair and glasses, dressed in a black leather jacket, speaking into a microphone in a modern office setting. The background includes a glass wall and a green panel, with a blurred view of desks and chairs. Jensen Huang discusses the evolution of artificial intelligence, focusing on the impact of deep learning and the development of AI frameworks. He emphasizes the importance of data, computing power, and software methodologies in driving AI advancements. The conversation highlights the transformative potential of AI, particularly in perception and generative capabilities, and the need for a new approach to software development. Jensen Huang also mentions the creation of a new AI supercomputer, DGX-1, and its impact on the industry. The video concludes with Jensen Huang discussing the second phase of AI, generative AI, and its potential to revolutionize the way we understand and create information.

2. Detailed Visual Narration

Prompt: [Video] + "Describe the video in detail."

OmniVinci Output:

The video begins with Jensen Huang in a modern, well-lit room with large windows and a view of greenery outside. He dressed in a black jacket and white pants, is seated at a table, writing a message on a black card with a gold pen. The message reads, "To Robot, Enjoy Your New Brain!" followed by a signature. He then places the card on the table rand begins to open a large black gift box with a gold ribbon and bow. The scene transitions to a close-up of the gift box on the table, with the person's hand visible. The focus then shifts to a robot wearing a white hard hat with the "NVIDIA" logo, standing in a workshop or industrial setting. The robot holds the same black gift box with the gold ribbon and bow, and it opens the box to reveal the black card with the message. The robot examines the card closely. The narrative continues with the robot, still in the workshop setting, holding the black gift box. The robot opens the box, revealing a sleek, white device with a black screen, nestled in crumpled black paper. The robot examines the device closely, then places it back into the box and closes it. The scene transitions to a different setting, where the robot is now in a modern office environment with green walls and multiple computer monitors. The robot stands behind the closed gift box, gesturing with its hands as if explaining or presenting something. The video wraps up with the robot in the modern office environment, gesturing with its hands. The scene transitions to a close-up of the robot's face, showing its detailed features and expressive eyes. The robot continues to gesture, emphasizing its points. The video concludes with the NVIDIA logo displayed on a white background, indicating the band associated with the robot and the gift."

3. Detailed Video and Audio Analysis

Prompt: [Video] + "Assess the video, followed by a detailed description of it's video and audio contents."

OmniVinci Output:

The video features a podcast discussion between two individuals, one in a light blue shirt and the other in a black blazer, set in a professional studio environment with beige walls and a microphone setup. The conversation revolves around the topic of artificial intelligence (AI), specifically focusing on the limitations and challenges of achieving human-level AI through neural networks (NNs). The individual in the light blue shirt, who is the primary speaker, expresses skepticism about the feasibility of NNs reaching human-level intelligence, emphasizing the need for a different approach. He highlights the limitations of NNs in handling complex tasks and the potential for diminishing returns on investment in this area. The individual in the black blazer, who listens attentively and occasionally interjects, shares his perspective on the potential of AI and the importance of understanding the physical world for AI systems to reason and plan effectively. He also discusses the challenges of achieving human-level AI and the need for a collaborative approach among the research community to address these challenges. The discussion is analytical and focused on the technical and practical aspects of AI, with both individuals providing insights and perspectives on the topic.

4. In-Context Speech Transcription

Prompt: [Video] + "Transcribe the speech."

OmniVinci Output:

The first part is: let's just go from what the AI can do. Let me work backwards. Agentic AI, as I mentioned, at this foundation is reasoning. We now have AIs that can reason, which is fundamentally about breaking a problem down step by step. Maybe it approaches a problem in a few different ways and selects the best answer. Maybe it solves the same problem in a variety of ways, and ensures it has the same answer — consistency checking. Or maybe, after it's done deriving the answer, it plugs it back into the equation — maybe a quadratic equation — to confirm that, in fact, that's the right answer instead of just one-shot blurting it out.

5. Speech-Prompted Visual Intelligence

Audio Prompt 1:

OmniVinci Output 1:

I see a kitchen setting with a countertop, a sink, and various kitchen items. There's a person's hand visible.... The countertop has a cutting board, a spoon, and a bottle of cleaning product. There's also a bowl with what appears to be pasta or noodles.

OmniVinci Voice-Output 1 (Target Speaker):

Audio Prompt 2:

OmniVinci Output 2:

Yes, you can use the soap dispenser located on the countertop next to the sink.

OmniVinci Voice-Output 2 (Target Speaker):

6. Interactive Game Playing (Taboo)

Audio Prompt in Video:

"...it's kind of similar to the human brain, a lot of layers of nodes and weight connections... it's kind of like originate from the 1950s."

OmniVinci Output (Guessed Word):

The word is "Neural Network".

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Abstract

OmniVinci Method

Benchmark Results

Demonstrations

1. Detailed Video Content Analysis

OmniVinci Output:

2. Detailed Visual Narration

OmniVinci Output:

3. Detailed Video and Audio Analysis

OmniVinci Output:

4. In-Context Speech Transcription

OmniVinci Output:

5. Speech-Prompted Visual Intelligence

OmniVinci Output 1:

OmniVinci Output 2:

6. Interactive Game Playing (Taboo)

OmniVinci Output (Guessed Word):

Citation