Image/Video input

Chat (Raw output)

User: Watch the video and list the paper titles in a table.

Eagle-2.5-8B:
Here is a table listing the paper titles shown in the video:

| Title |
|------|
| A New Sequential Prediction Framework with Spatial-temporal Embedding |
| NewW CRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation |
| OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework |
| Deep Unified Representation for Heterogeneous Recommendation |
| KoMen: Domain Knowledge Guided Interaction for Emerging Scenarios |
| Vision-Language Pre-Training for Boosting Scene Text Detectors |

Image/Video input

Chat (Raw output)

User: Please help me design the system in the diagram in detail and list the difficulties.

Eagle-2.5-8B:
The diagram seems to illustrate a conceptual system involving a data center with GPUs, a cooling system, and solar power. Here's a detailed breakdown of the system and potential difficulties:

### System Design

1. **Data Center with GPUs:**
- **Purpose:** The data center hosts GPUs, which are used for intensive computational tasks.
- **Components:**
- **GPUs:** High-performance computing units for parallel processing.
- **Radiator:** A component for cooling, likely part of the Liquid Cooling System.
- **Cooling System:** Manages thermal output from GPUs.
- **1 GW Power:** Indicates the power consumption of the entire system.

2. **Solar Power:**
- **Source:** The Sun generates the solar power.
- **Transmission:** Solar power is harnessed and transmitted to the data center.

3. **Thermal Management:**
- **Radiator and Cooling System:** Critical for maintaining optimal temperatures to prevent overheating and maintain efficiency.
- **Efficiency (ε=0.9):** The radiator has a high efficiency, indicating effective heat dissipation.

### Potential Difficulties

1. **Thermal Management:**
- **Challenge:** GPUs generate significant heat, requiring efficient cooling solutions to avoid performance degradation.
- **Solution:** Advanced liquid cooling systems with high-efficiency radiators and robust airflow design.

2. **Power Efficiency:**
- **Challenge:** High power consumption (1 GW) can be costly and require reliable power sources.
- **Solution:** Utilize renewable energy sources like solar power to reduce operational costs and environmental impact.

3. **Integration:**
- **Challenge:** Integrating solar power into the data center's power infrastructure requires sophisticated energy management systems.
- **Solution:** Use smart inverters and energy storage systems to balance supply and demand.

4. **Scalability:**
- **Challenge:** Ensuring the system can scale with increasing computational demands.
- **Solution:** Modular architecture to add more GPUs or cooling capacity as needed.

5. **Reliability:**
- **Challenge:** Maintaining high uptime with complex cooling and power systems.
- **Solution:** Redundant cooling systems and power backup (e.g., batteries or generators).

6. **Environmental Factors:**
- **Challenge:** Outdoor solar power systems are exposed to weather conditions.
- **Solution:** Use weather-resistant materials and designs to ensure reliability.

By addressing these challenges, the system can achieve high performance, efficiency, and reliability while leveraging renewable energy sources.

Image/Video input

Chat (Raw output)

User: Please infer step by step who this manuscript belongs to and what it records.

Eagle-2.5-8B:
The manuscript appears to be from Albert Einstein, given the mathematical complexity and the handwriting style. It records his work on the field equations of general relativity and the related calculations for the gravitational constant $ G $ and related constants. The presence of variables like $ \sum $, $ \epsilon $, and $ \frac{G}{c^2} $ suggests advanced theoretical physics, likely related to Einstein's work on the theory of relativity.

Image/Video input

Chat (Raw output)

User:
Analyze the video and divide it into distinct sections. For each section, create a title using a few words or phrases, and a detailed caption describing the content of the section, and indicate when it begins in seconds. Multiple sections are separated by line breaks.

Eagle-2.5-8B:
1.05 seconds, Introduction
detailed caption: The clip begins with a majestic scene of a warrior standing atop a mountaintop, surrounded by mystical energy. The warrior is adorned in ornate, decorated armor with intricate patterns and elaborate designs. The background features a misty, ethereal landscape with mountains in the distance. The warrior then climbs a large stone adorned with intricate engravings. The stone is massive, set amidst a mystical and serene environment with other towering rocks nearby. As the warrior ascends, the camera zooms in to reveal the detailed engravings on the stone, showcasing its ancient and mystical nature. The scene transitions to the warrior reaching the top of the stone and spreading his arms wide, standing victorious. Suddenly, the focus shifts to a mystical figure in flowing robes, who appears to be a sage or a mystical character, standing amidst the rocky landscape. This figure is illuminated by an otherworldly light, suggesting his power and wisdom. The clip concludes with this mystical character appearing calm and serene, hinting at a connection to the warrior and the stone, underscoring themes of power, wisdom, and mystical journey.

5.99 seconds, Explaining game genre
detailed caption: The clip begins with a character dressed in ornate armor, moving stealthily through a forest and up a mountain. The camera focuses on the intricate details of the armor, which has elaborate patterns and is adorned with gold accents. As the character ascends, a large, mystical landscape with towering trees and distant mountains comes into view. The character is then seen standing on the mountaintop, with an ethereal glow surrounding them. The camera shifts to show a wide view of the sky with clouds, creating a dramatic backdrop. Next, the character engages in combat, wielding a large weapon amidst an army dressed similarly, with a focus on their coordinated attack. The action intensifies as the character fights against a large, stone statue, which has a menacing expression and rough, textured surface. The clip continues with another character in white robes, who seems to be casting spells or invoking some form of power. The environment transitions to a snowy battlefield where the main character battles against another warrior, engaging in dynamic combat moves. The scene is filled with dramatic lighting effects, showing the two warriors clashing amidst a snowy landscape with large statues looming in the background. The clip concludes with the two characters continuing their intense battle, with the main character executing elaborate and powerful strikes.

517.10 seconds, Discussing controls
detailed caption: The clip begins with two warriors locked in combat, using long, decorated staffs to strike each other in an icy landscape. The scene is intense, with the warriors demonstrating precise and powerful movements. As they clash, the camera shifts to focus on a mystical figure wearing ornate armor and a crown, who appears to be engaging in a magical ritual or challenge. The warrior in white robes seems to be performing a spell, summoning glowing lights and energy. The camera captures the intricate details of the armor, showing dragon and lion motifs intricately etched into the metal. As the scene progresses, the warrior in white robes is seen riding a large, mystical beast, which attacks the main warrior, creating a swirling vortex of energy. The background features a mountainous landscape, adding to the epic and fantastical atmosphere. The clip concludes with the warrior in white robes standing triumphantly, holding his staff aloft, as the spell they cast manifests before them, demonstrating their power and mastery over the mystical forces.

614.53 seconds, Talking about story & characters
detailed caption: The clip begins with a wide shot of a vast forest with tall trees, and mountains visible in the distance partially covered by mist. The sky is bright and clear, suggesting a peaceful morning. The scene then transitions to a close-up of a mystical, swirling black rock with intricate patterns upon its surface. A hand, presumably belonging to a character named Wukong, reaches out and touches the rock, causing it to shift and reveal a hidden passage. As the hand touches the rock, a transformation appears to begin, symbolizing awakening or revelation. Next, the scene shifts to a different location where Wukong stands on a rocky terrain with trees and a mountainous backdrop, suggesting a connection to nature and ancient wisdom. Wukong is dressed in simple, flowing robes, and with an air of contemplation, gazes off into the distance. The clip transitions to another scene where Wukong is now seen in a more rugged outfit, equipped with a large staff, indicating a readiness for adventure or confrontation. The clip captures the essence of a journey filled with mystery, self-discovery, and the pursuit of enlightenment. Throughout the clip, text appears in Chinese characters, providing context and narrative to the unfolding story, including the names 'Wukong' (in both the original script and pinyin) and 'Peng Lao Dang,' adding depth to the characters and their world.

698.69 seconds, Describing visuals & sound
detailed caption: The clip begins with an ancient stone structure partially submerged in the ocean, its smooth, curved surface hinting at its historical significance. As the camera pans out, large waves crash against the structure, creating a powerful and dramatic atmosphere. The water is a deep blue, and the waves crash with great force, splashing white foam around. In the next scene, the setting shifts to a lush forest at dusk, where dense foliage is illuminated by golden light filtering through the trees. The scene captures the tranquility and beauty of nature, with the play of light and shadow creating a mystical ambiance. The forest then transforms into a rocky mountainous terrain, where a waterfall cascades down a cliff. The water is clear, and the rocks are rugged, emphasizing the raw power of nature. Finally, the clip returns to the mystical realm, where a large, ancient stone stands in a serene landscape. This stone is covered in intricate carvings and stands as a testament to the enduring legacy of the place. The camera lingers on the details of the carvings, highlighting the artistry and craftsmanship of the ancient civilization. The clip concludes with a close-up of the stone, capturing the fine details and the texture of the carvings, leaving a lasting impression of the historical and natural beauty depicted throughout the clip.

738.62 seconds, Closing
detailed caption: The clip begins with a character with white hair and a beard, wearing a loose robe and a beaded necklace, extending their hand towards a large, swirling rock on a rocky mountain. The rock features intricate designs that appear to be swirling patterns. The action proceeds as the character touches the rock, and the rock starts to glow with a mystical light, suggesting a magical transformation or unlocking of power. As the character's hand makes contact with the rock, the surroundings become more luminous, highlighting the magical ambiance. The scene transitions to the character walking away from the rock, moving down a series of rocky steps into a valley. As the character descends, more characters dressed in similar attire come into view, implying a gathering or a quest. The video captures the mystical atmosphere of the mountainous landscape, emphasizing the character's interaction with the rock and the mystical surroundings.

Image/Video input

Chat (Raw output)

User: Find time intervals in the video when the query occurs.\nQuery: ‘a person walks down stairs drinking from a glass bottle.’

Eagle-2.5-8B:
1.10s - 8.46s

Image/Video input

Chat (Raw output)

User: What is unusual in the video?

Eagle-2.5-8B:
The video shows a surreal and unusual scene of bears sitting at a table and eating sushi. This is highly unconventional, as bears are wild animals that wouldn’t typically dine in a human-like setting. The bears are depicted using chopsticks to eat sushi, which is a very specific cultural action associated with human dining.

Long-Context Vision-Language Understanding

Why Long-Context Understanding?

Despite significant advances in multimodal learning, many vision-language models (VLMs) remain focused on short-context tasks, with long-context understanding under-explored. This gap is particularly evident in both long video comprehension and high-resolution image/video understanding, where the processing of extended visual contexts remains an open challenge.

Challenges in Long-Context VLMs

The development of long-context VLMs is still in its early stages, hindered by fundamental challenges in dataset construction, architecture design, training strategies, and computation/memory bottlenecks. While prior studies have explored extending context length, key limitations remain: suboptimal performance compared to proprietary models, inconsistent improvements as visual input increases, and unclear optimal training strategies.

Eagle 2.5: Consistent Performance Scaling

Unlike models solely optimized for handling long multimodal sequences without improving performance, Eagle 2.5 benefits from increased input length, leading to consistent performance gains besides merely accommodating longer inputs. Our model achieves superior context coverage and exhibits consistent performance scaling with increasing frame counts, attaining competitive results compared to larger models like GPT-4o and Qwen2.5-VL-72B while maintaining a significantly smaller parameter footprint.

SOTA Performance across Image and Video Understanding

Eagle 2.5 demonstrates exceptional performance across a wide range of image and video understanding benchmarks, achieving competitive results compared to both open-source and proprietary models with significantly larger parameter counts.

Video Understanding

Eagle2.5-8B shows remarkable capabilities on multiple video benchmarks, achieving 74.8 on MVBench, 82.0 on Perception_test, and 72.2 on EgoSchema, outperforming similar-sized models like InternVL2.5-8B (72.0) and Qwen2.5-VL-8B (69.6, 70.5, 65.0). It particularly excels in long-form video understanding with 77.6 on MLVU and 66.4 on LongVideobench, surpassing even InternVL2.5-78B (75.7, 63.6). On VideoMME (without subtitles), Eagle 2.5 achieves 72.4, coming extremely close to 72B parameter models while using far fewer parameters.

Image Understanding

Eagle2.5-8B demonstrates versatile image understanding across document comprehension (94.1 on DocVQA, 87.5 on ChartQA), information extraction (80.4 on InfoVQA, 83.7 on TextVQA), and optical character recognition (869 on OCRBench). The model also shows balanced capabilities in general perception and reasoning tasks (66.2 on MMstar, 76.7 on RWQA, 81.7 on MMB₁.₁), domain-specific knowledge (55.8 on MMMU, 84.5 on AI2D), visual hallucination assessment (54.7 on HallB), and even mathematical reasoning (67.8 on MathVista).

Training Strategy

Our approach contains two key components to achieve effective long-context training: first, an information-first sampling strategy that establishes optimal sampling criteria; and second, a progressive training schedule based on this strategy, which directs the entire model training process.

Information-First Sampling

In multimodal training, the sampling of visual content is essential. Multi-image documents typically comprise dozens of pages with ultra-high-resolution images, while video content can vary drastically in length - from mere seconds to hours. To effectively manage this diversity, we present information-first sampling to promote information preservation from both visual and semantic dimensions.

Image Area Preservation (IAP): Traditional tiling methods divide an image of size W × H into a rigid grid of tiles. While effective for handling high-resolution inputs, these approaches often distort the original image geometry through improper aspect ratio handling. To address this, we propose an area-prioritized tiling strategy that optimizes two key objectives: area preservation and aspect ratio fidelity.
Automatic Degradation Sampling (ADS): VLMs require careful allocation of sequence length budgets between visual and textual inputs. We propose Automatic Degradation Sampling (ADS), an all-context-centric strategy that dynamically optimizes this balance. ADS employs a dual-phase degradation process: temporal degradation first optimizes the sampling of frames or pages, followed by tiling degradation that maximizes the use of available context.

Post-Training Schedule

We introduce a comprehensive post-training framework consisting of two complementary strategies:

Mixed post-training: Our ADS method adaptively adjusts each training sample to the maximum sequence length, providing a frame-agnostic training paradigm. We implement a mixed training strategy with length-balanced packing to optimize performance uniformly across the entire spectrum of context lengths.
Progressive mixed post-training: For scenarios with large maximum sequence length values, we propose a progressive mixed training methodology that gradually exposes the model to increasingly larger sequence lengths, systematically enhancing its capacity to process extended contexts.

Eagle-Video-110K: A Diverse Video Dataset for Long-Context Understanding

Addressing Insufficient Video Length in Existing Datasets

Existing video datasets often contain videos that are too short for comprehensive long-context understanding. Eagle-Video-110K addresses this limitation by curating a diverse collection of longer videos from multiple sources including Vidchapters, MiraData, InternVid-10M, Panda-70M, Vript, Shot2story, ViTT, and WebVid-10M. Using a diversity-driven strategy with CLIP embeddings and similarity thresholds, we identify and select novel videos that significantly extend the average duration available for training and evaluation.

Comparison of video lengths across datasets

Dual Annotation Pipeline for Comprehensive Understanding

Eagle-Video-110K features a novel dual annotation pipeline that combines both top-down and bottom-up approaches for comprehensive video understanding. In the top-down approach, we leverage human-annotated chapters as semantically meaningful video segments, generating chapter-level dense captions with GPT-4o and long-form QA pairs with GPT-4. Simultaneously, our bottom-up approach focuses on localized spatiotemporal details by generating clip-level QA pairs enhanced with time interval references and textual context anchors. This hierarchical methodology enables both fine-grained temporal understanding and high-level semantic comprehension, creating a dataset that effectively supports long-context video reasoning.

Impact on Video Understanding Performance

Eagle-Video-110K significantly enhances model performance across both mainstream long and short video benchmarks. Most notably, it substantially improves the model's capability to handle high frame counts (≥128 frames) by providing training data with longer videos that were previously absent in open-source training sets. This improvement is particularly evident on the Video-MME benchmark, where Eagle 2.5 trained with Eagle-Video-110K demonstrates consistent performance scaling with increasing frame counts, unlike other models that plateau or degrade with longer inputs.

Performance comparison on video understanding

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Long-Context Vision-Language Understanding

Why Long-Context Understanding?

Challenges in Long-Context VLMs

Eagle 2.5: Consistent Performance Scaling

SOTA Performance across Image and Video Understanding

Video Understanding

Image Understanding

Training Strategy

Information-First Sampling

Post-Training Schedule

Eagle-Video-110K: A Diverse Video Dataset for Long-Context Understanding

Addressing Insufficient Video Length in Existing Datasets

Dual Annotation Pipeline for Comprehensive Understanding

Impact on Video Understanding Performance

Eagle 2.5: Boosting Long-Context Post-Training
for Frontier Vision-Language Models