AI

Juan Pablo Balarini • 11 MAR 2026

How Eagerworks Built a Top-10 Global Multimodal Embedding Model for RAG for $500

post cover picture

Eagerworks’ Embedding Model for RAG Ranks Top-10 Globally in the ViDoRe2 Benchmark

When we released eager-embed-v1, it didn’t just perform well, it entered the global Top 10 of the ViDoRe2 multimodal retrieval benchmark (December 2025).

ViDoRe2 multimodal retrieval benchmark comparison of embedding models for Retrieval Augmented Generation, showing eager-embed-v1 ranked among leading open source multimodal embedding models with competitive performance and smaller model size.

Model: https://huggingface.co/eagerworks/eager-embed-v1

Training Code: https://github.com/eagerworks/eager-embed

In multilingual retrieval scenarios, it delivers:

All while maintaining a dramatically more efficient single-vector architecture and being trained with just $500 in compute.

This places Eagerworks alongside the world's most advanced AI research teams and technology providers.

But this project did not start as a benchmark exercise.

It started as a concrete problem.

Building AI Systems for Real-World Retrieval Augmented Generation (RAG)

Eagerworks is a product-driven software development company that works with founders to design, build, and scale high-quality digital products. Alongside client work, we also build our own products and actively contribute to the open-source machine learning ecosystem.

We build products like DocsHunter, which allows companies to search and run Retrieval Augmented Generation (RAG) across millions of private documents in a secure way. We face similar retrieval challenges in products like BetterFirm, where understanding complex, visually rich data is critical, and Scouty, where search quality directly impacts user experience.

And that’s where the real challenge appeared.

Why Embedding Architecture Matters in Production AI Infrastructure

Embedding models are the backbone of modern AI systems. If embeddings are weak or inefficient, everything built on top of them suffers, especially enterprise search and large-scale RAG pipeline systems.

While building our products, we ran into limitations with existing embedding models.

Some models didn’t scale effectively, others were too expensive to operate in production, and some models lost critical visual information, while others exploded in storage and retrieval costs.

These constraints directly impacted retrieval quality, operational costs, and overall system performance. In DocsHunter, for example, this meant struggling to answer complex questions where visual layout mattered as much as text.

Benchmarks were not the issue. Architecture was.

We needed something that worked in practice for scalable AI systems, not just something that scored well on leaderboards.

Alt text: Diagram showing how an embedding model converts different data typed (text, audio, image) into numerical vector representations for semantic search and retrieval.

The Architectural Limits of Modern RAG Pipelines

1. Text-Only Embeddings Fail on Visually Rich Documents in Enterprise Search

Despite major advances in AI, most production systems still rely on text-only embeddings. In practice, that means extracting text from documents, typically via Optical Character Recognition (OCR), and embedding only the extracted text.

This creates a fundamental limitation: the pipeline strips away the very signals that make documents useful (layout, visual hierarchy, spatial relationships, and non-textual elements), so information is lost before retrieval even begins.

How Traditional RAG Pipelines Break on Visual Content?

Most modern RAG pipelines are built around text-only embedding models, and the usual flow looks like this:

This works reasonably well for plain text. 

But as soon as documents become visually rich (presentations, financial reports, slides, blueprints, dashboards, dense PDFs), single-vector text embeddings start to fail.

Meaning often lives in:

When these documents are reduced to OCR text, several trade-offs appear:

In short: text-only embeddings work for plain text

They break on visually rich documents.

Typical single-vector embedding dimensions range from 768 to 2048. While increasing dimensionality can improve expressiveness, it also increases storage footprint and infrastructure costs.

2. Why Multimodal Multi-Vector Models Increase AI Infrastructure Cost at Scale

Multimodal models solve the visual blindness problem. They can process images, layout, and text jointly, improving retrieval quality for visually dense content.

However, because images are more complex than text, most multimodal models output multi-vector representations (matrices) instead of a single vector per document.

This means that, instead of comparing a single query vector to a single document vector, the system must compare many query vectors against many document vectors and then combine those comparisons to determine which result is most relevant.

This introduces new operational complexity:

At a small scale, this works.

At a large scale, it becomes significantly more expensive to store, slower to query, and more complex to operate.

For systems handling millions, or even hundreds of millions, of embeddings, these trade-offs are not theoretical. The costs compound quickly, turning architectural decisions into infrastructure constraints.

A Real-World Enterprise RAG Case: Scaling DocsHunter to Hundreds of Millions of Embeddings

For one of our clients alone, DocsHunter handles hundreds of millions of embeddings. 

Using multi-vector embeddings would have pushed storage and retrieval costs beyond what was reasonable.

Also, text-only embeddings failed to answer complex questions about visual content. At the same time, existing multimodal multi-vector models capable of handling this were simply too expensive to operate at scale.


Storage comparison between dense embeddings and ColBERT-style multi-vector embeddings in large-scale RAG systems, showing the infrastructure cost difference when indexing millions of document pages.

So we had a structural trade-off:

We needed something in the middle.

Finding the Best Embedding Model for RAG: The optimal in the Pareto Curve

Finding the best embedding model is not an easy task, and definitely not about leaderboard scores alone. What really matters is the balance between:

We needed:

But, no model met those constraints. Thus, we decided to build our own, under real-world budget and infrastructure constraints.

Introducing Eagerworks’ eager-embed-v1: An Open Source Multimodal Embedding Model for Enterprise RAG

We created eager-embed-v1, a single-vector multimodal embedding model that resolves the core architectural trade-offs outlined above, designed for scalable retrieval, enterprise RAG, and semantic search in production AI architectures.

Our goal with this open-source AI model was simple: balance retrieval quality, storage efficiency, speed, and cost, without sacrificing performance. 

We built this multimodal embedding model on top of a Vision-Language Model (VLM), integrating it into a modern AI software development and custom AI deployment workflow. And it runs on any modern GPU, requires minimal storage thanks to its compact embedding dimension, and enables fast retrieval, making it well-suited for production AI systems at scale.

Designing RAG Embeddings for Scalable AI Infrastructure

Instead of choosing between blind scalability (text-only embeddings) and expensive accuracy (multi-vector multimodal systems), we deliberately designed a single-vector multimodal architecture.

eager-embed-v1 was built to sit at a very specific point on the Pareto curve relevant to modern AI system architecture. It encodes both textual and visual information into a single 2560-dimensional dense vector.

Multi-vector systems (such as ColBERT-style architectures) optimize for peak accuracy, but introduce two structural costs at scale:

With tens or hundreds of millions of documents, this becomes operationally prohibitive.

Text-only embeddings sit on the opposite extreme: cheap and fast, but blind to visual structure, charts, layout, and images.

eager-embed-v1 targets the middle of the Pareto curve: a single dense vector that preserves multimodal structure while keeping infrastructure costs under control.

One vector per chunk. One cosine similarity function.  Real-world scalability.

What This Architecture Enables in Enterprise Search and Semantic Search

By encoding both visual and textual information into a single dense vector, eager-embed-v1 simplifies retrieval across the entire stack.

This architectural decision allows us to:

The result is:

In practice, this enables:

Use cases that were previously too expensive or operationally complex now become feasible in production AI systems.

Model Specifications and Characteristics

We built:

Despite its capabilities, eager-embed-v1 remains one of the smallest models in its class and was trained with just $500 in compute, as part of a broader AI cost optimization effort.

Open Source AI Model for Production RAG Pipelines

As part of our broader open source machine learning initiatives, we’ve made the model fully open-source and available to the community.

Explore the model: https://huggingface.co/eagerworks/eager-embed-v1

Check the Training Code: https://github.com/eagerworks/eager-embed

Benchmark Results: Top-10 Multimodal Retrieval in ViDoRe2

When we released eager-embed-v1, it didn’t just perform well. It entered the global Top 10 of the ViDoRe2 multimodal retrieval benchmark (December 2025).

This places Eagerworks alongside the most advanced AI research teams and technology providers in the world.

In multilingual retrieval scenarios, eager-embed-v1 delivers 674% higher performance than OpenAI’s Clip Vit Base (56.4 vs 8.3 Average score) and 80% higher than Google’s Siglip model (56.4 vs 31.4 Average score), and 5% better performance than Colpali v1.3, while maintaining a dramatically more efficient single-vector architecture.

What makes this especially relevant for production teams is that eager-embed-v1:

This makes eager-embed-v1 one of the most capable open-source embedding models for RAG available today, combining enterprise-grade multimodal retrieval performance with architectural simplicity and cost efficiency.

Production Validation: Running at Enterprise Scale

The Top-10 global ranking in ViDoRe2 validates the approach. But the real validation is this:

This is what product-driven AI engineering looks like.

For us, this ranking is not about chasing leaderboards. It validates our core philosophy: build AI systems that work in the real world, at scale, under constraints, with ownership.

Getting Started: Using eager-embed-v1 in Your RAG Pipeline

We recommend using quantization for production deployments to further reduce memory usage and improve inference speed.

The model integrates cleanly into existing RAG pipelines and can be used for:

Now, let’s look at a minimal example of how to load the model and extract embeddings.

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info

MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
DTYPE = torch.bfloat16

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    attn_implementation=(
        "flash_attention_2" if is_flash_attn_2_available() else None
    ),
    dtype=DTYPE
).to(DEVICE).eval()

# Function to Encode Message
def encode_message(message):
    with torch.no_grad():
        texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
        image_inputs, video_inputs = process_vision_info(message)

        inputs = processor(
            text=texts,
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
            padding="longest",
        ).to(DEVICE)

        model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

        last_hidden_state = model_outputs.hidden_states[-1]
        embeddings = last_hidden_state[:, -1]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
        return embeddings

When Architecture Beats Hype in Enterprise RAG Systems

This project proves something important:

You don’t need billion-dollar budgets to compete globally.
You need architectural clarity.

eager-embed-v1 was trained with $500. It ranks in the global Top 10.  It runs in production.

Because we optimized for the right constraints.

At Eagerworks, we don’t build AI experiments. We build AI systems that scale.

And when architecture matters more than hype, that difference shows.

If you're designing large-scale RAG systems, multimodal retrieval pipelines, or enterprise search architectures, and you’re looking for an ai development company, ai software development, or machine learning consulting partner for custom ai development, get in touch! We’d love to help.

Technical Overview: Architecture and Training of eager-embed-v1

High-Level Multimodal Embedding Architecture

eager-embed-v1 is built on top of a Vision-Language Model (VLM) and trained using contrastive learning with in-batch negatives.

The core design goal was simple but restrictive: encode visually rich documents (text, layout, images) into a single fixed-size vector, without incurring multi-vector storage or retrieval overhead.

At a high level:

Rather than relying on CoIBERT-like multi-vector architectures, we deliberately produce one dense vector per document chunk or query.

The runtime story becomes extremely simple: One forward pass. One vector. One cosine similarity computation.

Single-Vector vs Multi-Vector RAG Embeddings

Compared to multi-vector architectures (such as ColBERT-like approaches), eager-embed-v1 strikes a strong balance between embedding dimensionality and retrieval accuracy while maintaining operational efficiency. 

Unlike those approaches, it does not require max-sim distance functions, simplifying RAG architecture, RAG system design, and overall Machine Learning system design.

Multi-vector approaches can achieve strong accuracy, but they introduce two non-negotiable costs at scale:

At tens or hundreds of millions of documents, this becomes operationally prohibitive.

Pure text embeddings sit on the opposite extreme: cheap and fast, but blind to images, layout, charts, and structure.

eager-embed-v1 deliberately targets the middle of this trade-off.

It generates a single dense vector per document chunk that captures both textual and visual information within the same representation space.

Extracting Dense Embeddings for Semantic Search and RAG

Architecturally, eager-embed-v1 encodes text, images, or mixed visual documents (PDF pages, slides, screenshots) into a single dense vector.

Images are processed through the vision encoder and fused with text tokens inside the language model. After full cross-modal fusion, embeddings are extracted directly from the final hidden layer of the decoder.

Concretely:

Unlike traditional embedding models that rely on an additional projection head, eager-embed-v1 extracts embeddings directly from the model’s internal representations.

This design has several advantages:

Fine-Tuning for Dense Retrieval: Training an Embedding Model for RAG

We fine-tuned the model using the Tevatron framework, which is purpose-built for dense retrieval systems.

Tevatron was a good fit because:

Training was done with a standard contrastive objective:

This setup closely mirrors how the model is used at inference time, which helped reduce train–test mismatch.

Retrieval-Oriented Dataset Construction for Multimodal RAG

The training data was assembled with a single explicit goal: to optimize retrieval behavior rather than generic multimodal understanding.

Most of the dataset was designed under the same retrieval-first assumptions later reflected in the multimodal extensions of MTEB. In practice, this meant prioritizing query–document style supervision over captioning or instruction-style data.

Each training example follows a retrieval-oriented structure:

Rather than relying on a single source, the dataset mixes several concrete datasets that are commonly used to evaluate multimodal and document retrieval, especially those later incorporated into MTEB and ViDoRe-style benchmarks. 

Concretely, training draws from:

All datasets were adapted to a consistent query–document format so that training conditions closely match both benchmark evaluation and production retrieval workloads.

A deliberate choice was to keep documents as documents. We avoided aggressively decomposing inputs into small regions or patches, even when visual grounding datasets made that tempting. This better matches real-world retrieval, where the unit of retrieval is usually a page or slide, not a bounding box.

One important lesson from this setup is that visual negatives matter as much as textual ones. Without them, models default to text-only behavior. When negatives are only text-based, the model quickly learns to rely on text tokens and ignore layout or imagery. Including visually similar but semantically different documents forces the model to encode charts, tables, and overall structure more faithfully.

Overall, the dataset is less about raw scale and more about alignment between training data, evaluation benchmarks, and production retrieval workloads.

Why Qwen3-VL Fits Modern AI System Architecture

We evaluated several vision-language models before settling on Qwen3-VL-4B-Instruct, and the choice was driven by very practical considerations rather than benchmark chasing.

Qwen3-VL stood out for a few reasons:

In practice, Qwen3-VL allowed us to treat document understanding as a first-class problem, rather than bolting OCR + heuristics on top of a text-only embedding model.

Training Under AI Infrastructure and Cost Constraints ($500 Compute)

One of the goals of this project was to prove that state-of-the-art embeddings do not require massive budgets.

Initial experiments and validation were done on a single RTX 3090, which allowed us to iterate quickly on:

Final training was run on:

Console output from a Tevatron-based training run displaying loss values, gradient norms, learning rate schedule, and checkpoint progress while training a multimodal embedding model for RAG

NVIDIA-SMI output displaying an 8x NVIDIA GeForce RTX 5090 training setup, with GPU memory usage, power consumption, temperatures, and CUDA driver information used for training a multimodal embedding model.

The total compute cost, including failed runs and ablations, was approximately $500.

This was only possible because we:

Engineering Challenges in Production Machine Learning Infrastructure

During training, we encountered and resolved several low-level challenges, including padding strategies when using Flash Attention, and optimized the setup to run efficiently on limited hardware, an important part of AI infrastructure cost and machine learning cost management.

Some multimodal pipelines default to left padding. While harmless for generation, it silently degrades embedding quality.

The fix was simple but critical:

This is the kind of issue that doesn’t crash training; it silently degrades embedding quality on inference.

Technical Specifications:

Diagram of eager-embed-v1 architecture converting text or image inputs into a single dense embedding vector using a vision encoder and Qwen3-VL model.

Stay updated!

project background