Juan Pablo Balarini • 11 MAR 2026

How Eagerworks Built a Top-10 Global Multimodal Embedding Model for RAG for $500

Eagerworks’ Embedding Model for RAG Ranks Top-10 Globally in the ViDoRe2 Benchmark

When we released eager-embed-v1, it didn’t just perform well, it entered the global Top 10 of the ViDoRe2 multimodal retrieval benchmark (December 2025).

ViDoRe2 multimodal retrieval benchmark comparison of embedding models for Retrieval Augmented Generation, showing eager-embed-v1 ranked among leading open source multimodal embedding models with competitive performance and smaller model size.

Model: https://huggingface.co/eagerworks/eager-embed-v1

Training Code: https://github.com/eagerworks/eager-embed

In multilingual retrieval scenarios, it delivers:

80% higher performance than Google’s Siglip model (56.4 vs 31.4 Average score)
674% better performance than OpenAI’s Clip Vit Base (56.4 vs 8.3 Average score)
5% better performance than Colpali v1.3, despite eager-embed being a single-vector model (vs multi-vector in Colpali)

All while maintaining a dramatically more efficient single-vector architecture and being trained with just $500 in compute.

This places Eagerworks alongside the world's most advanced AI research teams and technology providers.

But this project did not start as a benchmark exercise.

It started as a concrete problem.

Building AI Systems for Real-World Retrieval Augmented Generation (RAG)

Eagerworks is a product-driven software development company that works with founders to design, build, and scale high-quality digital products. Alongside client work, we also build our own products and actively contribute to the open-source machine learning ecosystem.

We build products like DocsHunter, which allows companies to search and run Retrieval Augmented Generation (RAG) across millions of private documents in a secure way. We face similar retrieval challenges in products like BetterFirm, where understanding complex, visually rich data is critical, and Scouty, where search quality directly impacts user experience.

And that’s where the real challenge appeared.

Why Embedding Architecture Matters in Production AI Infrastructure

Embedding models are the backbone of modern AI systems. If embeddings are weak or inefficient, everything built on top of them suffers, especially enterprise search and large-scale RAG pipeline systems.

While building our products, we ran into limitations with existing embedding models.

Some models didn’t scale effectively, others were too expensive to operate in production, and some models lost critical visual information, while others exploded in storage and retrieval costs.

These constraints directly impacted retrieval quality, operational costs, and overall system performance. In DocsHunter, for example, this meant struggling to answer complex questions where visual layout mattered as much as text.

Benchmarks were not the issue. Architecture was.

We needed something that worked in practice for scalable AI systems, not just something that scored well on leaderboards.

Alt text: Diagram showing how an embedding model converts different data typed (text, audio, image) into numerical vector representations for semantic search and retrieval.

The Architectural Limits of Modern RAG Pipelines

1. Text-Only Embeddings Fail on Visually Rich Documents in Enterprise Search

Despite major advances in AI, most production systems still rely on text-only embeddings. In practice, that means extracting text from documents, typically via Optical Character Recognition (OCR), and embedding only the extracted text.

This creates a fundamental limitation: the pipeline strips away the very signals that make documents useful (layout, visual hierarchy, spatial relationships, and non-textual elements), so information is lost before retrieval even begins.

How Traditional RAG Pipelines Break on Visual Content?

Most modern RAG pipelines are built around text-only embedding models, and the usual flow looks like this:

Extract text from documents (often using OCR)
Split the content into chunks
Generate text embeddings
Store those vectors in a vector database and then
Embed the user’s query into a vector
Match the query vector against stored vectors
Retrieve the closest vectors at query time
Send retrieved chunks to the LLM for answer generation

This works reasonably well for plain text.

But as soon as documents become visually rich (presentations, financial reports, slides, blueprints, dashboards, dense PDFs), single-vector text embeddings start to fail.

Meaning often lives in:

Tables
Charts
Diagrams
Images
Layout and spatial structure (columns, callouts, captions, headers, footnotes)

When these documents are reduced to OCR text, several trade-offs appear:

You must run OCR (often slow + error-prone)
You lose layout and hierarchy
Charts/diagrams become “flattened” into noisy or incomplete text
Preprocessing complexity increases (cleanup, chunking hacks, table parsers)
You miss critical non-textual signals entirely

In short: text-only embeddings work for plain text.

They break on visually rich documents.

Typical single-vector embedding dimensions range from 768 to 2048. While increasing dimensionality can improve expressiveness, it also increases storage footprint and infrastructure costs.

2. Why Multimodal Multi-Vector Models Increase AI Infrastructure Cost at Scale

Multimodal models solve the visual blindness problem. They can process images, layout, and text jointly, improving retrieval quality for visually dense content.

However, because images are more complex than text, most multimodal models output multi-vector representations (matrices) instead of a single vector per document.

This means that, instead of comparing a single query vector to a single document vector, the system must compare many query vectors against many document vectors and then combine those comparisons to determine which result is most relevant.

This introduces new operational complexity:

Max-sim or late-interaction scoring
Multiple similarity computations per query
Complex ranking logic

At a small scale, this works.

At a large scale, it becomes significantly more expensive to store, slower to query, and more complex to operate.

For systems handling millions, or even hundreds of millions, of embeddings, these trade-offs are not theoretical. The costs compound quickly, turning architectural decisions into infrastructure constraints.

A Real-World Enterprise RAG Case: Scaling DocsHunter to Hundreds of Millions of Embeddings

For one of our clients alone, DocsHunter handles hundreds of millions of embeddings.

Using multi-vector embeddings would have pushed storage and retrieval costs beyond what was reasonable.

Also, text-only embeddings failed to answer complex questions about visual content. At the same time, existing multimodal multi-vector models capable of handling this were simply too expensive to operate at scale.

Storage comparison between dense embeddings and ColBERT-style multi-vector embeddings in large-scale RAG systems, showing the infrastructure cost difference when indexing millions of document pages.

So we had a structural trade-off:

Text-only → scalable but blind
Multi-vector multimodal → accurate but expensive

We needed something in the middle.

Finding the Best Embedding Model for RAG: The optimal in the Pareto Curve

Finding the best embedding model is not an easy task, and definitely not about leaderboard scores alone. What really matters is the balance between:

Retrieval performance
Model size (inference cost)
Storage requirements
Retrieval speed
License suitability for commercial use

We needed:

Better than text-only
Far more efficient than multi-vector multimodal models
Commercially usable
Architecturally simple

But, no model met those constraints. Thus, we decided to build our own, under real-world budget and infrastructure constraints.

Introducing Eagerworks’ eager-embed-v1: An Open Source Multimodal Embedding Model for Enterprise RAG

We created eager-embed-v1, a single-vector multimodal embedding model that resolves the core architectural trade-offs outlined above, designed for scalable retrieval, enterprise RAG, and semantic search in production AI architectures.

Our goal with this open-source AI model was simple: balance retrieval quality, storage efficiency, speed, and cost, without sacrificing performance.

We built this multimodal embedding model on top of a Vision-Language Model (VLM), integrating it into a modern AI software development and custom AI deployment workflow. And it runs on any modern GPU, requires minimal storage thanks to its compact embedding dimension, and enables fast retrieval, making it well-suited for production AI systems at scale.

Designing RAG Embeddings for Scalable AI Infrastructure

Instead of choosing between blind scalability (text-only embeddings) and expensive accuracy (multi-vector multimodal systems), we deliberately designed a single-vector multimodal architecture.

eager-embed-v1 was built to sit at a very specific point on the Pareto curve relevant to modern AI system architecture. It encodes both textual and visual information into a single 2560-dimensional dense vector.

Multi-vector systems (such as ColBERT-style architectures) optimize for peak accuracy, but introduce two structural costs at scale:

Storage grows linearly with document length
Retrieval requires max-sim or late-interaction scoring
Query-time similarity computations multiply

With tens or hundreds of millions of documents, this becomes operationally prohibitive.

Text-only embeddings sit on the opposite extreme: cheap and fast, but blind to visual structure, charts, layout, and images.

eager-embed-v1 targets the middle of the Pareto curve: a single dense vector that preserves multimodal structure while keeping infrastructure costs under control.

One vector per chunk. One cosine similarity function. Real-world scalability.

What This Architecture Enables in Enterprise Search and Semantic Search

By encoding both visual and textual information into a single dense vector, eager-embed-v1 simplifies retrieval across the entire stack.

This architectural decision allows us to:

Avoid expensive max-sim or late-interaction scoring
Keep similarity search limited to fast cosine computation
Significantly reduce storage requirements
Maintain strong multimodal retrieval performance

The result is:

Predictable storage growth
Millisecond retrieval latency
Infrastructure simplicity
Scalability across hundreds of millions of documents

In practice, this enables:

Multimodal semantic search at scale
Faster enterprise document retrieval
Duplicate detection across documents and images
Large-scale knowledge systems
RAG pipelines that work reliably with visually rich documents

Use cases that were previously too expensive or operationally complex now become feasible in production AI systems.

Model Specifications and Characteristics

We built:

A model that produces a single 2560-dimensional dense vector
Full text + image support
Cosine similarity only
No max-sim or late-interaction scoring
Apache 2.0 license
A production-ready architecture, based on Qwen3-VL-4B-Instruct

Despite its capabilities, eager-embed-v1 remains one of the smallest models in its class and was trained with just $500 in compute, as part of a broader AI cost optimization effort.

Open Source AI Model for Production RAG Pipelines

As part of our broader open source machine learning initiatives, we’ve made the model fully open-source and available to the community.

Explore the model: https://huggingface.co/eagerworks/eager-embed-v1

Check the Training Code: https://github.com/eagerworks/eager-embed

Benchmark Results: Top-10 Multimodal Retrieval in ViDoRe2

When we released eager-embed-v1, it didn’t just perform well. It entered the global Top 10 of the ViDoRe2 multimodal retrieval benchmark (December 2025).

This places Eagerworks alongside the most advanced AI research teams and technology providers in the world.

In multilingual retrieval scenarios, eager-embed-v1 delivers 674% higher performance than OpenAI’s Clip Vit Base (56.4 vs 8.3 Average score) and 80% higher than Google’s Siglip model (56.4 vs 31.4 Average score), and 5% better performance than Colpali v1.3, while maintaining a dramatically more efficient single-vector architecture.

What makes this especially relevant for production teams is that eager-embed-v1:

It is fully open source
It is commercially usable
It exposes architectural decisions transparently
It was trained using real-world documents
It is already running in production systems

This makes eager-embed-v1 one of the most capable open-source embedding models for RAG available today, combining enterprise-grade multimodal retrieval performance with architectural simplicity and cost efficiency.

Production Validation: Running at Enterprise Scale

The Top-10 global ranking in ViDoRe2 validates the approach. But the real validation is this:

It runs in production.
It handles hundreds of millions of embeddings.
It competes globally.
It was built under real-world product constraints.

This is what product-driven AI engineering looks like.

For us, this ranking is not about chasing leaderboards. It validates our core philosophy: build AI systems that work in the real world, at scale, under constraints, with ownership.

Getting Started: Using eager-embed-v1 in Your RAG Pipeline

We recommend using quantization for production deployments to further reduce memory usage and improve inference speed.

The model integrates cleanly into existing RAG pipelines and can be used for:

Document search
Multimodal retrieval
Duplicate detection
Enterprise knowledge systems

Now, let’s look at a minimal example of how to load the model and extract embeddings.

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info

MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
DTYPE = torch.bfloat16

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    attn_implementation=(
        "flash_attention_2" if is_flash_attn_2_available() else None
    ),
    dtype=DTYPE
).to(DEVICE).eval()

# Function to Encode Message
def encode_message(message):
    with torch.no_grad():
        texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
        image_inputs, video_inputs = process_vision_info(message)

        inputs = processor(
            text=texts,
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
            padding="longest",
        ).to(DEVICE)

        model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

        last_hidden_state = model_outputs.hidden_states[-1]
        embeddings = last_hidden_state[:, -1]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
        return embeddings

When Architecture Beats Hype in Enterprise RAG Systems

This project proves something important:

You don’t need billion-dollar budgets to compete globally.
You need architectural clarity.

eager-embed-v1 was trained with $500. It ranks in the global Top 10. It runs in production.

Because we optimized for the right constraints.

At Eagerworks, we don’t build AI experiments. We build AI systems that scale.

And when architecture matters more than hype, that difference shows.

If you're designing large-scale RAG systems, multimodal retrieval pipelines, or enterprise search architectures, and you’re looking for an ai development company, ai software development, or machine learning consulting partner for custom ai development, get in touch! We’d love to help.

Technical Overview: Architecture and Training of eager-embed-v1

High-Level Multimodal Embedding Architecture

eager-embed-v1 is built on top of a Vision-Language Model (VLM) and trained using contrastive learning with in-batch negatives.

The core design goal was simple but restrictive: encode visually rich documents (text, layout, images) into a single fixed-size vector, without incurring multi-vector storage or retrieval overhead.

At a high level:

Inputs can be text, images, or mixed visual documents (PDF pages, slides, screenshots)
Images are encoded through the vision encoder
Visual tokens are fused with text tokens inside the language model
Embeddings are extracted from the final hidden layer
The last token representation is selected
The vector is L2-normalized

Rather than relying on CoIBERT-like multi-vector architectures, we deliberately produce one dense vector per document chunk or query.

The runtime story becomes extremely simple: One forward pass. One vector. One cosine similarity computation.

Single-Vector vs Multi-Vector RAG Embeddings

Compared to multi-vector architectures (such as ColBERT-like approaches), eager-embed-v1 strikes a strong balance between embedding dimensionality and retrieval accuracy while maintaining operational efficiency.

Unlike those approaches, it does not require max-sim distance functions, simplifying RAG architecture, RAG system design, and overall Machine Learning system design.

Multi-vector approaches can achieve strong accuracy, but they introduce two non-negotiable costs at scale:

Storage grows linearly with sequence length
Retrieval requires complex max-sim or late-interaction scoring
Query-time similarity computations multiply

At tens or hundreds of millions of documents, this becomes operationally prohibitive.

Pure text embeddings sit on the opposite extreme: cheap and fast, but blind to images, layout, charts, and structure.

eager-embed-v1 deliberately targets the middle of this trade-off.

It generates a single dense vector per document chunk that captures both textual and visual information within the same representation space.

Extracting Dense Embeddings for Semantic Search and RAG

Architecturally, eager-embed-v1 encodes text, images, or mixed visual documents (PDF pages, slides, screenshots) into a single dense vector.

Images are processed through the vision encoder and fused with text tokens inside the language model. After full cross-modal fusion, embeddings are extracted directly from the final hidden layer of the decoder.

Concretely:

output_hidden_states=True is enabled
The final-layer hidden states are retrieved
The last token representation (after full multimodal context) is selected
The vector is L2-normalized

Unlike traditional embedding models that rely on an additional projection head, eager-embed-v1 extracts embeddings directly from the model’s internal representations.

This design has several advantages:

No extra parameters
Fully fused multimodal signal
Minimal architectural overhead
Transparent and simple runtime behavior

Fine-Tuning for Dense Retrieval: Training an Embedding Model for RAG

We fine-tuned the model using the Tevatron framework, which is purpose-built for dense retrieval systems.

Tevatron was a good fit because:

It natively supports contrastive learning with in-batch negatives.
It is designed for retrieval models, not generic instruction tuning.
It integrates cleanly with DeepSpeed for multi-GPU training.

Training was done with a standard contrastive objective:

Queries and documents are encoded independently.
Similarity is computed via dot product (cosine after normalization).
The model is optimized to maximize similarity for positive pairs and minimize it for all other items in the batch.

This setup closely mirrors how the model is used at inference time, which helped reduce train–test mismatch.

Retrieval-Oriented Dataset Construction for Multimodal RAG

The training data was assembled with a single explicit goal: to optimize retrieval behavior rather than generic multimodal understanding.

Most of the dataset was designed under the same retrieval-first assumptions later reflected in the multimodal extensions of MTEB. In practice, this meant prioritizing query–document style supervision over captioning or instruction-style data.

Each training example follows a retrieval-oriented structure:

A query, written in natural language and often intentionally underspecified.
A positive document, which can be plain text, an image, or a visually rich document page.
In-batch negatives, which act as hard negatives at scale.

Rather than relying on a single source, the dataset mixes several concrete datasets that are commonly used to evaluate multimodal and document retrieval, especially those later incorporated into MTEB and ViDoRe-style benchmarks.

Concretely, training draws from:

ViDoRe-style document retrieval datasets, where queries are matched against full document pages or screenshots rather than cropped regions.
DocVQA-style datasets, using pages as the unit of retrieval instead of bounding-box answers.
InfographicVQA and ChartQA-like data, to force the model to rely on layout, charts, and visual structure rather than surface text alone.
Slide and presentation retrieval datasets, where queries target visually dense slides with mixed text and graphics.
Multilingual document retrieval datasets aligned with MTEB, where queries and documents may be in different languages.
Synthetic query generation over document collections, used to increase coverage and control query difficulty while keeping supervision retrieval-oriented.

All datasets were adapted to a consistent query–document format so that training conditions closely match both benchmark evaluation and production retrieval workloads.

A deliberate choice was to keep documents as documents. We avoided aggressively decomposing inputs into small regions or patches, even when visual grounding datasets made that tempting. This better matches real-world retrieval, where the unit of retrieval is usually a page or slide, not a bounding box.

One important lesson from this setup is that visual negatives matter as much as textual ones. Without them, models default to text-only behavior. When negatives are only text-based, the model quickly learns to rely on text tokens and ignore layout or imagery. Including visually similar but semantically different documents forces the model to encode charts, tables, and overall structure more faithfully.

Overall, the dataset is less about raw scale and more about alignment between training data, evaluation benchmarks, and production retrieval workloads.

Why Qwen3-VL Fits Modern AI System Architecture

We evaluated several vision-language models before settling on Qwen3-VL-4B-Instruct, and the choice was driven by very practical considerations rather than benchmark chasing.

Qwen3-VL stood out for a few reasons:

Strong document understanding: It handles tables, charts, screenshots, and mixed layouts unusually well compared to other open VLMs.
Long-context support: This matters for long documents.
Stable multimodal fusion: Visual tokens have a measurable impact on the final representations, which matters when embeddings are taken after full multimodal fusion.
Permissive license (Apache 2.0): This was a hard requirement for commercial usage.
Solid engineering foundations: Flash Attention 2 support, a clean Hugging Face integration, and predictable memory behavior, which made training and inference easier to reason about.

In practice, Qwen3-VL allowed us to treat document understanding as a first-class problem, rather than bolting OCR + heuristics on top of a text-only embedding model.

Training Under AI Infrastructure and Cost Constraints ($500 Compute)

One of the goals of this project was to prove that state-of-the-art embeddings do not require massive budgets.

Initial experiments and validation were done on a single RTX 3090, which allowed us to iterate quickly on:

Embedding extraction strategy.
Padding and attention behavior.
Dataset balance.

Final training was run on:

8× RTX 5090 GPUs (256 GB of VRAM total)
AMD EPYC 9534 64-Core CPU (128 threads)
256 GB RAM
DeepSpeed ZeRO optimizations

Console output from a Tevatron-based training run displaying loss values, gradient norms, learning rate schedule, and checkpoint progress while training a multimodal embedding model for RAG

NVIDIA-SMI output displaying an 8x NVIDIA GeForce RTX 5090 training setup, with GPU memory usage, power consumption, temperatures, and CUDA driver information used for training a multimodal embedding model.

The total compute cost, including failed runs and ablations, was approximately $500.

This was only possible because we:

Choose a 4B-parameter model, not a 30B+.
Used dense single-vector embeddings.
Avoided multi-stage or multi-tower architectures.
Kept the runtime pipeline simple.

Engineering Challenges in Production Machine Learning Infrastructure

During training, we encountered and resolved several low-level challenges, including padding strategies when using Flash Attention, and optimized the setup to run efficiently on limited hardware, an important part of AI infrastructure cost and machine learning cost management.

Some multimodal pipelines default to left padding. While harmless for generation, it silently degrades embedding quality.

The fix was simple but critical:

Enforce right padding consistently.
Ensure attention masks align with embedding extraction positions.

This is the kind of issue that doesn’t crash training; it silently degrades embedding quality on inference.

Technical Specifications:

Type: single-vector (dense)
Modalities: image and text
Embedding dimension: 2560
Max tokens: 256K
Parameters: 4B
Memory usage: 8465 MB
Languages: 32 supported languages
Attention: Flash Attention 2
Training: contrastive learning
License: Apache 2.0

Diagram of eager-embed-v1 architecture converting text or image inputs into a single dense embedding vector using a vision encoder and Qwen3-VL model.

Stay updated!

Juan Pablo Balarini

March 11, 2026