ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

🔥 News – 🎉 We have released the evaluation code, allowing you to customize your own evaluation pipeline.

🎉 Our framework integrates various embedding models, enabling you to create your own retriever.
🎉 We have released the ViDoSeek dataset, which is suitable for Retrieval-augmented Generation in the large visually rich document collection.

🚀Overview – We introduce ViDoSeek, a benchmark specifically designed for visually rich document retrieval-reason-answer, fully suited for evaluation of RAG within large document corpus.

We propose ViDoRAG, a novel RAG framework that utilizes a multi-agent, actor-critic paradigm for iterative reasoning, enhancing the noise robustness of generation models.
We introduce a GMM-based multi-modal hybrid retrieval strategy to effectively integrate visual and textual pipelines.
Extensive experiments demonstrate the effectiveness of our method. ViDoRAG significantly outperforms strong baselines, achieving over 10% improvement, thus establishing a new state-of-the-art on ViDoSeek.

🔍ViDoSeek Dataset We release our ViDoSeek dataset which designed for visually rich document retrieval-reason-answer. In ViDoSeek, each query has a unique answer and specific reference pages.

The provided JSON structure includes a unique identifier (uid) to distinguish queries, the query content (query), a reference answer (reference_answer), and metadata (meta_info) containing the original file name (file_name), reference page numbers (reference_page), data source type (source_type), and query type (query_type):

{
    "uid": "04d8bb0db929110f204723c56e5386c1d8d21587_2",
    "query": "What is the temperature of Steam explosion of Pretreatment for Switchgrass and Sugarcane bagasse preparation?", 
    "reference_answer": "195-205 Centigrade", 
    "meta_info": {
        "file_name": "Pretreatment_of_Switchgrass.pdf", 
        "reference_page": [10, 11], 
        "source_type": "Text", 
        "query_type": "Multi-Hop" 
    }
}

You can use Git LFS to download annotation files and original documents from Hugging Face or ModelScope. The format of the files can refer to ./data/ExampleDataset.

Then, you can use the following script to convert the original file into images:

python ./scripts/pdf2images.py

Optionally, you can use OCR models or Vision-Language Models (VLMs) to recognize text within images:

## triditional OCR models
python ./scripts/ocr_triditional.py 
## VLMs as ocr models (Optional)
python ./scripts/ocr_vlms.py

💻 Running ViDoRAGViDoRAG is a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval.

Dependencies“`

Create environment

conda create -n vidorag python=3.10

Clone project

git clone https://github.com/alibaba-nlp/ViDoRAG.git
cd ViDoRAG

Install requirements

pip install -r requirements.txt


Below is a step-by-step guide to help you run the entire framework on your own dataset. You can also use individual modules independently:

### Step1. Bulid the Index DatabaseOur framework is built on the foundation of the Llama-Index. We preprocess the corpus in advance and then establish an index database.

Before embedding the whole dataset, you can run `./llm/vl_embedding.py` to check whether the embedding model is loaded correctly:

python ./llm/vl_embedding.py


Then, you can run `ingestion.py` to embedding the whole dataset:

Document ingestion and Multi-Modal Embedding

python ./ingestion.py


### Step2. Run Multi-Modal RetrieverTry using the basic single-modal search engine:

from search_engine import SearchEngine

initial engine

search_engine = SearchEngine(dataset=’ViDoSeek’, node_dir_prefix=’colqwen_ingestion’,embed_model_name=’vidore/colqwen2-v1.0′)

Retrieve some results

recall_results = search_engine.search(‘some query’)


Try using the dynamic single-modal search engine with GMM:

from search_engine import SearchEngine

initial engine

search_engine = SearchEngine(dataset=’ViDoSeek’, node_dir_prefix=’colqwen_ingestion’,embed_model_name=’vidore/colqwen2-v1.0′)

Set parameters of dynamic retriever

search_engine.gmm = True
search_engine.input_gmm = 20 # The default setting is K

Retrieve some results using dynamic recall

recall_results = search_engine.search(‘some query’)


Try using the dynamic hybrid multi-modal search engine:

from search_engine import HybridSearchEngine

initial engine

hybrid_search_engine = HybridSearchEngine(dataset=’ViDoSeek’, embed_model_name_vl=’vidore/colqwen2-v0.1′, embed_model_name_text=’BAAI/bge-m3′, gmm=True)

Retrieve some results using dynamic recall

hybrid_recall_results = hybrid_search_engine.search(‘some query’)


Optionally, you can choose to test these features in `search_engine.py`.

### Step3. Run Multi-Agent GenerationYou can directly use our script for generation in `vidorag_agents.py`, or you can integrate it into your own framework:

from llms.llm import LLM
vlm = LLM(‘qwen-vl-max’)
agent = ViDoRAG_Agents(vlm)
answer=agent.run_agent(query=’Who is Tim?’, images_path=[‘./data/ExampleDataset/img/00a76e3a9a36255616e2dc14a6eb5dde598b321f_1.jpg’,’./data/ExampleDataset/img/00a76e3a9a36255616e2dc14a6eb5dde598b321f_2.jpg’])
print(answer)


### Step4. Run EvaluationFor our end-to-end evaluation, we employed a LLM-based assessment:

python eval.py
–experiment_type retrieval_infer ## choose from retrieval_infer/dynamic_hybird_retrieval_infer/vidorag
–dataset ViDoSeek ## dataset folder name
–embed_model_name_vl ## vl embedding model name
–embed_model_name_text ## text embedding model name
–embed_model_name ## only for single embedding model eval, when you need not use vl or text above
–generate_vlm ## VLMs name eg. gpt-4o/qwen-max-vl


## 📝 Citation```
@article{wang2025vidorag,
  title={ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents},
  author={Wang, Qiuchen and Ding, Ruixue and Chen, Zehui and Wu, Weiqi and Wang, Shihang and Xie, Pengjun and Zhao, Feng},
  journal={arXiv preprint arXiv:2502.18017},
  year={2025}
}

Post Views: 64

Alibaba-NLP/ViDoRAG: ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

🔥 News – 🎉 We have released the evaluation code, allowing you to customize your own evaluation pipeline.

🚀Overview – We introduce ViDoSeek, a benchmark specifically designed for visually rich document retrieval-reason-answer, fully suited for evaluation of RAG within large document corpus.

🔍ViDoSeek Dataset We release our ViDoSeek dataset which designed for visually rich document retrieval-reason-answer. In ViDoSeek, each query has a unique answer and specific reference pages.

💻 Running ViDoRAGViDoRAG is a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval.

Dependencies“`

Create environment

Clone project

Install requirements

Document ingestion and Multi-Modal Embedding

initial engine

Retrieve some results

initial engine

Set parameters of dynamic retriever

Retrieve some results using dynamic recall

initial engine

Retrieve some results using dynamic recall

By YXI.AI

Leave a Reply Cancel reply

You Missed

告别单一语音！Kokoro CLI语音合成：多语言文档直读，声音还能自由混搭

告别RAM爆炸！Memvid把百万文本块塞进视频，检索快过眨眼

终结AI工具记忆断层！OpenMemory实现跨平台无缝协作与90%Token节省

OpenSPG进化论：KAG如何定义下一代逻辑驱动型检索系统

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

🔥 News – 🎉 We have released the evaluation code, allowing you to customize your own evaluation pipeline.

🚀Overview – We introduce ViDoSeek, a benchmark specifically designed for visually rich document retrieval-reason-answer, fully suited for evaluation of RAG within large document corpus.

🔍ViDoSeek Dataset We release our ViDoSeek dataset which designed for visually rich document retrieval-reason-answer. In ViDoSeek, each query has a unique answer and specific reference pages.

💻 Running ViDoRAGViDoRAG is a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval.

Dependencies“`

Create environment

Clone project

Install requirements

Document ingestion and Multi-Modal Embedding

initial engine

Retrieve some results

initial engine

Set parameters of dynamic retriever

Retrieve some results using dynamic recall

initial engine

Retrieve some results using dynamic recall

By YXI.AI

Related Post

Leave a Reply Cancel reply

You Missed