Home

Awesome

<h1 align="center">RAG-Retrieval</h1> <p align="center"> <a href="https://pypi.org/project/rag-retrieval/#description"> <img alt="Build" src="https://img.shields.io/pypi/v/rag-retrieval?color=brightgreen"> </a> <a href="https://www.pepy.tech/projects/rag-retrieval"> <img alt="Build" src="https://static.pepy.tech/personalized-badge/rag-retrieval?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"> </a> <a href="https://github.com/NLPJCL/RAG-Retrieval"> <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue"> </a> <a href="https://github.com/NLPJCL/RAG-Retrieval/blob/master/LICENSE"> <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"> </a> </p>

English | 中文

The RAG-Retrieval offers end-to-end code for training, inference, and distillation of the RAG retrieval model.

ColBERT

Communication between communities

Join our WeChat group chat

News

Features

Quick Start

Installation

For training (all):

conda create -n rag-retrieval python=3.8 && conda activate rag-retrieval
# To avoid incompatibility between the automatically installed torch and the local cuda, it is recommended to manually install the compatible version of torch before proceeding to the next step.
pip install -r requirements.txt 

For prediction (reranker):

# To avoid incompatibility between the automatically installed torch and the local cuda, it is recommended to manually install the compatible version of torch before proceeding to the next step.
pip install rag-retrieval

Training

For different model types, please go into different subdirectories. For example: For embedding, and similarly for others. Detailed procedures can be found in the README file in each subdirectories.

cd ./rag_retrieval/train/embedding
bash train_embedding.sh

inference

RAG-Retrieval has developed a lightweight Python library, rag-retrieval, which provides a unified interface for calling various RAG reranker models with the following features:

For detailed usage and considerations of the rag-retrieval package, please refer to the Tutorial

Experimental Results

Results of the reranker model on the MTEB Reranking task

ModelModel Size(GB)T2RerankingMMarcoRerankingCMedQAv1CMedQAv2Avg
bge-reranker-base1.1167.2835.4681.2784.1067.03
bce-reranker-base_v11.1170.2534.1379.6481.3166.33
rag-retrieval-reranker0.4167.3331.5783.5486.0367.12

Among them, rag-retrieval-reranker is the result of training on the hfl/chinese-roberta-wwm-ext model using the RAG-Retrieval code, and the training data uses the training data of the bge-rerank model.

Results of the Colbert model in the MTEB Reranking task

ModelModel Size(GB)DimT2RerankingMMarcoRerankingCMedQAv1CMedQAv2Avg
bge-m3-colbert2.24102466.8226.7175.8876.8361.56
rag-retrieval-colbert0.41102466.8531.4681.0584.2265.90

Among them, rag-retrieval-colbert is the result of training on the hfl/chinese-roberta-wwm-ext model using the RAG-Retrieval code, and the training data uses the training data of the bge-rerank model.

Fine-tune the open source BGE series models with domain data

ModelT2ranking
bge-v1.5-embedding66.49
bge-v1.5-embedding finetune67.15+0.66
bge-m3-colbert66.82
bge-m3-colbert finetune67.22+0.40
bge-reranker-base67.28
bge-reranker-base finetune67.57+0.29

The number with finetune at the end means that we used RAG-Retrieval to fine-tune the corresponding open source model, and the training data used the training set of T2-Reranking.

It is worth noting that the training set of the three open source models of bge already includes T2-Reranking, and the data is relatively general, so the performance improvement of fine-tuning using this data is not significant. However, if the open source model is fine-tuned using a vertical field data set, the performance improvement will be greater.

Star History

Star History Chart

License

RAG-Retrieval is licensed under the MIT License.