Home

Awesome

<!-- * @Description: * @Author: shenlei * @Modified: linhui * @Date: 2023-12-19 10:31:41 * @LastEditTime: 2024-05-13 17:05:35 * @LastEditors: shenlei --> <h1 align="center">BCEmbedding: Bilingual and Crosslingual Embedding for RAG</h1> <div align="center"> <a href="./LICENSE"> <img src="https://img.shields.io/badge/license-Apache--2.0-yellow"> </a>      <a href="https://twitter.com/YDopensource"> <img src="https://img.shields.io/badge/follow-%40YDOpenSource-1DA1F2?logo=twitter&style={style}"> </a>      </div> <br> <p align="center"> <strong style="background-color: green;">English</strong> | <a href="./README_zh.md" target="_Self">简体中文</a> </p> <details open="open"> <summary>Click to Open Contents</summary> </details> <br>

Bilingual and Crosslingual Embedding (BCEmbedding) in English and Chinese, developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks.

BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implementation, notably QAnything [github], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation.

Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves

<img src="./Docs/assets/rag_eval_multiple_domains_summary.jpg">

Our Goals

Provide a bilingual and crosslingual two-stage retrieval model repository for the RAG community, which can be used directly without finetuning, including EmbeddingModel and RerankerModel:

Third-party Examples

🌐 Bilingual and Crosslingual Superiority

Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.

EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko).

💡 Key Features

🚀 Latest Updates

🍎 Model List

Model NameModel TypeLanguagesParametersWeights
bce-embedding-base_v1EmbeddingModelch, en279MHuggingface, 国内通道
bce-reranker-base_v1RerankerModelch, en, ja, ko279MHuggingface, 国内通道

📖 Manual

Installation

First, create a conda environment and activate it.

conda create --name bce python=3.10 -y
conda activate bce

Then install BCEmbedding for minimal installation (To avoid cuda version conflicting, you should install torch that is compatible to your system cuda version manually first):

pip install BCEmbedding==0.1.5

Or install from source (recommended):

git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

Quick Start

1. Based on BCEmbedding

Use EmbeddingModel, and cls pooler is default.

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

Use RerankerModel to calculate relevant scores and rerank:

from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']

# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]

# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")

# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)

# method 1: rerank passages
rerank_results = model.rerank(query, passages)

NOTE:

2. Based on transformers

For EmbeddingModel:

from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

For RerankerModel:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# calculate scores
scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
scores = torch.sigmoid(scores)

3. Based on sentence_transformers

For EmbeddingModel:

from sentence_transformers import SentenceTransformer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
## New update for sentence-trnasformers. So clean up your "`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1" or "~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version.
model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)

For RerankerModel:

from sentence_transformers import CrossEncoder

# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)

# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)

Embedding and Reranker Integrations for RAG Frameworks

1. Used in langchain

We provide BCERerank in BCEmbedding.tools.langchain that inherits the advanced preproc tokenization of RerankerModel.

pip install langchain==0.1.0
pip install langchain-community==0.0.9
pip install langchain-core==0.1.7
pip install langsmith==0.0.77
# We provide the advanced preproc tokenization for reranking.
from BCEmbedding.tools.langchain import BCERerank

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain.retrievers import ContextualCompressionRetriever


# init embedding model
embedding_model_name = 'maidalun1020/bce-embedding-base_v1'
embedding_model_kwargs = {'device': 'cuda:0'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}

embed_model = HuggingFaceEmbeddings(
  model_name=embedding_model_name,
  model_kwargs=embedding_model_kwargs,
  encode_kwargs=embedding_encode_kwargs
)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker = BCERerank(**reranker_args)

# init documents
documents = PyPDFLoader("BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# example 1. retrieval with embedding and reranker
retriever = FAISS.from_documents(texts, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.3, "k": 10})

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=retriever
)

response = compression_retriever.get_relevant_documents("What is Llama 2?")

2. Used in llama_index

We provide BCERerank in BCEmbedding.tools.llama_index that inherits the advanced preproc tokenization of RerankerModel.

pip install llama-index==0.9.42.post2
# We provide the advanced preproc tokenization for reranking.
from BCEmbedding.tools.llama_index import BCERerank

import os
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.llms import OpenAI
from llama_index.retrievers import VectorIndexRetriever

# init embedding model and reranker model
embed_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 32, 'device': 'cuda:0'}
embed_model = HuggingFaceEmbedding(**embed_args)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker_model = BCERerank(**reranker_args)

# example #1. extract embeddings
query = 'apples'
passages = [
        'I like apples', 
        'I like oranges', 
        'Apples and oranges are fruits'
    ]
query_embedding = embed_model.get_query_embedding(query)
passages_embeddings = embed_model.get_text_embedding_batch(passages)

# example #2. rag example
llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL'))
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=400, chunk_overlap=80)
nodes = node_parser.get_nodes_from_documents(documents[0:36])
index = VectorStoreIndex(nodes, service_context=service_context)

query = "What is Llama 2?"

# example #2.1. retrieval with EmbeddingModel and RerankerModel
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=10, service_context=service_context)
retrieval_by_embedding = vector_retriever.retrieve(query)
retrieval_by_reranker = reranker_model.postprocess_nodes(retrieval_by_embedding, query_str=query)

# example #2.2. query with EmbeddingModel and RerankerModel
query_engine = index.as_query_engine(node_postprocessors=[reranker_model])
query_response = query_engine.query(query)

⚙️ Evaluation

Evaluate Semantic Representation by MTEB

We provide evaluation tools for embedding and reranker models, based on MTEB and C_MTEB.

First, install MTEB:

pip install mteb==1.1.1

1. Embedding Models

Just run following cmd to evaluate your_embedding_model (e.g. maidalun1020/bce-embedding-base_v1) in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls

The total evaluation tasks contain 114 datasets of "Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering".

NOTE:

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {mean_pooler_models} --pooler mean

python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path jinaai/jina-embeddings-v2-base-en --pooler mean --trust_remote_code

2. Reranker Models

Run following cmd to evaluate your_reranker_model (e.g. "maidalun1020/bce-reranker-base_v1") in bilingual and crosslingual settings (e.g. ["en", "zh", "en-zh", "zh-en"]).

python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1

The evaluation tasks contain 12 datasets of "Reranking".

3. Metrics Visualization Tool

We provide a one-click script to summarize evaluation results of embedding and reranker models as Embedding Models Evaluation Summary and Reranker Models Evaluation Summary.

python BCEmbedding/tools/eval_mteb/summarize_eval_results.py --results_dir {your_embedding_results_dir | your_reranker_results_dir}

Evaluate RAG by LlamaIndex

LlamaIndex is a famous data framework for LLM-based applications, particularly in RAG. Recently, a LlamaIndex Blog has evaluated the popular embedding and reranker models in RAG pipeline and attracts great attention. Now, we follow its pipeline to evaluate our BCEmbedding.

First, install LlamaIndex, and upgrade transformers to 4.36.0:

pip install transformers==4.36.0

pip install llama-index==0.9.22

Export your "openai" and "cohere" app keys, and openai base url (e.g. "https://api.openai.com/v1") to env:

export OPENAI_BASE_URL={openai_base_url}  # https://api.openai.com/v1
export OPENAI_API_KEY={your_openai_api_key}
export COHERE_APPKEY={your_cohere_api_key}

1. Metrics Definition

2. Reproduce LlamaIndex Blog

In order to compare our BCEmbedding with other embedding and reranker models fairly, we provide a one-click script to reproduce results of the LlamaIndex Blog, including our BCEmbedding:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_reproduce.py

Then, summarize the evaluation results by:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_reproduce_results

Results reproduced from the LlamaIndex Blog can be checked in Reproduced Summary of RAG Evaluation, with some obvious conclusions:

3. Broad Domain Adaptability

The evaluation of LlamaIndex Blog is monolingual, small amount of data, and specific domain (just including "llama2" paper). In order to evaluate the broad domain adaptability, bilingual and crosslingual capability, we follow the blog to build a multiple domains evaluation dataset (includding "Computer Science", "Physics", "Biology", "Economics", "Math", and "Quantitative Finance". Details), named CrosslingualMultiDomainsDataset:

First, run following cmd to evaluate the most popular and powerful embedding and reranker models:

# There should be two GPUs available at least.
CUDA_VISIBLE_DEVICES=0,1 python BCEmbedding/tools/eval_rag/eval_llamaindex_multiple_domains.py

Then, run the following script to summarize the evaluation results:

python BCEmbedding/tools/eval_rag/summarize_eval_results.py --results_dir BCEmbedding/results/rag_results

The summary of multiple domains evaluations can be seen in <a href="#1-multiple-domains-scenarios" target="_Self">Multiple Domains Scenarios</a>.

📈 Leaderboard

Semantic Representation Evaluations in MTEB

1. Embedding Models

ModelDimensionsPoolerInstructionsRetrieval (47)STS (19)PairClassification (5)Classification (21)Reranking (12)Clustering (15)AVG (119)
bge-base-en-v1.5768clsNeed37.1455.0675.4559.7343.0037.7447.19
bge-base-zh-v1.5768clsNeed47.6363.7277.4063.3854.9532.5653.62
bge-large-en-v1.51024clsNeed37.1854.0975.0059.2442.4737.3246.80
bge-large-zh-v1.51024clsNeed47.5864.7379.1464.1955.9833.2654.23
gte-large1024meanFree36.6855.2274.2957.7342.4438.5146.67
gte-large-zh1024clsFree41.1564.6277.5862.0455.6233.0351.51
jina-embeddings-v2-base-en768meanFree31.5854.2874.8458.4241.1634.6744.29
m3e-base768meanFree46.2963.9371.8464.0852.3837.8453.54
m3e-large1024meanFree34.8559.7467.6960.0748.9931.6246.78
e5-large-v21024meanNeed35.9855.2375.2859.5342.1236.5146.52
multilingual-e5-base768meanNeed54.7365.4976.9769.7255.0138.4458.34
multilingual-e5-large1024meanNeed56.7666.7978.8071.6156.4943.0960.50
bce-embedding-base_v1768clsFree57.6065.7374.9669.0057.2938.9559.43

NOTE:

2. Reranker Models

ModelReranking (12)AVG (12)
bge-reranker-base59.0459.04
bge-reranker-large60.8660.86
bce-reranker-base_v161.2961.29

NOTE:

RAG Evaluations in LlamaIndex

1. Multiple Domains Scenarios

<img src="./Docs/assets/rag_eval_multiple_domains_summary.jpg">

NOTE:

🛠 Youdao's BCEmbedding API

For users who prefer a hassle-free experience without the need to download and configure the model on their own systems, BCEmbedding is readily accessible through Youdao's API. This option offers a streamlined and efficient way to integrate BCEmbedding into your projects, bypassing the complexities of manual setup and maintenance. Detailed instructions and comprehensive API documentation are available at Youdao BCEmbedding API. Here, you'll find all the necessary guidance to easily implement BCEmbedding across a variety of use cases, ensuring a smooth and effective integration for optimal results.

🧲 WeChat Group

Welcome to scan the QR code below and join the WeChat group.

<img src="./Docs/assets/Wechat.jpg" width="20%" height="auto">

✏️ Citation

If you use BCEmbedding in your research or project, please feel free to cite and star it:

@misc{youdao_bcembedding_2023,
    title={BCEmbedding: Bilingual and Crosslingual Embedding for RAG},
    author={NetEase Youdao, Inc.},
    year={2023},
    howpublished={\url{https://github.com/netease-youdao/BCEmbedding}}
}

🔐 License

BCEmbedding is licensed under Apache 2.0 License

🔗 Related Links

Netease Youdao - QAnything

FlagEmbedding

MTEB

C_MTEB

LLama Index | LlamaIndex Blog

HuixiangDou