Home

Awesome

<img src="./imgs/FlagOpen.png">

<h1 align="center">⚡️BGE: One-Stop Retrieval Toolkit For Search and RAG</h1>

bge_logo

<p align="center"> <a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d"> <img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding"> <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE"> <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"> </a> <a href="https://huggingface.co/C-MTEB"> <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/research/baai_general_embedding"> <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.3.0-red"> </a> </p> <h4 align="center"> <p> <a href=#news>News</a> | <a href=#installation>Installation</a> | <a href=#quick-start>Quick Start</a> | <a href=#community>Community</a> | <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/research">Projects</a> | <a href=#model-list>Model List</a> | <a href="#contributors">Contributor</a> | <a href="#citation">Citation</a> | <a href="#license">License</a> <p> </h4>

English | 中文

BGE (BAAI General Embedding) focuses on retrieval-augmented LLMs, consisting of the following projects currently:

projects

News

<details> <summary>More</summary> <!-- ### More --> </details>

Installation

Using pip:

If you do not want to finetune the models, you can install the package without the finetune dependency:

pip install -U FlagEmbedding

If you want to finetune the models, you can install the package with the finetune dependency:

pip install -U FlagEmbedding[finetune]

Install from sources:

Clone the repository and install

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
# If you do not want to finetune the models, you can install the package without the finetune dependency:
pip install  .
# If you want to finetune the models, you can install the package with the finetune dependency:
# pip install  .[finetune]

For development in editable mode:

# If you do not want to finetune the models, you can install the package without the finetune dependency:
pip install -e .
# If you want to finetune the models, you can install the package with the finetune dependency:
# pip install -e .[finetune]

Quick Start

First, load one of the BGE embedding model:

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5',
                                      query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                                      use_fp16=True)

Then, feed some sentences to the model and get their embeddings:

sentences_1 = ["I love NLP", "I love machine learning"]
sentences_2 = ["I love BGE", "I love text retrieval"]
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)

Once we get the embeddings, we can compute similarity by inner product:

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

For more details, you can refer to embedder inference, reranker inference, embedder finetune, reranker fintune, evaluation.

If you're unfamiliar with any of related concepts, please check out the tutorial. If it's not there, let us know.

For more interesting topics related to BGE, take a look at research.

Community

We are actively maintaining the community of BGE and FlagEmbedding. Let us know if you have any suggessions or ideas!

Currently we are updating the tutorials, we aim to create a comprehensive and detailed tutorial for beginners on text retrieval and RAG. Stay tuned!

The following contents are releasing in the upcoming weeks:

<details> <summary>The whole tutorial roadmap</summary> <img src="./Tutorials/tutorial_map.png"/> </details>

Model List

bge is short for BAAI general embedding.

ModelLanguageDescriptionquery instruction for retrieval
BAAI/bge-en-iclEnglishA LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examplesProvide instructions and few-shot examples freely based on the given task.
BAAI/bge-multilingual-gemma2MultilingualA LLM-based multilingual embedding model, trained on a diverse range of languages and tasks.Provide instructions based on the given task.
BAAI/bge-m3MultilingualMulti-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens)
LM-CocktailEnglishfine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail
BAAI/llm-embedderEnglisha unified embedding model to support diverse retrieval augmentation needs for LLMsSee README
BAAI/bge-reranker-v2-m3Multilinguala lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-gemmaMultilinguala cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities.
BAAI/bge-reranker-v2-minicpm-layerwiseMultilinguala cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference.
BAAI/bge-reranker-v2.5-gemma2-lightweightMultilinguala cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference.
BAAI/bge-reranker-largeChinese and Englisha cross-encoder model which is more accurate but less efficient
BAAI/bge-reranker-baseChinese and Englisha cross-encoder model which is more accurate but less efficient
BAAI/bge-large-en-v1.5Englishversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-base-en-v1.5Englishversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-small-en-v1.5Englishversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-large-zh-v1.5Chineseversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zh-v1.5Chineseversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zh-v1.5Chineseversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-large-enEnglishEmbedding Model which map text into vectorRepresent this sentence for searching relevant passages:
BAAI/bge-base-enEnglisha base-scale model but with similar ability to bge-large-enRepresent this sentence for searching relevant passages:
BAAI/bge-small-enEnglisha small-scale model but with competitive performanceRepresent this sentence for searching relevant passages:
BAAI/bge-large-zhChineseEmbedding Model which map text into vector为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zhChinesea base-scale model but with similar ability to bge-large-zh为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zhChinesea small-scale model but with competitive performance为这个句子生成表示以用于检索相关文章:

Contributors:

Thank all our contributors for their efforts and warmly welcome new members to join in!

<a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors"> <img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" /> </a>

Citation

If you find this repository useful, please consider giving a star :star: and citation

@misc{bge_m3,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  year={2023},
  eprint={2309.07597},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{cocktail,
      title={LM-Cocktail: Resilient Tuning of Language Models via Model Merging}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing},
      year={2023},
      eprint={2311.13534},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{llm_embedder,
      title={Retrieve Anything To Augment Large Language Models}, 
      author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
      year={2023},
      eprint={2310.07554},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlagEmbedding is licensed under the MIT License.