Home

Awesome

<h1 align="center">FlagEmbedding</h1> <p align="center"> <a href="https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d"> <img alt="Build" src="https://img.shields.io/badge/BGE_series-🤗-yellow"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding"> <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE"> <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"> </a> <a href="https://huggingface.co/C-MTEB"> <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow"> </a> <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding"> <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red"> </a> </p> <h4 align="center"> <p> <a href=#news>News</a> | <a href=#installation>Installation</a> | <a href=#quick-start>Quick Start</a> | <a href=#community>Community</a> | <a href="#projects">Projects</a> | <a href=#model-list>Model List</a> | <a href="#contributor">Contributor</a> | <a href="#citation">Citation</a> | <a href="#license">License</a> <p> </h4>

English | 中文

FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:

News

<details> <summary>More</summary> <!-- ### More --> </details>

Installation

Using pip:

pip install -U FlagEmbedding

Install from sources:

Clone the repository and install

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .

For development in editable mode:

pip install -e .

Quick Start

First, load one of the BGE embedding model:

from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

Then, feed some sentences to the model and get their embeddings:

sentences_1 = ["I love NLP", "I love machine learning"]
sentences_2 = ["I love BGE", "I love text retrieval"]
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)

Once we get the embeddings, we can compute similarity by inner product:

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Community

We are actively maintaining the community of BGE and FlagEmbedding. Let us know if you have any suggessions or ideas!

Currently we are updating the tutorials, we aim to create a comprehensive and detailed tutorial for beginners on text retrieval and RAG. Stay tuned!

The following contents are releasing in the upcoming weeks:

<details> <summary>The whole tutorial roadmap</summary> <img src="./Tutorials/tutorial_map.png"/> </details>

Projects

BGE-M3 (Paper, Code)

In this project, we introduce BGE-M3, the first embedding model which supports:

The training code and fine-tuning data will be open-sourced in the near future.

Visualized-BGE

In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.

Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.

LongLLM QLoRA

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.

Activation Beacon

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. More details please refer to our paper and code.

LM-Cocktail

LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights. LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain, as well as generate a model for new tasks without fine-tuning. You can use it to merge the LLMs (e.g., Llama) or embedding models. More details please refer to our report: LM-Cocktail and code.

LLM Embedder

LLM Embedder is fine-tuned based on the feedback from LLMs. It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, Long-Range Language Modeling, In-Context Learning, and Tool Learning. For more details please refer to report and ./FlagEmbedding/llm_embedder/README.md

BGE Reranker

Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/reranker/README.md

LLM Reranker

We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/llm_reranker/README.md.

BGE Embedding

BGE embedding is a general Embedding Model. We pre-train the models using retromae and train them on large-scale pair data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. Refer to our report: c-pack and code for more details.

BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance.

C-MTEB

A benchmark for chinese text embedding. This benchmark has been merged into MTEB. Refer to our report: c-pack and code for more details.

Model List

bge is short for BAAI general embedding.

ModelLanguageDescriptionquery instruction for retrieval
BAAI/bge-en-iclEnglishA LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examplesProvide instructions and few-shot examples freely based on the given task.
BAAI/bge-multilingual-gemma2Multilingual-A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks.Provide instructions based on the given task.
BAAI/bge-m3MultilingualInference Fine-tuneMulti-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens)
LM-CocktailEnglishfine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail
BAAI/llm-embedderEnglishInference Fine-tunea unified embedding model to support diverse retrieval augmentation needs for LLMsSee README
BAAI/bge-reranker-v2-m3MultilingualInference Fine-tunea lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-gemmaMultilingualInference Fine-tunea cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities.
BAAI/bge-reranker-v2-minicpm-layerwiseMultilingualInference Fine-tunea cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference.
BAAI/bge-reranker-v2.5-gemma2-lightweightMultilingualInferencea cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference.
BAAI/bge-reranker-largeChinese and EnglishInference Fine-tunea cross-encoder model which is more accurate but less efficient
BAAI/bge-reranker-baseChinese and EnglishInference Fine-tunea cross-encoder model which is more accurate but less efficient
BAAI/bge-large-en-v1.5EnglishInference Fine-tuneversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-base-en-v1.5EnglishInference Fine-tuneversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-small-en-v1.5EnglishInference Fine-tuneversion 1.5 with more reasonable similarity distributionRepresent this sentence for searching relevant passages:
BAAI/bge-large-zh-v1.5ChineseInference Fine-tuneversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zh-v1.5ChineseInference Fine-tuneversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zh-v1.5ChineseInference Fine-tuneversion 1.5 with more reasonable similarity distribution为这个句子生成表示以用于检索相关文章:
BAAI/bge-large-enEnglishInference Fine-tuneEmbedding Model which map text into vectorRepresent this sentence for searching relevant passages:
BAAI/bge-base-enEnglishInference Fine-tunea base-scale model but with similar ability to bge-large-enRepresent this sentence for searching relevant passages:
BAAI/bge-small-enEnglishInference Fine-tunea small-scale model but with competitive performanceRepresent this sentence for searching relevant passages:
BAAI/bge-large-zhChineseInference Fine-tuneEmbedding Model which map text into vector为这个句子生成表示以用于检索相关文章:
BAAI/bge-base-zhChineseInference Fine-tunea base-scale model but with similar ability to bge-large-zh为这个句子生成表示以用于检索相关文章:
BAAI/bge-small-zhChineseInference Fine-tunea small-scale model but with competitive performance为这个句子生成表示以用于检索相关文章:

Contributors:

Thank all our contributors for their efforts and warmly welcome new members to join in!

<a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors"> <img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" /> </a>

Citation

If you find this repository useful, please consider giving a star :star: and citation

@misc{bge_m3,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  year={2023},
  eprint={2309.07597},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

@misc{cocktail,
      title={LM-Cocktail: Resilient Tuning of Language Models via Model Merging}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing},
      year={2023},
      eprint={2311.13534},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{llm_embedder,
      title={Retrieve Anything To Augment Large Language Models}, 
      author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
      year={2023},
      eprint={2310.07554},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlagEmbedding is licensed under the MIT License.