Home

Awesome

uniem

Release PyPI - Python Version ci cd

uniem 项目的目标是创建中文最好的通用文本嵌入模型。

本项目主要包括模型的训练,微调和评测代码,模型与数据集会在 HuggingFace 社区上进行开源。

🌟 重要更新

🔧 使用 M3E

M3E 系列模型完全兼容 sentence-transformers ,你可以通过 替换模型名称 的方式在所有支持 sentence-transformers 的项目中无缝使用 M3E Models,比如 chroma, guidance, semantic-kernel

安装

pip install sentence-transformers

使用

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])

🎨 微调模型

uniem 提供了非常易用的 finetune 接口,几行代码,即刻适配!

from datasets import load_dataset

from uniem.finetuner import FineTuner

dataset = load_dataset('shibing624/nli_zh', 'STS-B')
# 指定训练的模型为 m3e-small
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
finetuner.run(epochs=3)

微调模型详见 uniem 微调教程 or <a target="_blank" href="https://colab.research.google.com/github/wangyuxinwhy/uniem/blob/main/examples/finetune.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

如果您想要在本地运行,您需要运行如下命令,准备环境

conda create -n uniem python=3.10
pip install uniem

💯 MTEB-zh

中文 Embedding 模型缺少统一的评测标准,所以我们参考了 MTEB ,构建了中文评测标准 MTEB-zh,目前已经对 6 种模型在各种数据集上进行了横评,详细的评测方式和代码请参考 MTEB-zh

文本分类

text2vecm3e-smallm3e-basem3e-large-0619openaiDMetaSouluererlangshen
TNews0.430.44430.48270.48660.45940.30840.35390.4361
JDIphone0.82140.82930.85330.86920.7460.79720.82830.8356
GubaEastmony0.74720.7120.76210.76630.75740.7350.75340.7787
TYQSentiment0.60990.65960.71880.72470.680.64370.66620.6444
StockComSentiment0.43070.42910.43630.44750.48190.43090.45550.4482
IFlyTek0.4140.42630.44090.44450.44860.39690.37620.4241
Average0.57550.58340.61570.62310.59560.5520166670.572250.594516667

检索排序

text2vecopenai-ada-002m3e-smallm3e-basem3e-large-0619DMetaSouluererlangshen
map@10.46840.61330.55740.6260.62560.252030.086470.25394
map@100.58770.74230.68780.76560.76270.333120.130080.34714
mrr@10.53450.69310.63240.70470.70630.292580.100670.29447
mrr@100.62170.76680.7120.78410.78270.362870.145160.3751
ndcg@10.52070.67640.61590.68810.68840.283580.097480.28578
ndcg@100.63460.77860.72620.80040.79740.374680.157830.39329

🤝 Contributing

如果您想要在 MTEB-zh 中添加评测数据集或者模型,欢迎提 issue 或者 PR,我会在第一时间进行支持,期待您的贡献!

📜 License

uniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.

🏷 Citation

Please cite this model using the following format:

@software {Moka Massive Mixed Embedding, author = {Wang Yuxin,Sun Qingxuan,He sicheng}, title = {M3E: Moka Massive Mixed Embedding Model}, year = {2023} }