Home

Awesome

bert4vector

向量计算、存储、检索、相似度计算

licence GitHub release PyPI PyPI - Downloads GitHub stars GitHub Issues contributions welcome

Documentation | Bert4torch | Examples | Source code

1. 下载安装

pip install bert4vector
pip install git+https://github.com/Tongjilibo/bert4vector

2. 快速使用

from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sentences = ['喜欢打篮球的男生喜欢什么样的女生', '西安下雪了?是不是很冷啊?', '第一次去见女朋友父母该如何表现?', '小蝌蚪找妈妈怎么样', '给我推荐一款红色的车', '我喜欢北京']
vecs = model.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)
print(vecs.shape)
# (6, 312)
from bert4vector.core import BertSimilarity
text2vec = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sent1 = ['你好', '天气不错']
sent2 = ['你好啊', '天气很好']
similarity = text2vec.similarity(sent1, sent2)
print(similarity)
# [[0.9075422  0.42991278]
#  [0.19584633 0.72635853]]
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
model.add_corpus(['你好', '我选你', '天气不错', '人很好看'])
print(model.search('你好'))
# {'你好': [{'corpus_id': 0, 'score': 0.9999, 'text': '你好'},
#           {'corpus_id': 3, 'score': 0.5694, 'text': '人很好看'}]} 
from bert4vector.pipelines import SimilaritySever
server = SimilaritySever('/data/pretrain_ckpt/embedding/BAAI--bge-base-zh-v1.5')
server.run(port=port)
# 接口调用可以参考'./examples/api.py'

3. 支持的句向量权重

模型分类模型名称权重来源权重链接备注(若有)
simbertsimbert追一科技Tongjilibo/simbert-chinese-base, Tongjilibo/simbert-chinese-small, Tongjilibo/simbert-chinese-tiny
simbert_v2/roformer-sim追一科技junnyu/roformer_chinese_sim_char_basejunnyu/roformer_chinese_sim_char_ft_basejunnyu/roformer_chinese_sim_char_smalljunnyu/roformer_chinese_sim_char_ft_smallroformer_chinese_sim_char_base, roformer_chinese_sim_char_ft_base, roformer_chinese_sim_char_small, roformer_chinese_sim_char_ft_small
embeddingtext2vec-base-chineseshibing624shibing624/text2vec-base-chinesetext2vec-base-chinese
m3emoka-aimoka-ai/m3e-basem3e-base
bgeBAAIBAAI/bge-large-en-v1.5, BAAI/bge-large-zh-v1.5, BAAI/bge-base-en-v1.5, BAAI/bge-base-zh-v1.5, BAAI/bge-small-en-v1.5, BAAI/bge-small-zh-v1.5bge-large-en-v1.5, bge-large-zh-v1.5, bge-base-en-v1.5, bge-base-zh-v1.5, bge-small-en-v1.5, bge-small-zh-v1.5
gtethenlperthenlper/gte-large-zh, thenlper/gte-base-zhgte-base-zh, gte-large-zh

*注:

  1. 高亮格式(如Tongjilibo/simbert-chinese-small)的表示可直接联网下载
  2. 国内镜像网站加速下载
    • HF_ENDPOINT=https://hf-mirror.com python your_script.py
    • export HF_ENDPOINT=https://hf-mirror.com后再执行python代码
    • 在python代码开头如下设置
    import os
    os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
    

4. 版本历史

更新日期bert4vector版本说明
202409280.0.5小修改,api中可以reset
202407100.0.4增加最长公共子序列字面召回,不安装torch也可以使用部分功能
202406280.0.3增加多种字面召回,增加api接口部署

5. 更新历史:

6. Reference