Awesome
⚠️ 重要 2024/10/8 より多様なタスクにより埋め込みモデルを評価したリーダーボードJMTEBが公開されておりますので、こちらを参照することをお勧めします。
⚠️ IMPORTANT UPDATE: we recommend checking out JMTEB, a new leaderboard that evaluates embedding models using a more diverse set of tasks.
JapaneseEmbeddingEval
- JSTS/JSICK: Spearman's rank correlation coefficient
- Cosine similarity was used to calculate the similarity of sentence pairs.
- MIRACL: top30 recall
Model | #dims | #params | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average |
---|---|---|---|---|---|---|
BAAI/bge-m3(dense_vecs) | 1024 | 567M | 0.802 | 0.798 | 0.9101 | 0.837 |
jinaai/jina-embeddings-v3 | 1024 | 12M | 0.819 | 0.782 | 0.862 | 0.821 |
MU-Kindai/SBERT-JSNLI-base | 768 | 110M | 0.766 | 0.652 | 0.326 | 0.581 |
MU-Kindai/SBERT-JSNLI-large | 1024 | 337M | 0.774 | 0.677 | 0.278 | 0.576 |
bclavie/fio-base-japanese-v0.1 2 | 768 | 111M | 0.863 | 0.894 | 0.718 | 0.825 |
cl-nagoya/ruri-small | 768 | 67M | 0.821 | 0.833 | 0.7911 | 0.815 |
cl-nagoya/ruri-base | 768 | 111M | 0.833 | 0.823 | 0.8461 | 0.834 |
cl-nagoya/ruri-large | 1024 | 337M | 0.842 | 0.819 | 0.8641 | 0.842 |
cl-nagoya/sup-simcse-ja-base | 768 | 111M | 0.809 | 0.827 | 0.527 | 0.721 |
cl-nagoya/sup-simcse-ja-large | 1024 | 337M | 0.831 | 0.831 | 0.507 | 0.723 |
cl-nagoya/unsup-simcse-ja-base | 768 | 111M | 0.789 | 0.790 | 0.487 | 0.689 |
cl-nagoya/unsup-simcse-ja-large | 1024 | 337M | 0.814 | 0.796 | 0.485 | 0.699 |
colorfulscoop/sbert-base-ja | 768 | 110M | 0.742 | 0.657 | 0.254 | 0.551 |
intfloat/multilingual-e5-small | 384 | 117M | 0.789 | 0.814 | 0.8471 | 0.817 |
intfloat/multilingual-e5-base | 768 | 278M | 0.796 | 0.806 | 0.8451 | 0.816 |
intfloat/multilingual-e5-large | 1024 | 559M | 0.819 | 0.794 | 0.8831 | 0.832 |
intfloat/multilingual-e5-large-instruct | 1024 | 559M | 0.832 | 0.822 | 0.8761 | 0.844 |
oshizo/sbert-jsnli-luke-japanese-base-lite | 768 | 133M | 0.811 | 0.726 | 0.497 | 0.678 |
pkshatech/GLuCoSE-base-ja-v2 | 768 | 133M | 0.809 | 0.849 | 0.8791 | 0.846 |
pkshatech/RoSEtta-base-ja | 768 | 190M | 0.790 | 0.835 | 0.8451 | 0.823 |
pkshatech/GLuCoSE-base-ja | 768 | 133M | 0.818 | 0.757 | 0.692 | 0.755 |
pkshatech/simcse-ja-bert-base-clcmlp | 768 | 111M | 0.801 | 0.735 | 0.544 | 0.693 |
API | ||||||
text-embedding-3-large | 3072 | 0.838 | 0.812 | 0.8413 | 0.830 | |
text-embedding-3-small | 1536 | 0.781 | 0.804 | 0.7953 | 0.793 | |
text-embedding-ada-002 | 1536 | 0.790 | 0.790 | 0.7283 | 0.769 | |
textembedding-gecko-multilingual@001 | 768 | 0.801 | 0.804 | 0.8003 | 0.801 | |
LLM | ||||||
intfloat/e5-mistral-7b-instruct | 4096 | 7.3B | 0.836 | 0.836 | 0.885 | 0.852 |
oshizo/japanese-e5-mistral-7b_slerp | 4096 | 7.3B | 0.846 | 0.842 | 0.886 | 0.858 |
oshizo/japanese-e5-mistral-1.9b | 4096 | 1.9B | 0.826 | 0.833 | 0.797 | 0.819 |
ColBERT | ||||||
bclavie/jacolbert_first_100 4 | 128/token | 111M | 0.8723 | |||
bclavie/JaColBERTv2 4 | 128/token | 111M | 0.9183 | |||
BAAI/bge-m3(colbert_vecs) | 1024/token | 567M | 0.799 | 0.798 | 0.9171 | 0.838 |
BAAI/bge-m3(colbert+sparse+dense) | 1024/token5 | 567M | 0.800 | 0.805 | 0.926 1 | 0.844 |
Reranker | ||||||
hotchpotch/japanese-bge-reranker-v2-m3-v1 | - | 567M | 0.9471 | |||
Sparse Retrieval | ||||||
hotchpotch/japanese-splade-base-v1 | - | 111M | 0.9251 |
Datasets
-
JSTS valid-v1.1
- https://github.com/yahoojapan/JGLUE
- 1,457 sentence pairs
-
JSICK test
- https://github.com/verypluming/JSICK
- 4,927 sentence pairs
-
MIRACL dev
- https://huggingface.co/datasets/miracl/miracl
- 860 japanese queries
- From the 6,953,614 japanese data in miracl/miracl-corpus, the sentences to be searched were selected as follows to reduce computation time.
- positive passage for each query
- 300 hard negatives for each query
- Hard negative mining was performed using intfloat/multilingual-e5-base
- Scores for models other than intfloat/multilingual-e5-base are calculated higher only in the following case, but we believe that they are almost unaffected.
- A negative that is ranked lower than the top 300 by intfloat/multilingual-e5-base is ranked within the top 30 by that model, which pushes the positive into the top 30 or lower.
- Some queries contain more than 30 potential positive documents in the miracl-corpus. In this case, even a very good model may not be able to rank the ground truth positive documents within the top 30. We estimated such queries to be about 7% to 10% of the total 860 queries. This number was estimated by referring to the tydiqa data for the same query as the corresponding miracl dev query and counting whether the tydiqa answer phrase was in at least 30 of the 300 hard negatives documents.
Footnotes
-
These models have been fine-tuned using the MIRACL dataset, so the MIRACL task is not an unseen task for them. For detailed information on each model, please refer to the following links: multilingual-e5, BGE-M3, hotchpotch/japanese-bge-reranker-v2-m3-v1, hotchpotch/japanese-splade-base-v1, Ruri, pkshatech/GLuCoSE-base-ja-v2, pkshatech/RoSEtta-base-ja ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14
-
According to the blog post about fio-base-japanese-v0.1, the tasks aren't unseen by the model, which makes it hard to directly compare with the other models. ↩
-
Evaluate only the first 100 queries out of 860 queries ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
JaColBERT is a retrieval model. It is optimised only for document retrieval tasks, and not for semantic similarity/entailment tasks like JSTS or JSICK. ↩ ↩2
-
Embedded dimension for dence is 1024, sparse is one float value per unique token, colbert is 1024 per token. ↩