Awesome

⚠️ 重要 2024/10/8 より多様なタスクにより埋め込みモデルを評価したリーダーボードJMTEBが公開されておりますので、こちらを参照することをお勧めします。
⚠️ IMPORTANT UPDATE: we recommend checking out JMTEB, a new leaderboard that evaluates embedding models using a more diverse set of tasks.

JapaneseEmbeddingEval

JSTS/JSICK: Spearman's rank correlation coefficient
- Cosine similarity was used to calculate the similarity of sentence pairs.
MIRACL: top30 recall

Model	#dims	#params	JSTS valid-v1.1	JSICK test	MIRACL dev	Average
BAAI/bge-m3(dense_vecs)	1024	567M	0.802	0.798	0.910¹	0.837
jinaai/jina-embeddings-v3	1024	12M	0.819	0.782	0.862	0.821
MU-Kindai/SBERT-JSNLI-base	768	110M	0.766	0.652	0.326	0.581
MU-Kindai/SBERT-JSNLI-large	1024	337M	0.774	0.677	0.278	0.576
bclavie/fio-base-japanese-v0.1 ²	768	111M	0.863	0.894	0.718	0.825
cl-nagoya/ruri-small	768	67M	0.821	0.833	0.791¹	0.815
cl-nagoya/ruri-base	768	111M	0.833	0.823	0.846¹	0.834
cl-nagoya/ruri-large	1024	337M	0.842	0.819	0.864¹	0.842
cl-nagoya/sup-simcse-ja-base	768	111M	0.809	0.827	0.527	0.721
cl-nagoya/sup-simcse-ja-large	1024	337M	0.831	0.831	0.507	0.723
cl-nagoya/unsup-simcse-ja-base	768	111M	0.789	0.790	0.487	0.689
cl-nagoya/unsup-simcse-ja-large	1024	337M	0.814	0.796	0.485	0.699
colorfulscoop/sbert-base-ja	768	110M	0.742	0.657	0.254	0.551
intfloat/multilingual-e5-small	384	117M	0.789	0.814	0.847¹	0.817
intfloat/multilingual-e5-base	768	278M	0.796	0.806	0.845¹	0.816
intfloat/multilingual-e5-large	1024	559M	0.819	0.794	0.883¹	0.832
intfloat/multilingual-e5-large-instruct	1024	559M	0.832	0.822	0.876¹	0.844
oshizo/sbert-jsnli-luke-japanese-base-lite	768	133M	0.811	0.726	0.497	0.678
pkshatech/GLuCoSE-base-ja-v2	768	133M	0.809	0.849	0.879¹	0.846
pkshatech/RoSEtta-base-ja	768	190M	0.790	0.835	0.845¹	0.823
pkshatech/GLuCoSE-base-ja	768	133M	0.818	0.757	0.692	0.755
pkshatech/simcse-ja-bert-base-clcmlp	768	111M	0.801	0.735	0.544	0.693
API
text-embedding-3-large	3072		0.838	0.812	0.841³	0.830
text-embedding-3-small	1536		0.781	0.804	0.795³	0.793
text-embedding-ada-002	1536		0.790	0.790	0.728³	0.769
textembedding-gecko-multilingual@001	768		0.801	0.804	0.800³	0.801
LLM
intfloat/e5-mistral-7b-instruct	4096	7.3B	0.836	0.836	0.885	0.852
oshizo/japanese-e5-mistral-7b_slerp	4096	7.3B	0.846	0.842	0.886	0.858
oshizo/japanese-e5-mistral-1.9b	4096	1.9B	0.826	0.833	0.797	0.819
ColBERT
bclavie/jacolbert_first_100 ⁴	128/token	111M			0.872³
bclavie/JaColBERTv2 ⁴	128/token	111M			0.918³
BAAI/bge-m3(colbert_vecs)	1024/token	567M	0.799	0.798	0.917¹	0.838
BAAI/bge-m3(colbert+sparse+dense)	1024/token⁵	567M	0.800	0.805	0.926 ¹	0.844
Reranker
hotchpotch/japanese-bge-reranker-v2-m3-v1	-	567M			0.947¹
Sparse Retrieval
hotchpotch/japanese-splade-base-v1	-	111M			0.925¹

Datasets

JSTS valid-v1.1
- https://github.com/yahoojapan/JGLUE
- 1,457 sentence pairs
JSICK test
- https://github.com/verypluming/JSICK
- 4,927 sentence pairs
MIRACL dev
- https://huggingface.co/datasets/miracl/miracl
- 860 japanese queries
- From the 6,953,614 japanese data in miracl/miracl-corpus, the sentences to be searched were selected as follows to reduce computation time.
  1. positive passage for each query
  2. 300 hard negatives for each query
  - Hard negative mining was performed using intfloat/multilingual-e5-base
  - Scores for models other than intfloat/multilingual-e5-base are calculated higher only in the following case, but we believe that they are almost unaffected.
    - A negative that is ranked lower than the top 300 by intfloat/multilingual-e5-base is ranked within the top 30 by that model, which pushes the positive into the top 30 or lower.
- Some queries contain more than 30 potential positive documents in the miracl-corpus. In this case, even a very good model may not be able to rank the ground truth positive documents within the top 30. We estimated such queries to be about 7% to 10% of the total 860 queries. This number was estimated by referring to the tydiqa data for the same query as the corresponding miracl dev query and counting whether the tydiqa answer phrase was in at least 30 of the 300 hard negatives documents.

These models have been fine-tuned using the MIRACL dataset, so the MIRACL task is not an unseen task for them. For detailed information on each model, please refer to the following links: multilingual-e5, BGE-M3, hotchpotch/japanese-bge-reranker-v2-m3-v1, hotchpotch/japanese-splade-base-v1, Ruri, pkshatech/GLuCoSE-base-ja-v2, pkshatech/RoSEtta-base-ja ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴
According to the blog post about fio-base-japanese-v0.1, the tasks aren't unseen by the model, which makes it hard to directly compare with the other models. ↩
Evaluate only the first 100 queries out of 860 queries ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
JaColBERT is a retrieval model. It is optimised only for document retrieval tasks, and not for semantic similarity/entailment tasks like JSTS or JSICK. ↩ ↩²
Embedded dimension for dence is 1024, sparse is one float value per unique token, colbert is 1024 per token. ↩

Awesome

JapaneseEmbeddingEval

Datasets

Footnotes