Awesome

Trials of pre-trained BERT models for the medical domain in Japanese

They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use from Today's diagnosis and treatment: premium, which consists of 15 digital references for clinicians in Japanese published by IGAKU-SHOIN Ltd..
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) on https://dumps.wikimedia.org/jawiki/.

Our demonstration models

medBERTjp - MeCab-IPAdic
- pre-trained model following MeCab-IPAdic-tokenized Japanese BERT model.
- Japanese tokenizer: MeCab + Byte Pair Encoding (BPE)
- ipadic-py, or manual install of IPAdic is required.
- max_seq_length=128
medBERTjp - Unidic-2.3.0
- Japanese tokenizer: MeCab + BPE
- Unidic v2.3.0+2020-10-08 via unidic-py is required.
- max_seq_length=128
medBERTjp - MeCab-IPAdic-NEologd-JMeDic
- Japanese tokenizer: MeCab + BPE
- install of both mecab-ipadic-NEologd and J-MeDic (MANBYO_201907_Dic-utf8.dic) is required.
- max_seq_length=128
medBERTjp - SentencePiece <br>(Old: v0.1-sp)
- Japanese tokenizer: SentencePiece following Sentencepiece Japanese BERT model
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128

Requirements

For just using the models:

Transformers (>=2.11.0)
fugashi, a Cython wrapper for MeCab
- ipadic, unidic-py, mecab-ipadic-NEologd, and J-MeDic: if required.
SentencePiece would be automatically installed with Transformers.

Usage

Please check code examples of tokenization_example.ipynb, or try to use example_google_colab.ipynb on Google Colab.

Funding

This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).

Licenses

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />The pretrained models are distributed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.
They are freely available for academic purpose or individual research, but restricted for commecial use.

The codes in this repository are licensed under the Apache License, Version2.0.