Awesome
Trials of pre-trained BERT models for the medical domain in Japanese
They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use from Today's diagnosis and treatment: premium, which consists of 15 digital references for clinicians in Japanese published by IGAKU-SHOIN Ltd..
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) on https://dumps.wikimedia.org/jawiki/.
Our demonstration models
- medBERTjp - MeCab-IPAdic
- pre-trained model following MeCab-IPAdic-tokenized Japanese BERT model.
- Japanese tokenizer: MeCab + Byte Pair Encoding (BPE)
- ipadic-py, or manual install of IPAdic is required.
- max_seq_length=128
- medBERTjp - Unidic-2.3.0
- medBERTjp - MeCab-IPAdic-NEologd-JMeDic
- Japanese tokenizer: MeCab + BPE
- install of both mecab-ipadic-NEologd and J-MeDic (MANBYO_201907_Dic-utf8.dic) is required.
- max_seq_length=128
- medBERTjp - SentencePiece
<br>(Old: v0.1-sp)
- Japanese tokenizer: SentencePiece following Sentencepiece Japanese BERT model
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128
Requirements
For just using the models:
- Transformers (>=2.11.0)
- fugashi, a Cython wrapper for MeCab
- ipadic, unidic-py, mecab-ipadic-NEologd, and J-MeDic: if required.
- SentencePiece would be automatically installed with Transformers.
Usage
Please check code examples of tokenization_example.ipynb
, or try to use example_google_colab.ipynb
on Google Colab.
Funding
This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).
Licenses
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />The pretrained models are distributed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.
They are freely available for academic purpose or individual research, but restricted for commecial use.
The codes in this repository are licensed under the Apache License, Version2.0.