Awesome

Build PMC-OA

This is our pipeline for the development of PMC-OA. You might need further adaptation to use it for your own purpose.

Build PMC-OA

Installation

Setup ENV

conda create -n pubmed python=3.8  # not test for other version
conda activate pubmed

pip install -r requirements.txt

git clone https://gitee.com/lin_wei_hung/build-pmcoa.git
python setup.py develop  # choose developer mode for customization

Run the script

python src/fetch_oa.py --volumes 0 1 2 3 4 5 6 7 8 9  # 10 volumes for PMC OA in total
python src/fetch_oa.py --volumes 0 1 2  # Choose volumes of 0,1,2 only

About PMC-OA

PMC-OA(Pubmed Open Acess Subset) is built with public papers in Pubmed, which can be downloaded from pubmed page.

Due to the issue of copyright, the papers with Non-Commertial-Use liscence and ones with no liscence is not included in PMC-OA. You might customize the repo for your own purpose.

Structure

setup.py
src/
  |--fetch_oa.py: main script for download PMC-OA
  |--args/
  | |--args_oa.py: Configures for pipeline
  |--parser/
  | |--parse_oa.py: Parse web pages into list of <img, caption> pairs
  |--utils/
  | |--io.py:

Contribution

Fork the repository
Create Feat_xxx branch
Commit your code
Create Pull Request

Limitation

Some of the paper are only presented in pdf formart, the figures in those would not be obtained by this pipeline
Media files other than images might also be downloaded, such as suffix .mp4, .avi.

Cite

@article{lin2023pmc,
  title={PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents},
  author={Lin, Weixiong and Zhao, Ziheng and Zhang, Xiaoman and Wu, Chaoyi and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
  journal={arXiv preprint arXiv:2303.07240},
  year={2023}
}