Home

Awesome

The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study

License: Open Data Commons Attribution

This repository contains the datasets and source code used in our paper The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study.

Links

Datasets

NOTE: If you are working on graph mining tasks (e.g., node classification, link prediction) in homogeneous/heterogeneous/attributed/text-rich networks, we have also created a graph format of MAPLE, and you can refer to README_Graph.md for more details.

The MAPLE benchmark constructed by us contains 20 datasets across 19 fields for scientific literature tagging. You can download the datasets from HERE. Once you unzip the downloaded file, you can see a folder MAPLE/. Please put the folder under the main directory ./ of this code repository.

There are 23 folders under MAPLE/, corresponding to 23 datasets. 20 of them with MAG labels are mentioned in the main text of our paper; the other 3 datasets with MeSH labels will be introduced in the next section. Statistics of the 20 "main" datasets are as follows:

Dataset Statistics

FolderField#Papers#Labels#Venues#Authors#References
ArtArt58,3731,9909854,802115,343
PhilosophyPhilosophy59,2963,7589836,619198,010
GeographyGeography73,8833,28598157,423884,632
BusinessBusiness84,8582,39297100,525685,034
SociologySociology90,2081,9359885,793842,561
HistoryHistory113,1472,6899984,529284,739
Political_SciencePolitical Science115,2914,9909893,393480,136
Environmental_ScienceEnvironmental Science123,945694100265,7281,217,268
EconomicsEconomics178,6705,20597135,2471,042,253
CSRankingsComputer Science (Conference)263,39313,61375331,5821,084,440
EngineeringEngineering270,00610,683100430,0461,867,276
PsychologyPsychology372,9547,641100460,1232,313,701
Computer_ScienceComputer Science (Journal)410,60315,54096634,5062,751,996
GeologyGeology431,8347,883100471,2161,753,762
MathematicsMathematics490,55114,27198404,0662,150,584
Materials_ScienceMaterials Science1,337,7316,802991,904,5495,457,773
PhysicsPhysics1,369,98316,664911,392,0703,641,761
BiologyBiology1,588,77864,2671002,730,5477,086,131
ChemistryChemistry1,849,95635,5381002,721,2538,637,438
MedicineMedicine2,646,10536,6191004,345,3857,405,779

Data Format

In each folder (e.g., Art/), you can see four files: authors.txt, labels.txt, papers.json, and venues.txt.

authors.txt has 3 columns: author id, normalized author name, and original author name:

12035	stephen rickerby	Stephen Rickerby
127649	clementine deliss	Clementine Deliss
1395514	tomas garciasalgado	Tomás García-Salgado
...

venues.txt has 3 columns: venue id, normalized venue name, and original venue name:

26308392	the journal of aesthetics and art criticism	The Journal of Aesthetics and Art Criticism
93676754	modern language review	Modern Language Review
998751717	classical world	Classical World
...

labels.txt has 3 columns: label id, label name, and depth of the label (1-5, with 1 being the coarsest and 5 being the finest):

2780583484	papyrus	2
2778949450	scientific writing	2
2780412351	purgatory	2
...

papers.json has text and metadata information of each paper. Each line is a json record representing one paper. For example,

{
  "paper": "2333162778",
  "venue": "103229351",
  "year": "1987",
  "title": "the life and unusual ideas of adelbert ames jr",
  "label": [
    "554144382", "153349607"
  ],
  "author": [
    "2162173344"
  ],
  "reference": [
    "132232344", "378964350", "562124327", ...
  ],
  "abstract": "this paper is a summary of the life and major achievements of adelbert ames jr an american ...",
  "title_raw": "The Life and Unusual Ideas of Adelbert Ames, Jr.",
  "abstract_raw": "This paper is a summary of the life and major achievements of Adelbert Ames, Jr., an American ..."
}

Additional Datasets with MeSH Labels

The three additional datasets: Biology_MeSH, Chemistry_MeSH, and Medicine_MeSH are constructed from Biology, Chemistry, and Medicine, respectively, by obtaining the MeSH labels of each paper (and removing those papers without MeSH labels).

Dataset Statistics

FolderField#Papers#Labels#Venues#Authors#References
Biology_MeSHBiology-MeSH1,379,39325,0391002,486,8146,876,739
Chemistry_MeSHChemistry-MeSH762,12921,585871,498,3585,928,908
Medicine_MeSHMedicine-MeSH1,536,66025,1881002,791,1657,190,021

Data Format

In each folder (e.g., Biology_MeSH/), you can see five files: authors.txt, labels.txt, labels_mesh.txt, papers.json, and venues.txt.

authors.txt and venues.txt have the same format as in the 20 "main" datasets.

labels.txt has 2 columns: MeSH label id and original MeSH label name:

D000818	Animals
D001824	Body Constitution
D005075	Biological Evolution
...

labels_mesh.txt has >=2 columns: MeSH label id, normalized MeSH label name, and all entry terms (i.e., synonyms) of the MeSH label:

D000818	animals	animalia
D001824	body constitution	body constitutions	constitution body	constitutions body
D005075	biological evolution	evolution biological
...

papers.json has the same format as in the 20 "main" datasets. The only difference is that the "label" field now contains all MeSH labels of the paper. For example,

{
  "paper": "1816482797",
  "venue": "166515463",
  "year": "2015",
  "title": "proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic ...",
  "label": [
    "D005810", "D005808", "D020125", "D019295", "D030541", ...
  ],
  "author": [
    "2303839782", "2953263946", "2160643821", ...
  ],
  "reference": [
    "80748578", "1563940013", "1570281893", ...
  ],
  "abstract": "the role of rare missense variants in disease causation remains difficult to interpret ...",
  "title_raw": "Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic ...",
  "abstract_raw": "The role of rare missense variants in disease causation remains difficult to interpret ..."
}

Running Parabel

The code of Parabel is written in C++. It is adapted from the original implementation by Prabhu et al. You need to run the following script.

cd ./Parabel/
./run.sh

P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./Parabel/scores.txt. The prediction results can be found in ./Parabel/Sandbox/Results/{dataset}/score_mat.txt.

Running Transformer

GPUs are required. We use one NVIDIA GeForce GTX 1080 Ti GPU in our experiments.

The code of Transformer is written in Python 3.6. It is adapted from the original implementation by Xun et al. You need to first install the dependencies like this:

cd ./Transformer/
pip3 install -r requirements.txt

Then, you need to download the GloVe embeddings (originally from here). Once you unzip the downloaded file, please put it (i.e., the data/ folder) under ./Transformer/. Then, you can run the code.

./run.sh

P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./Transformer/scores.txt. The prediction results can be found in ./Transformer/predictions.txt.

Running OAG-BERT

GPUs are required. We use one NVIDIA GeForce GTX 1080 Ti GPU in our experiments.

The code of OAG-BERT is written in Python 3.7. It is adapted from the original implementation by Liu et al. You need to first install PyTorch >= 1.7.1, and then the CogDL package. These two steps can be done by running the following:

cd ./OAGBERT/
./setup.sh

Then, you can run the code.

./run.sh

P@k and NDCG@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./OAGBERT/Parabel/scores.txt. The prediction results can be found in ./OAGBERT/Parabel/Sandbox/Results/{dataset}/score_mat.txt.

References

If you find the MAPLE benchmark or this repository useful, please cite our paper:

@inproceedings{zhang2023effect,
  title={The effect of metadata on scientific literature tagging: A cross-field cross-model study},
  author={Zhang, Yu and Jin, Bowen and Zhu, Qi and Meng, Yu and Han, Jiawei},
  booktitle={WWW'23},
  pages={1626--1637},
  year={2023}
}

The MAPLE benchmark is constructed from the Microsoft Academic Graph:

@inproceedings{sinha2015overview,
  title={An overview of microsoft academic service (mas) and applications},
  author={Sinha, Arnab and Shen, Zhihong and Song, Yang and Ma, Hao and Eide, Darrin and Hsu, Bo-June and Wang, Kuansan},
  booktitle={WWW'15},
  pages={243--246},
  year={2015}
}

The three classifiers in this repository are from the following three papers:

@inproceedings{prabhu2018parabel,
  title={Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising},
  author={Prabhu, Yashoteja and Kag, Anil and Harsola, Shrutendra and Agrawal, Rahul and Varma, Manik},
  booktitle={WWW'18},
  pages={993--1002},
  year={2018}
}

@inproceedings{xun2020correlation,
  title={Correlation networks for extreme multi-label text classification},
  author={Xun, Guangxu and Jha, Kishlay and Sun, Jianhui and Zhang, Aidong},
  booktitle={KDD'20},
  pages={1074--1082},
  year={2020}
}

@inproceedings{liu2022oag,
  title={Oag-bert: Towards a unified backbone language model for academic knowledge services},
  author={Liu, Xiao and Yin, Da and Zheng, Jingnan and Zhang, Xingjian and Zhang, Peng and Yang, Hongxia and Dong, Yuxiao and Tang, Jie},
  booktitle={KDD'22},
  pages={3418--3428},
  year={2022}
}