


This repository collects multiple Text-Attributed Graph (TAG) datasets from various sources and provides a unified approach for preprocessing and loading. We also offer a standardized task generation pipeline for evaluating the performance of GNN/LLM on these datasets. Technical report of the TAGLAS is available in arxiv. The project is still under construction, so please expect more datasets and features in the future. Stay tuned!



Here are currently included datasets:

Dataset (key)Avg. #NAvg. #E#GTask levelTaskSplit (train/val/test)DomaindescriptionSource
Cora_node (cora_node)2708105561Node7-way classification140/500/2068Co-CitationPredict the category of papers.Graph-LLM, OFA
Cora_link (cora_link)2708105561LinkBinary classification17944/1056/2112Co-CitationPredict whether two papers are co-cited by other papers.Graph-LLM, OFA
Pubmed_node (pubmed_node)19717886481Node3-way classification60/500/19157Co-CitationPredict the category of papers.Graph-LLM, OFA
Pubmed_link (pubmed_link)19717884681LinkBinary classification150700/8866/17730Co-CitationPredict whether two papers are co-cited by other papers.Graph-LLM, OFA
Arxiv (arxiv)16934311662431Node40-way classification90941/29799/48603CitationPredict the category of papers.OGB, OFA
WikiCS (wikics)117012161231Node10-way classification580/1769/5847Wiki pagePredict the category of wiki pages.PyG, OFA
Product-subset (products)540251446381Node47-way classification14695/1567/36982Co-purchasePredict the category of products.TAPE
FB15K237 (fb15k237)145413101161Link237-way classification272115/17535/20466Knowledge graphPredict the relationship between two entities.OFA
WN18RR (wn18rr)40943930031Link11-way classification86835/3034/3134Knowledge graphPredict the relationship between two entities.OFA
MovieLens-1m (ml1m)992320004181Linkregression/5-way850177/50011/100021Movie ratingPredict the rating between users and movies.PyG
Chembl_pretrain (chemblpre)25.8755.92365065Graph1048-way binary classification341952/0/0molecularPredict the effectiveness of molecule to multiple assays.GIMLET, OFA
PCBA (pcba)25.9756.20437929Graph128-way binary classification349854/43650/43588molecularPredict the effectiveness of molecule to multiple assays.GIMLET, OFA
HIV (hiv)25.5154.9441127GraphBinary classification32901/4113/4113molecularPredict the effectiveness of molecule to hiv.GIMLET, OFA
BBBP (bbbp)24.0651.912039GraphBinary classification1631/204/204molecularPredict the effectiveness of molecule to brain blood barrier.GIMLET, OFA
BACE (bace)34.0973.721513GraphBinary classification1210/151/152molecularPredict the effectiveness of molecule to BACE1 protease.GIMLET, OFA
toxcast (toxcast)18.7638.508575Graph588-way binary classification.6859/858/858molecularPredict the effectiveness of molecule to multiple assays.GIMLET, OFA
esol (esol)13.2927.351128GraphRegression902/113/113molecularPredict the solubility of the molecule.GIMLET, OFA
freesolv (freesolv)8.7216.76642GraphRegression513/64/65molecularPredict the free energy of hydration of the molecule.GIMLET, OFA
lipo (lipo)27.0459.004200GraphRegression3360/420/420molecularPredict the lipophilicity of the molecule.GIMLET, OFA
cyp450 (cyp450)24.5253.0216896Graph5-way binary classification13516/1690/1690molecularPredict the effectiveness of molecule to CYP450 enzyme family.GIMLET, OFA
tox21 (tox21)18.5738.597831Graph12-way binary classification6264/783/784molecularPredict the effectiveness of molecule to multiple assays.GIMLET, OFA
muv (muv)24.2352.5693087Graph17-way binary classification74469/9309/9309molecularPredict the effectiveness of molecule to multiple assays.GIMLET, OFA
ExplaGraphs (expla_graph)5.174.252766GraphQuestion Answering1659/553/554CommonsenseCommon sense reasoning.G-retriver
SceneGraphs (scene_graph)19.1368.44100000GraphQuestion Answering59978/19997/20025scene graphScene graph question answering.G-retriver
MAG240m-subset (mag240m)5875010264347261Node153-way classification900722/63337/63338/132585CitationPredict the category of papers.OGB
Ultrachat200k (ultrachat200k)3.722.72449929GraphQuestion Answering400000/20000/29929ConversationAnswer the question given previous conversation.UltraChat200k




You can directly clone the repository into your working project by using the following command:

git clone https://github.com/JiaruiFeng/TAGLAS.git

We will provide a more user-friendly installation method in the future.



Load datasets

The basic way to load a dataset is by using its key. The dataset key can be found in the table above. For example, to load the Arxiv dataset:

from TAGLAS import get_dataset
dataset = get_dataset("arxiv")

You can also load multiple datasets at the same time:

from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"])

By default, all data files are be saved in the ./TAGDataset directory root in the repository directory. If you want to change the data path, you can set the root parameter when loading the dataset:

from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"], root="your_path")

The above function will load the dataset in the default way, which is suitable for most cases. However, some datasets may have additional arguments. To have further control over the loading process, you can also load the dataset by directly add additional arguments:

from TAGLAS import get_dataset
dataset = get_dataset("fb15k237", to_undirected=False)

Finally, directly import from the dataset class is also supported:

from TAGLAS.datasets import Arxiv
dataset = Arxiv()

Data key description and basic usage

All data samples are stored in the dataset with class TAGData, which is inherited from Data class in torch_geometric package. Different information will be stored in different key. Most datasets contain the following keys:

Some datasets may also contain:

Here is a simple demonstration:

from TAGLAS import get_dataset
dataset = get_dataset("arxiv")
# Get node text feature for the whole dataset.
x = dataset.x
# Get the first graph sample in the dataset.
data = dataset[0]
# Get edge text feature for the sample.
edge_attr = data.edge_attr

Feature mapping

For graph-level datasets, all _map keys like node_map or edge_map will store the mapping to the global feature of all data sample. The global features can be accessed by:

from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Get the global node text features.
# Get the global edge text features.

The feature for a specific sample can be obtained by:

from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Global node text features
x = dataset.x
data = dataset[0]
# Get node text feature for sample 0 by the global node_map key of the sample 0.
sample_x = [x[i] for i in data.node_map]
# We also provide direct access to the text feature of each sample by:
sample_x = dataset[0].x

For node/edge-level datasets, since they contain only one graph, the local map is also the global map, and the logic remains the same. The reason we store the features this way is to avoid repeated text features, especially for large datasets with only a few unique text features (like molecule datasets).


Supported tasks

In this repository, we provide a unified way to generate tasks based on datasets. Currently, we support the following five task types:

Load tasks

To load a specific task, simply call:

from TAGLAS import get_task
# Load default node-level task on cora
task = get_task("cora_node", "default")
# Load subgraph_text edge-level task on pubmed and val split
task = get_task("pubmed_link", "subgraph_text", split="val")

Similarly, you can load multiple task at the same time:

from TAGLAS import get_tasks
# Load QA tasks on all datasets.
tasks = get_tasks(["cora_node", "arxiv", "wn18rr", "scene_graph"], "QA")
# Specify task type for each dataset.
tasks = get_tasks(["cora_node", "arxiv"], ["QA", "subgraph_text"])

By default, all generated tasks will not be saved. For fast loading and repeat experiments, you can save and load the generated tasks by:

from TAGLAS import get_task
# save_data will save the generated task into corresponding folder. load_saved will try to load the saved task first before generate new task.
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True)
# In defualt, the saved task file will be named by used important arguments (like split, hop...). You can also specify it by yourself:
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True, save_name="your_name")

Directly construct task given dataset is also supported: Finally, directly import from the dataset class is also supported:

from TAGLAS.datasets import Arxiv
from TAGLAS.tasks import SubgraphTextNPTask
dataset = Arxiv()
# Load subgraph_text node-level task on Arxiv dataset.
task = SubgraphTextNPTask(dataset)

Convert text feature to sentence embedding

For default_text, subgraph_text, and QA task types, we also provide function to convert raw text feature to sentence embedding:

from TAGLAS import get_task
from TAGLAS.tasks.text_encoder import SentenceEncoder
encoder_name = "ST"
encoder = SentenceEncoder(encoder_name)
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
arxiv_task.convert_text_to_embedding(encoder_name, encoder)

In TAGLAS, we implement several commonly used LLMs for sentence embedding, including ST (Sentence Transformer), BERT (vanilla BERT), e5 (E5), llama2_7b (Llama2-7b), and llama2_13b (Llama2-13b). You can load different models by inputting the respective model_key into SentenceEncoder. Additionally, you can implement your own sentence embedding model as long as it has a __call__ function to convert input text lists into embeddings.


For all tasks in TAGLAS, we provide a unified collcate function. Specifically, call the collate function by:

from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
# Call collate function to get a batch of data
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])

The collate function is implemented based on torch_geometric.loader.dataloader.Collater. However, there is a major difference. For all text feature keys like x and edge_attr, it only stores the unique text features in the batch. Additionally, all _map keys store the map from the corresponding unique text features in the batch to all elements.

from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])
# to get node text features for all nodes in the batch
x = batch.x[batch.node_map]
# to get edge text features for all edges in the batch
edge_attr = batch.edge_attr[batch.edge_map]

In this way, the batch data is more memory and computation efficient, as each unique text only needs to be encoded once.


"For each dataset and task, we provide a default evaluation tool for performance evaluation based on torchmetric. Specifically, for each dataset, we support two types of evaluation based on its supported task types."

To get an evaluator for a certain task, simply call:

from TAGLAS import get_evaluator, get_evaluators
# Get default evaluator for cora_node task. metric_name is a string indicate the name of metric.
metric_name, evaluator = get_evaluator("cora_node", "subgraph_text")
# Get QA evaluator for arxiv
metric_name, evaluator = get_evaluator("arxiv", "QA")
# Get evaluator for multiple input tasks.
metric_name_list, evaluator_list = get_evaluators(["cora_node", "arxiv"], "QA")

Issues and Bugs

The project is still in development. If you encounter any issues or bugs while using it, please feel free to open an issue in the GitHub repository.


If you found the TAGLAS helpful in your project, consider cite it! Thank you!

      title={TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models}, 
      author={Jiarui Feng and Hao Liu and Lecheng Kong and Yixin Chen and Muhan Zhang},