Home

Awesome

ScanDL

scandl-output

This repository contains the code to reproduce the experiments in ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts.

Summary

scandl-workflow

Setup

Clone this repository

git clone git@github.com:dili-lab/scandl
cd scandl

Install requirements

The code is based on the PyTorch and huggingface modules.

pip install -r requirements.txt

Download data

The CELER data can be downloaded from this link, where you need to follow the description.

The ZuCo data can be downloaded from this OSF repository. You can use scripts/get_zuco_data.sh to automatically download the ZuCo data. Note, ZuCo is a big dataset and requires a lot of storage.

Make sure you adapt the path to the folder that contains both the celer and the zuco in the file CONSTANTS.py. If you use aboves bash script scripts/get_zuco_data.sh, the zuco paths is data/. Make sure there are no whitespaces in the zuco directories (there might be when you download the data). You might want to check sp_load_celer_zuco.load_zuco() for the spelling of the directories.

Preprocess data

Preprocessing the eye-tracking data takes time. It is thus recommended to perform the preprocessing once for each setting and save the preprocessed data in a directory processed_data. This not only saves time if training is performed several times but it also ensures the same data splits for each training run in the same setting. For preprocessing and saving the data, run

python -m scripts.create_data_splits

Training

Execute the following commands to perform the training.

Notes

Training Commands

To execute the training commands below, you need GPUs setup with CUDA.

New Reader setting

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion reader

New Sentence setting

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion sentence

New Reader/New Sentence setting

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
     --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion combined

Cross-dataset setting

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference zuco \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --notes cross_dataset \
    --data_split_criterion scanpath

Ablation: without positional embedding and BERT embedding (New Reader/New Sentence)

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train_ablation.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 50 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion combined \
    --notes ablation-no-pos-bert

Ablation: without condition (sentence): unconditional scanpath generation (New Reader/New Sentence)

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train_ablation_no_condition.py \
    --corpus celer \
    --inference cv \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule sqrt \
    --learning_steps 80000 \
    --log_interval 50 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion combined \
    --notes ablation-no-condition

Ablation: cosine noise schedule (New Reader/New Sentence)

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule cosine \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion combined

Ablation: linear noise schedule (New Reader/New Sentence)

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12233 \
    --use_env scripts/sp_run_train.py \
    --corpus celer \
    --inference cv \
    --load_train_data processed_data \
    --num_transformer_heads 8 \
    --num_transformer_layers 12 \
    --hidden_dim 256 \
    --noise_schedule linear \
    --learning_steps 80000 \
    --log_interval 500 \
    --eval_interval 500 \
    --save_interval 5000 \
    --data_split_criterion combined

Inference

NOTES

If you run several inference processes at the same time, make sure to choose a different --seed for each of them. During training, the model is saved for many checkpoints. If you want to run inference on every checkpoint, leave the argument --run_only_on away. However, inference is quite costly time-wise and it is thus sensible to only specify certain checkpoints onto which inference should be run. For that purpose, the exact path to that saved model must be given.

<br>

Inference Commands

Adapt the following paths/variables:

<br>

For the settings:

python -u scripts/sp_run_decode.py \
    --model_dir checkpoint-path/[MODEL_DIR] \
    --seed 60 \
    --split test \
    --cv \
    --no_gpus 1 \
    --bsz 24 \
    --run_only_on 'checkpoint-path/[MODEL_DIR]/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt' \
    --load_test_data processed_data

Cross-dataset:

python -u scripts/sp_run_decode.py \
    --model_dir checkpoint-path/[MODEL_DIR] \
    --seed 60 \
    --split test \
    --no_gpus 1 \
    --bsz 24 \
    --run_only_on 'checkpoint-path/[MODEL_DIR]/ema_0.9999_0[STEPS].pt' \
    --load_test_data processed_data

Ablation: without positional embedding and BERT embedding (New Reader/New Sentence)

python -u scripts/sp_run_decode_ablation.py \
    --model_dir checkpoint-path/[MODEL_DIR] \
    --seed 60 \
    --split test \
    --cv \
    --no_gpus 1 \
    --bsz 24 \
    --load_test_data processed_data \
    --run_only_on 'checkpoint-path/[MODEL_DIR/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt'

Ablation: without condition (sentence): unconditional scanpath generation (New Reader/New Sentence)

python -u scripts/sp_run_decode_ablation_no_condition.py \
    --model_dir checkpoint-path/[MODEL_DIR] \
    --seed 60 \
    --split test \
    --cv \
    --no_gpus 1 \
    --bsz 24 \
    --run_only_on 'checkpoint-path/[MODEL_DIR]/fold-[FOLD_IDX]/ema_0.9999_0[STEPS].pt'

Evaluation

To run the evaluation on the ScanDL output, again indicate the model dir in generation_outputs:<br>

[MODEL_DIR]:

The argument --cv should be used for the evaluation on all cross-validation settings. <br>

For all cases except for the Cross-dataset:

python -m scripts.sp_eval --generation_outputs [MODEL_DIR] --cv

For the Cross-dataset setting:

python -m scripts.sp_eval --generation_outputs [MODEL_DIR]

Psycholinguistic Analysis

To run the psycholinguistic analysis, first compute reading measures as well as psycholinguistic effects:<br> Set MODEL_DIR to be the model directory in generation_outputs.<br>

NOTES

python model_analyses/psycholinguistic_analysis.py --model [MODEL_DIR] --steps [N_STEPS] --setting [SETTING] --seed [SEED]

The reading measure files will be stored in the directory pl_analysis/reading_measures.

To fit the generalized linear models, run

Rscript --vanilla model_analyses/compute_effects.R --setting [SETTING] --steps [N_STEPS]

The fitted models will be saved as RDS-files in the directory model_fits.

To compare the effect sizes between the different models, run

Rscript --vanilla model_analyses/analyze_fit.R --setting [SETTING] --steps [N_STEPS]

Citation

If you are using ScanDL, please consider citing our work:

@inproceedings{bolliger2023scandl,
    author = {Bolliger, Lena S. and Reich, David R. and Haller, Patrick and Jakobi, Deborah N. and Prasse, Paul and J{\"a}ger, Lena A.},
    title = {{S}can{DL}: {A} Diffusion Model for Generating Synthetic Scanpaths on Texts},
    booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
    year={2023},
    publisher = {Association for Computational Linguistics},
}
<br>

Acknowledgements

As indicated in the paper, our code is based on the implementation of DiffuSeq.