Awesome

What is Dataset Distillation Learning?

After cloning the repo, download the distilled data and pretrained model from Google Drive. Afterwards, create the conda enviroment:

conda create -n learning python=3.10.12
conda activate learning
pip install -r requirements.txt

Distilled vs. Real Data

To run analyses done in Section 3 of the paper, refer to the two jupyter notebooks in experiment_code/replacement_analysis

Information Captured by Distilled Data

Predictions of models trained on distilled data is similar to models trained with early-stopping

Generate a pool of subset-trained-models and early-stopped-models by running

python experiment_code/agreement_analysis/generate_models.py

Finally, compare and visualize the prediction differences using the jupyter notebook.

Recognition on the distilled data is learned early in the training process

Use jupyter notebook.

Distilled data stores little information beyond what would be learned early in training

Refer to jupyter notebook.

Note: computing Hessian approximation for the whole training data takes a long time (10+ hours on a L40 GPU). Comment out this Hessian computation if there is a lack of compute resources (Hessian calculations on distilled data is notably less resource intensive).

Semantics of Captured Information

Use jupyter notebook to generate the qualitative analysis (Figure 10) of the paper. For quantitative analysis, refer to the LLaVa repo.