Home

Awesome

Scaling Laws for Data Filtering

Registering data buckets

The buckets should be registered in the following file: all_paths_128.py This file contains the following information:

Estimating data bucket parameters

This step involves estimating the scaling parameters for each bucket of interest.

Grid search to find the bucket scaling parameters

Grid search is performed to find the best scaling parameters for each bucket. The grid search is performed using the following file: grid_search.py. The objective minimized in the grid search is defined in objective.py. We chose grid search because the of instabilities observed in scipy based optimization methods.

Objective Functions

This file implements scaling laws based on FADU, and also those inspired from work on Scaling Data Constrained Language Models.

python process_128_grid.py --a_upper 0.02 --objective effective_utility  --d 0.1

Here a_upper is used to give an upper limit to the grid search for a, and d is the irreducibile loss. Refer to ablations/finding_a.py if you want to jointly minimize a across the pools. Copy the obtained scaling parameters to the results/parameter_values.py file, and give an appropriate key name.

Finding best bucket combination

python estimate_best_pool.py --key given_key_name