Awesome
datapipelines
Iterable datapipelines for pytorch training.
The functions sdata.create_dataset()
and sdata.create_loader()
provide interfaces for your pytorch training code, where the former returns
a dataset and the latter a wrapper around a pytorch dataloader.
A dataset as returned by sdata.create_dataset()
consists of 5 main modules should be defined in a yaml-config:
- A base datapipeline, which reads data as tar files from local fs and assembles them to samples. Each sample comes as a python-dict.
- A list of preprocessors which can be either used to transform the entries of a sample or to filter out unsuitable samples. The former kinds are called
mappers
, the latterfilters
. This repository provides a basic set of mappers and filters which provide basic (not too application specific) data transforms and filters. - A list of decoders whose elements can be either defined as a string matching one of the predefined webdataset image decoders decoders or some custom decoder (in the config-style) for handling more specific needs. Note that decoding will be skipped alltogether when setting
decoders=None
(or in config-style yamldecoders: null
). - A list of postprocessors which are used to filter or transform the data after it has been decoded and should again be either
mappers
orfilters
. error_handler
: A webdataset-style function for handling any errors which occur in thedatapipeline
,preprocessors
,decoders
orpostprocessors
.
A wrapper around a pytorch dataloader, which can be plugged in to your training, is returned by sdata.create_loader()
. You can pass the dataset either as an IterableDataset
as returned by sdata.create_dataset()
or via the config which would instantiate this dataset. Apart from the known batch_size
, num_workers
, partial
and collation_fn
parameteters for pytorch dataloaders, the function can be configured via the following arguments.
batched_transforms
of batchedmappers
andfilters
which transform an entire training batch before being passed to the dataloader defined in the same style than thepreprocessors
andpostprocessors
from above.loader_kwargs
defining additional keyword arguments for the dataloader (such asprefetch_factor
, ...)error_handler
: A webdataset-style function for handling any errors which occur in thebatched_transforms
.
Examples
Here, it is most effective to look at the configs in examples/configs/
for the following examples. These will show you how this works.
For a simple example, see examples/image_simple.py
, find config here.
NOTE: You have to add your dataset in tar-form which should follow the webdataset-format. To find the parts which have to be adapted, search for comments conaining USER:
in the respective config.
Installation
Pytorch 2 and later
python3 -m venv .pt2
source .pt2/bin/activate
pip3 install wheel
pip3 install -r requirements_pt2.txt
Pytorch 1.13
python3 -m venv .pt1
source .pt1/bin/activate
pip3 install wheel
pip3 install -r requirements_pt1.txt