Awesome
dps
data process service
Overview
The data process pipeline of recommendation system research usually follows the fashion of load raw data, filter data, reindex data, split train/val/test datasets, save file and also negative sampling for training recommendation models. This repo serves as a general tool for those above data process operations.
Running examples
data process of a sampled taobao CTR dataset including 1M user-item interactions.
cd examples
python taobao_ctr.py
Requirements
numpy
pandas
scipy
absl-py
Components
The best entry point to use the following components is a DataFrame with 'uid', 'iid' and 'ts' in its columns.
loader
CsvLoader: load csv file
CooLoader: load coo file (sparse matrix in coordidate format)
JsonLoader: load json file
filter
CFFilter: k-core filter
DuplicationFilter: filter duplicated records with the earliest record left
reindexer
Reindexer: reindex uid and iid, start from 0
splitter
AbsoluteSplitter: split the dataset with test and validation sample number fixed
PercentageSplitter: split the dataset proportionally in chronological order
RandomSplitter: split the dataset randomly
SkewSplitter: split the dataset into biased and unbiased parts according to related literatures (PF, CausE, DICE).
generator
CooGenerator: generate sparse matrix in coordinate format
LilGenerator: generate sparse matrix in lists in list format
DokGenerator: generate sparse matrix in dictionary of keys format
transformer
SparseTransformer: perform sparse matrix format transformation from coo to lil and dok
saver
CsvSaver: save DataFrame to file
CooSaver: save coo matrix to file
JsonSaver: save dict to file
reporter
CsvReporter: report statistics of the data
sampler
PointSampler: negative sampling for pointwise optimization such as logloss
PairSampler: negtive sampling for pairwise optimization such as bprloss