Awesome
Product-based Neural Networks for User Response Prediction
Note
: An extended version of the conference paper is https://arxiv.org/abs/1807.00311 , which is accepted by TOIS.
Compared with this simple demo, a more detailed implementation of the journal paper is at https://github.com/Atomu2014/product-nets-distributed , which has large-scale data access, multi-gpu support, and distributed training support.
Note
: I would like to share some intersting and advanced discussions in the extended version.
Note
: Any problems, you can contact me at kevinqu16@gmail.com. Through email, you will get my rapid response.
This repository maintains the demo code of the paper
Product-based Neural Network for User Response Prediction
and other baseline models, implemented with tensorflow
.
And this paper has been published on ICDM2016.
Introduction to User Response Prediction
User response prediction takes a fundamental and crucial role in today's business, especially personalized recommender system and online display advertising.
Different from traditional machine learning tasks,
user response prediction always has categorical features
grouped by different fields
,
which we call multi-field categorical data
, e.g.:
ad. request={
'weekday': 3,
'hour': 18,
'IP': 255.255.255.255,
'domain': xxx.com,
'advertiser': 2997,
'click': 1
}
In practice, these categorical features are usually one-hot encoded for training.
However, this representation results in sparsity.
Challenged by data sparsity, linear models (e.g., LR
), latent factor-based models (e.g., FM
, FFM
), tree models (e.g., GBDT
), and DNN models (e.g., FNN
, DeepFM
) are proposed.
A core problem in user response prediction is how to represent the complex feature interactions. Industrial applications prefer feature engineering and simple models. With GPU servers becoming more and more popular, it is promising to design complex models to explore feature interactions automatically. Through our analysis and experiments, we find a coupled gradient
issue of latent factor-based models, and an insensitive gradient
issue of DNN models.
Take FM as an example, the gradient of each feature vector is the sum over other feature vectors. Suppose two features are independent, FM can hardly learn two orthogonal feature vectors. The gradient issue of DNNs is discussed in the paper Failures of Gradient-based Deep Learning
.
In order to solve these issues, we propose to use product operators in DNN to help explore feature interactions. We discuss these issues in an extended paper, which is submitted to TOIS at Seq. 2017 and will be released later. Any discussion is welcomed, please contact kevinqu16@gmail.com.
Product-based Neural Networks
Through discussion of previous works, we think a good predictor should have a good feature extractor (to convert sparse features into dense representations) as well as a powerful classifier (e.g., DNN as universal approximator). Since FM is good at represent feature interactions, we introduce product operators in DNN. The proposed PNN models follow this architecture: an embedding layer to represent sparse features, a product layer to explore feature interactions, and a DNN classifier.
For product layer, we propose 2 types of product operators in the paper: inner product and outer product. These operators output $n(n-1)/2$ feature interactions, which are concatenated with embeddings and fed to the following fully conncted layers.
The inner product is easy to understand, the outer product is actually equivalent to projecting embeddings into a hidden space and computing the inner product of projected embeddings:
$uv^T\odot w = u^Twv$
Since there are $n(n-1)/2$ feature interactions, we propose some tricks to reduce complexity. However, we find these tricks restrict model capacity and are unecessary. In recent update of the code, we remove the tricks for better performance.
In our implementation, we add the parameter kernel_type: {mat, vec, num}
for outer product.
The default type is mat, and you can switch to other types to save time and memory.
A potential risk may happen in training the first hidden layer. Feature embeddings and interactions are concatenated and fed to the first hidden layer, but the embeddings and interactions have different distribution. A simple method is adding linear transformation to the embeddings to balance the distributions. Layer norm
is also worth to try.
How to Use
For simplicity, we provide iPinYou dataset at make-ipinyou-data.
Follow the instructions and update the soft link data
:
XXX/product-nets$ ln -sfn XXX/make-ipinyou-data/2997 data
run main.py
:
cd python
python main.py
As for dataset, we build a repository on github serving as a benchmark in our Lab APEX-Datasets. This repository contains detailed data processing, feature engineering, data storage/buffering/access and other implementations. For better I/O performance, this benchmark provides hdf5 APIs. Currently we provide download links of 4 large scale ad-click datasets (already processed), Criteo-8day, Avazu, iPinYou-all, and Criteo Challenge. More datasets will be updated later.
This code is originally written in python 2.7, numpy, scipy and tensorflow are required.
In recent update, we make it consistent with python 3.x.
Thus you can use it as a start-up with any python version you like.
LR, FM, FNN, CCPM, DeepFM and PNN are all implemented in models.py
, based on TensorFlow.
You can train any of the models in main.py
and configure parameters via a dict.
More models and mxnet implementation will be released in the extended version.
Practical Issues
In this section we select some discussions from my emails and issues to share.
Note
: 2 advanced discussions about overfitting of adam and performance gain of DNNs are presented in the extended version. You are welcomed to discuss relavant problems through issues or emails.
1. Sparse Regularization (L2)
L2 is fundamental in controlling over-fitting. For sparse input, we suggest sparse regularization, i.e. we only regularize on activated weights/neurons. Traditional L2 regularization penalizes all parameters $\Vert w\Vert$, $w = [w_1, \dots, w_n]$ even though some inputs are zero $x_i = 0$, which means every parameter $w_i$ will have a non-zero gradient for every training example $x$. Sparse regularization instead penalizes on non-zero terms, $\Vert xw \Vert$.
2. Initialization
Initializing weights with small random numbers is always promising in Deep Learning.
Usually we use uniform
or normal
distribution around 0.
An empirical choice is to set the distribution variance near $\sqrt{(1/n)}$ where n is the input dimension.
Another choice is xavier
, for uniform distribution,
xavier
uses $\sqrt{(3/node_i)}$, $\sqrt{(3/node_o)}$,
or $\sqrt{(6/(node_i+node_o))}$ as the upper/lower bound.
This is to keep unit variance among different layers.
3. Learning Rate
For deep neural networks with a lot of parameters,
large learning rate always causes divergence.
Usually sgd with small learning rate has promising performance, however converges slow.
For extremely sparse input, adaptive learning rate converges much faster,
e.g. AdaGrad, Adam, FTRL, etc.
This blog
compares most of adaptive algorithms.
Even though adaptive algorithms speed up and sometimes jump out of local minimum,
there is no guarantee for better generalization performance.
To sum up, Adam
and AdaGrad
are good choices. Adam
converges faster than AdaGrad
, but is also easier to overfit.
4. Data Processing
Usually you need to build a feature map to convert categorical data into one-hot representation. These features usually follow a long-tailed distribution, resulting in extremely large feature space, e.g. IP address. A simple way is to remove those low frequency features by a threshold, which will dramatically reduce the input dimension without much decrease of performance.
For unbalance dataset, a typical positive/negative ratio is 0.1% - 1%, and Facebook has published a paper discussing negative down sampling. Negative down-sampling can speed up training, as well as reduce dimension, but requires calibration in some cases.
5. Normalization
There are two kinds of normalization, feature level and instance level.
Feature level is within one field,
e.g. set the mean of one field to 0 and the variance to 1.
Instance level is to keep consistent between difference records,
e.g. you have a multi-value field, which has 5-100 values and the length varies.
You can set the magnitude to 1 by shifting and scaling.
Besides, batch/weight/layer normalization
are worth to try when network grows deeper.
6. Continuous/Discrete/Multi-value Feature
Most features in User Response Prediction have discrete values (categorical features). The key difference between continuous and discrete features is, only continuous features are comparable in values. For example, {male
: 0, female
: 1} and {male
: 1, female
: 0} are equivalent.
When the data contains both continuous and discrete values, one solution is to discretize those continuous values using bucketing. Taking 'age' as an example, you can set [0, 12] as children
, [13, 18] as teenagers
, [19, ~] as adults
and so on.
Multi-value features are special cases of discrete features.
e.g. recently reviewed items = [item2
, item7
, item11
], [item1
, item4
, item9
, item13
].
This type of data is also called set data, with one key property permutation invariance
, which is discussed in the paper DeepSet
.
7. Activation Function
Do not use sigmoid
in hidden layers, use tanh
or relu
instead.
And recently selu
is proposed to maintain fixed point in training.
8. Numerical Stable Parameters
Adaptive optimizers usually requires hyperparameters for numerical stability, e.g., $\epsilon$ in Adam
, initial value
of AdaGrad
. Sometimes, these parameters have large impacts on model convergence and performance.