Awesome
Yandex Cup 2022: Like Prediction, 2nd place solution
This solution uses two-stage recommender system: candidate selection with different methods and ranking with GBDT.
Environment and running
- if you want to use CUDF, you can install conda distribution from https://rapids.ai/start.html
- on top of it -
pip install -r requirements.txt
- running with USE_CUDF=0 may be problematic, because there are some places where cudf-only methods like .to_pandas() are used
- to run pipeline put unzipped files to
data
and just usedvc repro
- to run only training&cv use
dvc repro train_lightgbm_cv
- you can also use
dvc exp run
with different params, e.g.dvc exp run -S train.working_dir=data/processed/sample
will run pipeline on a small sample of data
Hardware
All experiments were run on a rig with 512GB RAM and A100 GPU. The most memory intense step is model training, takes ~250GB RAM at peak. GPU is only needed for fast calculation of co-occurence features with cudf, but it's possible to use pandas instead (set env USE_CUDF=0). Full pipeline with inference takes ~8 hours if executed consecutively with GPU.
Candidate selection
Next-item co-occurence
- calculate dictionary on all consecutive pairs in train&test data
{item: {next_item: count}}
- get candidates as most common items in dictionary by keys - last item, pre-last, etc.
Smart co-occurence
- get candidates by aggregated co-occurence count with last N history item (used sum)
- one thing I missed during the competition and checked after - to add weight based on rank of action in history. Using 1 / (rank + 1) as weight boosts recall by 10% and precision by 30% compared to evenly weighted sum.
Implicit BM25 (Item2Item)
- train Item-Item recommender, take similar to last item as candidates
- other types of I2I from implicit also work, but BM25 is slightly better in terms of recall
Implicit ALS
- use imlpicit ALS model for candidates. I used
recalculate_user=True
, but using real user factors could be a bit better
Popular items
- since popular tracks change over time, popularity counts using only last items in user sessions
Last artist items
- recommend top tracks of user's last liked artist
Features
- score and rank from each candidate engine
- co-occurence aggregated stats (mean, max, std, min)
- als similarity aggregated stats
- i2i similarity aggregated stats
- item/artists statistics with different offsets (last 10, 50 actions, etc.)
- user features: number of likes, unique artists, likes per artist
Ranker
tl;dr - LightGBM with lambdarank objective. Some things to notice:
- 3 fold CV and averaged prediction
- downsample negative items with rate 0.3 (e.g. we keep 300k negatives from 1mln)
- use custom numba MRR implementation for early stopping
- 100 early stopping rounds, 600 iterations on average
- hyperparams were tuned on small subset of data once almost in the beginning, trying to change any of them later did not help
- learning rate: 0.04
- l1 reg: ~1
- l2 reg: ~8
- colsample: 0.6
- subsample: 0.6
Final ensemble
Final submission is generated by blending 3 submission files with inverse rank blend (see blend.py for exmaple).
Features and LightGBM parameters were pretty much the same between all three models.
first (0.0849 lb, 0.0845 cv, 0.49 recall)
- 0 1 2 3 4 5 next item co-occurence candidates (300 per rank)
- default implicit ALS, 300 items
- last item similar candidates, 300 items
- 200 popular items
- 100 last artist top items
- LEFT JOIN CANDIDATES :D
second (0.0854 lb, 0.0852 cv, 0.62 recall)
- 0 1 next item co-occurence candidates (300 per rank)
- 1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 100 last items in history)
- default implicit ALS, 300 items
- last item similar candidates, 300 items
- 200 popular items
third (0.08608 lb, 0.0856 cv, 0.64 recall)
- 0 1 next item co-occurence candidates (300 per rank)
- 1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 16 last items in history)
- default implicit ALS, 300 items
- last item similar candidates, 300 items
- 200 popular items
Bonus - DVC pipeline flow chart
flowchart LR
node1["calculate_als_candidates@test"]
node2["calculate_als_candidates@val"]
node3["calculate_artist_candidates@test"]
node4["calculate_artist_candidates@val"]
node5["calculate_cooc_candidates@test"]
node6["calculate_cooc_candidates@val"]
node7["calculate_cooc_smart_candidates@test"]
node8["calculate_cooc_smart_candidates@val"]
node9["calculate_cooc_stats"]
node10["calculate_cooc_stats_for_smart"]
node11["calculate_popular_candidates@test"]
node12["calculate_popular_candidates@val"]
node13["calculate_similar_candidates@test"]
node14["calculate_similar_candidates@val"]
node15["create_artist_features"]
node16["create_item_features"]
node17["create_submission"]
node18["create_submission_cv"]
node19["create_user_artist_features@test"]
node20["create_user_artist_features@val"]
node21["create_user_features@test"]
node22["create_user_features@val"]
node23["create_user_history_als_features@test"]
node24["create_user_history_als_features@val"]
node25["create_user_history_artist_features@test"]
node26["create_user_history_artist_features@val"]
node27["create_user_history_cooc_features@test"]
node28["create_user_history_cooc_features@val"]
node29["create_user_history_features@test"]
node30["create_user_history_features@val"]
node31["create_user_history_similarity_features@test"]
node32["create_user_history_similarity_features@val"]
node33["merge_candidates@test"]
node34["merge_candidates@val"]
node35["merge_candidates_and_features@test"]
node36["merge_candidates_and_features@val"]
node37["prepare_data"]
node38["split_test_by_chunks"]
node39["train_als_candidates"]
node40["train_cooc_candidates"]
node41["train_lightgbm"]
node42["train_lightgbm_cv"]
node43["train_popular_candidates"]
node44["train_similar_candidates"]
node1-->node33
node2-->node34
node5-->node33
node6-->node34
node7-->node33
node8-->node34
node9-->node27
node9-->node28
node10-->node7
node10-->node8
node11-->node33
node12-->node34
node13-->node33
node14-->node34
node15-->node35
node15-->node36
node16-->node35
node16-->node36
node19-->node35
node20-->node36
node21-->node35
node22-->node36
node23-->node35
node24-->node36
node27-->node35
node28-->node36
node29-->node35
node30-->node36
node31-->node35
node32-->node36
node33-->node23
node33-->node25
node33-->node27
node33-->node31
node33-->node35
node34-->node24
node34-->node26
node34-->node28
node34-->node32
node34-->node36
node35-->node17
node35-->node38
node36-->node41
node36-->node42
node37-->node1
node37-->node2
node37-->node3
node37-->node4
node37-->node5
node37-->node6
node37-->node7
node37-->node8
node37-->node9
node37-->node10
node37-->node11
node37-->node12
node37-->node13
node37-->node14
node37-->node15
node37-->node16
node37-->node19
node37-->node20
node37-->node21
node37-->node22
node37-->node23
node37-->node24
node37-->node25
node37-->node26
node37-->node27
node37-->node28
node37-->node29
node37-->node30
node37-->node31
node37-->node32
node37-->node39
node37-->node40
node37-->node41
node37-->node43
node37-->node44
node38-->node18
node39-->node1
node39-->node2
node39-->node23
node39-->node24
node40-->node5
node40-->node6
node41-->node17
node42-->node18
node43-->node11
node43-->node12
node44-->node13
node44-->node14
node44-->node31
node44-->node32