Awesome

Yandex Cup 2022: Like Prediction, 2nd place solution

This solution uses two-stage recommender system: candidate selection with different methods and ranking with GBDT.

Environment and running

if you want to use CUDF, you can install conda distribution from https://rapids.ai/start.html
on top of it - pip install -r requirements.txt
running with USE_CUDF=0 may be problematic, because there are some places where cudf-only methods like .to_pandas() are used
to run pipeline put unzipped files to data and just use dvc repro
to run only training&cv use dvc repro train_lightgbm_cv
you can also use dvc exp run with different params, e.g. dvc exp run -S train.working_dir=data/processed/sample will run pipeline on a small sample of data

Hardware

All experiments were run on a rig with 512GB RAM and A100 GPU. The most memory intense step is model training, takes ~250GB RAM at peak. GPU is only needed for fast calculation of co-occurence features with cudf, but it's possible to use pandas instead (set env USE_CUDF=0). Full pipeline with inference takes ~8 hours if executed consecutively with GPU.

Candidate selection

Next-item co-occurence

calculate dictionary on all consecutive pairs in train&test data {item: {next_item: count}}
get candidates as most common items in dictionary by keys - last item, pre-last, etc.

Smart co-occurence

get candidates by aggregated co-occurence count with last N history item (used sum)
one thing I missed during the competition and checked after - to add weight based on rank of action in history. Using 1 / (rank + 1) as weight boosts recall by 10% and precision by 30% compared to evenly weighted sum.

Implicit BM25 (Item2Item)

train Item-Item recommender, take similar to last item as candidates
other types of I2I from implicit also work, but BM25 is slightly better in terms of recall

Implicit ALS

use imlpicit ALS model for candidates. I used recalculate_user=True, but using real user factors could be a bit better

Last artist items

recommend top tracks of user's last liked artist

Features

score and rank from each candidate engine
co-occurence aggregated stats (mean, max, std, min)
als similarity aggregated stats
i2i similarity aggregated stats
item/artists statistics with different offsets (last 10, 50 actions, etc.)
user features: number of likes, unique artists, likes per artist

Ranker

tl;dr - LightGBM with lambdarank objective. Some things to notice:

3 fold CV and averaged prediction
downsample negative items with rate 0.3 (e.g. we keep 300k negatives from 1mln)
use custom numba MRR implementation for early stopping
100 early stopping rounds, 600 iterations on average
hyperparams were tuned on small subset of data once almost in the beginning, trying to change any of them later did not help
learning rate: 0.04
l1 reg: ~1
l2 reg: ~8
colsample: 0.6
subsample: 0.6

Final ensemble

Final submission is generated by blending 3 submission files with inverse rank blend (see blend.py for exmaple).

Features and LightGBM parameters were pretty much the same between all three models.

first (0.0849 lb, 0.0845 cv, 0.49 recall)

0 1 2 3 4 5 next item co-occurence candidates (300 per rank)
default implicit ALS, 300 items
last item similar candidates, 300 items
200 popular items
100 last artist top items
LEFT JOIN CANDIDATES :D

second (0.0854 lb, 0.0852 cv, 0.62 recall)

0 1 next item co-occurence candidates (300 per rank)
1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 100 last items in history)
default implicit ALS, 300 items
last item similar candidates, 300 items
200 popular items

third (0.08608 lb, 0.0856 cv, 0.64 recall)

0 1 next item co-occurence candidates (300 per rank)
1500 "smart co-occurence" candidates (cooc calculated in +-7 range, use 16 last items in history)
default implicit ALS, 300 items
last item similar candidates, 300 items
200 popular items

Bonus - DVC pipeline flow chart

flowchart LR
        node1["calculate_als_candidates@test"]
        node2["calculate_als_candidates@val"]
        node3["calculate_artist_candidates@test"]
        node4["calculate_artist_candidates@val"]
        node5["calculate_cooc_candidates@test"]
        node6["calculate_cooc_candidates@val"]
        node7["calculate_cooc_smart_candidates@test"]
        node8["calculate_cooc_smart_candidates@val"]
        node9["calculate_cooc_stats"]
        node10["calculate_cooc_stats_for_smart"]
        node11["calculate_popular_candidates@test"]
        node12["calculate_popular_candidates@val"]
        node13["calculate_similar_candidates@test"]
        node14["calculate_similar_candidates@val"]
        node15["create_artist_features"]
        node16["create_item_features"]
        node17["create_submission"]
        node18["create_submission_cv"]
        node19["create_user_artist_features@test"]
        node20["create_user_artist_features@val"]
        node21["create_user_features@test"]
        node22["create_user_features@val"]
        node23["create_user_history_als_features@test"]
        node24["create_user_history_als_features@val"]
        node25["create_user_history_artist_features@test"]
        node26["create_user_history_artist_features@val"]
        node27["create_user_history_cooc_features@test"]
        node28["create_user_history_cooc_features@val"]
        node29["create_user_history_features@test"]
        node30["create_user_history_features@val"]
        node31["create_user_history_similarity_features@test"]
        node32["create_user_history_similarity_features@val"]
        node33["merge_candidates@test"]
        node34["merge_candidates@val"]
        node35["merge_candidates_and_features@test"]
        node36["merge_candidates_and_features@val"]
        node37["prepare_data"]
        node38["split_test_by_chunks"]
        node39["train_als_candidates"]
        node40["train_cooc_candidates"]
        node41["train_lightgbm"]
        node42["train_lightgbm_cv"]
        node43["train_popular_candidates"]
        node44["train_similar_candidates"]
        node1-->node33
        node2-->node34
        node5-->node33
        node6-->node34
        node7-->node33
        node8-->node34
        node9-->node27
        node9-->node28
        node10-->node7
        node10-->node8
        node11-->node33
        node12-->node34
        node13-->node33
        node14-->node34
        node15-->node35
        node15-->node36
        node16-->node35
        node16-->node36
        node19-->node35
        node20-->node36
        node21-->node35
        node22-->node36
        node23-->node35
        node24-->node36
        node27-->node35
        node28-->node36
        node29-->node35
        node30-->node36
        node31-->node35
        node32-->node36
        node33-->node23
        node33-->node25
        node33-->node27
        node33-->node31
        node33-->node35
        node34-->node24
        node34-->node26
        node34-->node28
        node34-->node32
        node34-->node36
        node35-->node17
        node35-->node38
        node36-->node41
        node36-->node42
        node37-->node1
        node37-->node2
        node37-->node3
        node37-->node4
        node37-->node5
        node37-->node6
        node37-->node7
        node37-->node8
        node37-->node9
        node37-->node10
        node37-->node11
        node37-->node12
        node37-->node13
        node37-->node14
        node37-->node15
        node37-->node16
        node37-->node19
        node37-->node20
        node37-->node21
        node37-->node22
        node37-->node23
        node37-->node24
        node37-->node25
        node37-->node26
        node37-->node27
        node37-->node28
        node37-->node29
        node37-->node30
        node37-->node31
        node37-->node32
        node37-->node39
        node37-->node40
        node37-->node41
        node37-->node43
        node37-->node44
        node38-->node18
        node39-->node1
        node39-->node2
        node39-->node23
        node39-->node24
        node40-->node5
        node40-->node6
        node41-->node17
        node42-->node18
        node43-->node11
        node43-->node12
        node44-->node13
        node44-->node14
        node44-->node31
        node44-->node32