Home

Awesome

SRS Benchmark

Introduction

Spaced repetition algorithms are computer programs designed to help people schedule reviews of flashcards. A good spaced repetition algorithm helps you remember things more efficiently. Instead of cramming all at once, it distributes your reviews over time. To make this efficient, these algorithms try to understand how your memory works. They aim to predict when you're likely to forget something, so they can schedule a review accordingly.

This benchmark is a tool designed to assess the predictive accuracy of various algorithms. A multitude of algorithms are evaluated to find out which ones provide the most accurate predictions.

Dataset

The dataset for the SRS benchmark comes from 20 thousand people who use Anki, a flashcard app. In total, this dataset contains information about ~1.7 billion reviews of flashcards. The full dataset is hosted on Hugging Face Datasets: open-spaced-repetition/FSRS-Anki-20k.

The dataset for the SRS benchmark comes from 10 thousand users who use Anki, a flashcard app. In total, this dataset contains information about ~727 million reviews of flashcards. The full dataset is hosted on Hugging Face Datasets: open-spaced-repetition/anki-revlogs-10k.

Evaluation

Data Split

In the SRS benchmark, we use a tool called TimeSeriesSplit. This is part of the sklearn library used for machine learning. The tool helps us split the data by time: older reviews are used for training and newer reviews for testing. That way, we don't accidentally cheat by giving the algorithm future information it shouldn't have. In practice, we use past study sessions to predict future ones. This makes TimeSeriesSplit a good fit for our benchmark.

Note: TimeSeriesSplit will remove the first split from evaluation. This is because the first split is used for training, and we don't want to evaluate the algorithm on the same data it was trained on.

Metrics

We use three metrics in the SRS benchmark to evaluate how well these algorithms work: log loss, AUC, and a custom RMSE that we call RMSE (bins).

Algorithms

If an algorithm has "-short" at the end of its name, it means that it uses data from same-day reviews as well.

For further information regarding the FSRS algorithm, please refer to the following wiki page: The Algorithm.

Result

Total number of users: 9,999.

Total number of reviews for evaluation: 349,923,850. Same-day reviews are excluded except when optimizing FSRS-5 and algorithms that have "-short" at the end of their names. Each algorithm uses only one review per day (the first, chronologically). Some reviews are filtered out, for example, the revlog entries created by changing the due date manually or reviewing cards in a filtered deck with "Reschedule cards based on my answers in this deck" disabled. Finally, an outlier filter is applied. These are the reasons why the number of reviews used for evaluation is significantly lower than the figure of 727 million mentioned earlier.

The following tables present the means and the 99% confidence intervals. The best result is highlighted in bold. The "Parameters" column shows the number of optimizable (trainable) parameters. If a parameter is a constant, it is not included.

Weighted by the number of reviews

ModelParametersLogLossRMSE (bins)AUC
GRU-P-short2970.320±0.00800.042±0.00130.710±0.0047
GRU-P2970.325±0.00810.043±0.00130.699±0.0046
FSRS-5 preset190.328±0.00830.051±0.00150.702±0.0043
FSRS-5190.328±0.00820.052±0.00150.701±0.0044
FSRS-rs190.328±0.00820.052±0.00150.700±0.0043
FSRS-4.5170.332±0.00830.054±0.00160.692±0.0041
FSRS-5 binary150.334±0.00830.056±0.00160.679±0.0047
FSRS-5 deck190.336±0.00850.056±0.00180.692±0.0044
FSRS v4170.338±0.00860.058±0.00170.689±0.0043
DASH90.340±0.00860.063±0.00170.639±0.0046
GRU390.343±0.00880.063±0.00170.673±0.0039
DASH[MCM]90.340±0.00850.064±0.00180.640±0.0051
DASH-short90.339±0.00840.066±0.00190.636±0.0050
DASH[ACT-R]50.343±0.00870.067±0.00190.629±0.0049
FSRS-5 pretrain40.344±0.00860.071±0.00200.690±0.0040
FSRS v3130.371±0.00990.073±0.00210.667±0.0047
FSRS-5 default param.00.353±0.00870.081±0.00230.687±0.0039
NN-17390.38±0.0270.081±0.00380.611±0.0043
ACT-R50.362±0.00890.086±0.00240.534±0.0054
AVG00.363±0.00900.088±0.00250.508±0.0046
HLR30.41±0.0120.105±0.00300.633±0.0050
HLR-short30.44±0.0130.116±0.00360.615±0.0062
SM2-trainable60.44±0.0120.119±0.00330.599±0.0050
SM-2-short00.51±0.0150.128±0.00380.593±0.0064
SM-200.55±0.0170.148±0.00410.600±0.0051
Ebisu-v200.46±0.0120.158±0.00380.594±0.0050
Transformer1270.45±0.0120.166±0.00490.519±0.0065

Unweighted

ModelParametersLog LossRMSE (bins)AUC
GRU-P-short2970.346±0.00420.062±0.00110.699±0.0026
GRU-P2970.352±0.00420.063±0.00110.687±0.0025
FSRS-rs190.356±0.00450.074±0.00120.698±0.0023
FSRS-5190.357±0.00430.074±0.00120.699±0.0023
FSRS-5 preset190.358±0.00450.074±0.00120.699±0.0023
FSRS-4.5170.362±0.00450.076±0.00130.689±0.0023
FSRS-5 binary150.367±0.00450.081±0.00140.671±0.0025
FSRS-5 deck190.368±0.00470.081±0.00140.694±0.0023
DASH90.368±0.00450.084±0.00130.631±0.0027
FSRS v4170.373±0.00480.084±0.00140.685±0.0023
DASH-short90.368±0.00450.086±0.00140.622±0.0029
DASH[MCM]90.369±0.00440.086±0.00140.634±0.0026
GRU390.375±0.00470.086±0.00140.668±0.0023
FSRS-5 pretrain40.369±0.00460.088±0.00130.695±0.0022
DASH[ACT-R]50.373±0.00470.089±0.00160.624±0.0027
NN-17390.398±0.00490.101±0.00130.624±0.0023
FSRS-5 default param.00.382±0.00470.102±0.00150.693±0.0022
AVG00.394±0.00500.103±0.00160.500±0.0026
ACT-R50.403±0.00550.107±0.00170.522±0.0024
FSRS v3130.436±0.00670.110±0.00200.661±0.0024
HLR30.469±0.00730.128±0.00190.637±0.0026
HLR-short30.493±0.00790.140±0.00210.611±0.0029
Ebisu-v200.499±0.00780.163±0.00210.605±0.0026
Transformer1270.468±0.00590.167±0.00220.531±0.0030
SM2-trainable60.58±0.0120.170±0.00280.597±0.0025
SM-2-short00.65±0.0150.170±0.00280.590±0.0027
SM-200.72±0.0170.203±0.00300.603±0.0025

Averages weighted by the number of reviews are more representative of "best case" performance when plenty of data is available. Since almost all algorithms perform better when there's a lot of data to learn from, weighting by n(reviews) biases the average towards lower values.

Unweighted averages are more representative of "average case" performance. In reality, not every user will have hundreds of thousands of reviews, so the algorithm won't always be able to reach its full potential.

Superiority

The metrics presented above can be difficult to interpret. In order to make it easier to understand how algorithms perform relative to each other, the image below shows the percentage of users for whom algorithm A (row) has a lower RMSE than algorithm B (column). For example, GRU-P-short has a 95.9% superiority over the Transformer, meaning that for 95.9% of all collections in this benchmark, GRU-P-short can estimate the probability of recall more accurately than the Transformer. This is based on 9,999 collections.

Superiority, 9999

You may have noticed that FSRS-5 has a 99.0% superiority over SM-2, meaning that for 99.0% of users, RMSE will be lower with FSRS-5 than with SM-2. But please remember that SM-2 wasn’t designed to predict probabilities, and the only reason it does that in this benchmark is because extra formulas were added to it.

Statistical significance

The figures below show two different measures of effect size comparing the RMSE between all pairs of algorithms:

  1. Wilcoxon signed-rank test r-values (effect sizes)
  2. Paired t-test Cohen's d values (effect sizes)

For both visualizations, the colors indicate:

The Wilcoxon test is non-parametric and considers both the sign and rank of differences between pairs, while the t-test assumes normality and provides Cohen's d as a standardized measure of the difference between means. Both tests are paired, comparing algorithms' performance on the same collections, but do not account for the varying number of reviews across collections. Therefore, while the test results are reliable for qualitative analysis, caution should be exercised when interpreting the specific magnitude of effects.

Wilcoxon, 9999 collections T-test, 9999 collections

You may have noticed that the two tests don't always agree on which algorithms are better or worse. This is because the Wilcoxon test only considers the sign and rank of differences, while the t-test also considers the magnitude of differences.

Default Parameters

FSRS-5:

0.40255, 1.18385, 3.173, 15.69105,
7.1949, 0.5345, 1.4604, 0.0046,
1.54575, 0.1192, 1.01925,
1.9395, 0.11, 0.29605, 2.2698,
0.2315, 2.9898,
0.51655, 0.6621

Comparisons with SuperMemo 15/16/17

Please refer to the following repositories:

How to run the benchmark

Requirements

Dataset (tiny): https://github.com/open-spaced-repetition/fsrs-benchmark/issues/28#issuecomment-1876196288

Dependencies:

pip install -r requirements.txt

Commands

FSRS-5:

python script.py

FSRS-5 with default parameters:

python script.py --dry

FSRS-5 with only the first 4 parameters optimized:

python script.py --pretrain

FSRS-rs:

It requires fsrs_rs_python to be installed.

pip install fsrs_rs_python

Then run the following command:

python script.py --rust

Dev model in fsrs-optimizer:

python script.py --dev

Please place the fsrs-optimizer repository in the same directory as this repository.

Set the number of processes:

python script.py --processes 4

Save the raw predictions:

python script.py --raw

Save the detailed results:

python script.py --file

Save the analyzing charts:

python script.py --plot

Benchmark FSRSv4/FSRSv3/HLR/LSTM/SM2:

python other.py --model FSRSv4

Please change the --model argument to FSRSv3, HLR, GRU, or SM2 to run the corresponding model.