Home

Awesome

About

Chess reinforcement learning by AlphaGo Zero methods.

This project is based in two main resources:

  1. DeepMind's Oct 19th publication: Mastering the Game of Go without Human Knowledge.
  2. The <b>great</b> Reversi development of the DeepMind ideas that @mokemokechicken did in his repo: https://github.com/mokemokechicken/reversi-alpha-zero

Note: <b>This project is still under construction!!</b>

News

DeepMind just released today a new version of their AlphaGo Zero idea (named now AlphaZero) where they mastering chess from scratch: https://arxiv.org/pdf/1712.01815.pdf. In fact, in chess AlphaZero outperformed Stockfish after just 4 hours (300k steps) Wow!

There are new ideas we have to take into account for this project. It seems, for example, that two planes for feeding the input model are not enough.

Environment

First "good" results

Using the new supervised learning step I created, I've been able to train a model to the point that seems to be learning the openings of chess. Also it seems the model starts to avoid losing naively pieces.

Here you can see an example of a game played for me against this model (AI plays black):

partida1

Here we have a game trained by @bame55 (AI plays white):

partida3

This model plays in this way after only 5 epoch iterations of the 'opt' worker, the 'eval' worker changed 4 times the best model (4 of 5). At this moment the loss of the 'opt' worker is 5.1 (and still seems to be converging very well).

As I have not GPU, I had to evaluate ('eval') using only "self.simulation_num_per_move = 10" and only 10 files of play data for the 'opt' worker. I'm pretty sure if anybody is able to run in a good GPU with a more powerful configuration the results after complete convergence would be really good.

New Supervised Learning Training Pipeline

I've done a supervised learning new pipeline step (to use those human games files "PGN" we can find in internet as play-data generator). This SL step was also used in the first and original version of AlphaGo and maybe chess is a some complex game that we have to pre-train first the policy model before starting the self-play process (i.e., maybe chess is too much complicated for a self training alone).

To use the new SL process is so simple as running in the beginning instead of the worker "self" the new worker "sl". Once the model converges enough with SL play-data we just stop the worker "sl" and start the worker "self" so the model will start improving now due to self-play data.

If you want to use this new SL step you will have to download from internet big PGN files (chess files) and paste them into the "data/play_data" folder.

Supervised Learning

python src/chess_zero/run.py sl

New Distributed Training Pipeline

Now it's possible to train the model in a distributed way. The only thing needed is to use the new parameter:

So, in order to contribute to the distributed team you just need to run the three workers locally like this:

python src/chess_zero/run.py self --type distributed (or python src/chess_zero/run.py sl --type distributed)
python src/chess_zero/run.py opt --type distributed
python src/chess_zero/run.py eval --type distributed

Modules

Reinforcement Learning

This AlphaGo Zero implementation consists of three worker self, opt and eval.

Evaluation

For evaluation, you can play chess with the BestModel.

GUI

To set up ChessZero with a GUI, point it to C0uci.bat (or rename to .sh). For example, this is screenshot of the random model using Arena's self-play feature: capture

Data

If you want to train the model from the beginning, delete the above directories.

How to use

Setup

install libraries

pip install -r requirements.txt

If you want to use GPU,

pip install tensorflow-gpu

set environment variables

Create .env file and write this.

KERAS_BACKEND=tensorflow

Basic Usages

For training model, execute Self-Play, Trainer and Evaluator.

Self-Play

python src/chess_zero/run.py self

When executed, Self-Play will start using BestModel. If the BestModel does not exist, new random model will be created and become BestModel.

options

Trainer

python src/chess_zero/run.py opt

When executed, Training will start. A base model will be loaded from latest saved next-generation model. If not existed, BestModel is used. Trained model will be saved every 2000 steps(mini-batch) after epoch.

options

Evaluator

python src/chess_zero/run.py eval

When executed, Evaluation will start. It evaluates BestModel and the latest next-generation model by playing about 200 games. If next-generation model wins, it becomes BestModel.

options

Tips and Memo

GPU Memory

Usually the lack of memory cause warnings, not error. If error happens, try to change vram_frac in src/configs/mini.py,

self.vram_frac = 1.0

Less batch_size will reduce memory usage of opt. Try to change TrainerConfig#batch_size in NormalConfig.

Model Performance

The following table is records of the best models.

best model generationwinning percentage to best modelTime Spent(hours)note
1--