Reinforcement Learning with Open Spiel (Part 2)

Reinforcement Learning
The training process
First games against an AI
The final boss
Summary
External links

In my previous post, I implemented a 3D variant of the Connect Four game in Python into the Open Spiel framework. It is now time to train some AI on the game, and see if it’s actually any good against a human player.

Reinforcement Learning

If you have never heard of reinforcement learning before, here is a loose definition for you. Reinforcement Learning is a machine learning technique by which an agent learns to perform a task based on experience acquired trying to perform this task. Depending on the outcome, a reward (or a penalty) is expanded to the agent. At first, the actions taken by the agent are almost random, but as it gains experience, it gradually learns which actions yield higher rewards depending on the situation. With sufficient training, the agent becomes capable of taking the action which maximizes its reward.

In our case, the reward is fairly simple to define:

\(+1\) if the game is won
\(-1\) if the game is lost

There are 16 actions that the agent can choose, one for each pole on which tokens can be placed. Of course, if four tokens have been placed on a particular pole, no more tokens can be placed there and this action is no longer available. If you recall the previous post, this mechanism is implemented in the _legal_actions method of the State class.

The training process

There are many Reinforcement Learning techniques out there, with generic implementations of these in Open Spiel.

For our purposes, we will use the AlphaZero algorithm to train a neural network to predict which action is the best given a certain situation. Technically, we will train two neural networks: one for the first player and one for the second player. A generic training program is provided: alpha_zero.py.

To be able to use games implemented in Python with this program, the following import needs to be added:

diff --git a/open_spiel/python/examples/alpha_zero.py b/open_spiel/python/examples/alpha_zero.py
index 2a1b4769..87cd2834 100644
--- a/open_spiel/python/examples/alpha_zero.py
+++ b/open_spiel/python/examples/alpha_zero.py
@@ -17,6 +17,7 @@
 from absl import app
 from absl import flags
 
+from open_spiel.python import games
 from open_spiel.python.algorithms.alpha_zero import alpha_zero
 from open_spiel.python.algorithms.alpha_zero import model as model_lib
 from open_spiel.python.utils import spawn

A quick look at the options available

There are a number of options that can be specified to the alpha_zero.py program. Here we specify our game through the string we used to register it. We also specify a path to which the parameters of the neural networks will be saved. Finally, the type of model to use can be specified with ``-nn_model`.

If we take a look at the list of options for the alpha_zero.py program, we obtain the following information concerning the neural network:

$ python open_spiel/python/examples/alpha_zero.py --help
...
  --nn_depth: How deep should the network be.
    (default: '10')
    (an integer)
  --nn_model: <mlp|conv2d|resnet>: What type of model should be used?.
    (default: 'resnet')
  --nn_width: How wide should the network be.
    (default: '128')
    (an integer)
...

Options depth and width define the size of the neural network, while the model option influences its architecture.

mlp will constructs a multi-layer perceptron with all nodes of one layer connected to all the nodes of the next layer.
conv2d will construct a neural network using 2D convolutions. This option is not compatible with our game since the shape of the input (the shape of the board in this case) is three-dimensional.
resnet which also relies on some convolutional layers.

For our purposes, we will use the mlp option.

Training

The command I chose to launch the learning process is the following:

python open_spiel/python/examples/alpha_zero.py \
  --game="python_connect_four_3d" \
  --path="${HOME}/TrainedModels/ConnectFour3D" \
  --nn_model=mlp

The training process will then update a “checkpoint” every 30 minutes or so in the directory specified through the --path option. In my case that resolves to directory /home/patrick/TrainedModels/ConnectFour3D. Let’s take a look at the contents of this directory.

After about 40 minutes of training, the checkpoint is created for the first time. Below is the output of the program roughly an hour after it was started (warning messages were removed for brevity)

patrick@host:~/Projects/open_spiel$ python open_spiel/python/examples/alpha_zero.py --game="python_connect
_four_3d" --path="/home/patrick/TrainedModels/ConnectFour3D" --nn_model=mlp
...
Starting game python_connect_four_3d
Writing logs and checkpoints to: /home/patrick/TrainedModels/ConnectFour3D
Model type: mlp(128, 10)
actor-0 started
actor-1 started
learner started
[2025-02-23 20:36:28.493] Initializing model
evaluator-0 started
...
[2025-02-23 20:36:29.017] Model type: mlp(128, 10)
[2025-02-23 20:36:29.017] Model size: 208529 variables
[2025-02-23 20:36:29.087] Initial checkpoint: /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
I0223 20:36:36.233901 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
I0223 20:36:38.082213 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
I0223 20:36:44.087164 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint-0
[2025-02-23 21:12:18.550] Step: 1
[2025-02-23 21:12:18.550] Collected 21858 states from 643 games, 9.9 states/s. 4.9 states/(s*actor), game length: 34.0
[2025-02-23 21:12:18.550] Buffer size: 21858. States seen: 21858
[2025-02-23 21:12:19.116] Losses(total: 3.639, policy: 2.628, value: 0.938, l2: 0.074)
[2025-02-23 21:12:19.116] Checkpoint saved: /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
[2025-02-23 21:12:19.117]
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
I0223 21:12:20.853002 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
I0223 21:12:25.912406 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
I0223 21:28:55.785191 131313476123520 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1

And looking in the directory where the models and the logs are saved, we can see the first checkpoint created at 20:36 (files checkpoint-0.*), and the files created as part of the later checkpoint at 21:12 (files checkpoint--1.*).

patrick@host:~/TrainedModels/ConnectFour3D$ ls -l
total 5960
-rw-rw-r-- 1 patrick patrick     252 Feb 23 21:12 checkpoint
-rw-rw-r-- 1 patrick patrick 2502356 Feb 23 20:36 checkpoint-0.data-00000-of-00001
-rw-rw-r-- 1 patrick patrick    2967 Feb 23 20:36 checkpoint-0.index
-rw-rw-r-- 1 patrick patrick  329309 Feb 23 20:36 checkpoint-0.meta
-rw-rw-r-- 1 patrick patrick 2502356 Feb 23 21:12 checkpoint--1.data-00000-of-00001
-rw-rw-r-- 1 patrick patrick    2967 Feb 23 21:12 checkpoint--1.index
-rw-rw-r-- 1 patrick patrick  329309 Feb 23 21:12 checkpoint--1.meta
-rw-rw-r-- 1 patrick patrick     649 Feb 23 20:36 config.json
-rw-rw-r-- 1 patrick patrick    2837 Feb 23 21:12 learner.jsonl
-rw-rw-r-- 1 patrick patrick  193830 Feb 23 21:31 log-actor-0.txt
-rw-rw-r-- 1 patrick patrick  192644 Feb 23 21:31 log-actor-1.txt
-rw-rw-r-- 1 patrick patrick    5190 Feb 23 21:28 log-evaluator-0.txt
-rw-rw-r-- 1 patrick patrick     725 Feb 23 21:12 log-learner.txt

With the AlphaZero program, the AI is going to play against itself and gradually improve its decision making. It should simply continue to continuously improve its performance over time. We can just leave it running overnight and stop it the morning the next day with Ctrl-C.

Note: I do not know why the checkpoint--1 has two dashes instead of just one like checkpoint-0. Either way, the files for checkpoint--1 will be periodically overwritten as the training program saves the current state of the model.

Some observations

Every time a checkpoint is saved, a few basic statistics are indicated on the terminal. Among these is the game length. Note that the maximum game length is 64 moves.

Step	Game Length
1	34.0
2	26.0
3	19.9
4	18.6
5	17.1
6	16.2
7	17.1
8	17.0
9	16.9
10	17.1
11	15.7
12	16.5
13	15.8
14	16.6
15	16.0
16	17.0
17	18.2
18	18.1
19	18.2
20	17.7
21	17.8
22	18.4
23	17.3
24	17.2
25	17.6
26	16.8
27	17.0

At first, the games are quite long. This is because the agent’s behavior is random at first, leading to long games as neither player “knows” how to win a game yet. But as they learn, the length of the game shortens greatly get shorter as they reach victory quicker. I would expect the games to get longer and longer as the agents become more proficient in the game, alas this did not happen within the more than 12 hours I ran the algorithm for.

First games against an AI

Now that we have a neural network capable of making a decision as to what is the best move to make given a certain situation, we can now test it and check how effective it is. To do this, we are going to use the same program as we did in my previous post: open_spiel/python/examples/mcts.py.

Last time, we used this program to launch player vs player games, or games of random agents against eachother. This time, we are going to specify the agent trained using AlphaZero to play the game.

Random opponent

As a first experiment, let’s try to make our trained AI against a random opponent. The command needed is the following:

python open_spiel/python/examples/mcts.py \
  --game="python_connect_four_3d" \
  --player1="random" --player2="az" \
  --az_path="/home/patrick/TrainedModels/ConnectFour3D/checkpoint--1"

The choices made by the random agent and the agent trained by AlphaZero are then showed on the terminal.

patrick@host:~/Projects/open_spiel$ python open_spiel/python/examples/mcts.py \
  --game=python_connect_four_3d \
  --player1=random --player2=az \
  --az_path=/home/patrick/TrainedModels/ConnectFour3D/checkpoint--1 
...
INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
I0224 11:09:24.782809 136523931904896 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
Initial state:
....
....
....
....

....
....
....
....

....
....
....
....

....
....
....
....


Player 0 sampled action: 14(0,3,2)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

....
....
....
..o.


Player 1 sampled action: 10(0,2,2)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

....
....
..x.
..o.


Player 0 sampled action: 13(0,3,1)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

....
....
..x.
.oo.


Player 1 sampled action: 5(0,1,1)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

....
.x..
..x.
.oo.


Player 0 sampled action: 1(0,0,1)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

.o..
.x..
..x.
.oo.


Player 1 sampled action: 6(0,1,2)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

.o..
.xx.
..x.
.oo.


Player 0 sampled action: 8(0,2,0)
Next state:
....
....
....
....

....
....
....
....

....
....
....
....

.o..
.xx.
o.x.
.oo.


Player 1 sampled action: 5(1,1,1)
Next state:
....
....
....
....

....
....
....
....

....
.x..
....
....

.o..
.xx.
o.x.
.oo.


Player 0 sampled action: 14(1,3,2)
Next state:
....
....
....
....

....
....
....
....

....
.x..
....
..o.

.o..
.xx.
o.x.
.oo.


Player 1 sampled action: 5(2,1,1)
Next state:
....
....
....
....

....
.x..
....
....

....
.x..
....
..o.

.o..
.xx.
o.x.
.oo.


Player 0 sampled action: 2(0,0,2)
Next state:
....
....
....
....

....
.x..
....
....

....
.x..
....
..o.

.oo.
.xx.
o.x.
.oo.


Player 1 sampled action: 5(3,1,1)
Next state:
....
.x..
....
....

....
.x..
....
....

....
.x..
....
..o.

.oo.
.xx.
o.x.
.oo.


Returns: -1.0 1.0 , Game actions: 14(0,3,2) 10(0,2,2) 13(0,3,1) 5(0,1,1) 1(0,0,1) 6(0,1,2) 8(0,2,0) 5(1,1,1) 14(1,3,2) 5(2,1,1) 2(0,0,2) 5(3,1,1)
Number of games played: 1
Number of distinct games played: 1
Players: random az
Overall wins [0, 1]
Overall returns [-1.0, 1.0]

The game ends in a victory for the agent trained by AlphaZero in just 12 turns.

The game above made the AlphaZero agent play as the second player, but this can of course be exchanged by adjusting the command:

python open_spiel/python/examples/mcts.py \
  --game="python_connect_four_3d" \
  --player1="az" --player2="random" \
  --az_path="/home/patrick/TrainedModels/ConnectFour3D/checkpoint--1"

We can also ask the mcts.py program to run many games to check how good the agent performs over a larger sample, reducing the output made to the terminal to the strict minimum.

python open_spiel/python/examples/mcts.py \
  --game="python_connect_four_3d" \
  --player1="az" --player2="random" \
  --az_path="/home/patrick/TrainedModels/ConnectFour3D/checkpoint--1" \
  --quiet --num_games=20

The output is then the following:

INFO:tensorflow:Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
I0224 11:20:27.679396 132263441468288 saver.py:1417] Restoring parameters from /home/patrick/TrainedModels/ConnectFour3D/checkpoint--1
Returns: 1.0 -1.0 , Game actions: 10(0,2,2) 10(1,2,2) 9(0,2,1) 15(0,3,3) 11(0,2,3) 10(2,2,2) 8(0,2,0)
Returns: 1.0 -1.0 , Game actions: 10(0,2,2) 14(0,3,2) 5(0,1,1) 13(0,3,1) 15(0,3,3) 12(0,3,0) 0(0,0,0)
...
Returns: 1.0 -1.0 , Game actions: 10(0,2,2) 11(0,2,3) 5(0,1,1) 3(0,0,3) 5(1,1,1) 7(0,1,3) 15(0,3,3) 6(0,1,2) 0(0,0,0)
Returns: 1.0 -1.0 , Game actions: 10(0,2,2) 0(0,0,0) 10(1,2,2) 2(0,0,2) 10(2,2,2) 10(3,2,2) 3(0,0,3) 6(0,1,2) 9(0,2,1) 15(0,3,3) 9(1,2,1) 4(0,1,0) 8(0,2,0) 0(1,0,0) 11(0,2,3)
Number of games played: 20
Number of distinct games played: 20
Players: az random
Overall wins [20, 0]
Overall returns [20.0, -20.0]

For each game, the moves taken by each player are displayed on a single line with the “returns” of the game. The agent trained with AlphaZero wins systematically.

Notice also that the AlphaZero agent always picks the same opening move: 10(0.2.2). This is because as a result of the training process, the “best” move possible when confronted with an empty board is judged to be on one of the center piles. If we want to force the first player to start with a random action, we can add the --random_first option to the program. Despite the random start, the AlphaZero agent still wins all games against the random agent.

Against itself

We can also make the agent play against itself by setting it for both the player 0 and player 1. By doing so we will be able to see which games are more likely to lead to a victory by either player depending on the first move.

First, due to the symmetries in the game, there are only 3 fundamental kinds of move available to the first player:

corner move: 0(0,0,0), 3(0,0,3), 12(0,3,0), and 15(0,3,3)
center move: 5(0,1,1), 6(0,1,2), 9(0,2,1) and 10(0,2,2)
edge move: 1(0,0,1), 2(0,0,2), 4(0,1,0), 7(0,1,3), 8(0,2,0), 11(0,2,3), 13(0,3,1), 14(0,3,2)

Types of first moves, "corner" in green, "edge" in yellow, and "center" in blue.

Since the agents’ decisions are fully deterministic given a state of the board, we only need to launch enough games so that every start possible appears at least once. Leaving enough games to be played, we obtain the 16 “fundamentally different” games. Here are the games, grouped by type of initial move:

Corner move first:

Returns: -1.0 1.0 , Game actions: 0(0,0,0) 10(0,2,2) 5(0,1,1) 6(0,1,2) 7(0,1,3) 10(1,2,2) 3(0,0,3) 10(2,2,2) 10(3,2,2) 2(0,0,2) 14(0,3,2) 13(0,3,1) 15(0,3,3) 11(0,2,3) 9(0,2,1) 14(1,3,2) 9(1,2,1) 2(1,0,2) 6(1,1,2) 3(1,0,3) 9(2,2,1) 9(3,2,1) 8(0,2,0) 2(2,0,2) 2(3,0,2) 14(2,3,2) 6(2,1,2) 14(3,3,2) 5(1,1,1) 5(2,1,1) 4(0,1,0) 12(0,3,0) 13(1,3,1) 3(2,0,3) 12(1,3,0) 3(3,0,3) 6(3,1,2) 11(1,2,3) 11(2,2,3) 11(3,2,3) 12(2,3,0) 12(3,3,0) 8(1,2,0) 7(1,1,3) 15(1,3,3) 15(2,3,3) 7(2,1,3) 7(3,1,3) 15(3,3,3) 8(2,2,0) 4(1,1,0) 0(1,0,0) 0(2,0,0) 0(3,0,0) 8(3,2,0) 4(2,1,0) 13(2,3,1) 13(3,3,1) 4(3,1,0) 5(3,1,1) 1(0,0,1) 1(1,0,1)
Returns: 1.0 -1.0 , Game actions: 3(0,0,3) 5(0,1,1) 10(0,2,2) 5(1,1,1) 6(0,1,2) 5(2,1,1) 5(3,1,1) 6(1,1,2) 9(0,2,1) 6(2,1,2) 12(0,3,0)
Returns: -1.0 1.0 , Game actions: 12(0,3,0) 10(0,2,2) 5(0,1,1) 10(1,2,2) 7(0,1,3) 10(2,2,2) 10(3,2,2) 9(0,2,1) 3(0,0,3) 0(0,0,0) 5(1,1,1) 2(0,0,2) 4(0,1,0) 6(0,1,2) 14(0,3,2) 11(0,2,3) 8(0,2,0) 14(1,3,2) 15(0,3,3) 13(0,3,1) 9(1,2,1) 6(1,1,2) 2(1,0,2) 14(2,3,2) 14(3,3,2) 6(2,1,2) 2(2,0,2) 6(3,1,2)
Returns: 1.0 -1.0 , Game actions: 15(0,3,3) 10(0,2,2) 5(0,1,1) 9(0,2,1) 3(0,0,3) 11(0,2,3) 8(0,2,0) 5(1,1,1) 10(1,2,2) 0(0,0,0) 10(2,2,2) 0(1,0,0) 9(1,2,1) 0(2,0,0) 0(3,0,0) 5(2,1,1) 10(3,2,2) 11(1,2,3) 11(2,2,3) 11(3,2,3) 13(0,3,1) 6(0,1,2) 2(0,0,2) 15(1,3,3) 4(0,1,0) 15(2,3,3) 8(1,2,0) 8(2,2,0) 4(1,1,0) 1(0,0,1) 4(2,1,0) 4(3,1,0) 12(0,3,0)

Edge move first:

Returns: -1.0 1.0 , Game actions: 1(0,0,1) 10(0,2,2) 5(0,1,1) 10(1,2,2) 3(0,0,3) 10(2,2,2) 10(3,2,2) 6(0,1,2) 7(0,1,3) 2(0,0,2) 14(0,3,2) 9(0,2,1) 15(0,3,3) 11(0,2,3) 8(0,2,0) 5(1,1,1) 1(1,0,1) 11(1,2,3) 1(2,0,1) 1(3,0,1) 11(2,2,3) 9(1,2,1) 8(1,2,0) 9(2,2,1) 9(3,2,1) 8(2,2,0) 8(3,2,0) 11(3,2,3) 13(0,3,1) 12(0,3,0) 0(0,0,0) 15(1,3,3) 0(1,0,0) 3(1,0,3) 7(1,1,3) 12(1,3,0) 6(1,1,2) 12(2,3,0) 12(3,3,0) 6(2,1,2) 4(0,1,0) 3(2,0,3)
Returns: 1.0 -1.0 , Game actions: 2(0,0,2) 10(0,2,2) 5(0,1,1) 9(0,2,1) 3(0,0,3) 5(1,1,1) 15(0,3,3) 0(0,0,0) 13(0,3,1) 11(0,2,3) 8(0,2,0) 11(1,2,3) 3(1,0,3) 10(1,2,2) 12(0,3,0) 14(0,3,2) 4(0,1,0) 10(2,2,2) 10(3,2,2) 11(2,2,3) 11(3,2,3) 3(2,0,3) 8(1,2,0) 15(1,3,3) 0(1,0,0) 9(1,2,1) 8(2,2,0) 8(3,2,0) 9(2,2,1) 15(2,3,3) 15(3,3,3) 14(1,3,2) 4(1,1,0) 12(1,3,0) 13(1,3,1) 14(2,3,2) 14(3,3,2) 5(2,1,1) 0(2,0,0) 12(2,3,0) 13(2,3,1) 13(3,3,1) 4(2,1,0) 7(0,1,3) 4(3,1,0)
Returns: -1.0 1.0 , Game actions: 4(0,1,0) 10(0,2,2) 5(0,1,1) 9(0,2,1) 5(1,1,1) 5(2,1,1) 15(0,3,3) 9(1,2,1) 3(0,0,3) 9(2,2,1) 9(3,2,1) 11(0,2,3) 8(0,2,0) 0(0,0,0) 11(1,2,3) 10(1,2,2) 11(2,2,3) 10(2,2,2) 10(3,2,2) 3(1,0,3) 13(0,3,1) 8(1,2,0) 2(0,0,2) 8(2,2,0) 8(3,2,0) 11(3,2,3) 0(1,0,0) 12(0,3,0) 13(1,3,1) 12(1,3,0) 13(2,3,1) 13(3,3,1) 2(1,0,2) 12(2,3,0) 12(3,3,0) 6(0,1,2) 6(1,1,2) 6(2,1,2) 1(0,0,1) 3(2,0,3)
Returns: 1.0 -1.0 , Game actions: 7(0,1,3) 5(0,1,1) 10(0,2,2) 5(1,1,1) 9(0,2,1) 6(0,1,2) 5(2,1,1) 0(0,0,0) 10(1,2,2) 3(0,0,3) 6(1,1,2) 1(0,0,1) 2(0,0,2) 5(3,1,1) 11(0,2,3) 8(0,2,0) 10(2,2,2) 10(3,2,2) 8(1,2,0) 4(0,1,0) 12(0,3,0) 8(2,2,0) 9(1,2,1) 11(1,2,3) 8(3,2,0) 9(2,2,1) 0(1,0,0) 0(2,0,0) 6(2,1,2) 2(1,0,2) 2(2,0,2) 12(1,3,0) 2(3,0,2) 14(0,3,2) 6(3,1,2) 12(2,3,0) 12(3,3,0) 9(3,2,1) 13(0,3,1) 0(3,0,0) 13(1,3,1) 1(1,0,1) 1(2,0,1) 1(3,0,1) 14(1,3,2) 3(1,0,3) 14(2,3,2)
Returns: 1.0 -1.0 , Game actions: 8(0,2,0) 5(0,1,1) 10(0,2,2) 9(0,2,1) 5(1,1,1) 4(0,1,0) 6(0,1,2) 1(0,0,1) 13(0,3,1) 6(1,1,2) 0(0,0,0) 9(1,2,1) 9(2,2,1) 5(2,1,1) 9(3,2,1) 12(0,3,0) 12(1,3,0) 13(1,3,1) 13(2,3,1) 13(3,3,1) 12(2,3,0) 12(3,3,0) 6(2,1,2) 0(1,0,0) 3(0,0,3) 0(2,0,0) 0(3,0,0) 10(1,2,2) 8(1,2,0) 6(3,1,2) 10(2,2,2) 10(3,2,2) 1(1,0,1) 4(1,1,0) 2(0,0,2) 14(0,3,2) 2(1,0,2) 1(2,0,1) 8(2,2,0) 8(3,2,0) 2(2,0,2) 2(3,0,2) 3(1,0,3) 14(1,3,2) 3(2,0,3)
Returns: -1.0 1.0 , Game actions: 11(0,2,3) 10(0,2,2) 5(0,1,1) 9(0,2,1) 5(1,1,1) 5(2,1,1) 5(3,1,1) 10(1,2,2) 10(2,2,2) 0(0,0,0) 7(0,1,3) 0(1,0,0) 4(0,1,0) 6(0,1,2) 15(0,3,3) 3(0,0,3) 12(0,3,0) 2(0,0,2) 1(0,0,1) 14(0,3,2)
Returns: 1.0 -1.0 , Game actions: 13(0,3,1) 5(0,1,1) 10(0,2,2) 6(0,1,2) 6(1,1,2) 5(1,1,1) 9(0,2,1) 5(2,1,1) 5(3,1,1) 6(2,1,2) 4(0,1,0) 0(0,0,0) 10(1,2,2) 0(1,0,0) 7(0,1,3) 0(2,0,0) 0(3,0,0) 8(0,2,0) 9(1,2,1) 8(1,2,0) 10(2,2,2) 10(3,2,2) 11(0,2,3) 8(2,2,0) 8(3,2,0) 9(2,2,1) 4(1,1,0) 3(0,0,3) 2(0,0,2) 3(1,0,3) 4(2,1,0) 4(3,1,0) 3(2,0,3) 2(1,0,2) 11(1,2,3) 1(0,0,1) 1(1,0,1) 13(1,3,1) 11(2,2,3) 11(3,2,3) 1(2,0,1) 13(2,3,1) 13(3,3,1) 9(3,2,1) 6(3,1,2) 2(2,0,2) 2(3,0,2) 12(0,3,0) 7(1,1,3) 12(1,3,0) 12(2,3,0) 15(0,3,3) 7(2,1,3) 7(3,1,3) 12(3,3,0) 14(0,3,2) 3(3,0,3) 1(3,0,1) 14(1,3,2) 15(1,3,3) 15(2,3,3)
Returns: 1.0 -1.0 , Game actions: 14(0,3,2) 10(0,2,2) 5(0,1,1) 6(0,1,2) 7(0,1,3) 10(1,2,2) 3(0,0,3) 10(2,2,2) 10(3,2,2) 7(1,1,3) 7(2,1,3) 3(1,0,3) 9(0,2,1) 8(0,2,0) 15(0,3,3) 8(1,2,0) 11(0,2,3)

Center move first:

Returns: 1.0 -1.0 , Game actions: 5(0,1,1) 10(0,2,2) 3(0,0,3) 9(0,2,1) 5(1,1,1) 5(2,1,1) 15(0,3,3) 11(0,2,3) 8(0,2,0) 0(0,0,0) 6(0,1,2) 0(1,0,0) 9(1,2,1) 0(2,0,0) 0(3,0,0) 10(1,2,2) 4(0,1,0) 11(1,2,3) 7(0,1,3)
Returns: 1.0 -1.0 , Game actions: 6(0,1,2) 10(0,2,2) 5(0,1,1) 9(0,2,1) 5(1,1,1) 5(2,1,1) 15(0,3,3) 13(0,3,1) 5(3,1,1) 11(0,2,3) 8(0,2,0) 0(0,0,0) 9(1,2,1) 0(1,0,0) 9(2,2,1) 0(2,0,0) 0(3,0,0) 2(0,0,2) 10(1,2,2) 1(0,0,1) 3(0,0,3) 2(1,0,2) 11(1,2,3) 8(1,2,0) 2(2,0,2) 1(1,0,1) 3(1,0,3) 3(2,0,3) 11(2,2,3) 1(2,0,1) 1(3,0,1) 8(2,2,0) 3(3,0,3) 2(3,0,2) 11(3,2,3) 10(2,2,2) 4(0,1,0) 7(0,1,3) 6(1,1,2) 15(1,3,3) 15(2,3,3) 6(2,1,2) 15(3,3,3) 14(0,3,2) 10(3,2,2)
Returns: 1.0 -1.0 , Game actions: 9(0,2,1) 10(0,2,2) 5(0,1,1) 6(0,1,2) 3(0,0,3) 14(0,3,2) 2(0,0,2) 10(1,2,2) 15(0,3,3) 10(2,2,2) 10(3,2,2) 13(0,3,1) 7(0,1,3) 0(0,0,0) 11(0,2,3)
Returns: 1.0 -1.0 , Game actions: 10(0,2,2) 9(0,2,1) 3(0,0,3) 5(0,1,1) 6(0,1,2) 1(0,0,1) 13(0,3,1) 5(1,1,1) 5(2,1,1) 7(0,1,3) 6(1,1,2) 7(1,1,3) 7(2,1,3) 6(2,1,2) 15(0,3,3) 4(0,1,0) 7(3,1,3) 0(0,0,0) 10(1,2,2) 8(0,2,0) 12(0,3,0) 15(1,3,3) 14(0,3,2)
Returns: -1.0 1.0 , Game actions: 12(0,3,0) 10(0,2,2) 5(0,1,1) 10(1,2,2) 7(0,1,3) 10(2,2,2) 10(3,2,2) 9(0,2,1) 3(0,0,3) 0(0,0,0) 5(1,1,1) 2(0,0,2) 4(0,1,0) 6(0,1,2) 14(0,3,2) 11(0,2,3) 8(0,2,0) 14(1,3,2) 15(0,3,3) 13(0,3,1) 9(1,2,1) 6(1,1,2) 2(1,0,2) 14(2,3,2) 14(3,3,2) 6(2,1,2) 2(2,0,2) 6(3,1,2)
Returns: 1.0 -1.0 , Game actions: 15(0,3,3) 10(0,2,2) 5(0,1,1) 9(0,2,1) 3(0,0,3) 11(0,2,3) 8(0,2,0) 5(1,1,1) 10(1,2,2) 0(0,0,0) 10(2,2,2) 0(1,0,0) 9(1,2,1) 0(2,0,0) 0(3,0,0) 5(2,1,1) 10(3,2,2) 11(1,2,3) 11(2,2,3) 11(3,2,3) 13(0,3,1) 6(0,1,2) 2(0,0,2) 15(1,3,3) 4(0,1,0) 15(2,3,3) 8(1,2,0) 8(2,2,0) 4(1,1,0) 1(0,0,1) 4(2,1,0) 4(3,1,0) 12(0,3,0)

Here we can see some inconcistencies. Games that start with a “corner” move (move 0, 3, 12, or 15) should all end with the same outcome. Same for the “edge” moves (move 1, 2, 4, 7, 8, 11, 13, 14) or the “center” moves (moves 5, 6, 9, or 10). But that is not what we see.

All games that start with a center move result in the Player 1 winning the game, with the exception of the game that starts with move 12. For the games that start with edge moves and corner moves, then win/loss distribution between the players is respectively 5/3 and 2/2.

Perhaps this situation could be resolved by feeding the learning mechanism mirrors or rotation symmetries of the games that are played.

Against me

To launch a game of this AI against a human player, we simply need to specify a human player in the options given to the mcts.py program.

python open_spiel/python/examples/mcts.py \
  --game="python_connect_four_3d" \
  --player1="az" --player2="human" \
  --az_path="/home/patrick/TrainedModels/ConnectFour3D/checkpoint--1"

We can then enter which move to make, wait a brief second for the AI to make its move and continue playing. Although I could observe that this trained AI was sufficiently trained to block obvious traps, I found it difficult to play the game from the terminal. Indeed it is a little difficult to “see” all the possible winning combinations, especially those that involve different rows. This resulted in a number of embarassing defeats where I would miss an alignment from my AI opponent and fail to block it. I unfortunately do not have a physical replica of the game at hand, so I was left with a single option: confront the master himself.

The final boss

To be able to confront the master with an AI, I had to make it somewhat portable. So I reinstalled Open Spiel on my laptop and left it training for a couple of hours before making my way to to the bar. The master kindly accepted to play against this AI. What follows is the list of moves of each player (master plays first with o, the AI x plays second).

 3(0,0,3)
 0(0,0,0)
12(0,3,0)
10(0,2,2)
10(1,2,2)
11(0,2,3)
15(0,3,3)
 3(1,0,3)
 8(0,2,0)
13(0,3,1)
13(1,3,1)
 9(0,2,1)
 9(1,2,1)
 3(2,0,3)
 3(3,0,3)
 6(0,1,2)
14(0,3,2)
 9(2,2,1)
10(2,2,2)
11(1,2,3)
11(2,2,3)
11(3,2,3)
 6(1,1,2)
 6(2,1,2)
14(1,3,2)
 8(1,2,0)
 2(0,0,2)
 2(1,0,2)
15(1,3,3)
12(1,3,0)
12(2,3,0)
 8(2,2,0)
12(3,3,0)
13(2,3,1)
 0(1,0,0)
 7(0,1,3)
 5(0,1,1)
 5(1,1,1)
 5(2,1,1)
 4(0,1,0)
 7(1,1,3)
 7(2,1,3)
 7(3,1,3)
 6(3,1,2)
 0(2,0,0)

Confusing? Here is the state of the board after these first 45 moves (the AI, x, is due to play next):

...o
..xo
...x
o...

o..x
.oxx
xxoo
ox..

o.xx
.xoo
xoox
xooo

x.oo
xoxx
oxxx
oxoo

It is a little tricky to see at first glance, but the AI is in a bind here. Indeed there are two ways in which master can win regardless of the next action taken:

On the third row of the board, the diagonal only needs the 3rd token on pole 15 to be completed
On the right-most side of the board, there is a rising diagonal which only needs the 4th token on pole 15 to be completed

As a result, if the AI plays 15(2,3,3), our human competitor can win by playing 15(3,3,3) on the next move. If the AI plays something else, our human competitor can win by playing 15(2,3,3).

The next move made by the AI is 2(2,0,2), following by the winning move 15(2,3,3) which creates an alignment in the diagonal on the third level of the board:

...o
..xo
...x
o...

o.xx
.oxx
xxoo
ox.o

o.xx
.xoo
xoox
xooo

x.oo
xoxx
oxxx
oxoo

End of the game, Player 1 (in black) wins

Summary

We trained an AI using the AlphaZero algorithm on the 3D variant of Connect Four game. While the AI is sufficiently trained to avoid obvious gaps, it is not proficient enough to systematically win against an experienced human player.

We also saw that the result of the games when the AI plays against itself were not consistent with the symmetries of the game at hand. Further training that exploits game symmetries to generate more training data based on the played games could help remedy this situation. Also, since the victory conditions are strongly related to the presence of tokens in their near vicinity, convolutional neural networks might be more efficient at extracting the locality of such information.