Questioning your values
One of the things that separates professionals from strong amateurs is their ability to look at even a complex board position and tell who is ahead.
This question of ‘value’ of a board position has been a non-trivial problem incomputer Go since inception, and DeepMind’s solution to it is the main thing separating its program from other Go AIs.
Deterministic, zero-sum games (like Go) actually have an objective value function across all board positions, but Go has too many combinations to ever calculate this precise value.
AlphaGo uses a neural network model to approximate the value function, and this model was created in three steps, building two other models along the way:
- A ‘policy network’ (i.e. a model giving a probability distribution over possible moves) built using ‘supervised learning’ (SL – where we get the model to make a prediction, then we give it the answer and it adjusts the model to ‘learn’ from the answer) to predict a human’s move, given a board position.
AlphaGo’s supervised learning policy network successfully predicted human moves 57% of the time, when trained on 160,000 6–9dan KGS games, with a total of 30 million board positions.
- Another policy network, built by ‘reinforcement learning’ (RL) – taking the supervised learning network and getting it to play subsequent versions of itself and learn from the game outcomes, to predict the move most likely to result in a victory.
The reinforced learning policy played 1.28 million games against different versions of itself, resulting in a very strong policy network for selecting moves.
- Finally, the ‘value network’, which was built by supervised learning & regression over board positions and values generated from the SL and RL networks, and predicts the expected value (i.e. probability of a victory) of a board position.
To do this, AlphaGo generated 30 million games, playing the first n-1 moves with the SL network, then selecting a random legal move, and then using the RL network to select all moves until the game ends and a value (i.e. win/lose) is known.
The value network was then trained on just one board position from each game – the one subsequent to the first RL network move – to minimize the error in predicted value.
This complex process resulted in a value function that is closer to the ‘real’ value function for Go than anyone has ever achieved before.
In fact using the value network alone, AlphaGo beat all other computer AIs!