Development process

Introduction

The combination of large, talented, corporate-backed research teams and furtive publishing creates a "competitive moat" that others must cross when replicating and extending results. Developing software and gathering data to catch up to existing results can be complex, especially when high performance systems or specialized applications are required, and when details are missing or ambiguously specified. Perhaps more importantly, failed ideas are usually not presented, so others must wade through the same initially promising layers of low and medium-hanging fruit before reaching truly new idea space.

Community projects with public discussions and work tracking provide a good counterexample. I would like to take a similar approach here by listing experiments, bugs, failures, and ideas for future work, in the hope that I can save others time or frustration. Technical documentation, code and data are also available.

Learnings

It was enlightening to discover just how much time is required on the research side trying many ideas before finally finding one that works. Reinforcement learning makes this especially time consuming and expensive because of the raw computation required for end-to-end experiments. Improving a process can require running it many times. It was naïve of me to try to speed up training when initially lacking the resources to complete even a quarter of it.

The contrast between machine learning systems and general software engineering also became very clear when validating behavior and results. You can usually look at a partially developed website or application and say, "this thing is working correctly", or "these three things are broken". With ChessCoach, there was often not a hard, expected result from components such as neural network evaluations, training processes, and search logic. Results can be underwhelming because more training is needed, or because other components need to catch up, or because of a bug. This is even more of a problem with fewer eyes on the project. I underestimated the subtlety of bugs that could emerge and the strict validation necessary to maintain confidence, and trust in additional experiments.

Timeline

I started preparation in March 2020. Beyond reading some articles, watching the AlphaGo match vs. Lee Se-dol, and taking an undergraduate subject covering MNIST handwriting recognition years ago, the machine learning side was going to be new. Research papers were extremely useful, spanning earlier neural network-based engines, autoencoders, commentary, criticality, introspection, and the AlphaGo family. I highly recommend the book, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Géron, 2019). Some online courses seemed valuable but were a little too low-level and slow-going for the pace I was aiming for.

In mid-March 2020, I started development, using the Zeta36/chess-alpha-zero project (Zeta36, 2017) as a reference for TensorFlow 2 (Keras) model code, and the AlphaZero pseudocode (Silver et al., 2018) as a reference for self-play and training code. A little over a year of development time was required, running experiments and programming in parallel. Not much code was needed in the end. Most of the time was spent discovering (most likely rediscovering) what works and what does not, and scaling self-play infrastructure up and out. Work tracking was via TODO.txt, and I found that keeping a detailed development journal was also invaluable.

Major bugs

Throughout most of the project, nowhere near enough computational power and data were available for ChessCoach to progress meaningfully. It was not always clear whether it was working correctly but needed more training, or should be working better but had a bug. This meant that some particularly egregious bugs existed for long periods, even while the engine grew much stronger than I had initially planned.

Failures

Neural network architecture

Many different neural network architectures and parameterizations were tried in the hope of improving training and inference speed on consumer GPUs without significantly reducing comprehension. Unfortunately, only Inception-based architectures matched the baseline performance during supervised training, but with higher complexity. Attention-based architectures were usually slower and no better in quality. While aiming to be more efficient, mobile-based architectures ended up less so because of their lower memory efficiency, limiting batch sizes on limited consumer GPU memory.

Neural network architecture failures:

Training

Some training experiments simply failed to produce results. Others were more dangerous, providing short-term benefits in training convergence, but ending up neutral or harmful as the schedule reached 100,000 AlphaZero steps and beyond. This is something to be wary of in partial replications of AlphaZero-style reinforcement learning.

Some techniques that help in Go appear to hurt convergence in chess, either because of flipped perspective between moves, or tactical non-smoothness.

I was surprised to see no benefit from Squeeze-and-Excitation-style techniques after three attempts throughout the project, in contrast to Lc0. It may have been flawed implementation or parameterization on my part, or the non-stationary policy plane mappings in place until 2021/08/08.

The concept of trained criticality morphed into training-free SBLE-PUCT and other tournament play techniques.

Training failures:

Self-play and tournament play

Most self-play and tournament play failures centered on child selection during PUCT and simply showed no benefit. However, evaluation is quite sensitive to specific implementation and parameterization, and I believe that some of these ideas could work with more time and effort.

Before the network was fully trained, manually investigating individual Arasan21 positions was helpful, with successful techniques such as SBLE-PUCT able to redirect "wasted" simulations into broader exploration. Once policies were more accurate, it became more beneficial to manually investigate tournament blunders against Stockfish 13, discovering more varied problems and applying more targeted solutions.

Self-play and tournament play failures:

Future work

Although I have no current plans for further development, it may be helpful to list ideas I was considering as part of the original scope.