Technical explanation


I have already covered many technical details in the high-level explanation, so I will try to make this document more code focused. If you have not read the High-level explanation, feel free to refer back to individual sections as needed. The next section, Software design rationale, explains some of the design choices made in the codebase, but may not be too interesting to you. Instead, you can jump straight to the sections on feature areas, starting with Chess position management.

Software design rationale

I try to take a practical approach with software design, focusing on concrete examples of how design choices can help individuals or teams while trying to avoid dogma. Some choices in ChessCoach may seem questionable at surface-level and would confuse the freshly graduated version of myself. I'll start with a really specific example.

It is important in code commentary to focus not just on the "why", but the "what". Constantly switching between high level and low level, architecture and minutia, is a big part of what makes programming mentally taxing. Anything that allows others and future-you to stay above the low-level details while finding your way somewhere else is immensely helpful. Critics often say, "the code should be self-documenting" – sure, but that only goes so far. We speak in natural language because it is a better interface to our brains than code. Critics often ask, "what if you change the code and the comment goes stale?" Well, what if we only fix the reported repro of a bug, or skip writing unit tests, or debug in production? Skilled work requires care and proactivity. Providing information and guidance rather than micromanaging with guard rails usually wins out.

Code is read many more times than it is written, and investments into readability are worthwhile even if they feel wrong to the part of our brain that craves structure and hands-off reliability. It is common to see a trivial example saying, "see, this comment just repeats exactly what the code says" – and sure, I agree that this is often useless – but extrapolating this to "comment why, not what" is harmful and lazy. We don't question the value of highways when surface streets exist. We don't question the value of exit markers and street signs when all of that info is on Google Maps. The mental tax in scanning other people's code, with varying subject matter, approaches, languages and styles, is underestimated. Redundancy is all around us, and often very useful.

This principle of acceptable redundancy applies in other areas. It is okay to copy and paste code if the redundancy is coincidental or temporary. It is okay to write your own code instead of depending on someone else's with extra features and no test coverage. It is okay to store versions of the truth in two different places if the alternative would be worse. It is okay to wait to consolidate logic or automate systems until enough examples exist to show the right way to do so.

Short, bloggable principles that assert themselves universally give us a cozy feeling that we're smart and doing the right thing. Explanations are often logical in isolation, or at least persuasive. However, we need to consider costs and benefits to real people in concrete scenarios. Depending on the workplace and environment, this can take courage.

Beyond the locus of redundancy, there are other instances of programming dogma worth examining. It is good to see questioning of object-oriented design in recent times, especially when these reevaluations are themselves pragmatic and self-aware. John Carmack (2007) explains the benefits of larger, cohesive blocks of code as opposed to extracting small pieces in the fashion of self-documentation and hypothetical reuse. Jonathan Blow (2019a) (2019b) (2013) takes a thoughtful, grounded approach when discussing many problematic aspects of modern software development.

With ChessCoach, I have tried to follow these principles where they make sense:

I'm satisfied with the way some of the design turned out. The system of flexibly arranged training stages saved time when running many experiments. Implementing the bare minimum of coroutines through three or four in-band function returns served a year of development without messing with libraries. A few key object abstractions such as chess games and worker groups helped with mental mapping. It was nice ending up with a single threading model to handle self-play and tournament play. Self-play and training performance hit the necessary targets. Unit tests give reasonable coverage.

I'm dissatisfied with a larger number of items. The code is nowhere near the clean and minimal aesthetic you often see in classical engines. Configuration is somewhat of a disaster, as while some aspects such as automatic optimization/option mapping and parsing policies were convenient, parsing and storing in two languages to avoid an API boundary probably ended up much worse. While it initially made sense to parse PGNs internally because going from an intermediate format to a training format is about the same complexity, by the time variations and commentary were required, an external tool may have been better. SelfPlay.cpp ended up too big even by my standards: while a lot of it belongs together, some of it is there just to work around link-time optimization failures, and some could be split out with enough care. However, I think a crude self-play/tournament or search/game split would end up counterproductive. Code calling in to Python contains a lot of boilerplate that convenience utilities could take care of, although I try to avoid templating. Systems like parameter optimization and strength testing sometimes use code hosting and sometimes use process hosting to work around hardware ownership issues. CPU and memory optimizations are incomplete and inconsistent. Tracing and debugging infrastructure you may be accustomed to in a large company is non-existent. Package management is the bare minimum required to compile. Builds link in too much and do not always follow the spirit of the framework. On reflection, however, I believe that many dissatisfying design areas still gave the project the right amount of support on the way to the finish line without veering into over-engineering.

Chess position management

Stockfish code is used to manage the low-level technical details of chess positions, rules and moves. The ChessCoach Game[1] class wraps the Stockfish Position[2] and StateInfo[3] classes. It also takes care of generating and processing input and output data for the neural network, since this data is closely tied to the low-level details of the position, including the Position's Bitboards[4]. Since the neural network depends on the previous seven positions in the form of history planes, the Game class implements an efficient way to walk back through them[5]. Classical engines are not normally required to reapply moves, so small modifications are made in Stockfish code[6] to make it more efficient.

ChessCoach also often needs to branch a scratch game off the real game and run through hypothetical moves; for example, when running self-play simulations. Stockfish positions use a list of StateInfo[7] objects to manage rules such as threefold repetition, 50-move draws, and en passant, especially when rolling back a move during search. The Game class avoids duplicating these lists in scratch games and instead creates new "branches" that point back to a main "trunk"[8], using StateInfo's built-in pointer[9].

Because StateInfos have a fixed size they are pooled to help with performance and fragmentation. Nodes were also pooled previously, but this was reverted to allow for multi-threaded search. The pooling[10] is very primitive and offers little benefit over built-in allocators. More complicated schemes with better threaded deallocation could help, especially through a more contextually aware arena setup; for example, coupling StateInfo and Node lifetimes to games'.

Neural network training

As training the neural network takes days to weeks, it is important to be able to pause and resume, and recover from network and hardware problems. Training teacher and student networks, and primary and commentary models also requires flexibility. ChessCoach uses a system of TrainingStages[1] that are rotated through a series of checkpoints[2]. Resuming is based on on-disk evidence of stage completion[3] and just requires the ChessCoachTrain C++ binary to be re-run.

ChessCoach uses the AlphaZero training schedule when running on a v3-8 Tensor Processing Unit (TPU), but adjusts to scale to different hardware. Parameters in config.toml[4] define the exact schedule.

TensorFlow 2 is used to define the models in functional form[5], and checkpoints are saved and loaded as weights[6] rather than as serialized SavedModels.

Training uses different distributed Strategies to abstract away hardware differences[7]: TPUStrategy, MirroredStrategy and the default. During training, models are initialized and compiled under the strategy scope[8] so that variables can be distributed to multiple devices as necessary.

TensorFlow 2/Keras's[9] API with steps_per_execution[10] gives excellent training speed, but requires a few tricks. is quite opinionated in how a schedule is set up, with validation occurring after each epoch, so ChessCoach maps epochs to the validation interval[11] and hooks callbacks for console and TensorBoard logging[12] accordingly.

Losses and metrics are fairly straightforward, although policy losses need to be flattened across rather than within planes[13]. The MCTS value auxiliary target is included at lower loss_weight[14].

Because the full model includes heads such as the commentary encoder[15], a functional subset is taken for training[16] which interferes with saving and loading of optimizer state, including the number of training steps taken so far. Therefore, this information is manually added back before training[17].

Knowledge distillation uses a StudentModel subclass which overrides the default train_step and test_step on each replica by collecting a soft prediction from the teacher model[18] and incorporating it into loss calculations[19].

Validation is unorthodox. Rather than a training/validation split of the self-play data, which creates a chicken-and-egg problem in evaluating general chess understanding, ChessCoach uses a stationary supervised dataset for validation: the validation portion of the CCRL Dataset published by Lc0 (2018), based on Computer Chess Ratings Lists (CCRL) data. Only a single, random batch is used for validation each interval[20], so the results are volatile but over time give a visualization of the distribution in TensorBoard.

Strength testing is also performed every few hours during training[21], estimating an Elo rating using the Strategic Test Suite (STS) (Swaminathan, 2008; and Mosca, 2015), a set of 1,500 positions with varying points for the top four moves. The test takes five minutes to run, and results are logged to TensorBoard.

The learning rate is reduced initially for a number of warmup steps[22]. Empirically, this ensures that the value head of the teacher network converges without requiring the number of bottleneck convolutional filters[23] to be expanded from 1, as in AlphaZero, to 8 or 32 as in projects such as KataGo and Lc0. Warmup reduces the magnitude of early gradient updates, especially when momentum is involved. Large updates can be harmful while the network is still deploying early structure.

Stochastic weight averaging (SWA) is implemented by contributing each saved checkpoint to a decaying average[24], and saving this average[25] as a separate checkpoint for inference. Just before saving the SWA model checkpoint, its batch normalization layers need to have their moving averages retrained[26]. Since these use an independent momentum parameter, this retraining can be performed using a learning rate of zero.

Neural network inference

Inference is using the neural network's full predictive powers, trusting in its training and ability to generalize on unseen inputs. It is also known as prediction, or evaluation in the high-level explanation. The network can generalize better and operate more efficiently during inference as layers like batch normalization and dropout focus on accuracy rather than training improvement; losses and gradients don't need to be calculated and remembered; parts of the calculation graph can be fused and optimized; and GPU/TPU devices don't have to synchronize with each other (although some workloads may still benefit from them collaborating).

ChessCoach has N threads calling in to Python to predict batches of M positions each; for example, 8 × 256 on a v3-8 TPU. Each thread chooses a logical device from TensorFlow's GPU or TPU listing in round-robin fashion[1] and uses it as a tf.device context[2]. The model is called as a @tf.function to gain graph-mode speed-up[3], and data is passed back to C++.

Models are constructed and weights are loaded lazily[4] to speed initialization and save memory. Construction is locked globally[5] (empirically required by TensorFlow) and loading/assignment is locked per-device because CPython is still preemptive despite the GIL. As finding the best model weights to load can be complicated, utilities help with decision-making[6] and path comparison[7]. SWA models are preferred for prediction[8] because of their improved generalization. Self-play threads continually check for new training checkpoints[9] in order to update inference model weights.

Performance would likely be improved instead using the TensorFlow C++ API, with corresponding sacrifices to portability and readability. Currently, TensorFlow does not distribute their C++ API on Linux or Windows, citing a lack of bandwidth and asking for community development. For example, a library such as GoogleTest can be linked against after installing the .h/.so via libgtest-dev on Linux, or downloading the .h/.lib/.dll on Windows via direct download or a package manager like NuGet or vcpkg. In contrast, TensorFlow must be built from source in the Bazel build tool with special flags, and linking projects must also use Bazel or try to narrow down exactly which artifacts are required in each version in order to transplant to another tool. Thankfully, the Python distribution is universal and consistent, so that is used instead.

Commentary training and inference

The full commentary model[1] includes the tokenizer, the primary ChessCoach model as the encoder, and the transformer decoder. The decoder uses mostly-default transformer parameters[2]. An issue exists where TensorFlow cannot load weights for the commentary model because it cannot determine its shape. Therefore, the model is first primed on fake data[3].

The commentary decoder uses transformer code taken from the TensorFlow Model Garden, via the Python package tf-models-official. Some code references the installed package, and some code is copied into ChessCoach and extended[4]. Embedding, encoding and masking are removed for inputs, and cross-attention masking is trivialized.

The tokenizer maps between English words and sub-words, and integer token IDs. ChessCoach uses a vocabulary size of 8000 and trains a Sentencepiece tokenizer[5] over the raw commentary corpus. The TensorFlow Text library hosts the tokenizer at training and inference time, most importantly because the data pipeline needs to run in graph mode[6] (see Data pipelines and storage for more detail). However, a bug exists for detokenization on TPU during inference, so this is moved to CPU[7].

Data is split into 95% training and 5% validation by the ChessCoachPgnToGames[8][9] tool, given the scarcity.

As with the primary model, commentary training uses a approach[10] with steps_per_execution, with more tweaks. Default code would weight each training paragraph equally when calculating losses and backpropagating gradients. However, it is better instead to weight each word or sub-word equally. Some transformer code uses a custom training loop to achieve this in multi-GPU and TPU scenarios. Instead, ChessCoach applies sample weights in the data pipeline[11][12]. Padded cross-entropy loss is used with label smoothing of 10%[13].

Optimization is also different for commentary. The Adam optimizer is commonly used for transformers instead of the stochastic gradient descent (SGD)-with-momentum combination that works well for convolutional models. ChessCoach uses Adam with a triangular cyclical learning rate (CLR) (Lee, Liu & Peng, 2020) after range testing, with large batch size potential[14]. A v3-8 TPU allows for a schedule of 50,000 training batches of 4,096 positions each (configured as 400,000 batches of 512 positions each but divided and multiplied by the 8 devices). The overall schedule is approximately 233 epochs (the number of times the full training data is covered). The step size (half of a triangle/cycle) is approximately 11.65 epochs, with 10 cycles over the full schedule. Training finishes at the smallest learning rate. No explicit warmup is required because the learning rate already cycles up from the minimum for many epochs. Range testing pointed to a minimum Adam learning rate of 10-5 and a maximum of 10-3. These parameters can be adjusted in config.toml[15]. Training on GPUs is also possible but slow by comparison, with a maximum measured batch size of 64 per NVIDIA GeForce GTX 1080 and 128 per NVIDIA V100. Range testing needs to be re-run for each new hardware configuration.

Validation is a little more orthodox for commentary and covers the full validation dataset. However, it is run every validation interval[16] of 250 batches, for a batch size of 4,096, rather than every epoch.

Commentary inference uses the TensorFlow Model Garden's sampling code for top-p and temperature[17]. ChessCoach applies a shim to work around a bug in top-p sampling on TPU[18]. Dropout of 20%[2] is used to balance generalization with specificity.

Commentary quality is subjectively evaluated using a small, hand-built test suite[19]. The ChessCoachGui tool has a Suite button that generates five comments per test position[20].


ChessCoach self-play code began with the AlphaZero pseudocode in Python, using a popular Python chess library for position management. However, playing out games[1] took on the order of minutes rather than seconds, so everything had to be ported to C++[2].

Many details are missed in the AlphaZero pseudocode. Because valuations are stored on child nodes rather than edges in a graph-theoretical sense, they need to be viewed from their parent's perspective for the purpose of move selection etc., so neural and terminal evaluations need to be flipped[3]. Consistent mapping between the (-1, 1) neural network value (tanh) range and [0, 1] probability domain is required[4]. Dirichlet noise as a gamma distribution needs to be normalized to sum to 1.0 before mixing into the prior distribution[5]. The implementation of queen-knight and underpromotion policy plane mappings[6] is unspecified. Search tree reuse is not implemented[7]. Wording in the paper around temperature and sampling is ambiguous. Sampling during self-play is not softmax sampling as pseudocode names it[8]. Move diversity vs. Stockfish during tournament play is also underspecified, and missing from the pseudocode[9].

In ChessCoach, each self-play thread runs a loop[10] that waits for work, then plays games by alternating CPU and GPU work before finally saving and chunking the game data for training. The CPU piece uses a coroutine-like mechanism for each game in the batch, potentially looping over the full 800 simulations, but jumping out for each game as soon as a neural network prediction is required. The null/NaN/SelfPlayState returns to jump out[11][12][13] and SelfPlayState to resume[14][15] may look messy, but I just hope that this system is clear in its intentions and behavior, and simple architecturally.

For each tree search simulation, SBLE-PUCT[16] is used to select a search path down the tree[17] until reaching a terminal node or unexplored node, applying moves to a branched-off scratch game. The leaf node may be permanently or temporarily terminal[18], or its evaluation may already be cached[19]. Otherwise, the input planes for the position need to be generated[20] and emplaced (directly constructed in place) in this game's slot[21]. For non-terminal leaves, after waiting for a network prediction if necessary, the node is expanded[22], setting up child nodes with priors. When receiving a raw network prediction, logits are converted to a probability distribution via softmax[23], then, if possible, the prediction is cached[24]. Finally, regardless of category, the node's valuation and visit are backpropagated[25] up the search path, along with mate-proving.

When initially expanding the root node before the first of 800 simulations, Dirichlet noise needs to be applied. However, because of tree reuse, the node may already be expanded. As an additional complication, if the new root was already in the reused tree, but was previously seen as a temporary draw via twofold repetition checking[26], it may not be expanded. It can be hazardous trying to correctly add noise and perform related tasks because of these special cases. ChessCoach centralizes the logic[27] and calls into it as appropriate for self-play and tournament play.

After simulations are finished, statistics are collected[28] and a move can be selected[29] and played in the real game. As in the AlphaZero paper, for the first 30 plies, moves are sampled in proportion to visit count, with each immediate child of the root having received some of the 800 visits. Sampling helps cover more of the potential game space of chess. After this, the best move is always selected. This is usually the most-visited move, as in the AlphaZero paper, but faster checkmates and slower opponent mates take priority[30]. Selecting the best move helps with game result attribution.

In terms of data structures, the SelfPlayGame[31] class extends the Game class and adds neural network input/output emplacement, statistics tracking and node management for tree reuse. The Node[32] class builds a search tree through its children field. Nodes drive most of the memory consumption for self-play and tournament play, and are much more demanding than in classical engines, which can search deeply and then unwind. High-memory or extended-memory configurations are required on cloud platforms relative to CPU count. ChessCoach nodes were shrunk down from 64 to 48 bytes[33], expanded back to 64 bytes[34] with debug info and an unoptimized tablebase implementation, then shrunk to 32 bytes[35] with prior quantization (matching the prediction cache) and bit packing. Node size reduction helps tree search not just in memory envelope but also in search speed, through copying and locality improvements. Further improvements are possible; for example, minimizing memory consumption for leaf nodes, implemented by Lc0.

Child selection via PUCT is also an important optimization target that has received minimal attention[36] in ChessCoach. The Ceres project (Elliott, 2020) pays particular attention to vectorizing PUCT calculations, and this type of work could help self-play and tournament throughput.

Checking for draws by repetition is a notable detail that has spawned long-lived bugs and discussions in other projects. In chess, a draw by threefold repetition can be claimed by a player when the same position has been reached for the third time in a game, where the position includes the side to move and castling and en passant rights. Often when playing over the board, this type of draw needs to be actively claimed, but computer platforms usually rule the draw automatically, and this especially makes sense when no human players are involved.

When an engine is searching, an optimization can be made. If a twofold repetition is encountered, this can be considered a draw early. If no other move is better than repeating, then the threefold repetition is inevitable with perfect play. However, this introduces a trap. Let's say that after playing a particular move, the resulting position would be losing but still the best option, so the engine enters the position. The opponent blunders and now a variety of drawing moves are possible. The engine thinks that returning to that previous position as a twofold repetition is just as good a draw as any other. However, it is not actually a draw yet in the real game. Now, the opponent has had more time to think and makes the winning move. Other variations are possible involving sub-optimal and then better play from the searching engine instead.

To fix this problem, a change is necessary: only twofold repetitions with both parts encountered strictly after the search root can be considered a draw, since both were opted into rather than destined. Once the first instance is actually played on the board, the second instance is no longer considered a draw. ChessCoach implements this optimization safely by always treating twofold repetition as a temporary draw and re-checking each visit[37]. In contrast, threefold repetitions are cached and do not need to be checked again, even when the search tree is reused for future moves. Also, since ChessCoach has already generated legal moves, Stockfish draw-checking code is customized to omit this step[38].

Distributed training and self-play

Generating 44 million AlphaZero-like self-play games requires mammoth computing resources. Even running non-stop for a year, a consumer GPU can only hope to generate 10-20 million.

Through Google's TPU Research Cloud (TRC) I was able to access 4 v2-8 TPUs initially and approximately 50 v3-8 TPUs by the end of the project. Initially, I used Google Kubernetes Engine (GKE) to manage a small number of older-style TPUs with corresponding compute VMs. After this I was invited to an alpha of the newer-style Cloud TPU VMs, which do not require a separate compute VM. Performance is potentially much higher with code running locally. However, Kubernetes was not yet supported, so I set up custom automation via SSH session management.

ChessCoach relies on simple building blocks for cluster-based training and self-play. Each machine runs the ChessCoachTrain binary and operates under the training (train) role and/or self-play (play) role[1]. At most one machine acts as a trainer. In smaller clusters the trainer also self-plays. Coordination is solely via Google Cloud Storage, under a gs:// bucket. The trainer looks for new data periodically and uses this to train the neural network and save checkpoints to storage. Self-play roles look for new checkpoints periodically and use these to self-play. Whenever a machine has saved 2000 games locally it chunks them and uploads the chunk to storage.

Machines without the training role could pause whenever they know training has started and resume once they see the new checkpoint. However, this wastes time, as training data that is a little less fresh is usually better than nothing, and the window size of 1 million positions[2] smears this further. The exception is when using uniform predictions before the first checkpoint is available[3]. Machines could also clear any games in progress and start from scratch using new weights. However, because the N × M games in progress are overlapping randomly after a point, this effectively throws away (N × M)/2 full games. As more self-playing roles are added to the cluster, checkpoints arrive faster relative to the N × M and this wastage worsens. Instead, it is better to just continue the existing games using the new network[4] in Python, again favoring bulk volume. C++ code remains mostly unaware of the changeover, but does get signaled to clear the prediction cache[5].

Machines without the self-play role[6] wait until enough data is available[7] before training the next checkpoint: they do not try to self-play. This is important for large clusters because training may be happening so often that self-play games that started many checkpoints ago on the training role are still trying to finish. I am not confident of the cut-off point, but it is probably safe for the trainer to also self-play with 25 or fewer v3-8 TPUs, using the default schedule.

Older-style TPUs are managed via GKE using cluster-*.sh scripts and cluster-*.yaml deployment specifications. A service account is set up using[8] as a TPU Admin and Storage Admin, and a key.json is generated and downloaded to the project root. The script[9] is used to build a docker image and upload it to the Google Container Registry (GCR). The files cluster-train-deployment.yaml[10] and cluster-play-deployment.yaml[11] are updated to reference the latest docker image. The script[12] sets up the cluster, by default using preemptible n2-highmem-4 VMs for the TPUs, and including a single n1-standard-1 VM in a critical node pool for DNS[13]. The script[14] applies the deployment specifications, creating or updating deployments as necessary using a Recreate strategy to match the latest details.

Newer-style Cloud TPU VMs, including pods, are managed using the[15] script, which runs in what-if mode until the --confirm flag is passed. The service account and key.json are set up as before (the same ones can be reused) and is run as before. Quota information (extrinsically or intrinsically imposed) and docker run commands are updated at the top of Command-line arguments drive the top-level command used and the TPUs and deployments managed. At the time of writing, special care is needed to run inside docker on a Cloud TPU VM[16][17]. Docker needs to run privileged, or the /dev/accel* devices need to be mounted; /lib/ needs to be mounted; /usr/share/tpu needs to be mounted; and wheel and /usr/share/tpu/tf_nightly*.whl need to be installed in the docker CMD before the intended entry point. In future, Cloud TPU VMs will use versioned TensorFlow releases, and this installation can be cached in the docker image. Additionally, private binaries are currently required to support TensorFlow Text and other libraries with custom ops, although I do not expect this to be the case for long. The script is minimum-effort and is resilient to SSH disconnects and ChessCoach/TensorFlow crashes, but fragile to harder hangs or TPUs being pulled away. It continuously pipes the output of each remote container to a file like AlphaManager/20210712-210214/alpha-1.log in local ChessCoach storage.

Tournament play

For tournament play, ChessCoach implements the Universal Chess Interface (UCI), a text-based protocol operating through standard input and output streams. The protocol is partially stateless in that position and time control information is always given in its entirety. It is partially stateful in using the previous position specification for searches and implicitly reusing search state between moves for what is loosely the same game. It is important to always process standard input commands, even while searching.

Through UCI, ChessCoach can be instructed to search a position infinitely[1]; to search until some condition is met based on nodes/mate/time; or to search using as much time as it chooses, given some kind of time control for the match. Time control is specified as overall search time per player for the game; additional search time per move; and in another mode, the number of moves for which the original time control repeats[2]. For the most part, ChessCoach emulates AlphaZero's trivial time control strategy and for each move searches for 1/32nd of the total remaining time[3].

When searching a position, ChessCoach first tries to reuse the existing search tree and prediction cache[4], pruning[5] down the tree following any additional moves applied. The N × M virtual games become shadows sharing a single search tree[6]. The WorkerGroup[7] class is used to set up the search threads initially[8], and the WorkCoordinator[9] class helps with starting, stopping and waiting for multi-threaded tasks like UCI searches. The job of the search threads is to collaboratively build out a search tree such that the best move at the root finishes with the most visits[10]. One of the worker threads is called the primary and does additional housekeeping[11] such as checking time control, printing the principal variation, updating the debug GUI when enabled, reporting the best move, and cleaning up.

Threads safely work on the same search tree by using std::atomic* fields[12]. A good example is when contributing a valuation to a node when backpropagating, using an exponentially weighted moving average (EWMA)[13]. EWMA helps the search adapt more quickly when making new tactical discoveries, but care is needed to avoid oscillations and excessive forgetfulness. The backpropagation weight involves a simple atomic increment. The calculation of the average is more complex, so a compare-and-swap instruction is attempted until it succeeds. If another thread swoops in and updates the value first, the calculation needs to be rebased before it is reattempted. In both cases, std::memory_order_relaxed is used because there is no need for ordering guarantees between either of these fields and any other field.

The use of different atomic primitives is situational: fetch_add, fetch_sub, compare_exchange_weak, compare_exchange_strong and exchange are spread throughout SelfPlay.cpp. Acquire/release semantics are only required when one field acts as a ready signal for another payload field. The two specific situations are (a) SearchState::principalVariationChanged[14] signaling Node::bestMove, and (b) Node::expansion signaling Node::children. However, acquire/release is mostly automatic for x64 and more important for ARM, which is not a target platform currently.

Only one thread is allowed to expand any particular node, by first getting a neural network evaluation, then allocating memory for the node's children and setting them up with priors. Because this is expensive and involves allocation, it is better to treat this like a lock rather than letting threads destructively race and cleaning up the fallout. Threads use compare-and-swap instructions on the Node::expansion field when attempting to claim ownership. If a thread has already committed to a node for a particular simulation and loses this race, the simulation just gives up for the current round of CPU work. This is called a failed node, visible with the debug on UCI command. However, if the thread is still selecting children and can see in advance that another thread holds expansion permission for a particular child, it treats that child as blocked, and another is selected. If a blocked child was going to be best, then backpropagation weight is cleared, breaking the chain upwards. This helps prevent flooding problems in small sub-trees during multi-threaded search.

Information freshness is important for tournament play, increasing the effective "sequentiality" of PUCT search. With parallel shadows sharing a tree, instead of running each simulation as an isolated coroutine, it helps to let them all "cash in" their results[15] after a network prediction before letting them all choose new search paths.

While the search tree is still small, only one thread is allowed to work, with limited parallelism[16]. This is called slowstart and prevents too many simulations from flooding in until decision metrics can be calculated more accurately, and until sub-optimal choices via virtual exploration and loss are more viable. The feature most visibly helps with Strategy Test Suite (STS) scores for short, 200 millisecond searches, but likely helps in bullet time controls and in setting up longer searches.

Tournament play techniques

A variety of techniques help improve tournament strength.

When making AZ-PUCT and SBLE-PUCT calculations for all children of a parent, a PuctContext[1] is first constructed to pre-compute common terms. A selection algorithm like introselect[2] is used to eliminate linear exploration candidates after calculating the appropriate cut-off[3], based on search progress and sub-tree node count relative to the root.

Mate-proving propagates forced wins and losses up the search tree[4], adding certainty to value[5] for child selection during tree search, and adding certainty to move selection[6] when making a decision. If a choice of moves is available and one of them leads to a guaranteed win, then that move can be played, and the opponent's move leading to this position is a guaranteed loss. If all move choices lead to a guaranteed loss, then one of them must be played and the opponent's move leading to this position is a guaranteed win. Classical engines usually generalize this concept, using exact, upper and lower bounds in conjunction with alpha-beta pruning and adding in draw propagation and endgame tablebase bounding. ChessCoach's implementation is special-cased and primitive by comparison.

Faster mate incentives are layered on top of mate-proving and encourage more visits to faster mates[7]. Helping known wins compete with other exploration incentives encourages tree shape to better match move quality and helps valuation above the proof to better reflect prospects below. Additionally, an engine's mate-in-5 is not necessarily exact but means mate in at most 5 moves. Encouraging visits to more promising mates helps to refine the valuation of the best possible move.

While slower mate disincentives are not necessary to differentiate guaranteed losses, because the search already scrambles for value in losing situations, it does help to clear exploration incentives for known losses by clearing priors[8][9]. This reduces value dilution when a tactical surprise runs counter to the engine's expectations from training.

First-play urgency (FPU) describes the value estimate of nodes that have never been visited. While increasing this parameter can correspondingly increase search breadth, this is less effective further from the root because child selection decisions are made hierarchically. FPU is never relevant for a child if its parent is never visited. So, rather than driving some kind of global depth vs. breadth, FPU reflects confidence in training and drives dawdling vs. focusing surrounding the principal variation. AlphaZero made an initially astonishing decision in setting this to zero, often giving each first-visited child many more visits before considering an alternative. However, this becomes increasingly viable as training quality improves and achieves more depth in a more intuitive type of search.

ChessCoach sets FPU to a win at the root only[10]. This ensures that every move is least considered; for example, a mate-in-one would not be missed. While this would often happen quickly anyway during tournament play, root-win-FPU also helps reduce the sharpness of policy training targets in self-play data, which can hurt the feedback cycle. AlphaZero benefited from smoothing when making 8 simulations in parallel during self-play, and other projects have applied temperature of 1.03 at the root to a similar end.

In a more complex technique, when encountering a guaranteed draw during search, ChessCoach sets the FPU of unvisited siblings of the draw to the current root valuation, flipped to the correct perspective[11]. The initial visit is also forgiven rather than backpropagated further. This helps avoid value dilution in positions where one player is clearly winning, but many draws by repetition are possible; for example, in an open position with queens. Such draws are too difficult for the neural network to predict, so their priors are too high, and it usually takes many visits to the draws until exploration incentives for alternative moves increase sufficiently. Wins and losses do not carry this same danger, as they are correctly incentivized vs. the default zero FPU of siblings. Draw-sibling-FPU recognizes that when a position is winning, non-drawing moves are usually better than drawing moves and are worth considering immediately. This technique especially helps with many simulations flooding into small sub-trees.

More general, explicit flood-prevention across the search tree, in a manner akin to expansion-blocking, did not improve strength. This may be because of the existing effectiveness of slowstart, loss-prior-clearing and draw-sibling-FPU against flooding. However, classification and tuning for this topic is sensitive and difficult, and I am sure that effective techniques and improvements could be dropped in.

Endgame tablebases allow for perfect play when at most 6 or 7 pieces including pawns and kings remain on the board. Checkmates and draws are proved backwards from the end of all possible games with this many pieces or fewer, and the results are compressed (although the storage requirements for 7-piece tables are extreme). Syzygy is a tablebase format that focuses on distance-to-zero (DTZ) rather than distance-to-mate (DTM), which gives knowledge of cursed wins and blessed losses that are ruined or saved by the 50-move rule. However, this can result in a slower and more unnatural path to victory. ChessCoach uses a modified version of Stockfish's Syzygy probing code[12], contributed to Stockfish by de Man (2013), during tournament play. Syzygy compression is much higher than that of previous tablebase formats, at the expense of increased probing complexity.

There are two different ways in which Syzygy probing is used. When the root position to be searched has sufficiently few pieces and lies within the tablebase, root probing[13] is performed, which perfectly categorizes each move into wins, cursed wins, draws, blessed losses and losses. These categories take highest priority[14] during final move selection. However, the search is still performed in order to differentiate within these categories. Since no distance-to-mate is available it can help to prove this during the search.

When the root position is not in the tablebase, positions in the tree may still be after captures. When meeting some additional conditions, positions are win/draw/loss (WDL)-probed[15] and given a bounded value. ChessCoach treats this as another special case of the generalized bounding of classical engines and adjusts any value backpropagating through the node to this bound. Both types of probing are reported to the UCI protocol as tbhits[16].

Neural network-based engines have a difficult time making progress in endgames because they blend in overly optimistic valuations for unresolved positions at shallow search depths. Looking further down lines, these engines see more concrete outcomes that are valued more realistically, and thus prefer to shuffle pieces around and stay in their "optimistic bubble". This is often dangerous because it can allow the opponent's king to come into play, or result in a draw by the 50-move rule in an otherwise winning position. By contrast, classical engines only look at maximum depth valuations, and often use heuristics to adjust valuations according to specific material combinations in endgames (for example, labeling opposite colored bishops as drawish) or as a 50-move draw is approached.

ChessCoach takes more of classical approach in endgames in order to make progress towards a win, by using minimax to choose a move[17]. After PUCT search has completed, the best line is chosen[18] according to leaf valuation, rather than considering visits driven by average valuations. Thresholds[19] determine when sufficient visits have been made to trust sub-tree valuations. In general, valuations are also scaled towards a draw as a 50-move draw approaches[20], less so in the middlegame and more so in the endgame. Valuation scaling risks optimistic shuffling in the search tree, reducing depth, but it may be better to do this here than over the board.

It makes intuitive sense in endgames to also use minimax to drive the search. However, a naïve approach does much more harm than good. It is possible that a comprehensive implementation such as that used in the Allie engine (Treat, 2019) improves overall strength in ChessCoach. Engines such as KataGo and Lc0 have also blended utility scores such as game score in Go, or estimated moves remaining in chess, into position valuation in order to incentivize progress, and this could also be of benefit for ChessCoach.

AlphaZero found that when opponents used an opening book that encouraged draws, introducing intentional move diversity dramatically increased win rate, at the expense of some losses. With admittedly low statistical significance, applying this technique generally to ChessCoach[21] did improve win rate. However, sampling temperature was necessarily much lower because of linear exploration. The value delta threshold was also reduced, and fewer opening moves were diversified[22]. Caution is required with parameterization. Move diversity based on a value threshold becomes more dangerous as exploration is decreased, as PUCT is happier to leave runner-up moves close in value but under-exploited, potentially missing tactical traps and other downsides.

Tournament play is equivalently referred to as search, UCI and TryHard in code and comments, as distinct from self-play.

Configuration management

Configuration is defined in config.toml[1]. Unfortunately, it is parsed both by the toml11 library in C++ and the toml library in Python. Config.cpp[2] and[3] expose configuration to other code.

Configuration in C++ is designed to allow fast access during self-play and tournament play. Whenever a keyed lookup or update is performed, the parse is repeated using an appropriate policy[4].

Multiple networks can be defined in config.toml and override base training and self-play configuration. At the top level, the network to use is specified by name, and the role or roles are chosen. Miscellaneous configuration is defined at the bottom.

UCI options are defined in config.toml[5] and automatically override configuration items with the same name[6] when set, but only in C++. Manual propagation to Python is required, and this is done for network_weights[7]. Name matching is also used to get existing values when listing options[8].

During EPD-based parameter optimization, parameters defined in config.toml[9] again automatically override configuration items with the same name[10] in C++ only.

During tournament-based parameter optimization, the parameters defined in config.toml[9] are instead applied through the UCI option interface[6] in a separate ChessCoachUci process via cutechess-cli.

The path to Syzygy endgame tablebases stored in config.toml is treated as a special case and rooted to the data root. At startup, ChessCoach checks whether these files need to be copied from cloud storage to local storage[11].

Prediction cache, transposition table

The prediction cache stores recent neural network predictions in the form of a multi-threaded key/value store. When encountering a position that is already in the cache, tree search can immediately complete the current simulation and move on to another one rather than waiting[1].

During self-play, the prediction cache key is Stockfish's Zobrist hash[2] and a representation of whether the position has repeated, for the current position and the last 7 positions[3], matching the history planes that would otherwise be used to make a neural network prediction[4]. The no-progress count for the 50-move rule is also mixed in.

During tournament play, the cache is used as a transposition table, so the key is just the current position's hash, a representation of whether the position has repeated, and a bucketed version of the no-progress count[5].

With default settings, the cache is sized at 8 GiB with 8 1-GiB tables. Tables are filled with 1024-KiB chunks, each containing 8 cache entries[6]. A cache entry holds the key and the neural network's value and policy prediction. The policy is quantized[7] from 32-bit floating point to 16-bit integer format and even then can only fit data for 56 moves: positions with more moves do not use the cache. An age field also helps preserve the most recently used entries when evicting to make room for a new entry[8]. The tables are allocated using large pages[9] when available, to help with performance. Large pages require locking permission and special initialization[10] on Windows.

Because the cache is read and written to by multiple threads without locks, data validation[11] is used to protect against type-1 errors (key collisions), type-2 errors (index collisions) and spliced data. When extracting a policy from the cache, the probabilities are summed and rejected if the sum is not close to 1.0.

Data pipelines and storage

During self-play and tournament play, the Game class generates input planes[1], and these are used immediately for neural network predictions. Training data requires input planes, policy output planes, value outputs, and MCTS value outputs. However, this training data is often at rest, so it makes sense to compress it.

Each prediction requires 109 input planes: 13 piece and repetition planes for the current position and each of the last 7 positions, plus 5 auxiliary planes for castling rights and the 50-move rule. The piece and repetition planes overlap heavily though, because of the history aspect, so each new position only introduces 18 new planes: the 13 fresh piece and repetition planes plus the 5 fresh auxiliary planes. By only including the fresh data for each position, using 8 bytes per packed plane, input planes can be reduced from 872 bytes to 144 bytes.

Policy output planes are stored as training targets and require 73 planes per position. These include arbitrary float logits and cannot be packed, instead taking up 256 bytes per plane for a total of 18,688 bytes uncompressed. However, most of the 4,672 conceivable moves are not legal in a given position, so assuming an average of 35 legal moves, these sparse planes can be compressed to 280 bytes plus overhead using a scatter/gather scheme with 4-byte indices and 4-byte values.

The value target is just the perspective-flipped game result and can be compressed from 4 bytes per position to 4 bytes per game. The MCTS value target does not offer trivial compression opportunities. In total, for a game with 135 positions and 35 legal moves per position, training data can be compressed from 19,568 bytes to approximately 428 bytes of payload. This scales to 105.7 TiB vs. 2.3 TiB of payload for 44 million games, again estimating 135 positions and 35 legal moves. Further compression is possible using ZLIB.

The Game class implements compressed and uncompressed generation of input and output planes in C++[2]. While planes are usually decompressed in the data pipeline in Python, the Game class also decompresses policy planes to support the ChessCoachGui tool and generates uncompressed policy planes for test purposes. A variety of move inferences and guesses[3] also support the ChessCoachGui tool.

ChessCoach's data pipeline feeds neural network training and is implemented in Python using This API offers Datasets, which operate in graph mode using TensorFlow ops rather than Python code. Because of this, they can be a little picky loading data and prefer tf.train.Example records in TFRecord format. TFRecord[4] files consist of a sequence of records, each with a length, payload, and checksums. In this case, the payloads are protobufs matching the tf.train.Example[5] specification, consisting of tf.train.Features with differently typed lists.

Unfortunately, using the TensorFlow API to write these records from Python is too slow to be viable; therefore, ChessCoach reads and writes TFRecords from C++ 100 to 1,000 times as fast. The protoc utility[6][7] is used to generate code for ChessCoach.proto, which implements the tf.train.Example specification. The protobuf library is linked statically to avoid conflicting with the copy that TensorFlow links dynamically. The Storage class implements serialization[8] and disk streaming[9]. Each game is written as a single tf.train.Example in a single TFRecord, using zlib compression, and chunks of 2,000 games are just decompressed, concatenated and re-compressed[10].

Commentary is also saved in this format[11], although because variations are branching rather than linear, and not every position has a comment, input planes cannot be compressed. A CommentarySaveContext[12] class is used to implement a training/validation split at 95%/5%[13]. Preprocessing is also performed[14] to eliminate most non-English comments and improve quality.

One key detail with input planes is that early positions in a game do not have 7 previous positions for history planes. Lc0 found that synthesizing data rather than zeroing the planes gave the network a more consistent distribution to operate with, and this technique helped ChessCoach slightly. When too little history is available, the earliest position is saturated backwards, flipping perspective between moves, using the HistoryWalker[15][16] utility.

After the primary model's compressed data has been loaded by the data pipeline, it needs to be decompressed using the TensorFlow API. Input planes are reconstructed by mapping over slices of the game's compressed input planes tensor[17]. This again requires synthesis of history via saturation for early positions, but with a simpler implementation because all training games begin from the starting position. Policy outputs are decompressed by constructing ragged tensors from the indices and values, and then scattering them[18]. Game results are just stretched out with alternating flips[19].

Shuffling training examples over a 1 million sized window adds challenge. Most data read is discarded[20][21] as a time/space trade-off. A large shuffle buffer is used on Cloud TPU VMs, although this probably needs to be slightly reduced on older-style TPUs, and heavily reduced on consumer desktops. Memory consumption is high because decompressed examples are stored in the buffer. A scheme for storing position pointers for post-shuffle decompression may be possible in some way. However, by targeting low correlation you are effectively wanting one position per game, so a pointer to the whole thing seems to end up worse. TensorBoard data over generations of ChessCoach training runs does justify the high window size requirement.

Training data in the same TFRecord format is also generated using the ChessCoachPgnToGames tool over the CCRL Dataset published by Lc0 (2018), based on Computer Chess Ratings Lists (CCRL) data. PGN parsing is implemented in Pgn.cpp[22]. The validation portion is used as described in Neural network training. The training portion of the dataset can be used for supervised training[23], rather than using self-play data. This results in a network with an estimated 2650 Elo rating; however, the network ends up with huge gaps in its knowledge. Because it never sees something like a Qxh7# mate-in-one or large mid-game material imbalances in the data, it emulates high-level play and makes threats, but does not necessarily follow through when allowed. Equivalently, its search tree does not necessarily see opponents following through, and with AZ-PUCT it gets checkmated easily.

C++ and Python interoperation

Python is most commonly run using the CPython reference implementation, whose straightforward C interoperability has contributed to Python's success. CPython supports C/C++ hosting Python code and Python hosting C/C++ code, in a macro sense and in callback form.

In ChessCoach, C++/Python interoperation always begins with a C++ binary for consistent initialization. C++ calls in to Python via the PythonNetwork[1] class, and in some scenarios, Python calls in to C++ via the PythonModule[2] class, after an import chesscoach statement.

Initialization has some quirks, especially on Windows, and is handled in the ChessCoach[3][4] class.

It is necessary to hold the global interpreter lock (GIL) when calling in to Python code, and it can help performance to give it up while in C++ code. PythonContext[5] and NonPythonContext classes are used for convenient GIL management.

Numpy multi-dimensional arrays are usually used to marshal data across the API boundary. NO_IMPORT_ARRAY is defined to include numpy in multiple translation units[6] (.cpp files), but undefined in ChessCoach.cpp[7].

When returning TensorFlow data to C++, the np.array(memoryview(X)) construct is used to avoid an additional copy after readback.

Debug GUI

The debug GUI is a web page using chessboardjs, hosted locally in Python with WebSockets for callbacks, and launched via the ChessCoachGui and ChessCoachUci C++ binaries. The GUI operates in pull and push modes[1], with pull via ChessCoachGui, allowing the user to browse training data, and push during UCI search, updating on the configured interval[2].

The GUI itself is quite trivial: it shows a data table and a chess board. The board highlights moves based on the posterior probability distribution after tree search[3]. In push/UCI mode, dragging the slider moves the data/highlighting through the history of the search, and clicking on moves drills into search lines[4][5].


One of the original goals of the project was to be able to viably train and run ChessCoach on a desktop machine with a single GPU. While I was not able to achieve this, I did focus on supporting ChessCoach on a single machine and cluster; local and cloud storage; single-GPU, multi-GPU and TPU; and Linux and Windows.

Single machine and cluster equivalence is implemented by allowing machines to operate using train, play or train|play roles[1]. Training and self-play roles do not care whether their storage is local or cloud-hosted and check periodically for updates in either case when they are up in top-level training stage management code[2].

Local and cloud storage are treated equally using the abstraction in Python and by deciding on a data root at startup time[3][4]. All relative dynamic storage paths are rooted to the data root. Because all cloud storage access is via Python, C++ code sometimes needs to call in awkwardly[5]. C++ and Python can also choose to use local storage explicitly[6][7], even in parallel; for example, for game data before chunking, and parameter optimization results.

ChessCoach enters TPU mode when it can successfully initialize a TPUStrategy at startup[8], either for newer-style Cloud TPU VMs, or failing that, older-style TPUs. This determines whether TensorFlow's listing of logical TPU devices or logical GPU devices is used. In TPU mode, the TPUStrategy is used for training; otherwise, when there is more than one GPU available, a MirroredStrategy is used for training[9]; otherwise, the default strategy is used for training. In any case, distributed strategies are not used for inference; instead, the relevant logical devices are used as tf.device contexts[10]. Although distributed strategies abstract away many differences, the training schedule still needs to be adjusted. The per-replica batch size in config.toml[11] is multiplied by the device count to give the global batch size[12], and this is used to initialize data pipelines[13]. needs to know how many global batches to run, so various step intervals are divided by the device count[14]. Learning rate schedules also need to think globally[15]. Commentary training scales similarly to the primary model; however, only the primary model multiplies the learning rate by the device count.

Cross-platform C++17 facilities were used where available, but some platform-specific code is abstracted via the Platform[16] class, and branched via the CHESSCOACH_WINDOWS[17] definition in a few cases. Build infrastructure supports Linux and Windows.

Strength testing

Suites of test positions have been used for decades by human players and more recently, for chess engine development. Win at Chess (WAC) (Reinfeld, 1958) is a famous example. The Chess Programming Wiki maintains an extensive list.

Test suites are often used to validate the behavior of chess engines and to improve their play through tuning of search logic, static evaluation functions, and parameters for pruning and optimization techniques.

ChessCoach uses two main test suites: the Strategic Test Suite (STS) (Corbit & Swaminathan, 2008-2014; and Mosca, 2015) for positional evaluation and holistic strength estimation, and the Arasan21 suite (Dart, 2019) for tactical evaluation. These suites use the Extended Position Description (EPD) format, which is somewhat like Forsyth–Edwards Notation (FEN) but expandable with operations and comments. EPD parsing is implemented in Epd.cpp[1].

STS[2] consists of 1,500 positions in 10 categories with points offered for the best four moves in each position; for example, 10, 3, 3 and 2 for the first position. The final score out of 15,000 can be passed through a linear function[3][4] to estimate Elo rating, based on logic in the STS-Rating tool (Mosca, 2019). This estimation is surprisingly accurate relative to other test suites and methods, perhaps because other suites often assign a single best move and have a more narrow, tactical focus. Each position is searched for 200 milliseconds, for a total of 5 minutes overall.

The Arasan21 suite[5] consists of 199 positions with difficult tactical problems that usually take multiple seconds of search time for even the best engines to solve. Some positions specify a move to avoid rather than a best move. ChessCoach calculates a score out of 199 points using 10 seconds of search time per position, for a total of approximately 33 minutes overall.

As mentioned in Neural network training, STS is run during training.

Standalone testing of any EPD suite can be run using the ChessCoachStrengthTest[6] tool. This tests ChessCoach slightly more efficiently than using external tools, and extracts some additional metrics. However, it is equally possible to run external tools in conjunction with the ChessCoachUci interface.

Parameter optimization

In ChessCoach, parameter optimization is the process of tuning a variety of parameters that affect playing strength, to find a global optimum via Bayesian optimization. In a machine learning context, this is called hyperparameter optimization because network weights are themselves parameters, and it aims to improve training speed and effectiveness.

Generally, optimization evaluates the domain effectiveness of various combinations of parameters, outputting either the best combination found, or the expected best combination via modeling. Various strategies are used for data acquisition, deciding on parameter combinations to try by using simple ideas such as grid search and random search, and more complex strategies involving exploration/exploitation trade-offs.

Bayesian optimization usually uses Gaussian Processes over the multivariate parameter space to predict global optima and construct acquisition functions for further evaluations. ChessCoach uses the Scikit-Optimize (skopt) implementation with the Expected Improvement (EI) acquisition function[1]. More advanced techniques are possible, and this configuration underweighted exploration, especially when distributions were noisy or skewed. However, it was the most stable and deterministic in getting to results.

ChessCoach has two optimization modes[2]. The first is EPD mode, which runs test suites such as Arasan21 and uses a nodes required metric. The second is tournament mode, which runs mini-tournaments against a baseline player such as ChessCoach with default parameters, or Stockfish, and uses a relative Elo metric. Both metrics are normalized; however, nodes required can form a very one-sided distribution, which can confuse modeling. In contrast, Elo is quite well behaved as long as enough games are run per mini-tournament.

In EPD mode, the nodes required[3] metric represents the number of simulations required to find the best move in each position without switching away from it. If the best move is not found, a failure nodes value is returned instead, and this can be tuned to favor absolute correctness vs. solution speed.

In tournament mode, the time control[4] can be varied to favor shorter or longer games. Some parameters are only meaningful with sufficient search time. However, using longer games for optimization can be very expensive computationally. Even with distributed community resources available, Fishtest uses only 60+0.6 time control. ChessCoach uses cutechess-cli (Pihlajisto & Jonsson) to run tournaments[5] and bayeselo (Coulom) to calculate relative Elo[6]. An extremely hastily built, distributed sub-mode is possible, spreading over a cluster in order to play out mini-tournaments more quickly. While the initial focus is on evaluation turnaround time[7], with enough machines available, multiple evaluations can be made in parallel[8]. Distributed optimization relies on the[9] and[10] scripts.

In both modes, partial dependence plots[11] are periodically emitted in .png format along with actual/expected optima, and a clean log file is written. The log file can be used to resume[12][13] parameter optimization if it is interrupted.

GPUs can be shared by TensorFlow processes, and TensorFlow can be instructed to grow memory use on GPUs only as needed[14]. In contrast, Cloud TPU VM chips can only be held by a single process, complicating the technical design. When running tournament optimization, a separate path into Python code is used to avoid initializing TensorFlow[15]. During distributed optimization, each host can only be used for one ChessCoach instance[16], although Stockfish can be used[17] instead of baseline ChessCoach to halve the number of hosts required (or just to optimize to beat Stockfish). The proxy docker image binds to separate ports for these two engines[18].

Because TPU initialization is very slow for every new process, taking up to 30 seconds, cutechess-cli times out by default. These timeouts are hardcoded, and the Qt framework overly complicates building, so the binary was modified directly by searching for the existing constants and changing the ping timeout from 15 seconds to 60 seconds[19].


ChessCoach can run as a bot on the Lichess platform, with an official bot set up at

ChessCoach bot code[1] is based on the official implementation at, relying on the Lichess API described at However, code for flexible management of various engines and their configuration has been removed, and code specific to ChessCoach has been added; for example, for posting commentary during play.

The script is treated as a combination code/config file. Since the script is not compiled before deployment, no other code cares about its configuration, and its configuration is often very narrowly relevant, it makes sense to integrate configuration with usage locally.

The bot plays one game at a time, accepting rated and casual challenges, but preferring rated[2]. Material odds are accepted[3], as long as the challenger's rating is sufficiently low for the type of piece(s) missing. The bot generates and posts commentary in spectator chat every 30 seconds[4]. It also ponders, thinking on the opponent's time.

If idle for long enough, the bot sends challenges to other bots. A list of bots with a rating of 1800 or higher is gathered (bot ratings are compressed downwards on Lichess), and an opponent is randomly sampled[5]. Time control is also randomly sampled[6].

The ChessCoach engine runs in the same process as the bot code[7][8] to simplify process and threading management, pondering, and data marshaling, rather than using the UCI interface.

A single thread runs the main bot loop[9], listening to a single event queue[10]. This thread reacts to various events and also performs periodic upkeep[11]. Other threads watch the control stream[12], watch the game stream[13], perform the search in C++[14][15], and post commentary asynchronously[16]. Each of these secondary threads reports back by posting events to the central queue.

Chess-specific logic is left to C++ as usual, so the raw position, move and time control details are passed in to the BotSearch API method. BotSearch then determines whether to search or ponder[17], and generates commentary if enough time remains[18]. When it is the bot's turn to move, the primary search thread calls back into Python directly to report the best move[19]. Pondering does not use any UCI-specific logic: just a time-limited think on the opponent's position[20], which helps extend the search tree.

Time management is a particular concern for bot play because it is a real-time environment, with one player's clock always ticking. Rather than being allowed "free time" to prepare a position or deallocate memory via the isready UCI command, the bot must fit all of this into its thinking time. Additionally, there is network overhead and API throttling to worry about[21]. ChessCoach has trouble especially with search tree deallocation time because of its naïve Node array allocation[22] in support of multi-threaded tree search. This limits potential ponder time[23] during the opponent's turn because the cleanup eats into ChessCoach's turn, and time is always more efficiently spent searching one ply deeper.


The utility is used to download publicly available chess games with commentary. It is intended for use with the ScrapingBee service, although this can be swapped[1] for another service, or for local handling of the technical details.

It is best to disable path length limitations on Windows before use, to avoid problems with the Base32 encoding.

HTML and PGN content is hierarchically cached on disk[2] to save on service costs, using the Base32-encoded URL as a key. However, some content cannot be cached because it is paginated and orders the most recent content first.

Once games have been downloaded, they can be combined and split into a smaller number of PGN files[3], ready to be passed to the ChessCoachPgnToGames utility using the --commentary flag[4] in order to generate commentary training data.

Unit tests

Unit tests are implemented using the GoogleTest framework. Some significant or interesting examples are:


Linux and Windows use different build infrastructure, relying on Meson/Ninja/GCC[1] and MSBuild[2] respectively. ChessCoach was primarily developed using Visual Studio and Visual Studio Code on Windows, so up-to-date solution[3] and project files are in place for Visual Studio. Python installation is more awkward on Windows, so virtual environment support is optional but more built-in for this platform[4].

Building first requires dependency setup[5][6].

On Linux, protobuf 3.13.0 needs to be built and installed from source[7]. Ubuntu 20.04 (Focal) and Debian 10 (Buster) are targeted, although Ubuntu 18.04 (Bionic) support can be found in (now and dockerfiles by syncing code back. TensorFlow setup varies between older-style TPUs and newer-style Cloud TPU VMs (see Installation in the README).

On Windows a CHESSCOACH_PYTHONHOME[8] environment variable is set up to use when building[9], and when initializing Python[10].

Stockfish is built in the 64-bit BMI2 configuration (USE_PEXT) but with prefetching disabled (NO_PREFETCH) as Stockfish search and transposition tables are not used[11][12].

Distributed training and self-play covers building and running docker images.

Third-party libraries

The following third-party C++ libraries are used:

The following third-party Python and JavaScript libraries are used:

The following third-party data is used:

Modifications and extensions to third-party libraries have been made: