Does spending more time on a chess move increase its quality? Analyzing 200k online games

/images/covers/chess-games-cover.jpg

Chess is as much about finding the best move as it is about managing your available time, but does going ‘into the tank’ actually result in a better move? In this article I try to make sense of hundreds of thousands of moves, and learn a thing or two about continuous variables in the process.


Dataset & Domain Understanding

Analysis will be done using Python 3.12. Specifically, the Pandas and NumPy packages for analysis and data cleaning, along with Matplotlib and Seaborn for visualizations. These are the tools I started with when learning data analysis, although I’m very interested in trying out alternatives like Polars or Plotly, which I plan to write an article about in the near future.

The dataset comes from Chessdigits.com, who mined and converted 200.000 online games. They were played on the Lichess website, the second most popular place to play chess on the internet, in 2019. Each row represents one game. The two most important columns for this analysis are:

  • Eval_ply_x: the computer evaluation of the position after the last played move
  • Clock_ply_x: the player’s remaining time after the last played move

‘ply’ in chess means half of a move: so when white starts, their move is ply 1, black’s response is ply 2, white’s next move is ply 3, and so on.

Both of these columns go up to 200 ply, but most games will have a lot of missing values towards the end of that range. The computer evaluation is based on the best move according to a chess playing computer like Stockfish, and as we’ll see later on, it’s not always the perfect way to judge which player has the better or easier to play position.

When computer evaluations go below zero, this means black has the better position. If they are positive, then white has the better position.

Other important columns include:

  • Opening: the common name for the first couple of moves made in the game
  • [White/Black]Rating: the ELO Rating of each of the players. It is a historically tried and true measure of their relative strength.
  • TimeControl: in the form ‘N+K’, where N is the number of minutes each player starts with, and K is their increment
  • Event: a categorical grouping of the time control, into Ultrabullet, Bullet, Blitz, Rapid and Classical.

Increment is the amount of time a player gets after making a move

Exploratory Data Analysis

The first step in understanding the dataset is to calculate some descriptive statistics. Below are histograms of the most important features that describe the games, such as the rating of the players, the length of the game, and more. Originally the code block below contained 5 lines of brute-force ‘code’, where I wrote out the lists containing the rating ranges and labels manually after looking through the data, before realising that this approach kind of defeats the purpose of programming. So I wrote a function that will create the rating ranges before plotting, ensuring the graph will still look correct even after the data cleaning process. The result is a quite a couple of lines longer, and took more time to write, but is also a lot more reusable and taught me a lesson; a tradeoff I am more than willing to make.

Click here to see the code

  def get_rating_ranges(df, bin_size=100):
    # Get min and max ratings amongs both colors
    min_rating = min(df['WhiteElo'].min(), df['BlackElo'].min())
    max_rating = max(df['WhiteElo'].max(), df['BlackElo'].max())
    
    # Round min down and max up to nearest bin_size
    min_rating = (min_rating // bin_size) * bin_size
    max_rating = ((max_rating // bin_size) + 1) * bin_size
    
    # Create bins
    rating_bins = list(range(min_rating, max_rating + bin_size, bin_size))
    
    # Create labels
    rating_labels = [f'{bin_start} - {bin_start + bin_size - 1}'\ 
                    for bin_start in rating_bins[:-1]]
    rating_labels[-1] = f'{rating_bins[-2]}+'  # Indicate the last bin's upper bound
    
    # Create rating ranges
    df['WhiteRatingRange'] = pd.cut(df['WhiteElo'], 
                                   bins=rating_bins, 
                                   labels=rating_labels)
    df['BlackRatingRange'] = pd.cut(df['BlackElo'], 
                                   bins=rating_bins, 
                                   labels=rating_labels)
    
    # Combine ranges
    combined_rating_range = pd.concat([df['WhiteRatingRange'], 
                                     df['BlackRatingRange']], 
                                    ignore_index=True)
    
    return rating_labels, combined_rating_range

Feeling good about this refactor, I decided to write a similar function for the game length. The time control and increment will only be added to the dataframe once and thus don’t need to be included in a function.

df['GameMode'] = df['Event'].str.split().str[1]
df['Increment'] = df['TimeControl'].str.split('+').str[1].astype('int')
df['IncrementCategory'] = df['Increment'].astype('category')

def get_game_length(df):
    move_columns = [col for col in df.columns if col.startswith('Move_ply_')]
    game_length_df = df[['Index']].copy()
    game_length_df['game_length'] = ((pd.Index(df[move_columns].isna().idxmax(axis=1)) \
    .str.extract('(\d+)').astype(float) + 1) // 2).astype(int)
    
    return game_length_df

Finally, I added a function to plot the different histograms into one plot.

def display_distribution_statistics(df, file_name=None):
    # Set the plot aesthetics
    sns.set_style('darkgrid')
    import matplotlib as mpl
    mpl.rcParams['font.family'] = 'DejaVu Sans'
    # Run all data creation functions
    rating_labels, combined_rating_range = get_rating_ranges(df, bin_size=100)
    game_length_df = get_game_length(df)
    # Create the plot
    fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3, 2, figsize=(14, 14))
    fig.suptitle("Descriptive statistics", fontsize= 25)
    sns.set_style('darkgrid')
    # Rating distribution (ax1)
    sns.histplot(ax=ax1, data=combined_rating_range, kde=True, kde_kws={'bw_adjust': 4})
    ax1.set_title('Rating', fontsize=20)
    ax1.set_xticklabels(rating_labels, rotation=60, ha='right', rotation_mode='anchor')
    ax1.set_ylabel('')
    ax1.set_axisbelow(True)
    # Game terminations (ax2)
    sns.histplot(ax=ax2, data=df, x='Termination')
    ax2.set_title('Game termination', fontsize = 20)
    ax2.set_xlabel('')
    ax2.set_ylabel('')
    # Game length (ax3)
    sns.histplot(ax=ax3, data=game_length_df, x='game_length', binwidth=5, kde=True)
    ax3.set_title('Game length', fontsize = 20)
    ax3.set_xlabel('')
    ax3.set_ylabel('')
    # Game mode (ax4)
    sns.histplot(ax=ax4, data=df, x='GameMode')
    ax4.set_title('Game mode', fontsize = 20)
    ax4.set_xlabel('')
    ax4.set_ylabel('')
    # Game result (ax5)
    sns.histplot(ax=ax5, data=df, x='Result')
    ax5.set_title('Game result', fontsize = 20)
    ax5.set_xlabel('')
    ax5.set_ylabel('')
    # Increment counts (ax6)
    increment_order = df['Increment'].value_counts(ascending=False).index
    sns.countplot(ax=ax6, data=df, x='Increment', order=increment_order)
    ax6.set_title('Increment', fontsize = 20)
    ax6.set_xlabel('')
    ax6.set_ylabel('')

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    if file_name:
        plt.savefig(file_name, dpi=300, bbox_inches='tight', transparent=True, )
    plt.show()

EDA subgraphs

From this, we see there’s some anomalies that can be removed to get a clearer picture. Rules infractions and abandoned games can go, as well as games where the result seems inconclusive (‘*’ in the result column). I considered removing draws as well since they only comprise of a small amount of games but their inclusion doesn’t hurt our results so I decided to leave them in. I also removed bullet and ultra bullet games since they are so short that no meaningful fluctuation in time spent can be observed.

Increment has a lot of unusual values which are not commonly used. This is probably due to people creating custom game challenges with less used or obscure time controls. We can remove the odd ones out to get a clearer picture of the ones that are commonly found in online chess games.

99% of games are a total of 75 moves or shorter, which indicates that we can remove the last 50 ply without losing much data, significantly speeding up the analysis, as well as reducing the skewness of those distributions. We cut off some games before their conclusion, but since we’re analyzing on a per-move basis, this should pose too much of a problem .

Finally, to reduce the skew of the rating distribution and make the results more applicable to the general population, we will only look at games with a maximum rating of 2500 for either player, and a rating difference of no more than 200 between them. This is to remove any confounding factors such as the stress / complacency of playing against a higher or lower rated player, respectively.

It’s possible that Blitz games prove to be too short as well, so we might revisit this at a later moment. But for now, let’s clean some data!

Data cleaning

df_clean = df[~df['Termination'].isin(["Abandoned", "Rules infraction"])]
df_clean = df_clean[~df_clean['GameMode'].isin(['Bullet', 'UltraBullet'])]
df_clean = df_clean[(df_clean['WhiteElo'] <= 2500) & (df_clean['BlackElo'] <= 2500)]
df_clean = df_clean[df_clean['Increment'].isin([0, 15, 3, 2, 1, 5, 10])]
df_clean = df_clean[(df_clean['WhiteRatingDiff'] <= 200) & (df_clean['WhiteRatingDiff'] >= -200)]

for col_prefix in ['Eval_ply_', 'Move_ply_', 'Clock_ply_']:
    columns_to_drop = [f'{col_prefix}{i}' for i in range(151, 201)]
    df_clean = df_clean.drop(columns=columns_to_drop)

Data cleaning results

Removed    17 Games with irregular Termination
Removed 43253 Bullet & UltraBullet games
Removed   310 Games with 2500 or higher ELO players
Removed  3503 Games with non-standard increments
Removed  4417 Games with RatingDiff of 200 or more
Removed    50 Columns for the last ply

Removed 47997 games in total
Remaining games: 152003 from 200000 (76.00%)
Old shape: (200000, 628)
New shape: (152003, 478)

These relatively simple filters make the dataset more normalized and remove outliers on key features. Now we can plot the descriptive stats again to see if things improved:



Those graphs already look a lot cleaner. Rating distribution skewness went from 0.24 to 0.14, and game length skewness from 1.17 to 0.68. This makes the distribution measurably more symmetrical. Removing the odd increments also decreases the influence that rare custom time controls have on the data.

Extracting Time and Evaluation Difference

To be able to answer our research question, we need to make a dataframe containing the time spent on each move, together with the evaluation difference between that move and the next one. An important factor to include is the increment the player gets, as it will be added to the clock after each move and thus isn’t actually time spent on making the move itself.

First, we make filtered dataframes containing only the relevant columns. Then we combine those, before finally starting on the bulk of the data processing for this project. The final pipeline is the result of a lot of trial and error, pushing my Python skills to it’s limit.

I’ve tried to explain my thought process as much as possible, but if you just want to go to the results, you can click the button below to collapse the code.

Show / hide the code

# Extract the Eval_ply_x columns
eval_columns = [f'Eval_ply_{i}' for i in range(1, 151)]
df_eval = df_clean[eval_columns]

# Extract the Clock_ply_x columns and convert to total seconds
clock_columns = [f'Clock_ply_{i}' for i in range(1, 151)]
df_time = df_clean[clock_columns].apply(pd.to_timedelta)
df_time = df_time.map(lambda x: x.total_seconds() if pd.notna(x) else x)

# Combine both DataFrames and add some relevant columns
df_combined = pd.concat([df_eval, df_time], axis=1)
columns_to_add = ['Index', 'GameMode', 'Increment', 'IncrementCategory']
df_combined[columns_to_add] = df_clean[columns_to_add]

When the computer sees a forced mate sequence, it will give the number of moves until mate, with a hashtag in front of it. Later on in the limitations I will talk more about how this poses some problems for our analysis. Since there is no way to convert this into a numerical value without choosing an arbitrary number, our function will skip over moves that contain a forced mate. To do this, we convert these to None values, and then skip over those later on in the function.

def eval_conversion(eval_value):
    if pd.isnull(eval_value):
        return None
    eval_str = str(eval_value)
    if '#' in eval_str:      
        return None
    try:                     
        return float(eval_str)          
    except (ValueError, TypeError):
        return None

We need to loop over each game, and within each game over each move. The move data will be appended to a list at the end of each loop.

all_moves = []

for idx, row in df.iterrows():              
    # Make sure we always have a move to compare to (+1 because range is non-inclusive)
    for move_num in range(1, max_ply - 2 + 1):
        # Determine the color to see if we need to multiply the evaluation by -1
        color = 'white' if move_num % 2 == 1 else 'black'
        # Column names for current and next move
        curr_time_col = f'Clock_ply_{move_num}'
        next_time_col = f'Clock_ply_{move_num + 2}'
        curr_eval_col = f'Eval_ply_{move_num}'
        next_eval_col = f'Eval_ply_{move_num + 2}'

We do all the necessary checks to see if all values are valid, before applying the eval_conversion function we wrote earlier.

# Check if all columns exist
if not all(col in df.columns for col in 
[curr_time_col, next_time_col, curr_eval_col, next_eval_col]):
    continue
# Convert evaluations to numeric (or None)
curr_eval = eval_conversion(row[curr_eval_col])
next_eval = eval_conversion(row[next_eval_col])
# Skip current move if either evaluation is None
if curr_eval is None or next_eval is None:
    continue
# Check for valid clock times
if pd.isnull(row[curr_time_col]) or pd.isnull(row[next_time_col]):
    continue

Finally, we can calculate the time spent on each move, and the change in computer evaluation. The phase of the game the move was made in is converted, and then a dictionary with all the data for that move is made and appended to the move list.

# Calculate time spent and eval change
time_spent = row[curr_time_col] - row[next_time_col] + row['Increment']
eval_change = next_eval - curr_eval
# Determine rough phase of the game the move was made in
def get_game_phase(move_num):
    if move_num <= 10:
        return 'opening'
    elif move_num <= 20:
        return 'early_middle'
    elif move_num <= 40:
        return 'late_middle'
    else:
        return 'endgame'
game_phase = get_game_phase(move_num)
# Store move data
move_data = {
    'game_id': idx,
    'move_number': (move_num + 1) // 2,
    'color': color,
    'time_spent': time_spent,
    'eval_change': round(eval_change if color == 'white' else -eval_change, 2),
    'game_phase': game_phase,
    'increment': row['Increment'],
    'game_mode': row['GameMode']
}
# After processing each game, append the move data to the main list
all_moves.append(move_data)

Results

Now that we have our DataFrame on a per move basis, a second round of data examination and cleaning begins. First, let’s look at the boxplots for evaluation change and time spent to check for any outliers. The scatterplot also helps show how the data is distributed.

There are some unexpected things happening here:

Some moves seem to take negative time to make, even after taking the increment into account. While it could potentially be an error in the code, this is likely due to a feature on lichess where your opponent can give you extra time on your clock, which will show up as negative time spent on that move. Some of the extremely large time spent values can also be due to this feature, as it allows a player to have much more time on their clock than the format should allow. With some exceptions, most of the large time spent comes from classical games, which can last for hours, so that makes sense.

This is also where we see a problem arise with the interpretability of computer evaluations. During the processing, we ommitted any move where the current or previous move contained a forced mate evaluation, as those cannot be easily converted into a numerical value. Mate in one, two or three moves can be conceptualized as being relatively easy to find, but if the computer sees forced mate in 20 moves, and every other sequence of moves leads to a much worse position, the position will be much worse in practical terms. A chess computer has no idea on how to differentiate those, as the evaluation of a position is based on the best move in that position. Similarly, when a position is really good but the computer can’t find a sequence leading to forced mate, it can produce some crazy high evaluations.

Good to know: chess evaluation is supposed to be mapped back to the value of the pieces. So if the evaluation is +9, this should roughly mean white is up the equivalent of a full queen.

This leads to some questions about interpretability:

  • If the evaluation is already +20 for your opponent, and your move takes it to +50, did the position get worse in human terms?
  • If the inverse happens, and you manage to ‘only’ be down the equivalent of two queens instead of four, did your chances to win really improve?

In other words, computer evaluation cannot be seen as a continuous scale on which to evaluate a position. Go past a threshold of, say, 10 and subsequent increases start to matter less and less, in practical human terms. Comparison with the second best computer move would provide some insight into how critical finding a specific move is, but that isn’t available in this dataset. For more analysis about the relationship between computer evaluation and winning chances, check out this article by the creator of the dataset.

What’s next?

Originally, and perhaps naively, I was planning to just do regression and correlation analyses on the extracted data. But after delving into the results, it seems that it is very hard to obtain useful insights from computer evaluation change as an absolute numerical value. Its value and relevancy fluctuates too drastically based on how close to 0 it is.

A more reasonable approach would be to turn evaluation changes into categories: Analyzing data like this removes the scale problem of evaluations, and focuses on the question at hand; if spending more time leads to fewer mistakes.

To address this, we can categorize evaluation changes into three main types:

  • Blunder: A move that drastically worsens the evaluation, turning a balanced or winning position into a losing one.
  • Mistake: A move that worsens the evaluation, but not as severely as a blunder, reducing a winning advantage to a smaller one.
  • Inaccuracy: A move that slightly worsens the evaluation, but doesn’t change the overall assessment of the position.

By grouping moves this way, we can analyze whether spending more time reduces the frequency of blunders and mistakes, rather than focusing on raw evaluation numbers. This approach aligns better with how chess players and coaches assess move quality in practical terms.



For those of you that have read this post all the way through, thank you so much! Your attention and support means a lot to me. As a parting gift, I made this nice graph of the top 20 most commonly played openings in the dataset. Check it out below!


Twitter, Facebook