
Thursday, March 19, 2015

Machine learning to rapidly search for the correct bitcoin block header nonce

Using machine learning, is it possible to predict the bitcoin nonce that will satisfy the difficulty target condition?

In this blog, I will attempt to do this. It is a work in progress, but the results are promising.


First, lets begin with a VERY brief introduction to the relevent portions of bitcoin and bitcoin mining to help understand the rest of the blog. Bitcoin is an online paymeny system, which operates in a completely decentralized manner. Bitcoins, the currency of the bitcoin system, are exchanged for goods and services. A good analogy to bitcoins in the conventional finalcial world is a cheque. Cheques can be used as payment for goods and services. However, the cheque has to be verified to ensure that the payor has sufficient funds to cover the cheque, a process that is performed by the banks. In the bitcoin system, the process of transaction verification (i.e., bitcoin mining) is performed by the bitcoin community (the equivelent of conventional banks, but completely decentralized).

Bitcoin mining, or transaction verification, is performed in chunks called "blocks". Successful verification of a block is rewarded by the bitcoin system with 25 bitcoins paid to the first person providing proof of verification. Each block is formed by collating all the transactions since the previous block, constructing a block header from these transactions, and then altering this header methodically until the SHA256 double hash of this header is less than the currently set "difficulty" target. Ken Shirriff's excellent blog post ( explains the specifics of this process well and is worth reading carefully.

For the purposes of this blog, here are the take-away: - if a bitcoin block is verified, the verifier gets paid 25 bitcoins (as of this writing) - the SHA256 double hash of the block header has to be below a predefined difficulty target for the verification to be deemed successful - bitcoin miner verify blocks and meet the difficulty threshold to get paid

Assumptions (or why bitcoin mining is deemed to be hard)

SHA256 hashing function is a one-way hash function. The only way to know if the hash is below the difficulty target set by the bitcoin network is to alter the input one by one and compare the resulting hash to the target - i.e., by brute force. Since the size of the search space is very very large, a lot of computational power is necessary to be successful at this task.

The current approach to mining

The block header is an 80 byte string made of 1. version (4 bytes) 2. previous block's hash (32 bytes) 3. Merkle root (32 bytes) 4. timestamp (4 bytes) 5. bits (4 bytes) 6. nonce (4 bytes)

Of these, miners systematically alter the nonce (and timestamp with some constraints), compute the double SHA256 hash for each change, and check if the result is below the target. Merkle root can also be changed (by changing some of the coinbase transaction fields).

The proposed machine learning approach

The idea is to take the previously solved block headers and learn where the nonce might be. To do this, take each solved header and build the training data table such as this:

| ver | prev_hash | merkle_root | time | bits | candidate nonce | label|
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 00000001        | 0    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 00010001        | 0    |
| .   | ...       | ...         | ..   | ..   | .               | .    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 10000001        | 1    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 40000001        | 1    |

The "candidate nonce" and "label" need explaining. The candidate nonces are numbers less than and greater than the true nonce - i.e., candidates that we are testing. The label is encoded as 0 if the candidate nonce is less than the true nonce, 1 if greater.

A machine learning classifier can be trained on this data. Once trained, the classifier will predict 0 if the true nonce of an unknown header is greater than the candidate nonce, and 1 if less. So for any given candidate nonce, this classifier will say if the true nonce is lower or higher.

Data for 49800 solved bitcoin headers is available here (python pickle format. See code below)

Python code implementing the above ideas

The necessary imports

import hashlib
import struct
import pandas as pd
import random
import os.path
import pickle
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
import json
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.legend_handler import HandlerLine2D

Helper functions with some comments

# Helper functions with some commenting
# load block header data
# Note: the data has been preprocessed and has been pickled
def load_data(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# Takes true nonce, returns random nonce value and label indicating 
#   above/below true nonce
# Used to make the training labels
def get_nonce(n):
    r = random.randint(0, 4294967296)
    while r == n:
        r = random.randint(0, 4294967296)
    if r < n:
        l = 0
    elif r > n:
        l = 1
    return (r, l)

# Makes a training set from the block header data by adding 
#   random nonce candidates and label
def make_df(data_dict):
    ml_df = []
    row = []
    hex_int_dict = {'0':0, '1':1, '2':2, '3':3, '4':4, '5':5, 
                    '6':6, '7':7, '8':8, '9':9, 'a': 10, 
                    'b': 11, 'c':12, 'd':13, 'e':14, 'f':15}
    for i in data_dict:
            header = [str(i['ver'])]
            nonce = int(i['nonce'])

            for n in range(150):
                rand_test_nonce, label = get_nonce(nonce)
                row = list(header)
                row = [hex_int_dict[r] for r in row]
                row.extend([rand_test_nonce, label, nonce])
    return ml_df

# The machine learning routine - uses a Random Forest classifier
# Runs 20 jobs - change this as necessary based on number of cores
# Warning - this takes time and memory to build the model
#     About 5 minutes on a 32 core 32GB machine
# Decrease sample size to reduce memory usage. Accuracy does not seem to suffer.
def train_randomforest(X_train, Y_train):
    clf = RandomForestClassifier(n_jobs=20)
    clf =, Y_train)
    return clf

Load 20000 block headers to train on and a further 4000 for testing

t = load_data('bitcoinheaders.pkl')
# use only 20000 + 4000 rows instead of the ~49000 - makes the process faster
train = t[0:20000]
test = t[21000:25000]

# Convert it to dataframe in a format that sklearn can use
ml_df = pd.DataFrame(make_df(train))
ml_df_test = pd.DataFrame(make_df(test))

# Split the raw data into test and training sets
X = ml_df.columns[0:148]
Y = ml_df.columns[148]
X_train = ml_df[X]
Y_train = ml_df[Y]
X_test = ml_df_test[X]
Y_test = ml_df_test[Y]
0 1 2 3 4 5 6 7 8 9 ... 138 139 140 141 142 143 144 145 146 147
0 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 612623580
1 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 1273942402
2 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 2892576921
3 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 2279745217
4 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 1415422173

5 rows × 148 columns

Train a classifier

# Train the classifier on the training data
clf = train_randomforest(X_train, Y_train)

# Print the training set accuracy
print "Training set accuracy : " + str(clf.score(X_train, Y_train))

# Print the test set accuracy
print "Test set accuracy : " + str(clf.score(X_test, Y_test))
Training set accuracy : 0.999698333333
Test set accuracy : 0.783788333333

Graph the results

# Graph a prediction in relation to the true nonce
# rerun this cell to get another random selection
unique_nonces = ml_df[149].unique()
true_nonce = unique_nonces[random.randint(0, unique_nonces.shape[0])]
df_to_graph = ml_df.loc[ml_df[149] == true_nonce]
p = clf.predict(df_to_graph[X])

plt.scatter(df_to_graph[147], p, label='Candidate nonces')
plt.scatter(true_nonce, 0.5, c='red', label='True nonce for reference')
plt.xlabel('Nonce - 0 to 2^32')
plt.ylabel('Probability(true nonce < selected nonce)' )
<matplotlib.legend.Legend at 0x7f886f8f98d0>

Feature importance

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.00505718,
        0.00411218,  0.00443182,  0.0042087 ,  0.0042018 ,  0.00401395,
        0.00403606,  0.00432378,  0.0043062 ,  0.00433245,  0.00423386,
        0.00406486,  0.00432232,  0.00410184,  0.00426883,  0.00427783,
        0.00415867,  0.00421704,  0.00432546,  0.00423495,  0.00417807,
        0.00435239,  0.00399234,  0.00408457,  0.00419569,  0.0042272 ,
        0.0041832 ,  0.00421131,  0.00436315,  0.00405665,  0.00407534,
        0.00405407,  0.00444397,  0.00426483,  0.00413364,  0.00441375,
        0.00413599,  0.0041483 ,  0.00430871,  0.00402952,  0.00445774,
        0.00407513,  0.00440545,  0.00401896,  0.0042067 ,  0.004515  ,
        0.00434291,  0.00427326,  0.00433099,  0.00432221,  0.00435674,
        0.00394147,  0.00419859,  0.00431308,  0.00405089,  0.00422883,
        0.00413344,  0.00419294,  0.00418918,  0.00421018,  0.00428869,
        0.00417185,  0.0039788 ,  0.00421298,  0.00437247,  0.00417522,
        0.004153  ,  0.00438604,  0.00432773,  0.0040969 ,  0.00437398,
        0.00419273,  0.00422763,  0.00438585,  0.00424765,  0.00435936,
        0.00417301,  0.00427413,  0.00409838,  0.00414348,  0.00424256,
        0.00421371,  0.00411336,  0.00425224,  0.00424889,  0.00431411,
        0.00396213,  0.00436847,  0.00416592,  0.00435117,  0.00420203,
        0.00440759,  0.00433886,  0.0042691 ,  0.00420313,  0.0044612 ,
        0.00468124,  0.00426538,  0.00428175,  0.00435905,  0.00416996,
        0.00421363,  0.00442613,  0.00445018,  0.00441981,  0.00429876,
        0.00421206,  0.00432622,  0.00442857,  0.00418831,  0.0042089 ,
        0.00401741,  0.00426253,  0.00422956,  0.00425827,  0.00441802,
        0.00439529,  0.00434705,  0.00417416,  0.00424235,  0.        ,
        0.        ,  0.02489109,  0.00509744,  0.00338871,  0.00344143,
        0.00355746,  0.00329374,  0.00357129,  0.00355075,  0.        ,
        0.00323213,  0.0033215 ,  0.00279406,  0.0077297 ,  0.01157081,
        0.00901339,  0.01932312,  0.38234826])

Each byte in the header contributes equally (about 0.004) to the prediction, except for the candidate nonce which contributes a lot more. Makes sense.

Concluding remarks

As can be seen from the above test set accuracy and the graph, the classifier does a very good job of identifying the nonce. Also, "feature importance" shows that most of the bytes in the header contribute equally to the prediction - an observation that makes intuitive sense. There is a tremendous amount of interraction between the input variables (bytes) in the SHA hashing algorithms. The candidate nonce contributes a lot more, which also makes intuitive sense.

In this blog, I have trained a classifier to search for the nonce. A classifier can obviously also be trained to search for the correct timestamp, or a combination of the two, by changing the data encoding. The next step is to use the classifier on live data from bitcoind and try to mine the next block.

A pot of bitcoins await at the end of the random forest!

Thursday, January 29, 2015

The intrinsic value of chess pieces inferred from an analysis of 4.6 million boards

Is it possible to emperically infer the value of each chess piece based on an analysis of it's participation in many winning/losing/drawing games?

Classical chess theory recommends relative values for each chess piece - 1 for pawns, 3 for knights and bishops etc. Unfortunately however, these values are based on the opinion of experts (chess grandmasters). One way to infer "true" values for pieces rather than values based on "expert opinion" is to statistically study and "average" the contribution of each piece to win/loss over many games. A practical way to do this is to train a machine learning model with board positions as features and the result of the game that the position came from (win/loss/draw) as target. We can then query the model for the importance of each piece. What follows is an attempt to do this using python sklearn.


The raw data for this effort comes from scraping data from here, (processed and mirrored here). A total of 84302 games found in this set were used for the analysis.


The following processing steps were performed for each game (processed data ready for use here):

1. Each game was parsed to get a sequence of board positions, one for each move

2. For each board position, crafty (the chess engine) was used to calculate a score

3. The final, fully processed, dataframe looks like this:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

chess_df = pd.read_csv('processed_chess_data.csv.gz', 
                       sep=' ', 
r n b q k b.1 n.1 r.1 p p.1 ... K B.1 N.1 R.1 turn count w_win b_win draw score
0 0 0 0 0 0 0 0 0 0 0 ... 38 0 0 0 0 1 0 0 1 0.02
1 0 0 0 0 0 0 0 0 0 0 ... 12 0 0 0 0 1 1 0 0 24.24
2 0 0 0 0 0 0 0 0 0 0 ... 37 0 0 0 0 1 0 1 0 -12.68
3 0 0 0 0 0 0 0 0 0 0 ... 37 0 0 0 1 1 0 1 0 -12.84
4 0 0 0 0 0 0 0 0 0 0 ... 30 50 0 0 1 1 0 0 1 1.57

5 rows — 38 columns

(4674871, 38)

The dataframe format


1. Column names are chess pieces Lower case for black pieces (e.g., n = black knight), upper case for white (R = white rook) When there is more than one piece, each piece is numbered (e.g., p.6 is the 7th black pawn, B.1 is the second white bishop)

2. 'turn' = who's turn is it to move (0=white, 1=black)

3. 'count' = number of times this board position was encountered in the dataset

4. 'w_win' = number of times white won when this board was encountered

5. 'b_win' = number of times black won when this board was encountered

6. 'draw' = number of draws when this board was encountered 7. 'score' = the score crafty assigned to this board


1. Each row represents one board position along with the extra data noted above


1. The cell value is the number of the chess board square on which that piece is found. The squares are numbered from the top left (a8 = 0) to the bottom right (h1 = 63) as a linear array (see figure below)

For example

chess_df.loc[0, 'K']  # returns 38.0

i.e., in board number 0, the white king ('K') is on square 38 (g4)

The board numbering is shown in the figure below

# Code for the chess board modified from 

def checkerboard_table(data, fmt='{:.2f}', bkg_colors=['grey', 'white']):
    fig, ax = plt.subplots()
    tb = Table(ax, bbox=[0,0,1,1])

    nrows, ncols = data.shape
    width, height = 1.0/ncols, 1.0/nrows

    # Add cells
    for (i,j), val in np.ndenumerate(data):
        # Index either the first or second item of bkg_colors based on
        # a checker board pattern
        idx = [j % 2, (j + 1) % 2][i % 2]
        color = bkg_colors[idx]

        tb.add_cell(i, j, width, height, text=str(val), 
                    loc='center', facecolor=color)

    # Row Labels...
    for i, label in enumerate(data.index):
        tb.add_cell(i, -1, width, height, text=(8-label), loc='right', 
                    edgecolor='none', facecolor='none')
    # Column Labels...
    for j, label in enumerate(data.columns):
        tb.add_cell(-1, j, width, height/2, text=label, loc='center', 
                           edgecolor='none', facecolor='none')
    return fig

data = pd.DataFrame(np.arange(0,64).reshape(8,8), 

Training a decision tree regression model on board score

We will first train a decision tree model. To start with, we will train it on the board evaluation 'score' provided by crafty, rather than win/loss.

from sklearn.tree import DecisionTreeRegressor

features = chess_df.columns[:33]   # the 32 pieces and 'turn'
target = 'score'      # the board 'score' provided by crafty
model1 = DecisionTreeRegressor(max_depth=70)[features], chess_df[target])
model1.score(chess_df[features], chess_df[target])

The model fits the data well; actually overfits the data (R^2 of 0.999). For our purposes though, overfitting is not a concern. We only want to infer variable importance from this data. If generalization is necessary, we could tune this model, guided by cross-validation.

Now, we retrieve feature importance, i.e., the importance of the various pieces, by accessing the 'feature_importances_' attribute of the model (scaled to the lowest value piece):

pd.DataFrame(zip(features, model1.feature_importances_/model1.feature_importances_[8]))
0 1
0 r 1.980954
1 n 2.786199
2 b 2.234911
3 q 3.823624
4 k 1.388084
5 b.1 1.796774
6 n.1 1.742865
7 r.1 3.522390
8 p 1.000000
9 p.1 1.406889
10 p.2 1.038084
11 p.3 1.377806
12 p.4 1.297784
13 p.5 0.858814
14 p.6 0.454960
15 p.7 0.104892
16 P 1.811296
17 P.1 1.142458
18 P.2 1.101726
19 P.3 0.951071
20 P.4 0.873358
21 P.5 0.692127
22 P.6 0.359597
23 P.7 0.059979
24 R 2.625228
25 N 2.832997
26 B 2.432010
27 Q 7.343026
28 K 1.257644
29 B.1 1.066833
30 N.1 1.279959
31 R.1 1.811157
32 turn 1.937530

The average scores for the pawns is less than the pieces, which is less than the queen.

The average importance of the black pieces is less then the white pieces.

Overall, the actual values are in line with what classical chess theory predicts.

Training a decision tree classification model on 'win/loss'

Note that the previous model was trained on board evaluation scores provided by crafty, a chess engine which uses piece weights in calculating the board evaluation score. It has not trained on win/loss (the '"truth"). The results obtained are probably a reflection of this fact. So, this time, we shall train the model on the ground truth - using "white wins" (column 'w_win' in our dataframe).

We will first process the w_win column by converting it to "white wins as a fraction of total", then binarizing the result - if a board was won by white over 50% of the time, label it as a win for white. Then, we will train a decision tree classifier.

chess_df['white_win_binary'] = chess_df['w_win']/chess_df['count'] > 0.5

from sklearn.tree import DecisionTreeClassifier

features = chess_df.columns[:33]   # the 32 pieces and 'turn'
target = 'white_win_binary'  
model2 = DecisionTreeClassifier(max_depth=70)[features], chess_df[target])
model2.score(chess_df[features], chess_df[target])
predicted_classes = model2.predict(chess_df[features])
pd.crosstab(chess_df[target], predicted_classes, rownames=['White win'], colnames=['Predicted'])
Predicted False True
White win
False 2989035 2
True 9 1685825

This model overfits too. The 'feature_importance_' values for the pieces, however, are completely different.

pd.DataFrame(zip(features, model2.feature_importances_/np.mean(model2.feature_importances_[17:23])))
0 1
0 r 1.183964
1 n 1.443030
2 b 1.517433
3 q 1.198660
4 k 1.164395
5 b.1 0.812256
6 n.1 0.662847
7 r.1 0.982471
8 p 1.241384
9 p.1 1.256570
10 p.2 1.254353
11 p.3 1.320920
12 p.4 1.191160
13 p.5 1.091353
14 p.6 0.727144
15 p.7 0.218774
16 P 1.882447
17 P.1 1.601374
18 P.2 1.346524
19 P.3 1.148129
20 P.4 0.929727
21 P.5 0.627370
22 P.6 0.346875
23 P.7 0.117512
24 R 1.511778
25 N 1.339097
26 B 1.590355
27 Q 1.255038
28 K 1.109622
29 B.1 0.780831
30 N.1 0.654167
31 R.1 0.729700
32 turn 0.069270

These are values aggregated over many different chess boards. Many of the boards in the dataset are only seen once in the dataset. Not all pieces are present in all the boards.

# Fraction of chess boards in which each of the pieces are present
for i in features:
    print i, "\t:\t",float((chess_df[i] > 0).sum())/float(chess_df.shape[0])
r  : 0.527911251455
n  : 0.675296708722
b  : 0.731863831109
q  : 0.638411626759
k  : 0.997306449739
b.1  : 0.324361891483
n.1  : 0.269364865897
r.1  : 0.613436606058
p  : 0.991219650767
p.1  : 0.9688089789
p.2  : 0.925571422185
p.3  : 0.853068459001
p.4  : 0.739480297959
p.5  : 0.571836527682
p.6  : 0.322804415352
p.7  : 0.0843967673119
P  : 0.992084273555
P.1  : 0.971450763026
P.2  : 0.928900711913
P.3  : 0.858552888411
P.4  : 0.74619000182
P.5  : 0.576605215417
P.6  : 0.322984313364
P.7  : 0.0817641813004
R  : 0.874121446346
N  : 0.663003535285
B  : 0.736831026995
Q  : 0.644438531031
K  : 0.939267842899
B.1  : 0.337640546659
N.1  : 0.257087307864
R.1  : 0.48930633594
turn  : 0.499698708264


Chess engines (crafty) assign values for the various pieces in line with classical chess theory. However, when peices are evaluated based on win/loss, the valuations are dramatically different from classical chess theory. One explaination for this is that piece values are dynamic and depend on other factors, such as the stage of the game, tactical opportunities, etc. It would be interesting to anayze this dataset further, stratifying it by the number of pieces, location of some pieces, different piece combinations (e.g., Q vs. r & r.1) etc.