twitter

Showing posts with label random forest. Show all posts
Showing posts with label random forest. Show all posts

Thursday, March 19, 2015

Machine learning to rapidly search for the correct bitcoin block header nonce

Using machine learning, is it possible to predict the bitcoin nonce that will satisfy the difficulty target condition?

In this blog, I will attempt to do this. It is a work in progress, but the results are promising.

Background

First, lets begin with a VERY brief introduction to the relevent portions of bitcoin and bitcoin mining to help understand the rest of the blog. Bitcoin is an online paymeny system, which operates in a completely decentralized manner. Bitcoins, the currency of the bitcoin system, are exchanged for goods and services. A good analogy to bitcoins in the conventional finalcial world is a cheque. Cheques can be used as payment for goods and services. However, the cheque has to be verified to ensure that the payor has sufficient funds to cover the cheque, a process that is performed by the banks. In the bitcoin system, the process of transaction verification (i.e., bitcoin mining) is performed by the bitcoin community (the equivelent of conventional banks, but completely decentralized).

Bitcoin mining, or transaction verification, is performed in chunks called "blocks". Successful verification of a block is rewarded by the bitcoin system with 25 bitcoins paid to the first person providing proof of verification. Each block is formed by collating all the transactions since the previous block, constructing a block header from these transactions, and then altering this header methodically until the SHA256 double hash of this header is less than the currently set "difficulty" target. Ken Shirriff's excellent blog post (http://www.righto.com/2014/02/bitcoin-mining-hard-way-algorithms.html) explains the specifics of this process well and is worth reading carefully.

For the purposes of this blog, here are the take-away: - if a bitcoin block is verified, the verifier gets paid 25 bitcoins (as of this writing) - the SHA256 double hash of the block header has to be below a predefined difficulty target for the verification to be deemed successful - bitcoin miner verify blocks and meet the difficulty threshold to get paid

Assumptions (or why bitcoin mining is deemed to be hard)

SHA256 hashing function is a one-way hash function. The only way to know if the hash is below the difficulty target set by the bitcoin network is to alter the input one by one and compare the resulting hash to the target - i.e., by brute force. Since the size of the search space is very very large, a lot of computational power is necessary to be successful at this task.

The current approach to mining

The block header is an 80 byte string made of 1. version (4 bytes) 2. previous block's hash (32 bytes) 3. Merkle root (32 bytes) 4. timestamp (4 bytes) 5. bits (4 bytes) 6. nonce (4 bytes)

Of these, miners systematically alter the nonce (and timestamp with some constraints), compute the double SHA256 hash for each change, and check if the result is below the target. Merkle root can also be changed (by changing some of the coinbase transaction fields).

The proposed machine learning approach

The idea is to take the previously solved block headers and learn where the nonce might be. To do this, take each solved header and build the training data table such as this:

-----------------------------------------------------------------------  
| ver | prev_hash | merkle_root | time | bits | candidate nonce | label|
|----------------------------------------------------------------------|  
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 00000001        | 0    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 00010001        | 0    |
| .   | ...       | ...         | ..   | ..   | .               | .    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 10000001        | 1    |
| 3   | 46ae7...  | 8732ad...   | 843..| 1d.. | 40000001        | 1    |
-----------------------------------------------------------------------       

The "candidate nonce" and "label" need explaining. The candidate nonces are numbers less than and greater than the true nonce - i.e., candidates that we are testing. The label is encoded as 0 if the candidate nonce is less than the true nonce, 1 if greater.

A machine learning classifier can be trained on this data. Once trained, the classifier will predict 0 if the true nonce of an unknown header is greater than the candidate nonce, and 1 if less. So for any given candidate nonce, this classifier will say if the true nonce is lower or higher.

Data for 49800 solved bitcoin headers is available here (python pickle format. See code below)

Python code implementing the above ideas

The necessary imports

import hashlib
import struct
import pandas as pd
import random
import os.path
import pickle
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
import json
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.legend_handler import HandlerLine2D

Helper functions with some comments

# Helper functions with some commenting
#############################################################
# load block header data
# Note: the data has been preprocessed and has been pickled
def load_data(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# Takes true nonce, returns random nonce value and label indicating 
#   above/below true nonce
# Used to make the training labels
def get_nonce(n):
    r = random.randint(0, 4294967296)
    while r == n:
        r = random.randint(0, 4294967296)
    if r < n:
        l = 0
    elif r > n:
        l = 1
    return (r, l)

# Makes a training set from the block header data by adding 
#   random nonce candidates and label
def make_df(data_dict):
    ml_df = []
    row = []
    hex_int_dict = {'0':0, '1':1, '2':2, '3':3, '4':4, '5':5, 
                    '6':6, '7':7, '8':8, '9':9, 'a': 10, 
                    'b': 11, 'c':12, 'd':13, 'e':14, 'f':15}
    
    for i in data_dict:
        try:
            header = [str(i['ver'])]
            header.extend(list(i['prev_block']))
            header.extend(list(i['mrkl_root']))
            header.extend(list(str(i['time']).zfill(10))) 
            header.extend(list(str(i['bits'])))
            nonce = int(i['nonce'])

            for n in range(150):
                rand_test_nonce, label = get_nonce(nonce)
                row = list(header)
                row = [hex_int_dict[r] for r in row]
                row.extend([rand_test_nonce, label, nonce])
                ml_df.append(row)
        except:
            continue
    return ml_df

# The machine learning routine - uses a Random Forest classifier
# Runs 20 jobs - change this as necessary based on number of cores
# Warning - this takes time and memory to build the model
#     About 5 minutes on a 32 core 32GB machine
# Decrease sample size to reduce memory usage. Accuracy does not seem to suffer.
def train_randomforest(X_train, Y_train):
    clf = RandomForestClassifier(n_jobs=20)
    clf = clf.fit(X_train, Y_train)
    return clf

Load 20000 block headers to train on and a further 4000 for testing

t = load_data('bitcoinheaders.pkl')
# use only 20000 + 4000 rows instead of the ~49000 - makes the process faster
train = t[0:20000]
test = t[21000:25000]

# Convert it to dataframe in a format that sklearn can use
ml_df = pd.DataFrame(make_df(train))
ml_df_test = pd.DataFrame(make_df(test))

# Split the raw data into test and training sets
X = ml_df.columns[0:148]
Y = ml_df.columns[148]
X_train = ml_df[X]
Y_train = ml_df[Y]
X_test = ml_df_test[X]
Y_test = ml_df_test[Y]
pd.DataFrame(X_train).head()
0 1 2 3 4 5 6 7 8 9 ... 138 139 140 141 142 143 144 145 146 147
0 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 612623580
1 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 1273942402
2 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 2892576921
3 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 2279745217
4 1 0 0 0 0 0 0 0 0 3 ... 9 1 13 0 0 15 15 15 15 1415422173

5 rows × 148 columns

Train a classifier

# Train the classifier on the training data
clf = train_randomforest(X_train, Y_train)

# Print the training set accuracy
print "Training set accuracy : " + str(clf.score(X_train, Y_train))

# Print the test set accuracy
print "Test set accuracy : " + str(clf.score(X_test, Y_test))
Training set accuracy : 0.999698333333
Test set accuracy : 0.783788333333

Graph the results

# Graph a prediction in relation to the true nonce
# rerun this cell to get another random selection
unique_nonces = ml_df[149].unique()
true_nonce = unique_nonces[random.randint(0, unique_nonces.shape[0])]
df_to_graph = ml_df.loc[ml_df[149] == true_nonce]
p = clf.predict(df_to_graph[X])

plt.scatter(df_to_graph[147], p, label='Candidate nonces')
plt.scatter(true_nonce, 0.5, c='red', label='True nonce for reference')
plt.xlabel('Nonce - 0 to 2^32')
plt.ylabel('Probability(true nonce < selected nonce)' )
plt.legend(loc=2)
<matplotlib.legend.Legend at 0x7f886f8f98d0>

Feature importance

clf.feature_importances_
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.00505718,
        0.00411218,  0.00443182,  0.0042087 ,  0.0042018 ,  0.00401395,
        0.00403606,  0.00432378,  0.0043062 ,  0.00433245,  0.00423386,
        0.00406486,  0.00432232,  0.00410184,  0.00426883,  0.00427783,
        0.00415867,  0.00421704,  0.00432546,  0.00423495,  0.00417807,
        0.00435239,  0.00399234,  0.00408457,  0.00419569,  0.0042272 ,
        0.0041832 ,  0.00421131,  0.00436315,  0.00405665,  0.00407534,
        0.00405407,  0.00444397,  0.00426483,  0.00413364,  0.00441375,
        0.00413599,  0.0041483 ,  0.00430871,  0.00402952,  0.00445774,
        0.00407513,  0.00440545,  0.00401896,  0.0042067 ,  0.004515  ,
        0.00434291,  0.00427326,  0.00433099,  0.00432221,  0.00435674,
        0.00394147,  0.00419859,  0.00431308,  0.00405089,  0.00422883,
        0.00413344,  0.00419294,  0.00418918,  0.00421018,  0.00428869,
        0.00417185,  0.0039788 ,  0.00421298,  0.00437247,  0.00417522,
        0.004153  ,  0.00438604,  0.00432773,  0.0040969 ,  0.00437398,
        0.00419273,  0.00422763,  0.00438585,  0.00424765,  0.00435936,
        0.00417301,  0.00427413,  0.00409838,  0.00414348,  0.00424256,
        0.00421371,  0.00411336,  0.00425224,  0.00424889,  0.00431411,
        0.00396213,  0.00436847,  0.00416592,  0.00435117,  0.00420203,
        0.00440759,  0.00433886,  0.0042691 ,  0.00420313,  0.0044612 ,
        0.00468124,  0.00426538,  0.00428175,  0.00435905,  0.00416996,
        0.00421363,  0.00442613,  0.00445018,  0.00441981,  0.00429876,
        0.00421206,  0.00432622,  0.00442857,  0.00418831,  0.0042089 ,
        0.00401741,  0.00426253,  0.00422956,  0.00425827,  0.00441802,
        0.00439529,  0.00434705,  0.00417416,  0.00424235,  0.        ,
        0.        ,  0.02489109,  0.00509744,  0.00338871,  0.00344143,
        0.00355746,  0.00329374,  0.00357129,  0.00355075,  0.        ,
        0.00323213,  0.0033215 ,  0.00279406,  0.0077297 ,  0.01157081,
        0.00901339,  0.01932312,  0.38234826])

Each byte in the header contributes equally (about 0.004) to the prediction, except for the candidate nonce which contributes a lot more. Makes sense.

Concluding remarks

As can be seen from the above test set accuracy and the graph, the classifier does a very good job of identifying the nonce. Also, "feature importance" shows that most of the bytes in the header contribute equally to the prediction - an observation that makes intuitive sense. There is a tremendous amount of interraction between the input variables (bytes) in the SHA hashing algorithms. The candidate nonce contributes a lot more, which also makes intuitive sense.

In this blog, I have trained a classifier to search for the nonce. A classifier can obviously also be trained to search for the correct timestamp, or a combination of the two, by changing the data encoding. The next step is to use the classifier on live data from bitcoind and try to mine the next block.

A pot of bitcoins await at the end of the random forest!

Thursday, January 22, 2015

A heuristic algorithm for factoring larger numbers into their composite primes using machine learning

Many cryptographic systems depend upon the difficulty in factorization of a large composite number into it's primes. There exists no polynomial time algorithm for this problem. Is it possible to solve this problem using statistical learning?

For the purposes of illustration, consider this simple example: 7 x 23 = 161

The task is to factorize 161. It is evident that one of the primes will be less than sqrt(161) and the other greater. Could an algorithm be developed that would, for any number in the range 1 to sqrt(161), give the probability of the true prime factor being above it. In this example, if 5 is given, the algorithm will return a probability close to 1 (indicating the prime factor is higher than 5); if 13 is given, it will return a probability close to 0.

What follows is an attempt to train a simple Random Forest Classifier using a dataset created from the first 50 million primes (found here). The data is available here. The Makefile should rebuild the prime number list.

The necessary imports:

import random
import math
import gzip
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.metrics import roc_auc_score

Functions for loading the prime numbers and making a training set:

# Load the list of prime numbers from file (available at the address above)
def load_prime_numbers():
    PrimeNumberList =  gzip.open('primes.txt.gz', 'rb').read().splitlines()
    PrimeNumberList = [long(p) for p in PrimeNumberList]
    return PrimeNumberList

def build_primes_and_product(PrimeNumberList, SampleSize = None):
    #builds a test set of size SampleSize from the PrimeNumberList
    #Returns a list of [[prime_number_1, primt_number_2, product]...]
    PrimesAndProduct = []
    if SampleSize == None:
        SampleSize = 100000
    for i in range(SampleSize):
        a = PrimeNumberList[random.randint(0, (SampleSize-1))]
        b = PrimeNumberList[random.randint(0, (SampleSize-1))]
        if a > b:
            a, b = b, a
        PrimesAndProduct.append([a, b, a*b])
    return PrimesAndProduct

def build_test_data_set(PrimesAndProduct, TestSetSize = None):
    #For each product and smaller of the primes,
    #return (product, random number less/greater than smaller of the primes, 
    # 0 for smaller or 1 for greater)
    if TestSetSize == None:
        TestSetSize = 1000000
    TrainingSet = []
    for i in PrimesAndProduct:
        prime1, prime2, product = i
        SqrtOfProduct = long(product**0.5)
        FlipCoin = random.randint(0,1)
        if FlipCoin == 0:
            ChosenNumber = long(random.randrange(1, prime1))
            target = 0
        else: 
            ChosenNumber = long(random.randrange(prime1-1, SqrtOfProduct))
            target = 1
        TrainingSet.append([product, 
                            ChosenNumber, 
                            target,
                            prime1])
    return TrainingSet
def extract_features(Sample):
    #make the training data
    #take a row of data from PrimesAndProduct,
    #return a list of features and target 
    FeatureList = []
    for i in Sample:
        product, chosen, target, prime1 = i
        product_split = list(str(product).zfill(20))
        product_split.extend(list(str(chosen).zfill(20)))
        product_split = [int(h) for h in product_split]
        product_split.extend([target, prime1, chosen])
        FeatureList.append(product_split)
    return FeatureList

Functions to fit various classifiers and return the fit model

def learn_cross_validation(features, target):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import cross_validation
    clf = RandomForestClassifier(n_estimators = 400, n_jobs=20)
    scores = cross_validation.cross_val_score(clf, features, target, cv=5)
    print scores
    
def learn_RF(features, target):
    from sklearn.ensemble import RandomForestClassifier
    clr = RandomForestClassifier(n_estimators = 400, n_jobs=20)
    clr.fit(features, target)
    return clr

def learn_LR(features, target):
    from sklearn.linear_model import LogisticRegression
    clr = LogisticRegression()
    clr.fit_transform(features, target)
    return clr

Make a test set with evenly spaced out numbers between 1 and sqrt(composite) along with target

def meaningful_test(product, prime1, TestSetSize):
    if TestSetSize == None:
        TestSetSize = 10000
    TrainingSet = []
    for i in range(1, TestSetSize):
        SqrtOfProduct = long(product**0.5) * 1.2
        ChosenNumber = long(SqrtOfProduct * i) / long(TestSetSize)
        if ChosenNumber < prime1:
            target = 0
        else:
            target = 1
        TrainingSet.append([product, 
                            ChosenNumber, 
                            target,
                            prime1])
    return TrainingSet    

Build training set

p = load_prime_numbers()
pnp = build_primes_and_product(p, 100000)
r=build_test_data_set(pnp)
test = pd.DataFrame(extract_features(r))

Train a model

model = learn_RF(test.loc[:,0:39], test.loc[:,40])
#model = learn_LR(test.loc[:,0:39], test.loc[:,40])

ROC_AUC of the training set - has trained (very ?) well

h = model.predict(test.loc[:,0:39])
print roc_auc_score(test.loc[:,40], h)
pd.crosstab(h, test.loc[:,40])
1.0

Truth 0 1
Predicted
0 50184 0
1 0 49816

ROC_AUC on a new test set

new_test_data = pd.DataFrame(extract_features(build_test_data_set((build_primes_and_product(p, 10000)))))
h = model.predict(new_test_data.loc[:,0:39])
print roc_auc_score(new_test_data.loc[:,40], h)
pd.crosstab(h, new_test_data.loc[:,40])
0.791999857273

Truth 0 1
Predicted
0 3499 550
1 1537 4414

ROC_AUC on a meaningful test set. The graph displays the predicted transition in probability in relation to the prime factor

rand_prime1, rand_prime2, rand_product = pnp[random.randint(0, len(pnp))]
meaningful_test_data = pd.DataFrame(extract_features(meaningful_test(rand_product, rand_prime1, 10000)))
h = model.predict(meaningful_test_data.loc[:,0:39])
print rand_product, rand_prime1
print roc_auc_score(meaningful_test_data.loc[:,40], h)
plt.scatter(meaningful_test_data[42], h)
plt.scatter(rand_prime1, 0.5, c='r')
pd.crosstab(h, meaningful_test_data.loc[:,40])
489359706679 450403
0.98438286485

Truth 0 1
Predicted
0 5253 48
1 112 4586

Graphical representation: