## Thursday, January 29, 2015

### Is it possible to emperically infer the value of each chess piece based on an analysis of it's participation in many winning/losing/drawing games?

Classical chess theory recommends relative values for each chess piece - 1 for pawns, 3 for knights and bishops etc. Unfortunately however, these values are based on the opinion of experts (chess grandmasters). One way to infer "true" values for pieces rather than values based on "expert opinion" is to statistically study and "average" the contribution of each piece to win/loss over many games. A practical way to do this is to train a machine learning model with board positions as features and the result of the game that the position came from (win/loss/draw) as target. We can then query the model for the importance of each piece. What follows is an attempt to do this using python sklearn.

### Data

The raw data for this effort comes from scraping data from here, (processed and mirrored here). A total of 84302 games found in this set were used for the analysis.

### Preprocessing

The following processing steps were performed for each game (processed data ready for use here):

1. Each game was parsed to get a sequence of board positions, one for each move

2. For each board position, crafty (the chess engine) was used to calculate a score

3. The final, fully processed, dataframe looks like this:

```import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

chess_df = pd.read_csv('processed_chess_data.csv.gz',
sep=' ',
compression='gzip')
chess_df.head(5)
```
r n b q k b.1 n.1 r.1 p p.1 ... K B.1 N.1 R.1 turn count w_win b_win draw score
0 0 0 0 0 0 0 0 0 0 0 ... 38 0 0 0 0 1 0 0 1 0.02
1 0 0 0 0 0 0 0 0 0 0 ... 12 0 0 0 0 1 1 0 0 24.24
2 0 0 0 0 0 0 0 0 0 0 ... 37 0 0 0 0 1 0 1 0 -12.68
3 0 0 0 0 0 0 0 0 0 0 ... 37 0 0 0 1 1 0 1 0 -12.84
4 0 0 0 0 0 0 0 0 0 0 ... 30 50 0 0 1 1 0 0 1 1.57

5 rows — 38 columns

```chess_df.shape
```
```(4674871, 38)
```

The dataframe format

Columns:

1. Column names are chess pieces Lower case for black pieces (e.g., n = black knight), upper case for white (R = white rook) When there is more than one piece, each piece is numbered (e.g., p.6 is the 7th black pawn, B.1 is the second white bishop)

2. 'turn' = who's turn is it to move (0=white, 1=black)

3. 'count' = number of times this board position was encountered in the dataset

4. 'w_win' = number of times white won when this board was encountered

5. 'b_win' = number of times black won when this board was encountered

6. 'draw' = number of draws when this board was encountered 7. 'score' = the score crafty assigned to this board

Rows:

1. Each row represents one board position along with the extra data noted above

Cells:

1. The cell value is the number of the chess board square on which that piece is found. The squares are numbered from the top left (a8 = 0) to the bottom right (h1 = 63) as a linear array (see figure below)

For example

``chess_df.loc[0, 'K']  # returns 38.0``

i.e., in board number 0, the white king ('K') is on square 38 (g4)

The board numbering is shown in the figure below

```# Code for the chess board modified from
# http://stackoverflow.com/questions/10194482/custom-matplotlib-plot-chess-board-like-table-with-colored-cells

def checkerboard_table(data, fmt='{:.2f}', bkg_colors=['grey', 'white']):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0,0,1,1])

nrows, ncols = data.shape
width, height = 1.0/ncols, 1.0/nrows

# Add cells
for (i,j), val in np.ndenumerate(data):
# Index either the first or second item of bkg_colors based on
# a checker board pattern
idx = [j % 2, (j + 1) % 2][i % 2]
color = bkg_colors[idx]

tb.add_cell(i, j, width, height, text=str(val),
loc='center', facecolor=color)

# Row Labels...
for i, label in enumerate(data.index):
tb.add_cell(i, -1, width, height, text=(8-label), loc='right',
edgecolor='none', facecolor='none')
# Column Labels...
for j, label in enumerate(data.columns):
tb.add_cell(-1, j, width, height/2, text=label, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
return fig

data = pd.DataFrame(np.arange(0,64).reshape(8,8),
columns=['a','b','b','d','e','f','g','h'])
checkerboard_table(data)
plt.show()
```

### Training a decision tree regression model on board score

We will first train a decision tree model. To start with, we will train it on the board evaluation 'score' provided by crafty, rather than win/loss.

```from sklearn.tree import DecisionTreeRegressor

features = chess_df.columns[:33]   # the 32 pieces and 'turn'
target = 'score'      # the board 'score' provided by crafty
model1 = DecisionTreeRegressor(max_depth=70)
model1.fit(chess_df[features], chess_df[target])
model1.score(chess_df[features], chess_df[target])
```
```0.9999812089822121
```

The model fits the data well; actually overfits the data (R^2 of 0.999). For our purposes though, overfitting is not a concern. We only want to infer variable importance from this data. If generalization is necessary, we could tune this model, guided by cross-validation.

Now, we retrieve feature importance, i.e., the importance of the various pieces, by accessing the 'feature_importances_' attribute of the model (scaled to the lowest value piece):

```pd.DataFrame(zip(features, model1.feature_importances_/model1.feature_importances_[8]))
```
0 1
0 r 1.980954
1 n 2.786199
2 b 2.234911
3 q 3.823624
4 k 1.388084
5 b.1 1.796774
6 n.1 1.742865
7 r.1 3.522390
8 p 1.000000
9 p.1 1.406889
10 p.2 1.038084
11 p.3 1.377806
12 p.4 1.297784
13 p.5 0.858814
14 p.6 0.454960
15 p.7 0.104892
16 P 1.811296
17 P.1 1.142458
18 P.2 1.101726
19 P.3 0.951071
20 P.4 0.873358
21 P.5 0.692127
22 P.6 0.359597
23 P.7 0.059979
24 R 2.625228
25 N 2.832997
26 B 2.432010
27 Q 7.343026
28 K 1.257644
29 B.1 1.066833
30 N.1 1.279959
31 R.1 1.811157
32 turn 1.937530

The average scores for the pawns is less than the pieces, which is less than the queen.

The average importance of the black pieces is less then the white pieces.

Overall, the actual values are in line with what classical chess theory predicts.

### Training a decision tree classification model on 'win/loss'

Note that the previous model was trained on board evaluation scores provided by crafty, a chess engine which uses piece weights in calculating the board evaluation score. It has not trained on win/loss (the '"truth"). The results obtained are probably a reflection of this fact. So, this time, we shall train the model on the ground truth - using "white wins" (column 'w_win' in our dataframe).

We will first process the w_win column by converting it to "white wins as a fraction of total", then binarizing the result - if a board was won by white over 50% of the time, label it as a win for white. Then, we will train a decision tree classifier.

```chess_df['white_win_binary'] = chess_df['w_win']/chess_df['count'] > 0.5

from sklearn.tree import DecisionTreeClassifier

features = chess_df.columns[:33]   # the 32 pieces and 'turn'
target = 'white_win_binary'
model2 = DecisionTreeClassifier(max_depth=70)
model2.fit(chess_df[features], chess_df[target])
model2.score(chess_df[features], chess_df[target])
```
```0.99999764699389571
```
```predicted_classes = model2.predict(chess_df[features])
pd.crosstab(chess_df[target], predicted_classes, rownames=['White win'], colnames=['Predicted'])
```
Predicted False True
White win
False 2989035 2
True 9 1685825

This model overfits too. The 'feature_importance_' values for the pieces, however, are completely different.

```pd.DataFrame(zip(features, model2.feature_importances_/np.mean(model2.feature_importances_[17:23])))
```
0 1
0 r 1.183964
1 n 1.443030
2 b 1.517433
3 q 1.198660
4 k 1.164395
5 b.1 0.812256
6 n.1 0.662847
7 r.1 0.982471
8 p 1.241384
9 p.1 1.256570
10 p.2 1.254353
11 p.3 1.320920
12 p.4 1.191160
13 p.5 1.091353
14 p.6 0.727144
15 p.7 0.218774
16 P 1.882447
17 P.1 1.601374
18 P.2 1.346524
19 P.3 1.148129
20 P.4 0.929727
21 P.5 0.627370
22 P.6 0.346875
23 P.7 0.117512
24 R 1.511778
25 N 1.339097
26 B 1.590355
27 Q 1.255038
28 K 1.109622
29 B.1 0.780831
30 N.1 0.654167
31 R.1 0.729700
32 turn 0.069270

These are values aggregated over many different chess boards. Many of the boards in the dataset are only seen once in the dataset. Not all pieces are present in all the boards.

```# Fraction of chess boards in which each of the pieces are present
for i in features:
print i, "\t:\t",float((chess_df[i] > 0).sum())/float(chess_df.shape[0])
```
```r  : 0.527911251455
n  : 0.675296708722
b  : 0.731863831109
q  : 0.638411626759
k  : 0.997306449739
b.1  : 0.324361891483
n.1  : 0.269364865897
r.1  : 0.613436606058
p  : 0.991219650767
p.1  : 0.9688089789
p.2  : 0.925571422185
p.3  : 0.853068459001
p.4  : 0.739480297959
p.5  : 0.571836527682
p.6  : 0.322804415352
p.7  : 0.0843967673119
P  : 0.992084273555
P.1  : 0.971450763026
P.2  : 0.928900711913
P.3  : 0.858552888411
P.4  : 0.74619000182
P.5  : 0.576605215417
P.6  : 0.322984313364
P.7  : 0.0817641813004
R  : 0.874121446346
N  : 0.663003535285
B  : 0.736831026995
Q  : 0.644438531031
K  : 0.939267842899
B.1  : 0.337640546659
N.1  : 0.257087307864
R.1  : 0.48930633594
turn  : 0.499698708264

```

### Conclusions

Chess engines (crafty) assign values for the various pieces in line with classical chess theory. However, when peices are evaluated based on win/loss, the valuations are dramatically different from classical chess theory. One explaination for this is that piece values are dynamic and depend on other factors, such as the stage of the game, tactical opportunities, etc. It would be interesting to anayze this dataset further, stratifying it by the number of pieces, location of some pieces, different piece combinations (e.g., Q vs. r & r.1) etc.

## Thursday, January 22, 2015

### A heuristic algorithm for factoring larger numbers into their composite primes using machine learning

Many cryptographic systems depend upon the difficulty in factorization of a large composite number into it's primes. There exists no polynomial time algorithm for this problem. Is it possible to solve this problem using statistical learning?

For the purposes of illustration, consider this simple example: 7 x 23 = 161

The task is to factorize 161. It is evident that one of the primes will be less than sqrt(161) and the other greater. Could an algorithm be developed that would, for any number in the range 1 to sqrt(161), give the probability of the true prime factor being above it. In this example, if 5 is given, the algorithm will return a probability close to 1 (indicating the prime factor is higher than 5); if 13 is given, it will return a probability close to 0.

What follows is an attempt to train a simple Random Forest Classifier using a dataset created from the first 50 million primes (found here). The data is available here. The Makefile should rebuild the prime number list.

The necessary imports:

```import random
import math
import gzip
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.metrics import roc_auc_score
```

Functions for loading the prime numbers and making a training set:

```# Load the list of prime numbers from file (available at the address above)
def load_prime_numbers():
PrimeNumberList =  gzip.open('primes.txt.gz', 'rb').read().splitlines()
PrimeNumberList = [long(p) for p in PrimeNumberList]
return PrimeNumberList

def build_primes_and_product(PrimeNumberList, SampleSize = None):
#builds a test set of size SampleSize from the PrimeNumberList
#Returns a list of [[prime_number_1, primt_number_2, product]...]
PrimesAndProduct = []
if SampleSize == None:
SampleSize = 100000
for i in range(SampleSize):
a = PrimeNumberList[random.randint(0, (SampleSize-1))]
b = PrimeNumberList[random.randint(0, (SampleSize-1))]
if a > b:
a, b = b, a
PrimesAndProduct.append([a, b, a*b])
return PrimesAndProduct

def build_test_data_set(PrimesAndProduct, TestSetSize = None):
#For each product and smaller of the primes,
#return (product, random number less/greater than smaller of the primes,
# 0 for smaller or 1 for greater)
if TestSetSize == None:
TestSetSize = 1000000
TrainingSet = []
for i in PrimesAndProduct:
prime1, prime2, product = i
SqrtOfProduct = long(product**0.5)
FlipCoin = random.randint(0,1)
if FlipCoin == 0:
ChosenNumber = long(random.randrange(1, prime1))
target = 0
else:
ChosenNumber = long(random.randrange(prime1-1, SqrtOfProduct))
target = 1
TrainingSet.append([product,
ChosenNumber,
target,
prime1])
return TrainingSet
```
```def extract_features(Sample):
#make the training data
#take a row of data from PrimesAndProduct,
#return a list of features and target
FeatureList = []
for i in Sample:
product, chosen, target, prime1 = i
product_split = list(str(product).zfill(20))
product_split.extend(list(str(chosen).zfill(20)))
product_split = [int(h) for h in product_split]
product_split.extend([target, prime1, chosen])
FeatureList.append(product_split)
return FeatureList
```

Functions to fit various classifiers and return the fit model

```def learn_cross_validation(features, target):
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
clf = RandomForestClassifier(n_estimators = 400, n_jobs=20)
scores = cross_validation.cross_val_score(clf, features, target, cv=5)
print scores

def learn_RF(features, target):
from sklearn.ensemble import RandomForestClassifier
clr = RandomForestClassifier(n_estimators = 400, n_jobs=20)
clr.fit(features, target)
return clr

def learn_LR(features, target):
from sklearn.linear_model import LogisticRegression
clr = LogisticRegression()
clr.fit_transform(features, target)
return clr
```

Make a test set with evenly spaced out numbers between 1 and sqrt(composite) along with target

```def meaningful_test(product, prime1, TestSetSize):
if TestSetSize == None:
TestSetSize = 10000
TrainingSet = []
for i in range(1, TestSetSize):
SqrtOfProduct = long(product**0.5) * 1.2
ChosenNumber = long(SqrtOfProduct * i) / long(TestSetSize)
if ChosenNumber < prime1:
target = 0
else:
target = 1
TrainingSet.append([product,
ChosenNumber,
target,
prime1])
return TrainingSet
```

Build training set

```p = load_prime_numbers()
pnp = build_primes_and_product(p, 100000)
r=build_test_data_set(pnp)
test = pd.DataFrame(extract_features(r))
```

Train a model

```model = learn_RF(test.loc[:,0:39], test.loc[:,40])
#model = learn_LR(test.loc[:,0:39], test.loc[:,40])
```

ROC_AUC of the training set - has trained (very ?) well

```h = model.predict(test.loc[:,0:39])
print roc_auc_score(test.loc[:,40], h)
pd.crosstab(h, test.loc[:,40])
```
```1.0

```
Truth 0 1
Predicted
0 50184 0
1 0 49816

ROC_AUC on a new test set

```new_test_data = pd.DataFrame(extract_features(build_test_data_set((build_primes_and_product(p, 10000)))))
h = model.predict(new_test_data.loc[:,0:39])
print roc_auc_score(new_test_data.loc[:,40], h)
pd.crosstab(h, new_test_data.loc[:,40])
```
```0.791999857273

```
Truth 0 1
Predicted
0 3499 550
1 1537 4414

ROC_AUC on a meaningful test set. The graph displays the predicted transition in probability in relation to the prime factor

```rand_prime1, rand_prime2, rand_product = pnp[random.randint(0, len(pnp))]
meaningful_test_data = pd.DataFrame(extract_features(meaningful_test(rand_product, rand_prime1, 10000)))
h = model.predict(meaningful_test_data.loc[:,0:39])
print rand_product, rand_prime1
print roc_auc_score(meaningful_test_data.loc[:,40], h)
plt.scatter(meaningful_test_data[42], h)
plt.scatter(rand_prime1, 0.5, c='r')
pd.crosstab(h, meaningful_test_data.loc[:,40])
```
```489359706679 450403
0.98438286485

```
Truth 0 1
Predicted
0 5253 48
1 112 4586

Graphical representation: