Fast Cross Validation

2 minute read

Published:

I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.

   Ordered by: cumulative time
   List reduced from 260 to 50 due to restriction <50>                          

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$
s_validate)
        1    1.160    1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
        1    0.912    0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$
tcomp>)
    11148  171.564    0.015 29307.219    2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
    33442 1665.965    0.050 29019.662    0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
    33443   58.356    0.002 28459.225    0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)

It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.

def _cross_score():
...
	return [(model.score(feature_values), label)
			for feature_values, label in test_set]

The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.

time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)

that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:

def _cross_score():
...
	feature_values, labels = map(list, zip(*test_set))
	docs = model.score_many(feature_values)
	return list(zip(docs, labels))

and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!