# Fast Cross Validation

Published:

I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.

   Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$s_validate) 1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1    0.912    0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$tcomp>) 11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965    0.050 29019.662    0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443   58.356    0.002 28459.225    0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)


It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.

def _cross_score():
...
return [(model.score(feature_values), label)
for feature_values, label in test_set]


The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.

time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)


that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:

def _cross_score():
...
feature_values, labels = map(list, zip(*test_set))
docs = model.score_many(feature_values)
return list(zip(docs, labels))


and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!

Tags: