I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.
Ordered by: cumulative time List reduced from 260 to 50 due to restriction <50> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$ s_validate) 1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$ ss_score) 1 0.912 0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$ tcomp>) 11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$ re) 33442 1665.965 0.050 29019.662 0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba) 33443 58.356 0.002 28459.225 0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)
It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.
def _cross_score(): ... return [(model.score(feature_values), label) for feature_values, label in test_set]
The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.
time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)
that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:
def _cross_score(): ... feature_values, labels = map(list, zip(*test_set)) docs = model.score_many(feature_values) return list(zip(docs, labels))
and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!