# Posts by Tags

## End of GSoC 2015

Published:

Aug 28th, 2015 marked the end of my Google Summer of Code intern at Wikimedia Foundation. My project targeted the banner expedition of English Wikivoyage. Here’s a small walkthrough of my project, what led to it and how it came out to be.

## Background:

English Wikivoyage and some other Wikivoyage projects had developed a culture to render page-wide banners on articles to enhance the overall appeal of the pages by portraying some attractive location of the place in the banner. The banners were originally rendered through the use of a Mediawiki Template which however had some issues with it:

1. Banners were not responsive. Since the banners were 2100px X 300px, downloading a large sized banner on small screens became a significant aspect.
2. Table of Contents were displayed in a linear fashion inside the banner, which however did not support multiple levels.
3. Banners on mobile devices clashed with page heading.
4. Due to a 7:1 aspect ration, banners on small screens became too small to leave a visual impact upon the reader.

## The project

Therefore, my project involved developing a Mediawiki extension to resolve the above issues and pack the feature inside a single unit so that its usage can be extended uniformly across all wikis, if needed. I worked under the mentorship of Jon Robson and Nicolas Raoul. We formed a great team and I really enjoyed working with them! We had regular meetings and I never was blocked on any issue for a long time. Their ideas and suggestions helped a lot in both refining the code as well as the aesthetics part.

Here’s the final demo of what the extension does:

## Phases of the project(in brief)

### Community Bonding Period:

This is a common period for all GSoC projects where the intern is required to get more familiar with his/her organization and his/her mentors. I was already in touch with my mentors from before this period so it involved taking the first steps towards building the extension. It involved:

• Bootstrapping the extension from a bootatrap extension Template.
• Setting up basic Mediawiki Hooks to insert banners by default on all pages.
• Starting the development of a parser function to insert custom page banners on articles.

### Coding phase:

This phase marked the beginning of actual coding.

• The parser function was completed.
• The feature to fetch a banner from Wikidata was implemented
• The banners were made to display on preview.
• Php unit tests were added to cover server-side code.
• The existig code was refactored to allow better testing.
• Issues of styling and Flash of unstyled content were resolved.
• “Srcset” attribute used to trigger responsiveness.
• A module to focus specific areas of banner on small screens was added.
• The extension has been deployed on English Wikivoyage and is in use there.

The above is a brief overview and complete report can be found at Phabricator

### English Wikivoyage Community support:

English Wikivoyage community is great and very open to new ideas. A lot of discussions took place on Traveller’s Pub which led to the refinement of the extension. The support extended by them in making this a success from feedback to criticism to pointing out bugs has been invaluable!

### Challenges and Takeaways:

• Maintaining a testable code was a challenge that I learnt and also the need for having good test cases, as they allow preemptive bug monitoring.
• Ensuring community involvement and making suitable changes to satisfy the beneficiary community needs was an important aspect that I learnt.
• Writing code conforming to standards and also keeping it self-explanatory.
• Weekly meetings with my mentors helped me keep everything on track and things never got too far out of hand.

## A cool new feature!

Due to banner cropping on smaller devices, a new feature has been added which would focus the banner on specific areas. This can have a huge impact on banner creation as now it would no longer be mandatory to create specific size banners. Just pick a suitable image and focus it!

### Final Words

The developed extension’s documentation is at Extension:WikidataPageBanner.

It was a great experience to work with such an organization which aims to bring free knowledge to everyone. The people are great! For anyone thinking of contributing to it, I’d say, Just Go for it!

## Fast Cross Validation

Published:

I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.

   Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$s_validate) 1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1    0.912    0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$tcomp>) 11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965    0.050 29019.662    0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443   58.356    0.002 28459.225    0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)


It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.

def _cross_score():
...
return [(model.score(feature_values), label)
for feature_values, label in test_set]


The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.

time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)


that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:

def _cross_score():
...
feature_values, labels = map(list, zip(*test_set))
docs = model.score_many(feature_values)
return list(zip(docs, labels))


and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!

## Wikipedia - Sentiment in Wikipedia Drafts and Edits

Published:

My exciting work with WMF is continuing and this is a brief account of that as I got some time to breathe between my other commitments. This post is about a sentiment feature I implemented using SentiWordnet on Wikipedia drafts/edits for the purpose of classification of an edit into damaging or not. The reason for using drafts and edits together is because this started with draftquality research but is now showing promises for editquality.

## Background

Wikipedia has a webservice called ORES that reviews edits made on Wikipedia and gives a score to them on based on damaging, goodfaith and reverted models at the backend. These models are curated using Machine Learning techniques on trained data acquired by labeling of edits by Wikipedia editors. Each Wikipedia language has its on models based on their wiki and language.

The revscoring library is the one doing the heavy lifting when it comes to building the models. For those new to the it, it is a very convenient piece of software written to ease development of Machine Learning models around Wikipedia by making featching of revisions and their scoring super easy through command line arguments. It also provides a framework to define new features to extract from edtis and already has an extensive list of features that it extracts from articles.

## Sentiment feature on drafts and edits

An edit or a draft on Wikipedia, according to the damaging criteria is one which harms the content to which it is added or the encyclopedia in general. I came up with a hypothesis that if the overall sentiment of the edit or an article is positive or negative it could play a role in determining the damaging nature of an edit.

Edits are reverted while new drafts are deleted based on Speedy deletion criteria Hence, starting from feature engineering on the revscoring library I set out to implement a feature to provide an overall sentiment of the edit. My research is recorded at the phab task but here are the few interesting bits and pieces: The terminology of spam, vandalism and attack is explained in a previous post.

A quick look at the below image will clearly show that the spam class created by fake editors to promote stuff clearly shows a spike in positive sentiment:

### From where do these come ?

These polarity scores are generated using SentiWordnet. In reality there are tons of resources out there to get polarity of sentences or documents but here I had to adhere to certain constraints:

• The code should be open source
• The scoring should be fast - typically in milliseconds otherwise ORES would timeout
• It should be easily portable or usable in python.

Sentiwordnet seemed to be the best fit. Its just a database of common words and their corresponding polarities per the word sense from wordnet.

### Attempt 1

I did a simple task of aggregating the positive and negative score of each word after doing its WSD using simple-lesk. To my despair, the scoring was taking seconds per draft! strict NO

### Attempt 2

I had to somehow prune the scoring time at the same time retaining the viability of my hypothesis. I decided to sacrifice accuracy for time. I was earlier using a library from kevincobain called sentiment_classifier that does WSD and then uses the sense to query the polarity. A profiling of the method revealed:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1125/1    0.076    0.000  171.370  171.370 {built-in method builtins.exec}
1    0.000    0.000  171.370  171.370 emot.py:1(<module>)
1    0.000    0.000  168.748  168.748 emot.py:19(get_polarity)
2    0.000    0.000  168.740   84.370 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:234(polarity_scores)
2    0.006    0.003  168.740   84.370 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:202(classify)
1084    1.248    0.001  168.734    0.156 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:173(disambiguateWordSenses)
731495    0.795    0.000  152.014    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:1547(path_similarity)
731495    2.020    0.000  151.219    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:728(path_similarity)
731495    7.519    0.000  119.911    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:658(shortest_path_distance)
1447733   34.959    0.000  102.647    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:634(_shortest_hypernym_paths)


Clearly, the code was spending most of the time in disambiguateWordSenses and path_similarity which are necessary for WSD.

No WSD: Hence I decided to completely skip WSD and go ahead with the sentiment of the most dominant sense which is the sense of a word with index 0 in nltk. This is a crude approximation but something that worked! the results retaining the above shape of the graph somewhat. And it took only milliseconds to do the scoring.

The benchmarking on the complete dataset is going on but this is something showing an almost balance between computational capacity and the need for accuracy. I’m not allowed to touch the deleted dataset because of confidentiality restrictions.

### Sentiment on Edits

However, I could still apply this feature to the editquality research. and see the stats there.

After generating the tuning reports, the Gradient Boosting classifier, (best one) gave almost a jump of 2% while BernoulliNB jumped from 66% to 77% showing some promises.

## Wikipedia - Article Draft Quality

Published:

I’m quite excited to write about my most recent volunteer work on , “Automatic classification of “Article Draft Quality that I’m doing in collaboration with Wikimedia Foundation’s AI team that helps develope Machine Learning models to counter vandalism on Wikipedia and research on improving editor’s and reader’s expereince. Well, too much of interesting things to turn around! For any technical details and update, refer to the wiki page

## Background

Wikipedia is a very large collaborative encyclopedia and a very strict policy is followed for each and every edit of its articles. As a result, a team of volunteers work relentlessly for oversight and control the quality through collaboration. An important aspect of this quality control is page deletion when deemed necessary and deletion has a strict set of guidelines.

Apart from systematic deletion, Wikipedia also has a criteria for speedy deletion where administrators have broad consensus to bypass deletion discussion at their discretion and immediately delete Wikipedia Pages. The page on speedy deletion defines comprehensively the aspects under which these have to be carried out, but few are noteworthy:

• G10: Pages that disparage, threaten, intimidate, or harass their subject or some other entity, and serve no other purpose
• G11: Unambiguous advertising or promotion
• A7: No indication of importance (people, animals, organizations, web content, events)

Its interesting to note that these are the properties which are most frequently cited when deleting pages under the category “vandalism”, “spam”, and “attack”

## The Problem

Since administrators have an overriding authority in deleting articles under the above mentioned clauses, it often happens that drafts created by new editors get deleted because of their inexperience regarding Wikipedia standards and hence Wikipedia loses such editors out of sheer demotivation.

## The Solution

The project Automatic Classification of Article Draft Quality addresses precisely this aspect of article deletion by providing a Machine Learning model to suggest draft categories to administrators so that both the work of administrators is reduced and new editors get some time to improve their work.

I’m quite excited to be part of such a project! More updates here, or on the wiki.

## First

Published:

This is the very first post in the line for this blog. Hopefully I shall continue to add more to this as a reminder of snapshots of my life.

## Viterbi and Numerical Optimizations

Published:

I’m quite excited to write about my recent set of investigations on a seemingly simple mini-project. Recently I wrote a small code for Machine Translation from English to Hindi using HMM Viterbi decoding. First I did word alignments using the GIZA++ tool from parallel corpus and after aligning the words in both languages, we get probability scores like: Please note that this is not exactly a post on Machine Translation, as this is a very naive way to do that, but a discussion on using HMM POS-Tagger code for a crude MT project and using basic level optimizations to make it fast.

Src lang indexDst lang indexProbability
060.0105564
090.0888391
0110.123789
0128.75807e-06
0140.000144811
0173.12898e-05

Here the integers represent the index for each word in the dictionary created from the corpus.

The Machine Translation model is analogous to an HMM POS-Tagger where words in one language can be treated as hidden states transitioning from one to another( and defined by the language model ) and each word( state ) generating a word in the target language using emission probability given by the alignments above.

The noisy channel model:

P(H|E) = argmax(over H) P(E|H) * P(H)


i.e, Probability of Hindi given an English sentence is argmax over the combined probability of English given Hindi and probability of the hindi sentence, maxed over hindi sentences.

## Viterbi Decoding

Once this analogy between POS-Tagging and Machine Translation is setup, I simply ported the POS-Tagger code for Machine Translation, but there were surprises to be had. I ran the algorithm and it would take ages to translate one sentence. Lets see what was wrong:

#### Initialization

SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
[expressing the fact that first state is S 1 ]


#### Iteration

For(t=2 to T) do
For(i=1 to N) do
SEQSCORE(i,t) = Max (j=1,N)
[ SEQSCORE( j , ( t − 1 )) * P ( Sj --ak--> Si)]
BACKPTR(I,t) = index j that gives the MAX above


Complexity: O(T x N x N)

POS-Tagger: Hidden states = tags = 20(say), words = 5000, Complexity = 5 x 10^8

Machine Translation: Hidden states = Hindi words = 94125~10^5, English words = 72167~10^5, Complexity ~ 10^10

Clearly, because the number of hidden states in Machine Translation is equivalent to the number of words, our complexity is of that order of magniture higher.

## Attempt one: Base approach

To start with I had a transition matrix that contained transitions of every word to every other words, which I readily indexed using ‘transition[w_idx1]’[w_idx2]. This was the naivest approach and which did not even meet memory constraints, letting my laptop hang.

To solve this, I decided to use a scipy sparse matrix, and after some research, given the way my transition was structured, I decided to go ahead with csr sparse matrix

## Attempt two

After discussion with one of my friends, it came to my mind that the above algorithm naively checks transition from every state to every other state, whereas in a real language one word leads two only a handful number of other words in the language. I was unnecessarily iterating over null transitions!

Lets take a look at the code:

# 't' is the current word
for idx1, i in enumerate(transition[0]):
maxScore = seqscore[0][t]
for idx2, j in enumerate(transition[0]):
score = seqscore[j][t-1] * transition[idx2][idx1] * emission[wordindex][idx1]


Clearly the transition evaluation needs some love, I googled for ways to only get nonzero rows in transition matrix, and found find, so lets see:

tr_row = transition.getrow(idx1)
.
.
.
rs, indices, vals = find(tr_row)
for idx2, v2 in zip(indices, vals):


I did get some nasty code in the inner loop, but thats what was, I had to tradeoff an easy to understand implementation for speed.

As a result, I did get an instant improvement in speed! The search method which would take ages for one sentence now took 5 minutes. However I was still not satisfied, so I decided to dig further, this time using tech ;)

## Attempt three

I remembered that python had a profiling library so decided to take up actual profiling

Using

python -m cProfile [-o output_file] [-s sort_order] myscript.py


I could save my stats to a file, which I later retrieved as:

import pstats
p = pstats.Stats('stats')
p.sort_stats('cumulative').print_stats(50)


The result, after ordering by cumulative time:

   Ordered by: cumulative time
List reduced from 2002 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
557/1    0.040    0.000  197.171  197.171 {built-in method builtins.exec}
1    0.005    0.005  197.171  197.171 mt.py:1(<module>)
1    0.000    0.000  196.803  196.803 mt.py:122(main)
1    0.031    0.031  196.803  196.803 mt.py:100(train)
1   17.758   17.758  182.292  182.292 mt.py:51(search)
564750    4.789    0.000  112.658    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/extract.py:14(find)
1129511/564761    7.526    0.000   86.061    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/coo.py:118(__init__)
564757    5.523    0.000   53.782    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/compressed.py:905(tocoo)
1129511   20.949    0.000   52.660    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/coo.py:212(_check)
564757    0.943    0.000   51.593    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/csr.py:356(getrow)
564757    3.779    0.000   50.650    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/csr.py:411(_get_submatrix)
564770/564764    4.156    0.000   40.853    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/compressed.py:24(__init__)



From above data, it is clear that ‘scipy/sparse/coo.py’ was taking the largest number of individual calls. After some digging, I found that though I’m reducing the number of transitions by only evaluting non-zero transitions, the method ‘get_row’ from scipy for getting individual rows of a sparse matrix is in itself quite intensive

## Wrap-up

Whoa, so much for a simple Machine Translation task! but it was a good exercise that me familiar with sparse matrices, actual optimizations in terms of memory and time at the data structure level and profiling!

## Fast Cross Validation

Published:

I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.

   Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$s_validate) 1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1    0.912    0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$tcomp>) 11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965    0.050 29019.662    0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443   58.356    0.002 28459.225    0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)


It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.

def _cross_score():
...
return [(model.score(feature_values), label)
for feature_values, label in test_set]


The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.

time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)


that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:

def _cross_score():
...
feature_values, labels = map(list, zip(*test_set))
docs = model.score_many(feature_values)
return list(zip(docs, labels))


and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!

## Fast Cross Validation

Published:

I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.

   Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$s_validate) 1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1    0.912    0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$tcomp>) 11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965    0.050 29019.662    0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443   58.356    0.002 28459.225    0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)


It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.

def _cross_score():
...
return [(model.score(feature_values), label)
for feature_values, label in test_set]


The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.

time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)


that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:

def _cross_score():
...
feature_values, labels = map(list, zip(*test_set))
docs = model.score_many(feature_values)
return list(zip(docs, labels))


and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!

## Topics suggestion to new drafts on Wikipedia

Published:

How about an AI service that could classify new drafts on Wikipedia into some predefined categories? Wouldn’t it make the lives of editors easier in improving the articles? The project “Automatic suggestion of topics to new drafts” strives to address exactly this. I’ll not try to go much into the tech specs of the project which I’ve explained in detail on my research page rather I’ll try to provide a high level motivation for the project and what it entails.

The problem starts with the current state of Wikipedia: too many people creating new articles but too few to review them. The current review process which has been since long is that once a page is created, new page patrollers review the page for any offensive material or violations of any Wikipedia policy. They also help to improve the article by suggesting tags for improvement or adding WikiProjects.

Now WikiProjects on Wikipedia are nothing but certain dedicated topical namespaces where articles of a similar topic are grouped. For example WikiProejct: Biography would contain all articles related to biographies of persons. This dedicated group helps subject experts to pay attention to articles of their field and improve them.

Therefore, at the core of improving new drafts on Wikipedia as quickly as possible lies the efficacy with which we can tag these new drafts with WikiProject topics. The above project aims to provide exactly such an efficieny in tagging new drafts with WikiProjects by building a Machine Learning model that would suggest WikiProjecs for these new articles. For example, someone might be trying to write a biography of a recent celebrity but not able to meet the required standards for the article. If our model is able to predict that the draft could belong to WikiProject:Biography early on, biography subject experts would get the attention of the article and could provide useful suggestions.

## Wikipedia - Sentiment in Wikipedia Drafts and Edits

Published:

My exciting work with WMF is continuing and this is a brief account of that as I got some time to breathe between my other commitments. This post is about a sentiment feature I implemented using SentiWordnet on Wikipedia drafts/edits for the purpose of classification of an edit into damaging or not. The reason for using drafts and edits together is because this started with draftquality research but is now showing promises for editquality.

## Background

Wikipedia has a webservice called ORES that reviews edits made on Wikipedia and gives a score to them on based on damaging, goodfaith and reverted models at the backend. These models are curated using Machine Learning techniques on trained data acquired by labeling of edits by Wikipedia editors. Each Wikipedia language has its on models based on their wiki and language.

The revscoring library is the one doing the heavy lifting when it comes to building the models. For those new to the it, it is a very convenient piece of software written to ease development of Machine Learning models around Wikipedia by making featching of revisions and their scoring super easy through command line arguments. It also provides a framework to define new features to extract from edtis and already has an extensive list of features that it extracts from articles.

## Sentiment feature on drafts and edits

An edit or a draft on Wikipedia, according to the damaging criteria is one which harms the content to which it is added or the encyclopedia in general. I came up with a hypothesis that if the overall sentiment of the edit or an article is positive or negative it could play a role in determining the damaging nature of an edit.

Edits are reverted while new drafts are deleted based on Speedy deletion criteria Hence, starting from feature engineering on the revscoring library I set out to implement a feature to provide an overall sentiment of the edit. My research is recorded at the phab task but here are the few interesting bits and pieces: The terminology of spam, vandalism and attack is explained in a previous post.

A quick look at the below image will clearly show that the spam class created by fake editors to promote stuff clearly shows a spike in positive sentiment:

### From where do these come ?

These polarity scores are generated using SentiWordnet. In reality there are tons of resources out there to get polarity of sentences or documents but here I had to adhere to certain constraints:

• The code should be open source
• The scoring should be fast - typically in milliseconds otherwise ORES would timeout
• It should be easily portable or usable in python.

Sentiwordnet seemed to be the best fit. Its just a database of common words and their corresponding polarities per the word sense from wordnet.

### Attempt 1

I did a simple task of aggregating the positive and negative score of each word after doing its WSD using simple-lesk. To my despair, the scoring was taking seconds per draft! strict NO

### Attempt 2

I had to somehow prune the scoring time at the same time retaining the viability of my hypothesis. I decided to sacrifice accuracy for time. I was earlier using a library from kevincobain called sentiment_classifier that does WSD and then uses the sense to query the polarity. A profiling of the method revealed:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1125/1    0.076    0.000  171.370  171.370 {built-in method builtins.exec}
1    0.000    0.000  171.370  171.370 emot.py:1(<module>)
1    0.000    0.000  168.748  168.748 emot.py:19(get_polarity)
2    0.000    0.000  168.740   84.370 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:234(polarity_scores)
2    0.006    0.003  168.740   84.370 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:202(classify)
1084    1.248    0.001  168.734    0.156 /home/sumit/venv/lib/python3.6/site-packages/sentiment_classifier-0.7-py3.6.egg/senti_classifier/senti_classifier.py:173(disambiguateWordSenses)
731495    0.795    0.000  152.014    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:1547(path_similarity)
731495    2.020    0.000  151.219    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:728(path_similarity)
731495    7.519    0.000  119.911    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:658(shortest_path_distance)
1447733   34.959    0.000  102.647    0.000 /home/sumit/venv/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py:634(_shortest_hypernym_paths)


Clearly, the code was spending most of the time in disambiguateWordSenses and path_similarity which are necessary for WSD.

No WSD: Hence I decided to completely skip WSD and go ahead with the sentiment of the most dominant sense which is the sense of a word with index 0 in nltk. This is a crude approximation but something that worked! the results retaining the above shape of the graph somewhat. And it took only milliseconds to do the scoring.

The benchmarking on the complete dataset is going on but this is something showing an almost balance between computational capacity and the need for accuracy. I’m not allowed to touch the deleted dataset because of confidentiality restrictions.

### Sentiment on Edits

However, I could still apply this feature to the editquality research. and see the stats there.

After generating the tuning reports, the Gradient Boosting classifier, (best one) gave almost a jump of 2% while BernoulliNB jumped from 66% to 77% showing some promises.

## Viterbi and Numerical Optimizations

Published:

I’m quite excited to write about my recent set of investigations on a seemingly simple mini-project. Recently I wrote a small code for Machine Translation from English to Hindi using HMM Viterbi decoding. First I did word alignments using the GIZA++ tool from parallel corpus and after aligning the words in both languages, we get probability scores like: Please note that this is not exactly a post on Machine Translation, as this is a very naive way to do that, but a discussion on using HMM POS-Tagger code for a crude MT project and using basic level optimizations to make it fast.

Src lang indexDst lang indexProbability
060.0105564
090.0888391
0110.123789
0128.75807e-06
0140.000144811
0173.12898e-05

Here the integers represent the index for each word in the dictionary created from the corpus.

The Machine Translation model is analogous to an HMM POS-Tagger where words in one language can be treated as hidden states transitioning from one to another( and defined by the language model ) and each word( state ) generating a word in the target language using emission probability given by the alignments above.

The noisy channel model:

P(H|E) = argmax(over H) P(E|H) * P(H)


i.e, Probability of Hindi given an English sentence is argmax over the combined probability of English given Hindi and probability of the hindi sentence, maxed over hindi sentences.

## Viterbi Decoding

Once this analogy between POS-Tagging and Machine Translation is setup, I simply ported the POS-Tagger code for Machine Translation, but there were surprises to be had. I ran the algorithm and it would take ages to translate one sentence. Lets see what was wrong:

#### Initialization

SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
[expressing the fact that first state is S 1 ]


#### Iteration

For(t=2 to T) do
For(i=1 to N) do
SEQSCORE(i,t) = Max (j=1,N)
[ SEQSCORE( j , ( t − 1 )) * P ( Sj --ak--> Si)]
BACKPTR(I,t) = index j that gives the MAX above


Complexity: O(T x N x N)

POS-Tagger: Hidden states = tags = 20(say), words = 5000, Complexity = 5 x 10^8

Machine Translation: Hidden states = Hindi words = 94125~10^5, English words = 72167~10^5, Complexity ~ 10^10

Clearly, because the number of hidden states in Machine Translation is equivalent to the number of words, our complexity is of that order of magniture higher.

## Attempt one: Base approach

To start with I had a transition matrix that contained transitions of every word to every other words, which I readily indexed using ‘transition[w_idx1]’[w_idx2]. This was the naivest approach and which did not even meet memory constraints, letting my laptop hang.

To solve this, I decided to use a scipy sparse matrix, and after some research, given the way my transition was structured, I decided to go ahead with csr sparse matrix

## Attempt two

After discussion with one of my friends, it came to my mind that the above algorithm naively checks transition from every state to every other state, whereas in a real language one word leads two only a handful number of other words in the language. I was unnecessarily iterating over null transitions!

Lets take a look at the code:

# 't' is the current word
for idx1, i in enumerate(transition[0]):
maxScore = seqscore[0][t]
for idx2, j in enumerate(transition[0]):
score = seqscore[j][t-1] * transition[idx2][idx1] * emission[wordindex][idx1]


Clearly the transition evaluation needs some love, I googled for ways to only get nonzero rows in transition matrix, and found find, so lets see:

tr_row = transition.getrow(idx1)
.
.
.
rs, indices, vals = find(tr_row)
for idx2, v2 in zip(indices, vals):


I did get some nasty code in the inner loop, but thats what was, I had to tradeoff an easy to understand implementation for speed.

As a result, I did get an instant improvement in speed! The search method which would take ages for one sentence now took 5 minutes. However I was still not satisfied, so I decided to dig further, this time using tech ;)

## Attempt three

I remembered that python had a profiling library so decided to take up actual profiling

Using

python -m cProfile [-o output_file] [-s sort_order] myscript.py


I could save my stats to a file, which I later retrieved as:

import pstats
p = pstats.Stats('stats')
p.sort_stats('cumulative').print_stats(50)


The result, after ordering by cumulative time:

   Ordered by: cumulative time
List reduced from 2002 to 50 due to restriction <50>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
557/1    0.040    0.000  197.171  197.171 {built-in method builtins.exec}
1    0.005    0.005  197.171  197.171 mt.py:1(<module>)
1    0.000    0.000  196.803  196.803 mt.py:122(main)
1    0.031    0.031  196.803  196.803 mt.py:100(train)
1   17.758   17.758  182.292  182.292 mt.py:51(search)
564750    4.789    0.000  112.658    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/extract.py:14(find)
1129511/564761    7.526    0.000   86.061    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/coo.py:118(__init__)
564757    5.523    0.000   53.782    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/compressed.py:905(tocoo)
1129511   20.949    0.000   52.660    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/coo.py:212(_check)
564757    0.943    0.000   51.593    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/csr.py:356(getrow)
564757    3.779    0.000   50.650    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/csr.py:411(_get_submatrix)
564770/564764    4.156    0.000   40.853    0.000 /usr/lib/python3.6/site-packages/scipy/sparse/compressed.py:24(__init__)



From above data, it is clear that ‘scipy/sparse/coo.py’ was taking the largest number of individual calls. After some digging, I found that though I’m reducing the number of transitions by only evaluting non-zero transitions, the method ‘get_row’ from scipy for getting individual rows of a sparse matrix is in itself quite intensive

## Wrap-up

Whoa, so much for a simple Machine Translation task! but it was a good exercise that me familiar with sparse matrices, actual optimizations in terms of memory and time at the data structure level and profiling!

## Wikipedia - Article Draft Quality

Published:

I’m quite excited to write about my most recent volunteer work on , “Automatic classification of “Article Draft Quality that I’m doing in collaboration with Wikimedia Foundation’s AI team that helps develope Machine Learning models to counter vandalism on Wikipedia and research on improving editor’s and reader’s expereince. Well, too much of interesting things to turn around! For any technical details and update, refer to the wiki page

## Background

Wikipedia is a very large collaborative encyclopedia and a very strict policy is followed for each and every edit of its articles. As a result, a team of volunteers work relentlessly for oversight and control the quality through collaboration. An important aspect of this quality control is page deletion when deemed necessary and deletion has a strict set of guidelines.

Apart from systematic deletion, Wikipedia also has a criteria for speedy deletion where administrators have broad consensus to bypass deletion discussion at their discretion and immediately delete Wikipedia Pages. The page on speedy deletion defines comprehensively the aspects under which these have to be carried out, but few are noteworthy:

• G10: Pages that disparage, threaten, intimidate, or harass their subject or some other entity, and serve no other purpose
• G11: Unambiguous advertising or promotion
• A7: No indication of importance (people, animals, organizations, web content, events)

Its interesting to note that these are the properties which are most frequently cited when deleting pages under the category “vandalism”, “spam”, and “attack”

## The Problem

Since administrators have an overriding authority in deleting articles under the above mentioned clauses, it often happens that drafts created by new editors get deleted because of their inexperience regarding Wikipedia standards and hence Wikipedia loses such editors out of sheer demotivation.

## The Solution

The project Automatic Classification of Article Draft Quality addresses precisely this aspect of article deletion by providing a Machine Learning model to suggest draft categories to administrators so that both the work of administrators is reduced and new editors get some time to improve their work.

I’m quite excited to be part of such a project! More updates here, or on the wiki.

## UG-Research-Project

Published:

A lot of times I had to explain my seemingly “complicated” Undergradute-research project or as it may seem because of the relative novelty of the topic. Hence I finally take time to pen this down once and for all for any references for myself or anyone else in future, with an aim to comprehensively define my topic and research conducted as part of UG Thesis project at IIT Patna, 2016-17. For a concise and brief overview of my project, head over to the poster here.

## Background

The title of my UG-Research Project is “Entity Relation ranking from unstructured text”. Lets first see what are entity relations.

### Entity Relations

These are the fundamental blocks of structured knowledge bases. Example, any knowledge base has data in the form of entities which may be representation of any real world object or any concept. These are linked to each other by subject-predicate-object triples where predicate is some property linking subject and object each of which forms an entity in knowledge bases. Example, cat is an entity and animal is an entity. Cat is an animal is a relation which links cat and animal through the is relation.

Now my project was about ranking these “entity relations” which means ranking amongst several subject-predicate-object triple where each such relation has a common subject. Its best illustrated through a snapshot of the dataset I’m using.

Here, each relation is a Person - Profession relation to which my project is restricted. Clearly, each person can have several different professions and the goal is to rank these relations based on the relevance of the profession to the person its associated with.

E.g. William Shakespeare makes more sense with Playwright than with Lyricist.

The quantitative definition of “relevance” is taken to be the amount of information about the profession present in Wikipedia related to that person.

## Dataset

The dataset used was from WSDM Cup 2017 competition. I only used the profession entities for my project. Each person-profession pair had a label between 0-7 with 7 being most relevant and a similar labels had to be predicted for test samples based on the relevancy information.

## Approach

During the course of my project I undertook two approaches, one feature based, and another deep learning based which was a novel proposal from my side.

### Feature Based Method

#### Query Expansion:

This was done to get more context out of individual profession words. E.g. for ‘actor’, words like ‘acting’, ‘acted’, ‘starred’, ‘cast’, ‘played’ were extracted. This was done for each of the 200 professions from the dataset using a Logistic Regression classifier where the positive samples were Wikipedia articles of persons who only worked in that profession and negative samples were Wikipedia articles of persons who never worked in that profession. Then after training, the above words would come up as top features.

#### Features:

Using the above expanded query for each profession, a set of features was formulated on the Wikipedia article of the person for the person-profession pair.

#### Ranking:

These sets of features were then fed to an SVM or Random Forest Classifier and ranking was done by following a hierarchical classification strategy, i.e., first classifying on the top level as 0 or 1, then on the next level as 0 or 1 then on the next level as 0 or 1. This way we get a tree with eight leaves corresponding to the 8 different labels between 0-7. The results are in the poster.

### Deep Learning Based Method

For this approach, the above feature selection strategy was replaced by a CNN network and following the classification of a profession as relevant or non-relevant with respect to a person based on a set of features automatically identified by the CNN network. A single classifier was used across all professions unlike the above case and the sample selection strategy for each profession was similar to above.

CNN Statistics:

ParametersValues
Samples22638
Hidden Layers2
Hidden Neurons512,64
Embedding Dimension300
Precision82%

## Future Work

A lot of work could be taken up in the domain of “ranking structured knowledge base entities from unstructured text”. I’ve shown the viability of the deep learning method to the problem but not shown the final rankings produced by the CNN as I had to stop at the stage of predicting a single profession of a person as relevant or non-relevant due to time constraints. This could be taken up as an interesting research problem.

## End of GSoC 2015

Published:

Aug 28th, 2015 marked the end of my Google Summer of Code intern at Wikimedia Foundation. My project targeted the banner expedition of English Wikivoyage. Here’s a small walkthrough of my project, what led to it and how it came out to be.

## Background:

English Wikivoyage and some other Wikivoyage projects had developed a culture to render page-wide banners on articles to enhance the overall appeal of the pages by portraying some attractive location of the place in the banner. The banners were originally rendered through the use of a Mediawiki Template which however had some issues with it:

1. Banners were not responsive. Since the banners were 2100px X 300px, downloading a large sized banner on small screens became a significant aspect.
2. Table of Contents were displayed in a linear fashion inside the banner, which however did not support multiple levels.
3. Banners on mobile devices clashed with page heading.
4. Due to a 7:1 aspect ration, banners on small screens became too small to leave a visual impact upon the reader.

## The project

Therefore, my project involved developing a Mediawiki extension to resolve the above issues and pack the feature inside a single unit so that its usage can be extended uniformly across all wikis, if needed. I worked under the mentorship of Jon Robson and Nicolas Raoul. We formed a great team and I really enjoyed working with them! We had regular meetings and I never was blocked on any issue for a long time. Their ideas and suggestions helped a lot in both refining the code as well as the aesthetics part.

Here’s the final demo of what the extension does:

## Phases of the project(in brief)

### Community Bonding Period:

This is a common period for all GSoC projects where the intern is required to get more familiar with his/her organization and his/her mentors. I was already in touch with my mentors from before this period so it involved taking the first steps towards building the extension. It involved:

• Bootstrapping the extension from a bootatrap extension Template.
• Setting up basic Mediawiki Hooks to insert banners by default on all pages.
• Starting the development of a parser function to insert custom page banners on articles.

### Coding phase:

This phase marked the beginning of actual coding.

• The parser function was completed.
• The feature to fetch a banner from Wikidata was implemented
• The banners were made to display on preview.
• Php unit tests were added to cover server-side code.
• The existig code was refactored to allow better testing.
• Issues of styling and Flash of unstyled content were resolved.
• “Srcset” attribute used to trigger responsiveness.
• A module to focus specific areas of banner on small screens was added.
• The extension has been deployed on English Wikivoyage and is in use there.

The above is a brief overview and complete report can be found at Phabricator

### English Wikivoyage Community support:

English Wikivoyage community is great and very open to new ideas. A lot of discussions took place on Traveller’s Pub which led to the refinement of the extension. The support extended by them in making this a success from feedback to criticism to pointing out bugs has been invaluable!

### Challenges and Takeaways:

• Maintaining a testable code was a challenge that I learnt and also the need for having good test cases, as they allow preemptive bug monitoring.
• Ensuring community involvement and making suitable changes to satisfy the beneficiary community needs was an important aspect that I learnt.
• Writing code conforming to standards and also keeping it self-explanatory.
• Weekly meetings with my mentors helped me keep everything on track and things never got too far out of hand.

## A cool new feature!

Due to banner cropping on smaller devices, a new feature has been added which would focus the banner on specific areas. This can have a huge impact on banner creation as now it would no longer be mandatory to create specific size banners. Just pick a suitable image and focus it!

### Final Words

The developed extension’s documentation is at Extension:WikidataPageBanner.

It was a great experience to work with such an organization which aims to bring free knowledge to everyone. The people are great! For anyone thinking of contributing to it, I’d say, Just Go for it!

## UG-Research-Project

Published:

A lot of times I had to explain my seemingly “complicated” Undergradute-research project or as it may seem because of the relative novelty of the topic. Hence I finally take time to pen this down once and for all for any references for myself or anyone else in future, with an aim to comprehensively define my topic and research conducted as part of UG Thesis project at IIT Patna, 2016-17. For a concise and brief overview of my project, head over to the poster here.

## Background

The title of my UG-Research Project is “Entity Relation ranking from unstructured text”. Lets first see what are entity relations.

### Entity Relations

These are the fundamental blocks of structured knowledge bases. Example, any knowledge base has data in the form of entities which may be representation of any real world object or any concept. These are linked to each other by subject-predicate-object triples where predicate is some property linking subject and object each of which forms an entity in knowledge bases. Example, cat is an entity and animal is an entity. Cat is an animal is a relation which links cat and animal through the is relation.

Now my project was about ranking these “entity relations” which means ranking amongst several subject-predicate-object triple where each such relation has a common subject. Its best illustrated through a snapshot of the dataset I’m using.

Here, each relation is a Person - Profession relation to which my project is restricted. Clearly, each person can have several different professions and the goal is to rank these relations based on the relevance of the profession to the person its associated with.

E.g. William Shakespeare makes more sense with Playwright than with Lyricist.

The quantitative definition of “relevance” is taken to be the amount of information about the profession present in Wikipedia related to that person.

## Dataset

The dataset used was from WSDM Cup 2017 competition. I only used the profession entities for my project. Each person-profession pair had a label between 0-7 with 7 being most relevant and a similar labels had to be predicted for test samples based on the relevancy information.

## Approach

During the course of my project I undertook two approaches, one feature based, and another deep learning based which was a novel proposal from my side.

### Feature Based Method

#### Query Expansion:

This was done to get more context out of individual profession words. E.g. for ‘actor’, words like ‘acting’, ‘acted’, ‘starred’, ‘cast’, ‘played’ were extracted. This was done for each of the 200 professions from the dataset using a Logistic Regression classifier where the positive samples were Wikipedia articles of persons who only worked in that profession and negative samples were Wikipedia articles of persons who never worked in that profession. Then after training, the above words would come up as top features.

#### Features:

Using the above expanded query for each profession, a set of features was formulated on the Wikipedia article of the person for the person-profession pair.

#### Ranking:

These sets of features were then fed to an SVM or Random Forest Classifier and ranking was done by following a hierarchical classification strategy, i.e., first classifying on the top level as 0 or 1, then on the next level as 0 or 1 then on the next level as 0 or 1. This way we get a tree with eight leaves corresponding to the 8 different labels between 0-7. The results are in the poster.

### Deep Learning Based Method

For this approach, the above feature selection strategy was replaced by a CNN network and following the classification of a profession as relevant or non-relevant with respect to a person based on a set of features automatically identified by the CNN network. A single classifier was used across all professions unlike the above case and the sample selection strategy for each profession was similar to above.

CNN Statistics:

ParametersValues
Samples22638
Hidden Layers2
Hidden Neurons512,64
Embedding Dimension300
Precision82%

## Future Work

A lot of work could be taken up in the domain of “ranking structured knowledge base entities from unstructured text”. I’ve shown the viability of the deep learning method to the problem but not shown the final rankings produced by the CNN as I had to stop at the stage of predicting a single profession of a person as relevant or non-relevant due to time constraints. This could be taken up as an interesting research problem.