I’m quite excited to write about my most recent volunteer work on , “Automatic classification of “Article Draft Quality that I’m doing in collaboration with Wikimedia Foundation’s AI team that helps develope Machine Learning models to counter vandalism on Wikipedia and research on improving editor’s and reader’s expereince. Well, too much of interesting things to turn around! For any technical details and update, refer to the wiki page
Background
Wikipedia is a very large collaborative encyclopedia and a very strict policy is followed for each and every edit of its articles. As a result, a team of volunteers work relentlessly for oversight and control the quality through collaboration. An important aspect of this quality control is page deletion when deemed necessary and deletion has a strict set of guidelines.
Apart from systematic deletion, Wikipedia also has a criteria for speedy deletion where administrators have broad consensus to bypass deletion discussion at their discretion and immediately delete Wikipedia Pages. The page on speedy deletion defines comprehensively the aspects under which these have to be carried out, but few are noteworthy:
G10: Pages that disparage, threaten, intimidate, or harass their subject or some other entity, and serve no other purpose
G11: Unambiguous advertising or promotion
A7: No indication of importance (people, animals, organizations, web content, events)
Its interesting to note that these are the properties which are most frequently cited when deleting pages under the category “vandalism”, “spam”, and “attack”
The Problem
Since administrators have an overriding authority in deleting articles under the above mentioned clauses, it often happens that drafts created by new editors get deleted because of their inexperience regarding Wikipedia standards and hence Wikipedia loses such editors out of sheer demotivation.
The Solution
The project Automatic Classification of Article Draft Quality addresses precisely this aspect of article deletion by providing a Machine Learning model to suggest draft categories to administrators so that both the work of administrators is reduced and new editors get some time to improve their work.
I’m quite excited to be part of such a project! More updates here, or on the wiki.
I was working on the drafttopic project for Wikipedia for which I was training and cross validating on the drafttopic dataset. It is a dataset with 93000 observations and each one having mutliple target labels. Therefore its a case of multilabel classification. With my few initial runs I realized that the training and cross validation wasn’t even finishing given even a full day. The number of estimators were close to 100 and max_depth was 4. For these parameters it shouldn’t have taken a day to train or cross validate. I decided to profile the cross validation which showed up interesting results.
Ordered by: cumulative time
List reduced from 260 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 30453.783 30453.783 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:209(cro$
s_validate)
1 1.160 1.160 30453.733 30453.733 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:242(_cr$
ss_score)
1 0.912 0.912 29308.131 29308.131 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/model.py:249(<li$
tcomp>)
11148 171.564 0.015 29307.219 2.629 /home/codezee/ai/venv/lib/python3.4/site-packages/revscoring-2.0.11-py3.4.egg/revscoring/scoring/models/sklearn.py:159(sc$
re)
33442 1665.965 0.050 29019.662 0.868 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/ensemble/forest.py:514(predict_proba)
33443 58.356 0.002 28459.225 0.851 /home/codezee/ai/venv/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py:759(__call__)
It clearly shows whats going on. The method taking the most time was predict_proba of the underlying sklearn. Upon further analysis, I found out that this method was being called by the score method of revscoring.
def _cross_score():
...
return [(model.score(feature_values), label)
for feature_values, label in test_set]
The above two lines cleared the whole picture. The above method was scoring instances one by one instead of scoring them together in a bunch. It is a very simple case of matrix multiplications and underlying numpy optimizations. To put it simply, consider A1, A2, A3 are feature vectors and B is the coefficient matrix with which each to multiply to get the prediction.
time(A1.K+A2.K+A3.K...) > time((A1+A2+A3...).K)
that is, the time of multiplying each feature vector with the coefficient matrix and then aggregating results will be slower than creating a feature matrix by concatenating all feature vectors and then multiplying with the coefficient matrix to get the prediction vector. This didn’t seem like a very obvious thing while investigating but when I applied the fix to revscoring seemed to improve the speed by as much as 5 times! I simply changed the above code to below:
and implementing the new score_many method that aggregates the feature vectors and then calls predict_proba. This was a small fix to revscoring but one that did improve cross_validation and brought down the times to reasonable limits and hence felt great when it finally got merged!
How about an AI service that could classify new drafts on Wikipedia into some predefined categories? Wouldn’t it make the lives of editors easier in improving the articles? The project “Automatic suggestion of topics to new drafts” strives to address exactly this. I’ll not try to go much into the tech specs of the project which I’ve explained in detail on my research page rather I’ll try to provide a high level motivation for the project and what it entails.
The problem starts with the current state of Wikipedia: too many people creating new articles but too few to review them. The current review process which has been since long is that once a page is created, new page patrollers review the page for any offensive material or violations of any Wikipedia policy. They also help to improve the article by suggesting tags for improvement or adding WikiProjects.
Now WikiProjects on Wikipedia are nothing but certain dedicated topical namespaces where articles of a similar topic are grouped. For example WikiProejct: Biography would contain all articles related to biographies of persons. This dedicated group helps subject experts to pay attention to articles of their field and improve them.
Therefore, at the core of improving new drafts on Wikipedia as quickly as possible lies the efficacy with which we can tag these new drafts with WikiProject topics. The above project aims to provide exactly such an efficieny in tagging new drafts with WikiProjects by building a Machine Learning model that would suggest WikiProjecs for these new articles. For example, someone might be trying to write a biography of a recent celebrity but not able to meet the required standards for the article. If our model is able to predict that the draft could belong to WikiProject:Biography early on, biography subject experts would get the attention of the article and could provide useful suggestions.
My exciting work with WMF is continuing and this is a brief account of that as I got some time to breathe between my other commitments. This post is about a sentiment feature I implemented using SentiWordnet on Wikipedia drafts/edits for the purpose of classification of an edit into damaging or not. The reason for using drafts and edits together is because this started with draftquality research but is now showing promises for editquality.
Background
Wikipedia has a webservice called ORES that reviews edits made on Wikipedia and gives a score to them on based on damaging, goodfaith and reverted models at the backend. These models are curated using Machine Learning techniques on trained data acquired by labeling of edits by Wikipedia editors. Each Wikipedia language has its on models based on their wiki and language.
The revscoring library is the one doing the heavy lifting when it comes to building the models. For those new to the it, it is a very convenient piece of software written to ease development of Machine Learning models around Wikipedia by making featching of revisions and their scoring super easy through command line arguments. It also provides a framework to define new features to extract from edtis and already has an extensive list of features that it extracts from articles.
Sentiment feature on drafts and edits
An edit or a draft on Wikipedia, according to the damaging criteria is one which harms the content to which it is added or the encyclopedia in general. I came up with a hypothesis that if the overall sentiment of the edit or an article is positive or negative it could play a role in determining the damaging nature of an edit.
Edits are reverted while new drafts are deleted based on Speedy deletion criteria Hence, starting from feature engineering on the revscoring library I set out to implement a feature to provide an overall sentiment of the edit. My research is recorded at the phab task but here are the few interesting bits and pieces: The terminology of spam, vandalism and attack is explained in a previous post.
Advertising is SPAM!
A quick look at the below image will clearly show that the spam class created by fake editors to promote stuff clearly shows a spike in positive sentiment:
From where do these come ?
These polarity scores are generated using SentiWordnet. In reality there are tons of resources out there to get polarity of sentences or documents but here I had to adhere to certain constraints:
The code should be open source
The scoring should be fast - typically in milliseconds otherwise ORES would timeout
It should be easily portable or usable in python.
Sentiwordnet seemed to be the best fit. Its just a database of common words and their corresponding polarities per the word sense from wordnet.
Attempt 1
I did a simple task of aggregating the positive and negative score of each word after doing its WSD using simple-lesk. To my despair, the scoring was taking seconds per draft! strict NO
Attempt 2
I had to somehow prune the scoring time at the same time retaining the viability of my hypothesis. I decided to sacrifice accuracy for time. I was earlier using a library from kevincobain called sentiment_classifier that does WSD and then uses the sense to query the polarity. A profiling of the method revealed:
Clearly, the code was spending most of the time in disambiguateWordSenses and path_similarity which are necessary for WSD.
No WSD: Hence I decided to completely skip WSD and go ahead with the sentiment of the most dominant sense which is the sense of a word with index 0 in nltk. This is a crude approximation but something that worked! the results retaining the above shape of the graph somewhat. And it took only milliseconds to do the scoring.
The benchmarking on the complete dataset is going on but this is something showing an almost balance between computational capacity and the need for accuracy. I’m not allowed to touch the deleted dataset because of confidentiality restrictions.
Sentiment on Edits
However, I could still apply this feature to the editquality research. and see the stats there.
After generating the tuning reports, the Gradient Boosting classifier, (best one) gave almost a jump of 2% while BernoulliNB jumped from 66% to 77% showing some promises.
I’m quite excited to write about my recent set of investigations on a seemingly simple mini-project. Recently I wrote a small code for Machine Translation from English to Hindi using HMM Viterbi decoding. First I did word alignments using the GIZA++ tool from parallel corpus and after aligning the words in both languages, we get probability scores like: Please note that this is not exactly a post on Machine Translation, as this is a very naive way to do that, but a discussion on using HMM POS-Tagger code for a crude MT project and using basic level optimizations to make it fast.
Src lang index
Dst lang index
Probability
0
6
0.0105564
0
9
0.0888391
0
11
0.123789
0
12
8.75807e-06
0
14
0.000144811
0
17
3.12898e-05
Here the integers represent the index for each word in the dictionary created from the corpus.
The Machine Translation model is analogous to an HMM POS-Tagger where words in one language can be treated as hidden states transitioning from one to another( and defined by the language model ) and each word( state ) generating a word in the target language using emission probability given by the alignments above.
The noisy channel model:
P(H|E) = argmax(over H) P(E|H) * P(H)
i.e, Probability of Hindi given an English sentence is argmax over the combined probability of English given Hindi and probability of the hindi sentence, maxed over hindi sentences.
Viterbi Decoding
Once this analogy between POS-Tagging and Machine Translation is setup, I simply ported the POS-Tagger code for Machine Translation, but there were surprises to be had. I ran the algorithm and it would take ages to translate one sentence. Lets see what was wrong:
Initialization
SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
[expressing the fact that first state is S 1 ]
Iteration
For(t=2 to T) do
For(i=1 to N) do
SEQSCORE(i,t) = Max (j=1,N)
[ SEQSCORE( j , ( t − 1 )) * P ( Sj --ak--> Si)]
BACKPTR(I,t) = index j that gives the MAX above
Complexity: O(T x N x N)
POS-Tagger: Hidden states = tags = 20(say), words = 5000, Complexity = 5 x 10^8
Machine Translation: Hidden states = Hindi words = 94125~10^5, English words = 72167~10^5, Complexity ~ 10^10
Clearly, because the number of hidden states in Machine Translation is equivalent to the number of words, our complexity is of that order of magniture higher.
Can we do better?
Attempt one: Base approach
To start with I had a transition matrix that contained transitions of every word to every other words, which I readily indexed using ‘transition[w_idx1]’[w_idx2]. This was the naivest approach and which did not even meet memory constraints, letting my laptop hang.
To solve this, I decided to use a scipy sparse matrix, and after some research, given the way my transition was structured, I decided to go ahead with csr sparse matrix
Attempt two
After discussion with one of my friends, it came to my mind that the above algorithm naively checks transition from every state to every other state, whereas in a real language one word leads two only a handful number of other words in the language. I was unnecessarily iterating over null transitions!
Lets take a look at the code:
# 't' is the current word
for idx1, i in enumerate(transition[0]):
maxScore = seqscore[0][t]
for idx2, j in enumerate(transition[0]):
score = seqscore[j][t-1] * transition[idx2][idx1] * emission[wordindex][idx1]
Clearly the transition evaluation needs some love, I googled for ways to only get nonzero rows in transition matrix, and found find, so lets see:
tr_row = transition.getrow(idx1)
.
.
.
rs, indices, vals = find(tr_row)
for idx2, v2 in zip(indices, vals):
I did get some nasty code in the inner loop, but thats what was, I had to tradeoff an easy to understand implementation for speed.
As a result, I did get an instant improvement in speed! The search method which would take ages for one sentence now took 5 minutes. However I was still not satisfied, so I decided to dig further, this time using tech ;)
Attempt three
I remembered that python had a profiling library so decided to take up actual profiling
From above data, it is clear that ‘scipy/sparse/coo.py’ was taking the largest number of individual calls. After some digging, I found that though I’m reducing the number of transitions by only evaluting non-zero transitions, the method ‘get_row’ from scipy for getting individual rows of a sparse matrix is in itself quite intensive
Wrap-up
Whoa, so much for a simple Machine Translation task! but it was a good exercise that me familiar with sparse matrices, actual optimizations in terms of memory and time at the data structure level and profiling!