A lot of times I had to explain my seemingly “complicated” Undergradute-research project or as it may seem because of the relative novelty of the topic. Hence I finally take time to pen this down once and for all for any references for myself or anyone else in future, with an aim to comprehensively define my topic and research conducted as part of UG Thesis project at IIT Patna, 2016-17. For a concise and brief overview of my project, head over to the poster here.
The title of my UG-Research Project is “Entity Relation ranking from unstructured text”. Lets first see what are entity relations.
These are the fundamental blocks of structured knowledge bases. Example, any knowledge base has data in the form of entities which may be representation of any real world object or any concept. These are linked to each other by subject-predicate-object triples where predicate is some property linking subject and object each of which forms an entity in knowledge bases. Example, cat is an entity and animal is an entity. Cat is an animal is a relation which links cat and animal through the is relation.
Now my project was about ranking these “entity relations” which means ranking amongst several subject-predicate-object triple where each such relation has a common subject. Its best illustrated through a snapshot of the dataset I’m using.
Here, each relation is a Person - Profession relation to which my project is restricted. Clearly, each person can have several different professions and the goal is to rank these relations based on the relevance of the profession to the person its associated with.
E.g. William Shakespeare makes more sense with Playwright than with Lyricist.
The quantitative definition of “relevance” is taken to be the amount of information about the profession present in Wikipedia related to that person.
The dataset used was from WSDM Cup 2017 competition. I only used the profession entities for my project. Each person-profession pair had a label between 0-7 with 7 being most relevant and a similar labels had to be predicted for test samples based on the relevancy information.
During the course of my project I undertook two approaches, one feature based, and another deep learning based which was a novel proposal from my side.
Feature Based Method
This was done to get more context out of individual profession words. E.g. for ‘actor’, words like ‘acting’, ‘acted’, ‘starred’, ‘cast’, ‘played’ were extracted. This was done for each of the 200 professions from the dataset using a Logistic Regression classifier where the positive samples were Wikipedia articles of persons who only worked in that profession and negative samples were Wikipedia articles of persons who never worked in that profession. Then after training, the above words would come up as top features.
Using the above expanded query for each profession, a set of features was formulated on the Wikipedia article of the person for the person-profession pair.
These sets of features were then fed to an SVM or Random Forest Classifier and ranking was done by following a hierarchical classification strategy, i.e., first classifying on the top level as 0 or 1, then on the next level as 0 or 1 then on the next level as 0 or 1. This way we get a tree with eight leaves corresponding to the 8 different labels between 0-7. The results are in the poster.
Deep Learning Based Method
For this approach, the above feature selection strategy was replaced by a CNN network and following the classification of a profession as relevant or non-relevant with respect to a person based on a set of features automatically identified by the CNN network. A single classifier was used across all professions unlike the above case and the sample selection strategy for each profession was similar to above.
A lot of work could be taken up in the domain of “ranking structured knowledge base entities from unstructured text”. I’ve shown the viability of the deep learning method to the problem but not shown the final rankings produced by the CNN as I had to stop at the stage of predicting a single profession of a person as relevant or non-relevant due to time constraints. This could be taken up as an interesting research problem.