Automatically Labeling Low Quality Content on Wikipedia by Leveraging Paterns in Editing Behaviors

Published in CSCW 2021, 2021

Recommended citation: Sumit Asthana, Sabrina Tobar Thommel, Aaron Halfaker, Nikola Banovic Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behaviors In Proceedings of the ACM on Human-Computer Interaction, Vol. 5, No. CSCW2, Article 359 (October 2021). ACM 23 pages.

Our pipeline for labeling low-quality sentences on Wikipedia. We start with our automated labeling approach (left), where we obtain a large corpus of historic Wikipedia sentence edits, and label their semantic intent using programmatic rules. We extract positive sentences from relevant semantic edits and negative sentences from Featured Articles. We then use our labels to train existing Machine Learning models, and test them by comparing with labeling approaches from past research (middle). Existing models trained on our labels can then be deployed to automatically detect Wikipedia sentences that require improvement (right).