Browse wiki

Jump to: navigation, search
Measuring content quality in user generated content systems: a machine learning approach
Abstract User Generated Content (UGC) has radicallyUser Generated Content (UGC) has radically transformed the Web from its humble origins as a document-publishing platform. Currently, and most likely in the foreseeable future as well, the Web serves primarily as a social medium, a largely unmoderated platform where millions of people share experiences and knowledge using their own points of view. While this freedom is empowering in general, when left unguided, the Web becomes a cacophony of voices, where fact and fiction, and good information and deception, blur. When faced with poor quality content, users are left with the feeling that nothing on the Web can be trusted. In order to tackle this issue of trust in unmoderated publishing media, I focus my work on Wikipedia. I set out to devise an efficient mechanism for automatic detection of low quality contributions, commonly known as "vandalism", and, at the same time, detect contributors who systematically behave as vandals. First I mine the Wikipedia history pages in order to extract user edit patterns. Then I use these patterns to derive several computational models of a user's reputation. Secondly, based on these models, I generate several new user reputation features and show that they are strong predictors for locating low quality content. To improve the accuracy of my approach, I extend the feature set by adding other textual features. I describe a method for detecting vandalism that is more accurate than others previously developed. Because of the high turnaround in user generated content systems, it is important for vandalism detection tools to be scalable and run in real-time. I explain how we can implement the system in a distributed way. In addition, I use cost-sensitive feature selection to reduce the total computational cost of executing our models. This work is a starting point; but it will prove to be one of great importance if it contributes to a better understanding of user generated content and the methods of measuring and ensuring its quality. The methods I use in this thesis are general and can be applied to numerous other UGCs such as Facebook and Twitter.s other UGCs such as Facebook and Twitter.
Added by wikilit team Added on initial load  +
Collected data time dimension Longitudinal  +
Conclusion In this thesis I studied the problem of meIn this thesis I studied the problem of measuring user reputation and content quality in Wikipedia. I modeled user reputation in Wikipedia and showed the e�ectiveness of three models for predicting user behavior. I used these reputation models to estimate content quality in Wikipedia. I showed that content quality is highly correlated to the reputation of its authors. I then extended the approach to enable me to locate low quality content in Wikipedia in the form of vandalism. I described a machine learning approach to detect vandalism in Wikipedia. I modeled vandalism as a binary classi�cation problem. Based on the validated corpus of Wikipedia edits from the PAN 2010 competition, I trained and tested several binary classi�ers utilizing learning methods that have been widely used for spam detection, such as: Naive Bayes, Logistic Regression, and SVMs.aive Bayes, Logistic Regression, and SVMs.
Data source Wikipedia pages  +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Measuring%2Bcontent%2Bquality%2Bin%2Buser%2Bgenerated%2Bcontent%2Bsystems%3A%2Ba%2Bmachine%2Blearning%2Bapproach%22  +
Has author Sara Javanmardi +
Has domain Computer science + , Information systems +
Has topic Vandalism +
Month October  +
Peer reviewed Yes  +
Publication type Thesis  +
Published in ProQuest Dissertations & Theses +
Research design Statistical analysis  +
Research questions In order to tackle this issue of trust in In order to tackle this issue of trust in unmoderated publishing media, I focus my work on Wikipedia. I set out to devise an e�cient mechanism for automatic detection of low quality contributions, commonly known as \vandalism," and, at the same time, detect contributors who systematically behave as vandals. First I mine the Wikipedia history pages in order to extract user edit patterns. Then I use these patterns to derive several computational models of a user's reputation. Secondly, based on these models, I generate several new user reputation features and show that they are strong predictors for locating low quality content. To improve the accuracy of my approach, I extend the feature set by adding other textual features. I describe a method for detecting vandalism that is more accurate than others previously developed. Because of the high turnaround in user generated content systems, it is important for vandalism detection tools to be scalable and run in real time. I explain how we can implement the system in a distributed way. In addition, I use cost sensitive feature selection to reduce the total computational cost of executing our models.omputational cost of executing our models.
Revid 10,868  +
Theory type Design and action  +
Title Measuring content quality in user generated content systems: a machine learning approach
Unit of analysis Edit  +
Url http://gradworks.umi.com/34/73/3473551.html  +
Wikipedia coverage Main topic  +
Wikipedia data extraction Dump  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2011  +
Creation dateThis property is a special property in this wiki. 9 August 2012 21:03:50  +
Categories Vandalism  + , Computer science  + , Information systems  + , Publications with missing theories  + , Publications with missing comments  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:29:47  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.