Browse wiki

Jump to: navigation, search
Information extraction from Wikipedia: moving down the long tail
Abstract Not only is Wikipedia a comprehensive sourNot only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.while maintaining or increasing precision.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments "these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision" p. 731
Conclusion Kylin has demonstrated the ability to perfKylin has demonstrated the ability to perform self-supervised information extraction from Wikipedia [26]. While Kylin achieved high precision and reasonable recall when infobox classes had a large number of instances, most classes sit on the long tail of few instances. For example, 82% classes can provide fewer than 100 training examples, and for these classes Kylin’s performance is unacceptable. Furthermore, even when Kylin does learn an effective extractor there are many cases where Wikipedia’s article on a topic is too short to hold much-needed information. This paper describes three powerful methods for increasing recall w.r.t. the above to long-tailed challenges: shrinkage, retraining, and supplementing Wikipedia extractions with those from the Web. Our experiments show that each of these methods is effective individually. Particularly, shrinkage addresses more the first longtailed challenge of sparse classes, and the latter two address more the second long-tailed challenge of short articles. We evaluate design tradeoffs within each method. Most importantly, we show that in concert, these methods constitute a huge improvement to Kylin’s performance (Figure 8): • Precision is modestly improved in most classes, with larger gains if sparsity is extreme (e.g., “Irish newspaper”). • Recall sees extraordinary improvement with gains from 5.8% to 50.8% (a factor of 8.8) in extremely sparse classes such as “Irish newspaper.” Even though the “Writer” class is populated with over 2000 infoboxes, its recall improves from 18.1% to 32.5% (a factor of 1.8) at equivalent levels of precision. • Calculating the area under the precision / recall curve also demonstrates substantial improvement, with an improvement factor of 23.3, 1.98, 2.02, and 1.96 for “Irish newspaper,” “Performer,” “Baseball stadium,” and “Writer,” respectively.ball stadium,” and “Writer,” respectively.
Conference location Las Vegas, NV, United states +
Data source Experiment responses  + , Wikipedia pages  +
Dates 24-27 +
Doi 10.1145/1401890.1401978 +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Information%2Bextraction%2Bfrom%2BWikipedia%3A%2Bmoving%2Bdown%2Bthe%2Blong%2Btail%22  +
Has author Fei Wu + , Raphael Hoffmann + , Daniel S. Weld +
Has domain Computer science +
Has topic Information extraction +
Month August  +
Pages 731-739  +
Peer reviewed Yes  +
Publication type Conference paper  +
Published in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining +
Research design Experiment  +
Research questions This paper presents three novel techniquesThis paper presents three novel techniques for increasing recall from Wikipedia’s long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web.esults by extracting from the broader Web.
Revid 10,825  +
Theories Undetermined
Theory type Design and action  +
Title Information extraction from Wikipedia: moving down the long tail
Unit of analysis Article  +
Url http://dl.acm.org/citation.cfm?id=1401978  +
Wikipedia coverage Main topic  +
Wikipedia data extraction Dump  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2008  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:29:01  +
Categories Information extraction  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:28:55  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.