Information extraction from Wikipedia: moving down the long tail

From WikiLit
Jump to: navigation, search
Publication (help)
Information extraction from Wikipedia: moving down the long tail
Authors: Fei Wu, Raphael Hoffmann, Daniel S. Weld [edit item]
Citation: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining  : 731-739. 2008 August 24-27. Las Vegas, NV, United states.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: 10.1145/1401890.1401978.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Information extraction from Wikipedia: moving down the long tail is a publication by Fei Wu, Raphael Hoffmann, Daniel S. Weld.


[edit] Abstract

Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.

[edit] Research questions

"This paper presents three novel techniques for increasing recall from Wikipedia’s long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web."

Research details

Topics: Information extraction [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"Kylin has demonstrated the ability to perform self-supervised information extraction from Wikipedia [26]. While Kylin achieved high precision and reasonable recall when infobox classes had a large number of instances, most classes sit on the long tail of few instances. For example, 82% classes can provide fewer than 100 training examples, and for these classes Kylin’s performance is unacceptable. Furthermore, even when Kylin does learn an effective extractor there are many cases where Wikipedia’s article on a topic is too short to hold much-needed information. This paper describes three powerful methods for increasing recall w.r.t. the above to long-tailed challenges: shrinkage, retraining, and supplementing Wikipedia extractions with those from the Web. Our experiments show that each of these methods is effective individually. Particularly, shrinkage addresses more the first longtailed challenge of sparse classes, and the latter two address more the second long-tailed challenge of short articles. We evaluate design tradeoffs within each method. Most importantly, we show that in concert, these methods constitute a huge improvement to Kylin’s performance (Figure 8): • Precision is modestly improved in most classes, with larger gains if sparsity is extreme (e.g., “Irish newspaper”). • Recall sees extraordinary improvement with gains from 5.8% to 50.8% (a factor of 8.8) in extremely sparse classes such as “Irish newspaper.” Even though the “Writer” class is populated with over 2000 infoboxes, its recall improves from 18.1% to 32.5% (a factor of 1.8) at equivalent levels of precision. • Calculating the area under the precision / recall curve also demonstrates substantial improvement, with an improvement factor of 23.3, 1.98, 2.02, and 1.96 for “Irish newspaper,” “Performer,” “Baseball stadium,” and “Writer,” respectively."

[edit] Comments

""these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision" p. 731"


Further notes[edit]

Facts about "Information extraction from Wikipedia: moving down the long tail"RDF feed
AbstractNot only is Wikipedia a comprehensive sourNot only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.while maintaining or increasing precision.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Comments"these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision" p. 731
ConclusionKylin has demonstrated the ability to perfKylin has demonstrated the ability to perform self-supervised information

extraction from Wikipedia [26]. While Kylin achieved high precision and reasonable recall when infobox classes had a large number of instances, most classes sit on the long tail of few instances. For example, 82% classes can provide fewer than 100 training examples, and for these classes Kylin’s performance is unacceptable. Furthermore, even when Kylin does learn an effective extractor there are many cases where Wikipedia’s article on a topic is too short to hold much-needed information. This paper describes three powerful methods for increasing recall w.r.t. the above to long-tailed challenges: shrinkage, retraining, and supplementing Wikipedia extractions with those from the Web. Our experiments show that each of these methods is effective individually. Particularly, shrinkage addresses more the first longtailed challenge of sparse classes, and the latter two address more the second long-tailed challenge of short articles. We evaluate design tradeoffs within each method. Most importantly, we show that in concert, these methods constitute a huge improvement to Kylin’s performance (Figure 8): • Precision is modestly improved in most classes, with larger gains if sparsity is extreme (e.g., “Irish newspaper”). • Recall sees extraordinary improvement with gains from 5.8% to 50.8% (a factor of 8.8) in extremely sparse classes such as “Irish newspaper.” Even though the “Writer” class is populated with over 2000 infoboxes, its recall improves from 18.1% to 32.5% (a factor of 1.8) at equivalent levels of precision. • Calculating the area under the precision / recall curve also demonstrates substantial improvement, with an improvement factor of 23.3, 1.98, 2.02, and 1.96 for “Irish newspaper,”

“Performer,” “Baseball stadium,” and “Writer,” respectively.
ball stadium,” and “Writer,” respectively.
Conference locationLas Vegas, NV, United states +
Data sourceExperiment responses + and Wikipedia pages +
Dates24-27 +
Doi10.1145/1401890.1401978 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Information%2Bextraction%2Bfrom%2BWikipedia%3A%2Bmoving%2Bdown%2Bthe%2Blong%2Btail%22 +
Has authorFei Wu +, Raphael Hoffmann + and Daniel S. Weld +
Has domainComputer science +
Has topicInformation extraction +
MonthAugust +
Pages731-739 +
Peer reviewedYes +
Publication typeConference paper +
Published inProceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining +
Research designExperiment +
Research questionsThis paper presents three novel techniquesThis paper presents three novel techniques for increasing recall from Wikipedia’s long tail of sparse classes:

(1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and

(3) supplementing results by extracting from the broader Web.
esults by extracting from the broader Web.
Revid10,825 +
TheoriesUndetermined
Theory typeDesign and action +
TitleInformation extraction from Wikipedia: moving down the long tail
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1401978 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageEnglish +
Wikipedia page typeArticle +
Year2008 +