Using Wikipedia to bootstrap open information extraction
We often use 'Data Management' to refer to the manipulation of relational or semi-structured information, but much of the world's data is unstructured, for example the vast amount of natural-language text on the Web. The ability to manage the information underlying this unstructured text is therefore increasingly important. While information retrieval techniques, as embodied in today's sophisticated search engines, offer important capabilities, they lack the most important faculties found in relational databases: 1) queries comprising aggregation, sorting and joins, and 2) structured visualization such as faceted browsing.
Collected data time dimension Cross-sectional  +
Comments We advocate an alternative approach: using Wikipedia to generate relation-specific training data for a broad set of thousands of relations.
This paper describes Kylin, which uses self-supervised learning to train relationally-targeted extractors from Wikipedia infoboxes. We explained how shrinkage and retraining allow Kylin to improve extractor robustness, and we demonstrate that these extractors can successfully mine tuples from a broader set of Web pages. Finally, we argued that the best way to utilize human efforts is by inviting humans to quickly validate the correctness of machine-generated extractions.
Data source Experiment responses  + , Wikipedia pages  +
Doi 10.1145/1519103.1519113 +
Google scholar url  +
Has author Daniel S. Weld + , Raphael Hoffmann + , Fei Wu +
Has domain Computer science +
Has topic Information extraction +
Issue 4  +
Pages 62-68  +
Peer reviewed Yes  +
Publication type Journal article  +
Published in ACM SIGMOD Record +
Research design Experiment  +
this paper presents Kylin as a case study of open IE. We start by describing Kylin's use of Wikipedia to power the self-supervised training of information extractors. Then, in Section 3 we show how Wikipedia training can be seen as a bootstrapping method enabling extraction from the wider set of general Web pages. Not even the best machine-learning algorithms have production-level precision
Revid 11,183  +
Theories Undetermined
Theory type Design and action  +
Title Using Wikipedia to bootstrap open information extraction
Unit of analysis Article  +
Url  +
Volume 37  +
Wikipedia coverage Other  +
Wikipedia data extraction Dump  +
Wikipedia language Not specified  +
Wikipedia page type Article  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:32:28  +
Categories Information extraction  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:32:09  +
