Autonomously semantifying Wikipedia

From WikiLit
Jump to: navigation, search
Publication (help)
Autonomously semantifying Wikipedia
Authors: Fei Wu, Daniel S. Weld [edit item]
Citation: CIKM '07 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management  : 41-50. 2007 November 6-9. Lisboa, Portugal. Association for Computing Machinery.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: 10.1145/1321440.1321449.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Autonomously semantifying Wikipedia is a publication by Fei Wu, Daniel S. Weld.


[edit] Abstract

Berners-Lee’s compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method — creating enough structured data to motivate the development of applications. This paper argues that autonomously “Semantifying Wikipedia” is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manually-derived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype implementation of a self-supervised, machine learning system which realizes our vision. Preliminary experiments demonstrate the high precision of our system’s extracted data — in one case equaling that of humans.

[edit] Research questions

"Berners-Lee’s compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method — creating enough structured data to motivate the development of applications. This paper argues that autonomously “SemantifyingWikipedia” is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manuallyderived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype implementation of a self-supervised, machine learning system which realizes our vision."

Research details

Topics: Information extraction [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"We propose bootstrapping the SemanticWeb by miningWikipedia and we identify some unique challenges (lack of redundancy) and opportunities (unique identifiers, user-supplied training data, lists, categories, etc.) of this approach. We also identify additional issues resulting from Wikipedia’s growth through decentralized authoring (e.g., inconsistency, schema drift, etc.). This high-level analysis should benefit future work onWikipedia and similar collaborative knowledge repositories. • We describe a systems for automatically generating attribute/value pairs summarizing an article’s properties. Based on self-supervised learning, KYLIN achieves performance which is roughly comparable with that of human editors. In one case, KYLIN does even better. • By automatically identifying missing internal links for proper nouns, more semantic tags are added. Because these links resolve noun phrases to unique identifiers, they are useful for many purposes such as information retrieval, structural analysis, and further semantic processing. Meaning lies in the graph structure of concepts defined in terms of each other, and KYLIN helps complete that graph. • Collaboratively authored data is rife with noise and incompleteness. We identify robust learning methods which can cope in this environment. Extensive experiments demonstrate the performance of our system and characterize some of the crucial architectural choices (e.g., the optimal ordering of heuristics, the utility of classifier-based training data refinement, a pipelined architecture for attribute extraction)."

[edit] Comments

"A system was proposed to extract structured data from Wikipedia; the main issues result "from Wikipedia’s growth through decentralized authoring... [Also, the] list and category information [in Wikipedia] is rudimentary" p. 49"


Further notes[edit]

Facts about "Autonomously semantifying Wikipedia"RDF feed
AbstractBerners-Lee’s compelling vision of a SemanBerners-Lee’s compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a bootstrapping method — creating enough structured data to motivate the development of applications. This paper argues that autonomously “Semantifying Wikipedia” is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manually-derived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype implementation of a self-supervised, machine learning system which realizes our vision. Preliminary experiments demonstrate the high precision of our system’s extracted data — in one case equaling that of humans.ata — in one case equaling that of humans.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
CommentsA system was proposed to extract structured data from Wikipedia; the main issues result "from Wikipedia’s growth through decentralized authoring... [Also, the] list and category information [in Wikipedia] is rudimentary" p. 49
ConclusionWe propose bootstrapping the SemanticWeb bWe propose bootstrapping the SemanticWeb by miningWikipedia

and we identify some unique challenges (lack of redundancy) and opportunities (unique identifiers, user-supplied training data, lists, categories, etc.) of this approach. We also identify additional issues resulting from Wikipedia’s growth through decentralized authoring (e.g., inconsistency, schema drift, etc.). This high-level analysis should benefit future work onWikipedia

and similar collaborative knowledge repositories. • We describe a systems for automatically generating attribute/value pairs summarizing an article’s properties. Based on self-supervised learning, KYLIN achieves performance which is roughly comparable with that of human editors. In one case, KYLIN does even better. • By automatically identifying missing internal links for proper nouns, more semantic tags are added. Because these links resolve noun phrases to unique identifiers, they are useful for many purposes such as information retrieval, structural analysis, and further semantic processing. Meaning lies in the graph structure of concepts defined in terms of each other, and KYLIN helps complete that graph. • Collaboratively authored data is rife with noise and incompleteness. We identify robust learning methods which can cope in this environment. Extensive experiments demonstrate the performance of our system and characterize some of the crucial architectural choices (e.g., the optimal ordering of heuristics, the utility of classifier-based training data refinement, a pipelined architecture for attribute extraction).
ed architecture for attribute extraction).
Conference locationLisboa, Portugal +
Data sourceExperiment responses + and Wikipedia pages +
Dates6-9 +
Doi10.1145/1321440.1321449 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Autonomously%2Bsemantifying%2BWikipedia%22 +
Has authorFei Wu + and Daniel S. Weld +
Has domainComputer science +
Has topicInformation extraction +
MonthNovember +
Pages41-50 +
Peer reviewedYes +
Publication typeConference paper +
Published inCIKM '07 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management +
PublisherAssociation for Computing Machinery +
Research designExperiment +
Research questionsBerners-Lee’s compelling vision of a SemanBerners-Lee’s compelling vision of a Semantic Web is hindered by

a chicken-and-egg problem, which can be best solved by a bootstrapping method — creating enough structured data to motivate the development of applications. This paper argues that autonomously “SemantifyingWikipedia” is the best way to solve the problem. We choose Wikipedia as an initial data source, because it is comprehensive, not too large, high-quality, and contains enough manuallyderived structure to bootstrap an autonomous, self-supervised process. We identify several types of structures which can be automatically enhanced in Wikipedia (e.g., link structure, taxonomic data, infoboxes, etc.), and we describe a prototype implementation of a self-supervised, machine learning system which realizes our vision.learning system which realizes our

vision.
Revid10,676 +
TheoriesUndetermined
Theory typeDesign and action +
TitleAutonomously semantifying Wikipedia
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1321449 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2007 +