Information arbitrage across multi-lingual Wikipedia

From WikiLit
Jump to: navigation, search
Publication (help)
Information arbitrage across multi-lingual Wikipedia
Authors: Eytan Adar, Michael Skinner, Daniel S. Weld [edit item]
Citation: WSDM '09 Proceedings of the Second ACM International Conference on Web Search and Data Mining  : 94-103. 2009 February 9-12. Barcelona, Spain.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: 10.1145/1498759.1498813.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Information arbitrage across multi-lingual Wikipedia is a publication by Eytan Adar, Michael Skinner, Daniel S. Weld.


[edit] Abstract

The rapid globalization of Wikipedia is generating a parallel, multi-lingual corpus of unprecedented scale. Pages for the same topic in many different languages emerge both as a result of manual translation and independent development. Unfortunately, these pages may appear at different times, vary in size, scope, and quality. Furthermore, differential growth rates cause the conceptual mapping between articles in different languages to be both complex and dynamic. These disparities provide the opportunity for a powerful form of information arbitrage-leveraging articles in one or more languages to improve the content in another. Analyzing four large language domains (English, Spanish, French, and German), we present Ziggurat, an automated system for aligning Wikipedia infoboxes, creating new infoboxes as necessary, filling in missing information, and detecting discrepancies between parallel pages. Our method uses self-supervised learning and our experiments demonstrate the method's feasibility, even in the absence of dictionaries.

[edit] Research questions

"The globalization of Wikipedia shows no apparent slowdown and there is a unique opportunity to utilize the parallel work of editors versed in different languages. As content is created at different rates in different languages, and the quality of that content is highly variable, there is a huge opportunity to resolve differences and inconsistencies. In this paper we introduce Ziggurat, a system to automatically resolve differentials in infobox completeness."

Research details

Topics: Other content topics, Text classification [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English, French, German, Spanish [edit item]

[edit] Conclusion

"The system provides a unique mechanism that allows the content in one language to benefit from parallel content in others. By utilizing the notion that this differential is exploitable (an arbitrage opportunity), we develop an accurate system for filling in missing infobox data."

[edit] Comments

""The system provides a unique mechanism that allows the content in one language to benefit from parallel content in others." p. 103"


Further notes[edit]

Facts about "Information arbitrage across multi-lingual Wikipedia"RDF feed
AbstractThe rapid globalization of Wikipedia is geThe rapid globalization of Wikipedia is generating a parallel, multi-lingual corpus of unprecedented scale. Pages for the same topic in many different languages emerge both as a result of manual translation and independent development. Unfortunately, these pages may appear at different times, vary in size, scope, and quality. Furthermore, differential growth rates cause the conceptual mapping between articles in different languages to be both complex and dynamic. These disparities provide the opportunity for a powerful form of information arbitrage-leveraging articles in one or more languages to improve the content in another. Analyzing four large language domains (English, Spanish, French, and German), we present Ziggurat, an automated system for aligning Wikipedia infoboxes, creating new infoboxes as necessary, filling in missing information, and detecting discrepancies between parallel pages. Our method uses self-supervised learning and our experiments demonstrate the method's feasibility, even in the absence of dictionaries.lity, even in the absence of dictionaries.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Comments"The system provides a unique mechanism that allows the content in one language to benefit from parallel content in others." p. 103
ConclusionThe system provides a unique mechanism thaThe system provides a unique mechanism that allows the content in one language to benefit from parallel content in others. By utilizing the notion that this differential is exploitable (an arbitrage opportunity), we develop an accurate system for filling in missing infobox data.ystem for filling in missing infobox data.
Conference locationBarcelona, Spain +
Data sourceExperiment responses + and Wikipedia pages +
Dates9-12 +
Doi10.1145/1498759.1498813 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Information%2Barbitrage%2Bacross%2Bmulti-lingual%2BWikipedia%22 +
Has authorEytan Adar +, Michael Skinner + and Daniel S. Weld +
Has domainComputer science +
Has topicOther content topics + and Text classification +
MonthFebruary +
Pages94-103 +
Peer reviewedYes +
Publication typeConference paper +
Published inWSDM '09 Proceedings of the Second ACM International Conference on Web Search and Data Mining +
Research designExperiment +
Research questionsThe globalization of Wikipedia shows no apThe globalization of Wikipedia shows no apparent slowdown and there is a unique opportunity to utilize the parallel work of editors versed in different languages. As content is created at different rates in different languages, and the quality of that content is highly variable, there is a huge opportunity to resolve differences and inconsistencies. In this paper we introduce Ziggurat, a system to automatically resolve differentials in infobox completeness.lve differentials in infobox completeness.
Revid10,824 +
TheoriesUndetermined
Theory typeDesign and action +
TitleInformation arbitrage across multi-lingual Wikipedia
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1498813 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageEnglish +, French +, German + and Spanish +
Wikipedia page typeArticle +
Year2009 +