Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history

From WikiLit
Jump to: navigation, search
Publication (help)
Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History
Authors: Oliver Ferschke, Torsten Zesch, Iryna Gurevych [edit item]
Citation: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies  : 97-102. 2011 June 21. Portland, Oregon, USA.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Yes
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History is a publication by Oliver Ferschke, Torsten Zesch, Iryna Gurevych.


[edit] Abstract

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

[edit] Research questions

"We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia."

Research details

Topics: Other natural language processing topics [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined?" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Longitudinal [edit item]
Unit of analysis: Edit [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article, History [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"In this paper, we presented an open-source toolkit which extends JWPL, an API for accessing Wikipedia, with the ability to reconstruct past states of Wikipedia, and to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia, and is also a requirement for the creation of time-based series of Wikipedia snapshots and for assessing the influence of Wikipedia growth on NLP algorithms. Furthermore, Wikipedia’s edit history has been shown to be a valuable knowledge source for NLP, which is hard to access because of the lack of efficient tools for managing the huge amount of revision data. By utilizing a dedicated storage format for the revisions, our toolkit massively decreases the amount of data to be stored. At the same time, it provides an easyto-use interface to access the revision data. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history. The toolkit will be made available as part of JWPL, and can be obtained from the project’s website at Google Code. (http:// jwpl.googlecode.com)"

[edit] Comments


Further notes[edit]

Facts about "Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history"RDF feed
AbstractWe present an open-source toolkit which alWe present an open-source toolkit which allows (i) to reconstruct past states of

Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design

allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.
ledge encoded in Wikipedia’s edit history.
Added by wikilit teamYes +
Collected data time dimensionLongitudinal +
ConclusionIn this paper, we presented an open-sourceIn this paper, we presented an open-source toolkit which extends JWPL, an API for accessing Wikipedia, with the ability to reconstruct past states

of Wikipedia, and to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia, and is also a requirement for the creation of time-based series of Wikipedia snapshots and for assessing the influence of Wikipedia growth on NLP algorithms. Furthermore, Wikipedia’s edit history has been shown to be a valuable knowledge source for NLP, which is hard to access because of the lack of efficient tools for managing the huge amount of revision data. By utilizing a dedicated storage format for the revisions,

our toolkit massively decreases the amount of data to be stored. At the same time, it provides an easyto-use interface to access the revision data. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history. The toolkit will be made available as part of JWPL, and can be obtained from the project’s website at Google Code. (http:// jwpl.googlecode.com)
Google Code. (http:// jwpl.googlecode.com)
Conference locationPortland, Oregon, USA +
Data sourceExperiment responses + and Wikipedia pages +
Dates21 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Wikipedia%2BRevision%2BToolkit%3A%2BEfficiently%2BAccessing%2BWikipedia%27s%2BEdit%2BHistory%22 +
Has authorOliver Ferschke +, Torsten Zesch + and Iryna Gurevych +
Has domainComputer science +
Has topicOther natural language processing topics +
MonthJune +
Pages97-102 +
Peer reviewedYes +
Publication typeConference paper +
Published in49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies +
Research designExperiment +
Research questionsWe present an open-source toolkit which alWe present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia.ious experimental work based on Wikipedia.
Revid11,100 +
TheoriesUndetermined?
Theory typeDesign and action +
TitleWikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History
Unit of analysisEdit +
Urlhttp://dl.acm.org/citation.cfm?id=2002440.2002457 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageEnglish +
Wikipedia page typeArticle + and History +
Year2011 +