Browse wiki

Jump to: navigation, search
Mining meaning from Wikipedia
Abstract Wikipedia is a goldmine of information; noWikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.e open-source software they have produced.
Added by wikilit team Added on initial load  +
Collected data time dimension N/A  +
Comments Unit of analysis: scholarly articles
Conclusion No conclusion, this is a summary of the whNo conclusion, this is a summary of the whole literature review: Summary A whole host of researchers have been quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia's creation process and structure (Section 2). The unique open editing philosophy, which accounts for its success, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce's proposal that knowledge be defined through its public character and future usefulness rather than any prior justification. Wikipedia is not just an encyclopedia but can be viewed as anything from a corpus, taxonomy, thesaurus, hierarchy of knowledge topics to a full-blown ontology. It includes explicit information about synonyms (redirects) and word senses (disambiguation pages), database-style information (infoboxes), semantic network information (hyperlinks), category information (category structure), discussion pages, and the full edit history of every article. Each of these sources of information can be mined in various ways. Section 3 explains how Wikipedia is being exploited for natural language processing. Unlike WordNet, it was not created as a lexical resource that reflects the intricacies of human language. Instead, its primary goal is to provide encyclopedic knowledge across subjects and languages. However, the research described here demonstrates that it has, unexpectedly, immense potential as a repository of linguistic knowledge for natural language applications. In particular, its unique features allow well-defined tasks such as word sense disambiguation and word similarity to be addressed automatically—and the resulting level of performance is remarkably high. ESA (Gabrilovich and Markovitch, 2007) and Wikipedia Link-based Measure (Milne and Witten, 2008a), for example, take advantage of the extended and hyperlinked description of concepts that in WordNet were restricted to short glosses. Furthermore, whereas in WordNet the sense frequency was defined by a simple ranking of meaning, Wikipedia implicitly contains conditional probabilities of word meanings (Mihalcea and Csomai, 2007 Mihalcea, R., Csomai, A., 2007. Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM’07, Lisbon, Portugal, 6–8 November 2007, pp. 233–241.Mihalcea and Csomai, 2007), which allows more accurate similarity computation and word sense disambiguation. While the current research in this area has been mostly restricted to English, the approaches are general enough to apply to other languages. Researchers on co-reference resolution and mining of multilingual information have only recently discovered Wikipedia; significant improvements in these areas can be expected. To our knowledge, its use as a resource for other tasks such as natural language generation, machine translation and discourse analysis, has not yet been explored. These areas are ripe for exploitation, and exciting discoveries can be expected. Section 4 describes applications in information retrieval. New techniques for document classification and topic indexing make productive use of Wikipedia for searching and organizing document collections. These areas can take advantage of its unique properties while grounding themselves in—and building upon—existing research. In particular, document classification has gathered momentum and significant advances have been obtained over the state of the art. Question answering and entity ranking are less well addressed, because current techniques do not seem to take full advantage of Wikipedia—most simply treat it as just another corpus. We found little evidence of cross-pollination between this work and the information extraction efforts described in Section 5. Given how closely question answering and entity ranking depend on the extraction of facts and entities, we expect this to become a fruitful line of enquiry. In Section 5, we turn to information extraction and ontology building; mining Wikipedia for topics, relations and facts and then organizing them into a single resource. This task is less well defined than those in Sections 3 and 4. Different researchers focus on different kinds of information: we have reviewed projects that identify movie directors and soccer players, composers, corporate descriptions and hierarchical and ontological relations. Techniques range from those developed for standard text corpora to ones that utilize Wikipedia-specific properties such as hyperlinks and the category structure. The extracted resources range in size from several hundred to several million relations, but the lack of a common basis for evaluation prevents any overall conclusions as to which approach performs best. We believe that an extrinsic evaluation would be most meaningful, and hope to see these systems compete on a well-defined task in an independent evaluation. It will also be interesting to see to what extent these resources are exploited by other research communities in the future. Some authors have suggested using the Wikipedia editors themselves to perform ontology building, an enterprise that might be thought of as mining Wikipedia's people rather than its data. Perhaps they understand the underlying driving force behind this massively successful resource better than most! Only time will tell whether the community is amenable to following such suggestions. The idea of moving to a more structured and ontologically principled Wikipedia raises an interesting question: how will it interact with the public, amateur-editor model? Does this signal the emergence of the Semantic Web? We suspect that, like the success of Wikipedia itself, the result will be something new, something that experts have not foreseen and may not condone. That is the glory of Wikipedia.t condone. That is the glory of Wikipedia.
Data source Scholarly articles  +
Doi 10.1016/j.ijhcs.2009.05.004 +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Mining%2Bmeaning%2Bfrom%2BWikipedia%22  +
Has author Olena Medelyan + , David N. Milne + , Catherine Legg + , Ian H. Witten +
Has domain Computer science + , Information science +
Has topic Other corpus topics + , Literature review +
Issue 9  +
Pages 716-754  +
Peer reviewed Yes  +
Publication type Journal article  +
Published in International Journal of Human Computer Studies +
Research design Literature review  +
Research questions It focuses on research that extracts and mIt focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.e open-source software they have produced.
Revid 10,873  +
Theories Wikipedia's editing process can be groundeWikipedia's editing process can be grounded in the knowledge theory proposed by the 19th Century pragmatist Peirce. According to Peirce, beliefs can be understood as knowledge not due to their prior justification, but due to their usefulness, public character and future development. His account of knowledge was based on a unique account of truth, which claimed that true beliefs are those that all sincere participants in a “community of inquiry” would converge on, given enough time. Influential 20th Century philosophers (e.g. Quine, 1960) scoffed at this notion as being insufficiently objective. Yet Peirce claimed that there is a kind of person whose greatest passion is to render the Universe intelligible and will freely give time to do so, and that over the long run, within a sufficiently broad community, the use of signs is intrinsically self-correcting (Peirce, 1877). Wikipedia can be seen as a fascinating and unanticipated concrete realization of these apparently wildly idealistic claims. In this context it is interesting to note that Larry Sanger, Wikipedia's co-founder and original editor-in-chief, had his initial training as a philosopher—with a specialization in theory of knowledge. In public accounts of his work he has tried to bypass vexed philosophical discussions of truth by claiming that Wikipedians are not seeking it but rather a neutral point of view.4 But as the purpose of this is to support every reader being able to build their own opinion, it can be argued that somewhat paradoxically this is the fastest route to genuine consensus. Interestingly, however, he and the other co-founder Jimmy Wales eventually clashed over the role of expert opinion in Wikipedia. In 2007, Sanger diverged to found a new public online encyclopedia Citizendium5 in an attempt to “do better” than Wikipedia, apparently reasserting validation by external authority—e.g., academics. Interestingly, although it is early days, Citizendium seems to lack Wikipedia's popularity and momentum. Duffy, 2001: An alternative to PageRank and HITS is the Green method (Duffy, 2001), which Ollivier and Senellart (2007) applied to Wikipedia's hyperlink network structure in order to find related articles. This method, which is based on Markov Chain theory, is related to the topic-sensitive version of PageRank introduced by Haveliwala (2003). Given a target article, one way of finding related articles is to look at nodes with high PageRank in its immediate neighborhood. For this a topic-sensitive measure like Green's is more appropriate than the global PageRank.more appropriate than the global PageRank.
Theory type Analysis  +
Title Mining meaning from Wikipedia
Unit of analysis Scholarly article  +
Url http://dx.doi.org/10.1016/j.ijhcs.2009.05.004  +
Volume 67  +
Wikipedia coverage Main topic  +
Wikipedia data extraction N/A  +
Wikipedia language Not specified  +
Wikipedia page type N/A  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:29:38  +
Categories Other corpus topics  + , Literature review  + , Computer science  + , Information science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:29:49  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.