Mining meaning from Wikipedia

From WikiLit
Jump to: navigation, search
Publication (help)
Mining meaning from Wikipedia
Authors: Olena Medelyan, David N. Milne, Catherine Legg, Ian H. Witten [edit item]
Citation: International Journal of Human Computer Studies 67 (9): 716-754. 2009.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1016/j.ijhcs.2009.05.004.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Mining meaning from Wikipedia is a publication by Olena Medelyan, David N. Milne, Catherine Legg, Ian H. Witten.


[edit] Abstract

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.

[edit] Research questions

"It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced."

Research details

Topics: Other corpus topics, Literature review [edit item]
Domains: Computer science, Information science [edit item]
Theory type: Analysis [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Wikipedia's editing process can be grounded in the knowledge theory proposed by the 19th Century pragmatist Peirce. According to Peirce, beliefs can be understood as knowledge not due to their prior justification, but due to their usefulness, public character and future development. His account of knowledge was based on a unique account of truth, which claimed that true beliefs are those that all sincere participants in a “community of inquiry” would converge on, given enough time. Influential 20th Century philosophers (e.g. Quine, 1960) scoffed at this notion as being insufficiently objective. Yet Peirce claimed that there is a kind of person whose greatest passion is to render the Universe intelligible and will freely give time to do so, and that over the long run, within a sufficiently broad community, the use of signs is intrinsically self-correcting (Peirce, 1877). Wikipedia can be seen as a fascinating and unanticipated concrete realization of these apparently wildly idealistic claims.

In this context it is interesting to note that Larry Sanger, Wikipedia's co-founder and original editor-in-chief, had his initial training as a philosopher—with a specialization in theory of knowledge. In public accounts of his work he has tried to bypass vexed philosophical discussions of truth by claiming that Wikipedians are not seeking it but rather a neutral point of view.4 But as the purpose of this is to support every reader being able to build their own opinion, it can be argued that somewhat paradoxically this is the fastest route to genuine consensus. Interestingly, however, he and the other co-founder Jimmy Wales eventually clashed over the role of expert opinion in Wikipedia. In 2007, Sanger diverged to found a new public online encyclopedia Citizendium5 in an attempt to “do better” than Wikipedia, apparently reasserting validation by external authority—e.g., academics. Interestingly, although it is early days, Citizendium seems to lack Wikipedia's popularity and momentum.


Duffy, 2001: An alternative to PageRank and HITS is the Green method (Duffy, 2001), which Ollivier and Senellart (2007) applied to Wikipedia's hyperlink network structure in order to find related articles. This method, which is based on Markov Chain theory, is related to the topic-sensitive version of PageRank introduced by Haveliwala (2003). Given a target article, one way of finding related articles is to look at nodes with high PageRank in its immediate neighborhood. For this a topic-sensitive measure like Green's is more appropriate than the global PageRank." [edit item]

Research design: Literature review [edit item]
Data source: Scholarly articles [edit item]
Collected data time dimension: N/A [edit item]
Unit of analysis: Scholarly article [edit item]
Wikipedia data extraction: N/A [edit item]
Wikipedia page type: N/A [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"No conclusion, this is a summary of the whole literature review: Summary

A whole host of researchers have been quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia's creation process and structure (Section 2). The unique open editing philosophy, which accounts for its success, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce's proposal that knowledge be defined through its public character and future usefulness rather than any prior justification. Wikipedia is not just an encyclopedia but can be viewed as anything from a corpus, taxonomy, thesaurus, hierarchy of knowledge topics to a full-blown ontology. It includes explicit information about synonyms (redirects) and word senses (disambiguation pages), database-style information (infoboxes), semantic network information (hyperlinks), category information (category structure), discussion pages, and the full edit history of every article. Each of these sources of information can be mined in various ways.

Section 3 explains how Wikipedia is being exploited for natural language processing. Unlike WordNet, it was not created as a lexical resource that reflects the intricacies of human language. Instead, its primary goal is to provide encyclopedic knowledge across subjects and languages. However, the research described here demonstrates that it has, unexpectedly, immense potential as a repository of linguistic knowledge for natural language applications. In particular, its unique features allow well-defined tasks such as word sense disambiguation and word similarity to be addressed automatically—and the resulting level of performance is remarkably high. ESA (Gabrilovich and Markovitch, 2007) and Wikipedia Link-based Measure (Milne and Witten, 2008a), for example, take advantage of the extended and hyperlinked description of concepts that in WordNet were restricted to short glosses. Furthermore, whereas in WordNet the sense frequency was defined by a simple ranking of meaning, Wikipedia implicitly contains conditional probabilities of word meanings (Mihalcea and Csomai, 2007 Mihalcea, R., Csomai, A., 2007. Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM’07, Lisbon, Portugal, 6–8 November 2007, pp. 233–241.Mihalcea and Csomai, 2007), which allows more accurate similarity computation and word sense disambiguation. While the current research in this area has been mostly restricted to English, the approaches are general enough to apply to other languages. Researchers on co-reference resolution and mining of multilingual information have only recently discovered Wikipedia; significant improvements in these areas can be expected. To our knowledge, its use as a resource for other tasks such as natural language generation, machine translation and discourse analysis, has not yet been explored. These areas are ripe for exploitation, and exciting discoveries can be expected.

Section 4 describes applications in information retrieval. New techniques for document classification and topic indexing make productive use of Wikipedia for searching and organizing document collections. These areas can take advantage of its unique properties while grounding themselves in—and building upon—existing research. In particular, document classification has gathered momentum and significant advances have been obtained over the state of the art. Question answering and entity ranking are less well addressed, because current techniques do not seem to take full advantage of Wikipedia—most simply treat it as just another corpus. We found little evidence of cross-pollination between this work and the information extraction efforts described in Section 5. Given how closely question answering and entity ranking depend on the extraction of facts and entities, we expect this to become a fruitful line of enquiry.

In Section 5, we turn to information extraction and ontology building; mining Wikipedia for topics, relations and facts and then organizing them into a single resource. This task is less well defined than those in Sections 3 and 4. Different researchers focus on different kinds of information: we have reviewed projects that identify movie directors and soccer players, composers, corporate descriptions and hierarchical and ontological relations. Techniques range from those developed for standard text corpora to ones that utilize Wikipedia-specific properties such as hyperlinks and the category structure. The extracted resources range in size from several hundred to several million relations, but the lack of a common basis for evaluation prevents any overall conclusions as to which approach performs best. We believe that an extrinsic evaluation would be most meaningful, and hope to see these systems compete on a well-defined task in an independent evaluation. It will also be interesting to see to what extent these resources are exploited by other research communities in the future.

Some authors have suggested using the Wikipedia editors themselves to perform ontology building, an enterprise that might be thought of as mining Wikipedia's people rather than its data. Perhaps they understand the underlying driving force behind this massively successful resource better than most! Only time will tell whether the community is amenable to following such suggestions. The idea of moving to a more structured and ontologically principled Wikipedia raises an interesting question: how will it interact with the public, amateur-editor model? Does this signal the emergence of the Semantic Web? We suspect that, like the success of Wikipedia itself, the result will be something new, something that experts have not foreseen and may not condone. That is the glory of Wikipedia."

[edit] Comments

"Unit of analysis: scholarly articles"


Further notes[edit]

Facts about "Mining meaning from Wikipedia"RDF feed
AbstractWikipedia is a goldmine of information; noWikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.e open-source software they have produced.
Added by wikilit teamAdded on initial load +
Collected data time dimensionN/A +
CommentsUnit of analysis: scholarly articles
ConclusionNo conclusion, this is a summary of the whNo conclusion, this is a summary of the whole literature review:

Summary

A whole host of researchers have been quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia's creation process and structure (Section 2). The unique open editing philosophy, which accounts for its success, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce's proposal that knowledge be defined through its public character and future usefulness rather than any prior justification. Wikipedia is not just an encyclopedia but can be viewed as anything from a corpus, taxonomy, thesaurus, hierarchy of knowledge topics to a full-blown ontology. It includes explicit information about synonyms (redirects) and word senses (disambiguation pages), database-style information (infoboxes), semantic network information (hyperlinks), category information (category structure), discussion pages, and the full edit history of every article. Each of these sources of information can be mined in various ways.

Section 3 explains how Wikipedia is being exploited for natural language processing. Unlike WordNet, it was not created as a lexical resource that reflects the intricacies of human language. Instead, its primary goal is to provide encyclopedic knowledge across subjects and languages. However, the research described here demonstrates that it has, unexpectedly, immense potential as a repository of linguistic knowledge for natural language applications. In particular, its unique features allow well-defined tasks such as word sense disambiguation and word similarity to be addressed automatically—and the resulting level of performance is remarkably high. ESA (Gabrilovich and Markovitch, 2007) and Wikipedia Link-based Measure (Milne and Witten, 2008a), for example, take advantage of the extended and hyperlinked description of concepts that in WordNet were restricted to short glosses. Furthermore, whereas in WordNet the sense frequency was defined by a simple ranking of meaning, Wikipedia implicitly contains conditional probabilities of word meanings (Mihalcea and Csomai, 2007 Mihalcea, R., Csomai, A., 2007. Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, CIKM’07, Lisbon, Portugal, 6–8 November 2007, pp. 233–241.Mihalcea and Csomai, 2007), which allows more accurate similarity computation and word sense disambiguation. While the current research in this area has been mostly restricted to English, the approaches are general enough to apply to other languages. Researchers on co-reference resolution and mining of multilingual information have only recently discovered Wikipedia; significant improvements in these areas can be expected. To our knowledge, its use as a resource for other tasks such as natural language generation, machine translation and discourse analysis, has not yet been explored. These areas are ripe for exploitation, and exciting discoveries can be expected.

Section 4 describes applications in information retrieval. New techniques for document classification and topic indexing make productive use of Wikipedia for searching and organizing document collections. These areas can take advantage of its unique properties while grounding themselves in—and building upon—existing research. In particular, document classification has gathered momentum and significant advances have been obtained over the state of the art. Question answering and entity ranking are less well addressed, because current techniques do not seem to take full advantage of Wikipedia—most simply treat it as just another corpus. We found little evidence of cross-pollination between this work and the information extraction efforts described in Section 5. Given how closely question answering and entity ranking depend on the extraction of facts and entities, we expect this to become a fruitful line of enquiry.

In Section 5, we turn to information extraction and ontology building; mining Wikipedia for topics, relations and facts and then organizing them into a single resource. This task is less well defined than those in Sections 3 and 4. Different researchers focus on different kinds of information: we have reviewed projects that identify movie directors and soccer players, composers, corporate descriptions and hierarchical and ontological relations. Techniques range from those developed for standard text corpora to ones that utilize Wikipedia-specific properties such as hyperlinks and the category structure. The extracted resources range in size from several hundred to several million relations, but the lack of a common basis for evaluation prevents any overall conclusions as to which approach performs best. We believe that an extrinsic evaluation would be most meaningful, and hope to see these systems compete on a well-defined task in an independent evaluation. It will also be interesting to see to what extent these resources are exploited by other research communities in the future.

Some authors have suggested using the Wikipedia editors themselves to perform ontology building, an enterprise that might be thought of as mining Wikipedia's people rather than its data. Perhaps they understand the underlying driving force behind this massively successful resource better than most! Only time will tell whether the community is amenable to following such suggestions. The idea of moving to a more structured and ontologically principled Wikipedia raises an interesting question: how will it interact with the public, amateur-editor model? Does this signal the emergence of the Semantic Web? We suspect that, like the success of Wikipedia itself, the result will be something new, something that experts have not foreseen and may not condone. That is the glory of Wikipedia.
t condone. That is the glory of Wikipedia.
Data sourceScholarly articles +
Doi10.1016/j.ijhcs.2009.05.004 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Mining%2Bmeaning%2Bfrom%2BWikipedia%22 +
Has authorOlena Medelyan +, David N. Milne +, Catherine Legg + and Ian H. Witten +
Has domainComputer science + and Information science +
Has topicOther corpus topics + and Literature review +
Issue9 +
Pages716-754 +
Peer reviewedYes +
Publication typeJournal article +
Published inInternational Journal of Human Computer Studies +
Research designLiterature review +
Research questionsIt focuses on research that extracts and mIt focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.e open-source software they have produced.
Revid10,873 +
TheoriesWikipedia's editing process can be groundeWikipedia's editing process can be grounded in the knowledge theory proposed by the 19th Century pragmatist Peirce. According to Peirce, beliefs can be understood as knowledge not due to their prior justification, but due to their usefulness, public character and future development. His account of knowledge was based on a unique account of truth, which claimed that true beliefs are those that all sincere participants in a “community of inquiry” would converge on, given enough time. Influential 20th Century philosophers (e.g. Quine, 1960) scoffed at this notion as being insufficiently objective. Yet Peirce claimed that there is a kind of person whose greatest passion is to render the Universe intelligible and will freely give time to do so, and that over the long run, within a sufficiently broad community, the use of signs is intrinsically self-correcting (Peirce, 1877). Wikipedia can be seen as a fascinating and unanticipated concrete realization of these apparently wildly idealistic claims.

In this context it is interesting to note that Larry Sanger, Wikipedia's co-founder and original editor-in-chief, had his initial training as a philosopher—with a specialization in theory of knowledge. In public accounts of his work he has tried to bypass vexed philosophical discussions of truth by claiming that Wikipedians are not seeking it but rather a neutral point of view.4 But as the purpose of this is to support every reader being able to build their own opinion, it can be argued that somewhat paradoxically this is the fastest route to genuine consensus. Interestingly, however, he and the other co-founder Jimmy Wales eventually clashed over the role of expert opinion in Wikipedia. In 2007, Sanger diverged to found a new public online encyclopedia Citizendium5 in an attempt to “do better” than Wikipedia, apparently reasserting validation by external authority—e.g., academics. Interestingly, although it is early days, Citizendium seems to lack Wikipedia's popularity and momentum.


Duffy, 2001: An alternative to PageRank and HITS is the Green method (Duffy, 2001), which Ollivier and Senellart (2007) applied to Wikipedia's hyperlink network structure in order to find related articles. This method, which is based on Markov Chain theory, is related to the topic-sensitive version of PageRank introduced by Haveliwala (2003). Given a target article, one way of finding related articles is to look at nodes with high PageRank in its immediate neighborhood. For this a topic-sensitive measure like Green's is more appropriate than the global PageRank.
more appropriate than the global PageRank.
Theory typeAnalysis +
TitleMining meaning from Wikipedia
Unit of analysisScholarly article +
Urlhttp://dx.doi.org/10.1016/j.ijhcs.2009.05.004 +
Volume67 +
Wikipedia coverageMain topic +
Wikipedia data extractionN/A +
Wikipedia languageNot specified +
Wikipedia page typeN/A +
Year2009 +