Last modified on January 30, 2014, at 20:21

Computing semantic relatedness using Wikipedia-based explicit semantic analysis

Publication (help)
Computing semantic relatedness using Wikipedia-based explicit semantic analysis
Authors: Evgeniy Gabrilovich, Shaul Markovitch [edit item]
Citation: IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence  : 1606-1611. 2007.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Computing semantic relatedness using Wikipedia-based explicit semantic analysis is a publication by Evgeniy Gabrilovich, Shaul Markovitch.


[edit] Abstract

Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

[edit] Research questions

"We propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic representation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of natural concepts derived from Wikipedia (http://en.wikipedia.org), the largest encyclopedia in existence. We employ text classification techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between fragments of natural language text."

Research details

Topics: Semantic relatedness [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Design science, Experiment [edit item]
Data source: Archival records, Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Longitudinal [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"Compared to LSA, which only uses statistical cooccurrence information, our methodology explicitly uses the knowledge collected and organized by humans. Compared to lexical resources such as WordNet, our methodology leverages knowledge bases that are orders of magnitude larger and more comprehensive. Empirical evaluation confirms that using ESA leads to substantial improvements in computing word and text relatedness. Compared with the previous state of the art, using ESA results in notable improvements in correlation of computed relatedness scores with human judgements: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Furthermore, due to the use of natural concepts, the ESA model is easy to explain to human users."

[edit] Comments

""Empirical evaluation confirms that using ESA, [the method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia] leads to substantial improvements in computing word and text relatedness." p. 1611"


Further notes[edit]