Computing semantic relatedness using Wikipedia-based explicit semantic analysis
|Computing semantic relatedness using Wikipedia-based explicit semantic analysis|
|Authors:||Evgeniy Gabrilovich, Shaul Markovitch|
|Citation:||IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence : 1606-1611. 2007.|
|Publication type:||Conference paper|
|Google Scholar cites:||Citations|
|Added by Wikilit team:||Added on initial load|
|Article:||Google Scholar BASE PubMed|
|Other scholarly wikis:||AcaWiki Brede Wiki WikiPapers|
|Web search:||Bing Google Yahoo! — Google PDF|
Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
"We propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic representation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of natural concepts derived from Wikipedia (http://en.wikipedia.org), the largest encyclopedia in existence. We employ text classification techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between fragments of natural language text."
|Theory type:||Design and action|
|Wikipedia coverage:||Main topic|
|Research design:||Design science, Experiment|
|Data source:||Archival records, Experiment responses, Wikipedia pages|
|Collected data time dimension:||Longitudinal|
|Unit of analysis:||Article|
|Wikipedia data extraction:||Dump|
|Wikipedia page type:||Article|
|Wikipedia language:||Not specified|
"Compared to LSA, which only uses statistical cooccurrence information, our methodology explicitly uses the knowledge collected and organized by humans. Compared to lexical resources such as WordNet, our methodology leverages knowledge bases that are orders of magnitude larger and more comprehensive. Empirical evaluation confirms that using ESA leads to substantial improvements in computing word and text relatedness. Compared with the previous state of the art, using ESA results in notable improvements in correlation of computed relatedness scores with human judgements: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Furthermore, due to the use of natural concepts, the ESA model is easy to explain to human users."
""Empirical evaluation confirms that using ESA, [the method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia] leads to substantial improvements in computing word and text relatedness." p. 1611"