Wikipedia-based semantic interpretation for natural language processing

From WikiLit
Jump to: navigation, search
Publication (help)
Wikipedia-based semantic interpretation for natural language processing
Authors: Evgeniy Gabrilovich, Shaul Markovitch [edit item]
Citation: Journal of Artificial Intelligence Research 34 : 443-498. 2009.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Wikipedia-based semantic interpretation for natural language processing is a publication by Evgeniy Gabrilovich, Shaul Markovitch.


[edit] Abstract

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

[edit] Research questions

"Here we propose a novel method, called Explicit Semantic Analysis (ESA), for ¯ne-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts."

Research details

Topics: Semantic relatedness [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Design science [edit item]
Data source: Archival records [edit item]
Collected data time dimension: Longitudinal [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"We succeeded to make automatic use of an encyclopedia without deep language under- standing, specially crafted inference rules or relying on additional common-sense knowledge bases. This was made possible by applying standard text classi¯cation techniques to match document texts with relevant Wikipedia articles. Empirical evaluation con¯rmed the value of Explicit Semantic Analysis for two com- mon tasks in natural language processing. Compared with the previous state of the art, using ESA results in signi¯cant improvements in automatically assessing semantic related- ness of words and texts. Speci¯cally, the correlation of computed relatedness scores with human judgements increased from r = 0:56 to 0:75 (Spearman) for individual words and from r = 0:60 to 0:72 (Pearson) for texts. In contrast to existing methods, ESA o®ers a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Using ESA to perform feature generation for text categorization yielded con- sistent improvements across a diverse range of datasets. Recently, the performance of the best text categorization systems became similar, and previous work mostly achieved small improvements. Using Wikipedia as a source of external knowledge allowed us to improve the performance of text categorization across a diverse collection of datasets."

[edit] Comments


Further notes[edit]

Facts about "Wikipedia-based semantic interpretation for natural language processing"RDF feed
AbstractAdequate representation of natural languagAdequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.A model is easy to explain to human users.
Added by wikilit teamAdded on initial load +
Collected data time dimensionLongitudinal +
ConclusionWe succeeded to make automatic use of an eWe succeeded to make automatic use of an encyclopedia without deep language under-

standing, specially crafted inference rules or relying on additional common-sense knowledge bases. This was made possible by applying standard text classi¯cation techniques to match document texts with relevant Wikipedia articles. Empirical evaluation con¯rmed the value of Explicit Semantic Analysis for two com- mon tasks in natural language processing. Compared with the previous state of the art, using ESA results in signi¯cant improvements in automatically assessing semantic related- ness of words and texts. Speci¯cally, the correlation of computed relatedness scores with human judgements increased from r = 0:56 to 0:75 (Spearman) for individual words and from r = 0:60 to 0:72 (Pearson) for texts. In contrast to existing methods, ESA o®ers a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Using ESA to perform feature generation for text categorization yielded con- sistent improvements across a diverse range of datasets. Recently, the performance of the best text categorization systems became similar, and previous work mostly achieved small improvements. Using Wikipedia as a source of external knowledge allowed us to improve

the performance of text categorization across a diverse collection of datasets.
n across a diverse collection of datasets.
Data sourceArchival records +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Wikipedia-based%2Bsemantic%2Binterpretation%2Bfor%2Bnatural%2Blanguage%2Bprocessing%22 +
Has authorEvgeniy Gabrilovich + and Shaul Markovitch +
Has domainComputer science +
Has topicSemantic relatedness +
Pages443-498 +
Peer reviewedYes +
Publication typeJournal article +
Published inJournal of Artificial Intelligence Research +
Research designDesign science +
Research questionsHere we propose a novel method, called ExpHere we propose a novel method, called Explicit Semantic Analysis (ESA),

for ¯ne-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts.text in terms

of Wikipedia-based concepts.
Revid11,074 +
TheoriesUndetermined
Theory typeDesign and action +
TitleWikipedia-based semantic interpretation for natural language processing
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1622728 +
Volume34 +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2009 +