Browse wiki

Jump to: navigation, search
Enhancing text clustering by leveraging Wikipedia semantics
Abstract Most traditional text clustering methods aMost traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved.ering performance can be further improved.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments "The text clustering experiments on two da"The text clustering experiments on two datasets indicate that with the help of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods."p.186 Wikipedia pages; secondar datasets (Reuters-21578 and OHSUMED)ndar datasets (Reuters-21578 and OHSUMED)
Conclusion The text clustering experiments on two datThe text clustering experiments on two datasets indicate that with the help of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods. Meanwhile, with the optimized parameters based on a few labeled data users provide, the clustering performance can be further improved - 16.2% and 18.8 improvement compared with the baseline on Reuters and OHSUMED, respectively.line on Reuters and OHSUMED, respectively.
Conference location Singapore, Singapore +
Data source Archival records  + , Experiment responses  + , Wikipedia pages  +
Dates 20-24 +
Doi 10.1145/1390334.1390367 +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Enhancing%2Btext%2Bclustering%2Bby%2Bleveraging%2BWikipedia%2Bsemantics%22  +
Has author Jian Hu + , Lujun Fang + , Yang Cao + , Hua-Jun Zeng + , Hua Li + , Qiang Yang + , Zheng Chen +
Has domain Computer science +
Has topic Ranking and clustering systems + , Semantic relatedness +
Month July  +
Pages 179-186  +
Peer reviewed Yes  +
Publication type Conference paper  +
Published in SIGIR '08 Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval +
Publisher Association for Computing Machinery +
Research design Experiment  +
Research questions In this paper, we show that by fully leverIn this paper, we show that by fully leveraging the structural relationship information in Wikipedia, we can enhance the clustering result by obtaining a more accurate distance measure. In particular, we first build an informative and easy-to-use thesaurus from Wikipedia, which explicitly derives the concept relationships based on the structural knowledge in Wikipedia, including synonymy, polysemy, hypernymy and associative relation. The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus.minologies present in the document corpus.
Revid 10,748  +
Theories Undetermined
Theory type Design and action  +
Title Enhancing text clustering by leveraging Wikipedia semantics
Unit of analysis Article  +
Url http://dl.acm.org/citation.cfm?id=1390367  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Dump  +
Wikipedia language Not specified  +
Wikipedia page type Article  +
Year 2008  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:26:15  +
Categories Ranking and clustering systems  + , Semantic relatedness  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:25:58  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.