Enhancing text clustering by leveraging Wikipedia semantics

From WikiLit
Jump to: navigation, search
Publication (help)
Enhancing text clustering by leveraging Wikipedia semantics
Authors: Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen [edit item]
Citation: SIGIR '08 Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval  : 179-186. 2008 July 20-24. Singapore, Singapore. Association for Computing Machinery.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: 10.1145/1390334.1390367.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Enhancing text clustering by leveraging Wikipedia semantics is a publication by Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen.


[edit] Abstract

Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved.

[edit] Research questions

"In this paper, we show that by fully leveraging the structural relationship information in Wikipedia, we can enhance the clustering result by obtaining a more accurate distance measure. In particular, we first build an informative and easy-to-use thesaurus from Wikipedia, which explicitly derives the concept relationships based on the structural knowledge in Wikipedia, including synonymy, polysemy, hypernymy and associative relation. The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus."

Research details

Topics: Ranking and clustering systems, Semantic relatedness [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Archival records, Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"The text clustering experiments on two datasets indicate that with the help of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods. Meanwhile, with the optimized parameters based on a few labeled data users provide, the clustering performance can be further improved - 16.2% and 18.8 improvement compared with the baseline on Reuters and OHSUMED, respectively."

[edit] Comments

""The text clustering experiments on two datasets indicate that with the help of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods."p.186 Wikipedia pages; secondar datasets (Reuters-21578 and OHSUMED)"


Further notes[edit]

Facts about "Enhancing text clustering by leveraging Wikipedia semantics"RDF feed
AbstractMost traditional text clustering methods aMost traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved.ering performance can be further improved.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Comments"The text clustering experiments on two da"The text clustering experiments on two datasets indicate that with the help

of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods."p.186

Wikipedia pages; secondar datasets (Reuters-21578 and OHSUMED)
ndar datasets (Reuters-21578 and OHSUMED)
ConclusionThe text clustering experiments on two datThe text clustering experiments on two datasets indicate that with the help

of our built Wikipedia thesaurus, the clustering performance of our method is improved compared with previous methods. Meanwhile, with the optimized parameters based on a few labeled data users provide, the clustering performance can be further improved - 16.2% and 18.8 improvement compared with

the baseline on Reuters and OHSUMED, respectively.
line on Reuters and OHSUMED, respectively.
Conference locationSingapore, Singapore +
Data sourceArchival records +, Experiment responses + and Wikipedia pages +
Dates20-24 +
Doi10.1145/1390334.1390367 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Enhancing%2Btext%2Bclustering%2Bby%2Bleveraging%2BWikipedia%2Bsemantics%22 +
Has authorJian Hu +, Lujun Fang +, Yang Cao +, Hua-Jun Zeng +, Hua Li +, Qiang Yang + and Zheng Chen +
Has domainComputer science +
Has topicRanking and clustering systems + and Semantic relatedness +
MonthJuly +
Pages179-186 +
Peer reviewedYes +
Publication typeConference paper +
Published inSIGIR '08 Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval +
PublisherAssociation for Computing Machinery +
Research designExperiment +
Research questionsIn this paper, we show that by fully leverIn this paper, we show that by fully leveraging the structural

relationship information in Wikipedia, we can enhance the clustering result by obtaining a more accurate distance measure. In particular, we first build an informative and easy-to-use thesaurus from Wikipedia, which explicitly derives the concept relationships based on the structural knowledge in Wikipedia, including synonymy, polysemy, hypernymy and associative relation. The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus.minologies present in

the document corpus.
Revid10,748 +
TheoriesUndetermined
Theory typeDesign and action +
TitleEnhancing text clustering by leveraging Wikipedia semantics
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1390367 +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2008 +