Building semantic kernels for text classification using Wikipedia

From WikiLit
Jump to: navigation, search
Publication (help)
Building semantic kernels for text classification using Wikipedia
Authors: Pu Wang, Carlotta Domeniconi [edit item]
Citation: International Conference on Knowledge Discovery and Data Mining  : 713-721. 2008. New York, USA. Association forComputing Machinery.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: 10.1145/1401890.1401976.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Building semantic kernels for text classification using Wikipedia is a publication by Pu Wang, Carlotta Domeniconi.


[edit] Abstract

Document classification presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. The traditional document representation is a word-based vector (Bag of Words, or BOW), where each dimension is associated with a term of the dictionary containing all the words that appear in the corpus. Although simple and commonly used, this representation has several limitations. It is essential to embed semantic information and conceptual patterns in order to enhance the prediction capabilities of classification algorithms. In this paper, we overcome the shortages of the BOW approach by embedding background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the representation of documents. Our empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the BOW technique, and to other recently developed methods.

[edit] Research questions

"In this paper, we overcome the shortages of the BOW approach by embedding background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the representation of documents. Our empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the BOW technique, and to other recently developed methods."

Research details

Topics: Text classification [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Archival records, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"To the best of our knowledge, this paper represents a first attempt to improve text classification by defining concept-based kernels using Wikipedia. Our approach overcomes the limitations of the bag-of-words approach by incorporating background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the content of documents. This methodology is able to keep multi-word concepts unbroken, it captures the semantic closeness to synonyms, and performs word sense disambiguation for polysemous terms. We note that our approach to highlight the semantic content of documents, from the definition of a proximity matrix, to the disambiguation of terms and to the identification of eligible candidate concepts, is totally unsupervised, i.e. makes no use of the class labels associated to documents. Thus, the same enrichment procedure could be extended to enhance the clustering of documents, when indeed class labels are not available, or too expensive to obtain. On the other hand, for classification problems where class labels are available, one could use them to facilitate the disambiguation process, and the identification of crucial concepts in a document."

[edit] Comments


Further notes[edit]

Facts about "Building semantic kernels for text classification using Wikipedia"RDF feed
AbstractDocument classification presents difficultDocument classification presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. The traditional document representation is a word-based vector (Bag of Words, or BOW), where each dimension is associated with a term of the dictionary containing all the words that appear in the corpus. Although simple and commonly used, this representation has several limitations. It is essential to embed semantic information and conceptual patterns in order to enhance the prediction capabilities of classification algorithms. In this paper, we overcome the shortages of the BOW approach by embedding background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the representation of documents. Our empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the BOW technique, and to other recently developed methods., and to other recently developed methods.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
ConclusionTo the best of our knowledge, this paper rTo the best of our knowledge, this paper represents a first attempt to improve text classification by defining concept-based kernels using Wikipedia. Our approach overcomes the limitations of the bag-of-words approach by incorporating background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the content of documents. This methodology is able to keep multi-word

concepts unbroken, it captures the semantic closeness to synonyms,

and performs word sense disambiguation for polysemous terms. We note that our approach to highlight the semantic content of documents, from the definition of a proximity matrix, to the disambiguation of terms and to the identification of eligible candidate concepts, is totally unsupervised, i.e. makes no use of the class labels associated to documents. Thus, the same enrichment procedure could be extended to enhance the clustering of documents, when indeed class labels are not available, or too expensive to obtain. On the other hand, for classification problems where class labels are available, one could use them to facilitate the disambiguation process, and the identification of crucial concepts in a document.
ication of crucial concepts in a document.
Conference locationNew York, USA +
Data sourceExperiment responses +, Archival records + and Wikipedia pages +
Doi10.1145/1401890.1401976 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Building%2Bsemantic%2Bkernels%2Bfor%2Btext%2Bclassification%2Busing%2BWikipedia%22 +
Has authorPu Wang + and Carlotta Domeniconi +
Has domainComputer science +
Has topicText classification +
Pages713-721 +
Peer reviewedYes +
Publication typeConference paper +
Published inInternational Conference on Knowledge Discovery and Data Mining +
PublisherAssociation forComputing Machinery +
Research designExperiment +
Research questionsIn this paper, we overcome the shortages oIn this paper, we overcome the shortages of the BOW approach by embedding background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the representation of documents. Our empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the BOW technique, and to other recently developed methods., and to other recently developed methods.
Revid10,689 +
TheoriesUndetermined
Theory typeDesign and action +
TitleBuilding semantic kernels for text classification using Wikipedia
Unit of analysisArticle +
Urlhttp://dl.acm.org/citation.cfm?id=1401976 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2008 +