Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge

From WikiLit
Jump to: navigation, search
Publication (help)
Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge
Authors: Evgeniy Gabrilovich, Shaul Markovitch [edit item]
Citation: AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2 2 : 1301-6. 2006 July 16-20. Menlo Park, CA, USA. AAAI Press.
Publication type: Conference paper
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge is a publication by Evgeniy Gabrilovich, Shaul Markovitch.


[edit] Abstract

When humans approach the task of text categorization, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art information retrieval systems are quite brittle-they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. For instance, given the sentence Wal-Mart supply chain goes real time" how can a text categorization system know that Wal-Mart manages its stock with RFID technology? And having read that "Ciprofloxacin belongs to the quinolones group" how on earth can a machine know that the drug mentioned is an antibiotic produced by Bayer? In this paper we present algorithms that can do just that. We propose to enrich document representation through automatic use of a vast compendium of human knowledge-an encyclopedia. We apply machine learning techniques to Wikipedia the largest encyclopedia to date which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia article represents a concept and documents to be categorized are represented in the rich feature space of words and relevant Wikipedia concepts. Empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets.

[edit] Research questions

"state-of-the-art information retrieval systems are quite brittle--they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. In this paper we present algorithms that can resolve that. We propose to enrich document representation through automatic use of a vast compendium of human knowledge--an encyclopedia."

Research details

Topics: Text classification [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Main topic [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"We succeeded to make use of an encyclopedia without deep language understanding and without relying on additional common-sense knowledge bases. This was made possible by applying standard text classification techniques to match document texts with relevant Wikipedia articles. Empirical evaluation definitively confirmed the value of encyclopedic knowledge for text categorization across a range of datasets. Recently, the performance of the best text categorization systems became similar, as if a plateau has been reached, and previous work mostly achieved improvements of up to a few percentage points. Using Wikipedia allowed us to reap much greater benefits, with double-digit improvements observed on a number of datasets."

[edit] Comments

""Empirical evaluation definitively confirmed the value of encyclopedic knowledge for text categorization across a range of datasets." p. 1306"


Further notes[edit]

Facts about "Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge"RDF feed
AbstractWhen humans approach the task of text cateWhen humans approach the task of text categorization, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art information retrieval systems are quite brittle-they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. For instance, given the sentence Wal-Mart supply chain goes real time" how can a text categorization system know that Wal-Mart manages its stock with RFID technology? And having read that "Ciprofloxacin belongs to the quinolones group" how on earth can a machine know that the drug mentioned is an antibiotic produced by Bayer? In this paper we present algorithms that can do just that. We propose to enrich document representation through automatic use of a vast compendium of human knowledge-an encyclopedia. We apply machine learning techniques to Wikipedia the largest encyclopedia to date which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia article represents a concept and documents to be categorized are represented in the rich feature space of words and relevant Wikipedia concepts. Empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets.e across a diverse collection of datasets.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Comments"Empirical evaluation definitively confirmed the value of encyclopedic knowledge for text categorization across a range of datasets." p. 1306
ConclusionWe succeeded to make use of an encyclopediWe succeeded to make use of an encyclopedia without

deep language understanding and without relying on additional common-sense knowledge bases. This was made possible by applying standard text classification techniques to match document texts with relevant Wikipedia articles. Empirical evaluation definitively confirmed the value of encyclopedic knowledge for text categorization across a range of datasets. Recently, the performance of the best text categorization systems became similar, as if a plateau has been reached, and previous work mostly achieved improvements of up to a few percentage points. Using Wikipedia allowed us to reap much greater benefits, with double-digit

improvements observed on a number of datasets.
ovements observed on a number of datasets.
Conference locationMenlo Park, CA, USA +
Data sourceExperiment responses + and Wikipedia pages +
Dates16-20 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Overcoming%2Bthe%2Bbrittleness%2Bbottleneck%2Busing%2BWikipedia%3A%2Benhancing%2Btext%2Bcategorization%2Bwith%2Bencyclopedic%2Bknowledge%22 +
Has authorEvgeniy Gabrilovich + and Shaul Markovitch +
Has domainComputer science +
Has topicText classification +
MonthJuly +
Pages1301-6 +
Peer reviewedYes +
Publication typeConference paper +
Published inAAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2 +
PublisherAAAI Press +
Research designExperiment +
Research questionsstate-of-the-art information retrieval sysstate-of-the-art information retrieval systems are quite brittle--they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. In this paper we present algorithms that can resolve that. We propose to enrich document representation through automatic use of a vast compendium of human knowledge--an encyclopedia.ndium of human knowledge--an encyclopedia.
Revid10,898 +
TheoriesUndetermined
Theory typeDesign and action +
TitleOvercoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge
Unit of analysisArticle +
Urlhttp://www.citeulike.org/group/382/article/2157092 +
Volume2 +
Wikipedia coverageMain topic +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2006 +