Browse wiki

Jump to: navigation, search
Feature generation for textual information retrieval using world knowledge
Abstract Imagine an automatic news filtering systemImagine an automatic news filtering system that tracks company news. Given the news item {FDA} approves ciprofloxacin for victims of anthrax inhalation" how can the system know that the drug mentioned is an antibiotic produced by Bayer? Or consider an information professional searching for data on {RFID} technology - how can a computer understand that the item {"Wal-Mart} supply chain goes real time" is relevant for the search? Algorithms we present can do just that.h? Algorithms we present can do just that.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments We therefore propose an alternative solution that capitalizes on the power of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces.
Conclusion In this work we instantiated our feature gIn this work we instantiated our feature generation methodology with two spe- ci¯c knowledge repositories, the Open Directory Project and the Wikipedia en- cyclopedia. We succeeded to make use of an encyclopedia without deep language understanding, specially crafted inference rules or relying on additional common- sense knowledge bases. This was made possible by applying standard text clas- si¯cation techniques to match document texts with relevant Wikipedia articles. The Wikipedia-based results are superior to the ODP-based ones on a number of datasets, and are comparable to it on others. Moreover, using Wikipedia im- poses fewer restrictions on suitable knowledge repositories, and does not assume the availability of an ontology. In our future work, we intend to study possi- ble ways for combining two or more knowledge repositories for improving text categorization performance even further. We also described multi-resolution analysis, which examines the document text at several levels of linguistic abstraction and performs feature generation at each level. When polysemous words are considered in their native context, word sense disambiguation is implicitly performed. Considering local contexts allows the feature generator to cope with word synonymy and polysemy. Furthermore, when the document text is processed at several levels of granularity, even brie°y mentioned aspects can be identi¯ed and used. These might easily have been overlooked if the document were processed as one large chunk of text. Empirical evaluation de¯nitively con¯rmed the value of knowledge-based fea- ture generation for text categorization across a range of datasets. Recently, the performance of the best text categorization systems became similar, as if a plateau has been reached, and previous work mostly achieved small improve- ments. Using the ODP and Wikipedia allowed us to reap much greater bene¯ts and to bring text categorization to a qualitatively new level of performance, with double-digit improvements observed on a number of datasets. Given the domain- speci¯c nature of some test collections, we also compared the utility of narrow domain-speci¯c knowledge with that of larger amounts of information covering all branches of knowledge (Section 5.3.4). Perhaps surprisingly, we found that even for narrow-scope test collections, a wide coverage knowledge base yielded substantially greater improvements than its domain-speci¯c subsets. This obser- vation reinforces the breadth hypothesis, formulated by Lenat and Feigenbaum (1990), that \to behave intelligently in unexpected situations, an agent must be capable of falling back on increasingly general knowledge." We also applied our feature generation methodology to the problem of au- tomatically assessing semantic relatedness of words and texts. To this end, we presented a novel technique, called Explicit Semantic Analysis, for represent- ing semantics of natural language texts using natural concepts. In contrast to existing methods, ESA o®ers a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Moreover, using natural concepts makes the ESA model easy to interpret, as can be seen in the examples we provided. Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with hu- man judgements: from r = 0:56 to 0:75 for individual words and from r = 0:60 to 0:72 for texts. Consequently, we anticipate ESA to give rise to the next generation of natural language processing tools.tion of natural language processing tools.
Conference location Haifa, Israel +
Data source Experiment responses  + , Wikipedia pages  +
Google scholar url  +
Has author Evgeniy Gabrilovich +
Has domain Computer science +
Has topic Textual information retrieval +
Month December  +
Peer reviewed Yes  +
Publication type Thesis  +
Published in Technion - Israel Institute of Technology +
Research design Experiment  +
Research questions We therefore propose an alternative solutiWe therefore propose an alternative solution that capitalizes on the power of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces. Prior to text categorization, we employ a feature generator that uses common-sense and domain-speci¯c knowledge to enrich the bag of words with new, more informative and discriminating features. Feature generation is performed automatically, using machine-readable reposito- ries of knowledge. Many sources of world knowledge have become available in recent years, thanks to rapid advances in information processing, and Internet proliferation in particular. Examples of general purpose knowledge bases include the Open Directory Project (ODP), Yahoo! Web Directory, and the Wikipedia encyclopedia.Directory, and the Wikipedia encyclopedia.
Revid 10,771  +
Theories Feature generation (FG), also known as feaFeature generation (FG), also known as feature construction, constructive induc- tion or bias shift, is a process of building new features based on those present in the examples supplied to the system, possibly using the domain theory (i.e., in- formation about goals, constraints and operators of the domain) (Fawcett, 1993). Feature construction techniques can be useful when the attributes supplied with the data are insu±cient for concise concept learning.e insu±cient for concise concept learning.
Theory type Design and action  +
Title Feature generation for textual information retrieval using world knowledge
Unit of analysis Article  +
Url  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Live Wikipedia  +
Wikipedia language Not specified  +
Wikipedia page type Article  +
Year 2006  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:28:22  +
Categories Textual information retrieval  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:27:33  +
hide properties that link here 
  No properties link to this page.


Enter the name of the page to start browsing from.