Difference between revisions of "Feature generation for textual information retrieval using world knowledge"

From WikiLit
Jump to: navigation, search
m (Text replace - "([ ][|]research_design=)([^ ]*Experiment)([^ ]*[ ][|]collected_datatype=)([^ ]*)([^ ]*[ ])" to "\1\2\3Experiment responses, \4\5")
Line 36: Line 36:
 
|research_design=Experiment
 
|research_design=Experiment
 
|collected_datatype=Experiment responses, Wikipedia pages
 
|collected_datatype=Experiment responses, Wikipedia pages
|collected_data_time_dimension=N/A
+
|collected_data_time_dimension=Cross-sectional
 
|unit_of_analysis=Article
 
|unit_of_analysis=Article
 
|wikipedia_data_extraction=Live Wikipedia
 
|wikipedia_data_extraction=Live Wikipedia

Revision as of 19:40, December 16, 2013

Publication (help)
Feature generation for textual information retrieval using world knowledge
Authors: Evgeniy Gabrilovich [edit item]
Citation: Technion - Israel Institute of Technology  : . 2006 December. Haifa, Israel.
Publication type: Thesis
Peer-reviewed: Yes
Database(s):
DOI: Define doi.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Feature generation for textual information retrieval using world knowledge is a publication by Evgeniy Gabrilovich.


[edit] Abstract

Imagine an automatic news filtering system that tracks company news. Given the news item {FDA} approves ciprofloxacin for victims of anthrax inhalation" how can the system know that the drug mentioned is an antibiotic produced by Bayer? Or consider an information professional searching for data on {RFID} technology - how can a computer understand that the item {"Wal-Mart} supply chain goes real time" is relevant for the search? Algorithms we present can do just that.

[edit] Research questions

"We therefore propose an alternative solution that capitalizes on the power of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces. Prior to text categorization, we employ a feature generator that uses common-sense and domain-speci¯c knowledge to enrich the bag of words with new, more informative and discriminating features. Feature generation is performed automatically, using machine-readable reposito- ries of knowledge. Many sources of world knowledge have become available in recent years, thanks to rapid advances in information processing, and Internet proliferation in particular. Examples of general purpose knowledge bases include the Open Directory Project (ODP), Yahoo! Web Directory, and the Wikipedia encyclopedia."

Research details

Topics: Textual information retrieval [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Feature generation (FG), also known as feature construction, constructive induc-

tion or bias shift, is a process of building new features based on those present in the examples supplied to the system, possibly using the domain theory (i.e., in- formation about goals, constraints and operators of the domain) (Fawcett, 1993). Feature construction techniques can be useful when the attributes supplied with the data are insu±cient for concise concept learning." [edit item]

Research design: Experiment [edit item]
Data source: [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Live Wikipedia [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"In this work we instantiated our feature generation methodology with two spe- ci¯c knowledge repositories, the Open Directory Project and the Wikipedia en- cyclopedia. We succeeded to make use of an encyclopedia without deep language understanding, specially crafted inference rules or relying on additional common- sense knowledge bases. This was made possible by applying standard text clas- si¯cation techniques to match document texts with relevant Wikipedia articles. The Wikipedia-based results are superior to the ODP-based ones on a number of datasets, and are comparable to it on others. Moreover, using Wikipedia im- poses fewer restrictions on suitable knowledge repositories, and does not assume the availability of an ontology. In our future work, we intend to study possi- ble ways for combining two or more knowledge repositories for improving text categorization performance even further. We also described multi-resolution analysis, which examines the document text at several levels of linguistic abstraction and performs feature generation at each level. When polysemous words are considered in their native context, word sense disambiguation is implicitly performed. Considering local contexts allows the feature generator to cope with word synonymy and polysemy. Furthermore, when the document text is processed at several levels of granularity, even brie°y mentioned aspects can be identi¯ed and used. These might easily have been overlooked if the document were processed as one large chunk of text. Empirical evaluation de¯nitively con¯rmed the value of knowledge-based fea- ture generation for text categorization across a range of datasets. Recently, the performance of the best text categorization systems became similar, as if a plateau has been reached, and previous work mostly achieved small improve- ments. Using the ODP and Wikipedia allowed us to reap much greater bene¯ts and to bring text categorization to a qualitatively new level of performance, with double-digit improvements observed on a number of datasets. Given the domain- speci¯c nature of some test collections, we also compared the utility of narrow domain-speci¯c knowledge with that of larger amounts of information covering all branches of knowledge (Section 5.3.4). Perhaps surprisingly, we found that even for narrow-scope test collections, a wide coverage knowledge base yielded substantially greater improvements than its domain-speci¯c subsets. This obser- vation reinforces the breadth hypothesis, formulated by Lenat and Feigenbaum (1990), that \to behave intelligently in unexpected situations, an agent must be capable of falling back on increasingly general knowledge." We also applied our feature generation methodology to the problem of au- tomatically assessing semantic relatedness of words and texts. To this end, we presented a novel technique, called Explicit Semantic Analysis, for represent- ing semantics of natural language texts using natural concepts. In contrast to existing methods, ESA o®ers a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Moreover, using natural concepts makes the ESA model easy to interpret, as can be seen in the examples we provided. Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with hu- man judgements: from r = 0:56 to 0:75 for individual words and from r = 0:60 to 0:72 for texts. Consequently, we anticipate ESA to give rise to the next generation of natural language processing tools."

[edit] Comments

"We therefore propose an alternative solution that capitalizes on the power of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces."


Further notes[edit]

Facts about "Feature generation for textual information retrieval using world knowledge"RDF feed
AbstractImagine an automatic news filtering systemImagine an automatic news filtering system that tracks company news. Given the news item {FDA} approves ciprofloxacin for victims of anthrax inhalation" how can the system know that the drug mentioned is an antibiotic produced by Bayer? Or consider an information professional searching for data on {RFID} technology - how can a computer understand that the item {"Wal-Mart} supply chain goes real time" is relevant for the search? Algorithms we present can do just that.h? Algorithms we present can do just that.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
CommentsWe therefore propose an alternative solution that capitalizes on the power of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces.
ConclusionIn this work we instantiated our feature gIn this work we instantiated our feature generation methodology with two spe-

ci¯c knowledge repositories, the Open Directory Project and the Wikipedia en- cyclopedia. We succeeded to make use of an encyclopedia without deep language understanding, specially crafted inference rules or relying on additional common- sense knowledge bases. This was made possible by applying standard text clas- si¯cation techniques to match document texts with relevant Wikipedia articles. The Wikipedia-based results are superior to the ODP-based ones on a number of datasets, and are comparable to it on others. Moreover, using Wikipedia im- poses fewer restrictions on suitable knowledge repositories, and does not assume the availability of an ontology. In our future work, we intend to study possi- ble ways for combining two or more knowledge repositories for improving text categorization performance even further. We also described multi-resolution analysis, which examines the document text at several levels of linguistic abstraction and performs feature generation at each level. When polysemous words are considered in their native context, word sense disambiguation is implicitly performed. Considering local contexts allows the feature generator to cope with word synonymy and polysemy. Furthermore, when the document text is processed at several levels of granularity, even brie°y mentioned aspects can be identi¯ed and used. These might easily have been overlooked if the document were processed as one large chunk of text. Empirical evaluation de¯nitively con¯rmed the value of knowledge-based fea- ture generation for text categorization across a range of datasets. Recently, the performance of the best text categorization systems became similar, as if a plateau has been reached, and previous work mostly achieved small improve- ments. Using the ODP and Wikipedia allowed us to reap much greater bene¯ts and to bring text categorization to a qualitatively new level of performance, with double-digit improvements observed on a number of datasets. Given the domain- speci¯c nature of some test collections, we also compared the utility of narrow domain-speci¯c knowledge with that of larger amounts of information covering all branches of knowledge (Section 5.3.4). Perhaps surprisingly, we found that even for narrow-scope test collections, a wide coverage knowledge base yielded substantially greater improvements than its domain-speci¯c subsets. This obser- vation reinforces the breadth hypothesis, formulated by Lenat and Feigenbaum (1990), that \to behave intelligently in unexpected situations, an agent must be capable of falling back on increasingly general knowledge." We also applied our feature generation methodology to the problem of au- tomatically assessing semantic relatedness of words and texts. To this end, we presented a novel technique, called Explicit Semantic Analysis, for represent- ing semantics of natural language texts using natural concepts. In contrast to existing methods, ESA o®ers a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Moreover, using natural concepts makes the ESA model easy to interpret, as can be seen in the examples we provided. Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with hu- man judgements: from r = 0:56 to 0:75 for individual words and from r = 0:60 to 0:72 for texts. Consequently, we anticipate ESA to give rise to the next generation of natural language processing tools.tion

of natural language processing tools.
Conference locationHaifa, Israel +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Feature%2Bgeneration%2Bfor%2Btextual%2Binformation%2Bretrieval%2Busing%2Bworld%2Bknowledge%22 +
Has authorEvgeniy Gabrilovich +
Has domainComputer science +
Has topicTextual information retrieval +
MonthDecember +
Peer reviewedYes +
Publication typeThesis +
Published inTechnion - Israel Institute of Technology +
Research designExperiment +
Research questionsWe therefore propose an alternative solutiWe therefore propose an alternative solution that capitalizes on the power

of existing induction techniques while enriching the language of representation, namely, exploring new feature spaces. Prior to text categorization, we employ a feature generator that uses common-sense and domain-speci¯c knowledge to enrich the bag of words with new, more informative and discriminating features. Feature generation is performed automatically, using machine-readable reposito- ries of knowledge. Many sources of world knowledge have become available in recent years, thanks to rapid advances in information processing, and Internet proliferation in particular. Examples of general purpose knowledge bases include the Open Directory Project (ODP), Yahoo! Web Directory, and the Wikipedia encyclopedia.Directory, and the Wikipedia

encyclopedia.
Revid10,173 +
TheoriesFeature generation (FG), also known as feaFeature generation (FG), also known as feature construction, constructive induc-

tion or bias shift, is a process of building new features based on those present in the examples supplied to the system, possibly using the domain theory (i.e., in- formation about goals, constraints and operators of the domain) (Fawcett, 1993). Feature construction techniques can be useful when the attributes supplied with

the data are insu±cient for concise concept learning.
e insu±cient for concise concept learning.
Theory typeDesign and action +
TitleFeature generation for textual information retrieval using world knowledge
Unit of analysisArticle +
Urlhttp://www.cs.technion.ac.il/~gabr/papers/phd-thesis.pdf +
Wikipedia coverageSample data +
Wikipedia data extractionLive Wikipedia +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2006 +