Browse wiki

Jump to: navigation, search
Extraction, selection and ranking of field association (FA) terms from domain-specific corpora for building a comprehensive FA terms dictionary
Abstract Field Association (FA) Terms—words or phraField Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.ers RCV1 corpus and 20 Newsgroup data set.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments we have presented a methodology to extract and select FA Terms effectively to build a comprehensive FA Terms dictionary.
Conclusion The novel technique of using FA Terms holdThe novel technique of using FA Terms holds much potential for use in many areas of information retrieval and natural language processing, but one of the major problems today is the lack of a comprehensive FA Terms dictionary. Therefore, we have presented a methodology to extract and select FA Terms effectively to build a comprehensive FA Terms dictionary. The methodology is based on POS pattern rules, corpora comparison and modified tf-idf weighting for selecting domain-relevant terms. Experimental evaluation carried out for 21 different fields using 306MBof domain-specific corpora obtained fromWikipedia dump selected 22,229 compound FA Terms and 9,005 single FA Terms. The precision and recall were 74–97 and 65–98% respectively. The results show that the proposed methodology is effective for building a comprehensive dictionary of FA Terms.ng a comprehensive dictionary of FA Terms.
Data source Experiment responses  + , Wikipedia pages  +
Doi 10.1007/s10115-010-0296-x +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Extraction%2C%2Bselection%2Band%2Branking%2Bof%2Bfield%2Bassociation%2B%28FA%29%2Bterms%2Bfrom%2Bdomain-specific%2Bcorpora%2Bfor%2Bbuilding%2Ba%2Bcomprehensive%2BFA%2Bterms%2Bdictionary%22  +
Has author Tshering Dorji + , El sayed Atlam + , Susumu Yata + , Masao Fuketa + , Kazuhiro Morita + , Junichi Aoe +
Has domain Computer science +
Has topic Other natural language processing topics +
Month April  +
Peer reviewed Yes  +
Publication type Journal article  +
Published in Knowledge and Information Systems +
Research design Experiment  +
Research questions FieldAssociation (FA) Terms—words or phrasFieldAssociation (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97and 65–98. This is better than the traditionalmethods. The FATerms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.ers RCV1 corpus and 20 Newsgroup data set.
Revid 10,769  +
Theories Undetermined
Theory type Design and action  +
Title Extraction, selection and ranking of field association (FA) terms from domain-specific corpora for building a comprehensive FA terms dictionary
Unit of analysis Article  +
Url http://dx.doi.org/10.1007/s10115-010-0296-x  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Live Wikipedia  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2010  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:27:57  +
Categories Other natural language processing topics  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:27:33  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.