Browse wiki

Jump to: navigation, search
Web-scale distributional similarity and entity set expansion
Abstract Computing the pairwise semantic similarityComputing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the {MapReduce} framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.erse entity sets extracted from Wikipedia.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments "We evaluated the impact of the large similarity matrix on a set expansion task and found that the Web similarity matrix gave a large performance boost over a state-of-the-art expansion algorithm using Wikipedia." p. 946
Conclusion We proposed a highly scalable term similarWe proposed a highly scalable term similarity algorithm, implemented in the MapReduce framework, and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms was computed in 50 hours using 200 quad-core nodes. We evaluated the impact of the large similarity matrix on a set expansion task and found that the Web similarity matrix gave a large performance boost over a state-of-the-art expansion algorithm using Wikipedia. Finally, we release to the community a testbed for experimentally analyzing automatic set expansion, which includes a large collection of nearly random entity sets extracted from Wikipedia and over 22,000 randomly sampled seed expansion trials.0 randomly sampled seed expansion trials.
Data source Experiment responses  + , Wikipedia pages  +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Web-scale%2Bdistributional%2Bsimilarity%2Band%2Bentity%2Bset%2Bexpansion%22  +
Has author Patrick Pantel + , Eric Crestan + , Arkady Borkovsky + , Ana-Maria Popescu + , Vishnu Vyas +
Has domain Computer science +
Has topic Semantic relatedness +
Pages 938-947  +
Peer reviewed Yes  +
Publication type Conference paper  +
Published in EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2 +
Research design Experiment  + , Statistical analysis  +
Research questions Computing the pairwise semantic similarityComputing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. over a 200 billion word crawl of the Web.
Revid 11,040  +
Theories Undetermined
Theory type Design and action  +
Title Web-scale distributional similarity and entity set expansion
Unit of analysis Article  +
Url http://dl.acm.org/citation.cfm?id=1699635  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Dump  +
Wikipedia language English  +
Wikipedia page type Article  +
Year 2009  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:32:56  +
Categories Semantic relatedness  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:32:21  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.