Browse wiki

Jump to: navigation, search
Comparing methods for single paragraph similarity analysis
Abstract Abstract The focus of this paper is two-foAbstract The focus of this paper is two-fold. First, similarities generated from six semantic models were compared to human ratings of paragraph similarity on two datasets—23 World Entertainment News Network paragraphs and 50 {ABC} newswire paragraphs. Contrary to findings on smaller textual units such as word associations {(Griffiths}, Tenenbaum, \& Steyvers, 2007), our results suggest that when single paragraphs are compared, simple nonreductive models (word overlap and vector space) can provide better similarity estimates than more complex models {(LSA}, Topic Model, {SpNMF}, and {CSM).} Second, various methods of corpus creation were explored to facilitate the semantic models’ similarity estimates. Removing numeric and single characters, and also truncating document length improved performance. Automated construction of smaller Wikipedia-based corpora proved to be very effective, even improving upon the performance of corpora that had been chosen for the domain. Model performance was further improved by augmenting corpora with dataset paragraphs.ugmenting corpora with dataset paragraphs.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments Clone: Wikipedia articles: All Wikipedia eClone: Wikipedia articles: All Wikipedia entries current to March 2007 were downloaded for this research. In total there were 2.8 million Wikipedia entries collected; however, the total number of documents was reduced to 1.57 million after the removal of incomplete articles contained in the original corpus Experiment: The first (dataset), which we will refer to as the WENN dataset, was composed of similarity ratings generated by subjects comparing celebrity gossip paragraphs taken from the World Entertainment News Network (WENN). Secondary pre-processed dataset: Lee et al. (2005) recorded observations of paragraph similarity made by 83 Adelaide University students to form the Lee dataset. Two corpora were chosen to act as knowledge bases for the semantic models to allow similarity estimates to be made on the paragraphs contained in the WENN and Lee datasets. The larger set of 12,787 documents collected from WENN between April 2000 and January 2006 was considered a relevant backgrounding corpus for the 23 paragraphs contained in the WENN dataset; this larger set of documents is henceforth called the WENN corpus. It was not possible to resource the original set of 364 headlines and précis gathered by Lee et al. (2005) from the ABC online news mail service. Therefore, in an attempt to provide a news media-based corpus that was similar in style to the original corpus of ABC documents used by Lee and colleagues, articles from Canada's Toronto Star newspaper were used. Moreover, the Toronto Star corpus comprised of 55,021 current affairs articles published during 2005.nt affairs articles published during 2005.
Conclusion The findings presented in this paper indicThe findings presented in this paper indicate that corpus preprocessing, document length, and content are all important factors that determine a semantic model's ability to estimate human similarity judgments on paragraphs. The online, community-driven Wikipedia encyclopedia also proved to be a valuable resource from which corpora could be derived when a more suitable domain-chosen corpus is not available. In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity judgments and machine evaluations is a result of applied significance.tions is a result of applied significance.
Data source Wikipedia pages  +
Doi 10.1111/j.1756-8765.2010.01108.x +
Google scholar url  +
Has author Benjamin Stone + , Simon Dennis + , Peter J. Kwantes +
Has domain Psychology +
Has topic Other natural language processing topics +
Peer reviewed Yes  +
Publication type Journal article  +
Published in Topics in Cognitive Science +
Research design Statistical analysis  +
Research questions This paper describes the outcome of a systThis paper describes the outcome of a systematic comparison of single paragraph similarities generated by six statistical semantic models to similarities generated by human participants. Paragraph complexity and length can vary widely. Therefore, for the purposes of this research, we define a paragraph as a self-contained section of ‘‘news’’ media (such as a pre´cis), presented in approximately 50–200 words. There are two main themes that are explored in this paper. At one level it is an evaluation of the semantic models, in which their performance at estimating the similarity of single paragraph documents is compared against human judgments. ......... At another level this paper explores the characteristics of the corpora or knowledge bases utilized by these models that may improve models’ performance when approximating human similarity judgments. With the exception of the word overlap model, a good background knowledge base is essential to the models’ performance. To this end, four studies are described in this paper that examine the semantic models’ performance relative to human ratings of paragraph similarity. In the first study, semantic models use domain-chosen corpora to generate knowledge spaces on which they make evaluations of similarity for two datasets of paragraphs. Overall, the models performed poorly using these domain-chosen corpora when estimates were compared to those made by human assessors. In the second study, improvements in the models’ performance were achieved by more thoroughly preprocessing the domain-chosen corpora to remove all instances of numeric and single alphabetical characters. In the third study, smaller targeted corpora (subcorpora) constructed by querying a larger set of documents (Wikipedia4) were examined to assess whether they could produce sufficient performance to be generally useful (Zelikovitz & Kogan, 2006). In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity evaluations and semantic models’ evaluations of paragraph similarity using automated methods of corpus construction is a desirable outcome. Furthermore, document length of the essay-like Wikipedia articles was manipulated to produce better approximations of human judgment by the semantic models. Finally, in the fourth study, several of the models were found to produce better estimates of paragraph similarity when the dataset paragraphs were included in the models’ backgrounding corpus.luded in the models’ backgrounding corpus.
Revid 10,706  +
Theories To calculate the similarity of the topic dTo calculate the similarity of the topic distributions representing documents, we employed both the Dot Product (see Eq. 2) and Jensen-Shannon Divergence (JSD, see Eq. 3). While the Dot Product was employed for convenience, the JSD is a symmetric form of the Kullback-Leibler Divergence (D), which derives from information theory and provides a well-motivated way of comparing probability distributions.ay of comparing probability distributions.
Theory type Analysis  +
Title Comparing methods for single paragraph similarity analysis
Unit of analysis Article  +
Url  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Dump  +
Wikipedia language Not specified  +
Wikipedia page type Article  +
Year 2010  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:25:36  +
Categories Other natural language processing topics  + , Psychology  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:21:53  +
hide properties that link here 
  No properties link to this page.


Enter the name of the page to start browsing from.