Last modified on January 30, 2014, at 20:21

Comparing methods for single paragraph similarity analysis

Publication (help)
Comparing methods for single paragraph similarity analysis
Authors: Benjamin Stone, Simon Dennis, Peter J. Kwantes [edit item]
Citation: Topics in Cognitive Science  : . 2010.
Publication type: Journal article
Peer-reviewed: Yes
DOI: 10.1111/j.1756-8765.2010.01108.x.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Format: BibTeX
Comparing methods for single paragraph similarity analysis is a publication by Benjamin Stone, Simon Dennis, Peter J. Kwantes.

[edit] Abstract

Abstract The focus of this paper is two-fold. First, similarities generated from six semantic models were compared to human ratings of paragraph similarity on two datasets—23 World Entertainment News Network paragraphs and 50 {ABC} newswire paragraphs. Contrary to findings on smaller textual units such as word associations {(Griffiths}, Tenenbaum, \& Steyvers, 2007), our results suggest that when single paragraphs are compared, simple nonreductive models (word overlap and vector space) can provide better similarity estimates than more complex models {(LSA}, Topic Model, {SpNMF}, and {CSM).} Second, various methods of corpus creation were explored to facilitate the semantic models’ similarity estimates. Removing numeric and single characters, and also truncating document length improved performance. Automated construction of smaller Wikipedia-based corpora proved to be very effective, even improving upon the performance of corpora that had been chosen for the domain. Model performance was further improved by augmenting corpora with dataset paragraphs.

[edit] Research questions

"This paper describes the outcome of a systematic comparison of single paragraph similarities generated by six statistical semantic models to similarities generated by human participants. Paragraph complexity and length can vary widely. Therefore, for the purposes of this research, we define a paragraph as a self-contained section of ‘‘news’’ media (such as a pre´cis), presented in approximately 50–200 words. There are two main themes that are explored in this paper. At one level it is an evaluation of the semantic models, in which their performance at estimating the similarity of single paragraph documents is compared against human judgments. ......... At another level this paper explores the characteristics of the corpora or knowledge bases utilized by these models that may improve models’ performance when approximating human similarity judgments. With the exception of the word overlap model, a good background knowledge base is essential to the models’ performance. To this end, four studies are described in this paper that examine the semantic models’ performance relative to human ratings of paragraph similarity. In the first study, semantic models use domain-chosen corpora to generate knowledge spaces on which they make evaluations of similarity for two datasets of paragraphs. Overall, the models performed poorly using these domain-chosen corpora when estimates were compared to those made by human assessors. In the second study, improvements in the models’ performance were achieved by more thoroughly preprocessing the domain-chosen corpora to remove all instances of numeric and single alphabetical characters. In the third study, smaller targeted corpora (subcorpora) constructed by querying a larger set of documents (Wikipedia4) were examined to assess whether they could produce sufficient performance to be generally useful (Zelikovitz & Kogan, 2006). In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity evaluations and semantic models’ evaluations of paragraph similarity using automated methods of corpus construction is a desirable outcome. Furthermore, document length of the essay-like Wikipedia articles was manipulated to produce better approximations of human judgment by the semantic models. Finally, in the fourth study, several of the models were found to produce better estimates of paragraph similarity when the dataset paragraphs were included in the models’ backgrounding corpus."

Research details

Topics: Other natural language processing topics [edit item]
Domains: Psychology [edit item]
Theory type: Analysis [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "To calculate the similarity of the topic distributions representing documents, we employed both the Dot Product (see Eq. 2) and Jensen-Shannon Divergence (JSD, see Eq. 3). While the Dot Product was employed for convenience, the JSD is a symmetric form of the Kullback-Leibler Divergence (D), which derives from information theory and provides a well-motivated way of comparing probability distributions." [edit item]
Research design: Statistical analysis [edit item]
Data source: Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"The findings presented in this paper indicate that corpus preprocessing, document length, and content are all important factors that determine a semantic model's ability to estimate human similarity judgments on paragraphs. The online, community-driven Wikipedia encyclopedia also proved to be a valuable resource from which corpora could be derived when a more suitable domain-chosen corpus is not available. In many applications the hand construction of corpora for a particular domain is not feasible, and so the ability to show a good match between human similarity judgments and machine evaluations is a result of applied significance."

[edit] Comments

"Clone: Wikipedia articles: All Wikipedia entries current to March 2007 were downloaded for this research. In total there were 2.8 million Wikipedia entries collected; however, the total number of documents was reduced to 1.57 million after the removal of incomplete articles contained in the original corpus

Experiment: The first (dataset), which we will refer to as the WENN dataset, was composed of similarity ratings generated by subjects comparing celebrity gossip paragraphs taken from the World Entertainment News Network (WENN).

Secondary pre-processed dataset: Lee et al. (2005) recorded observations of paragraph similarity made by 83 Adelaide University students to form the Lee dataset.

Two corpora were chosen to act as knowledge bases for the semantic models to allow similarity estimates to be made on the paragraphs contained in the WENN and Lee datasets. The larger set of 12,787 documents collected from WENN between April 2000 and January 2006 was considered a relevant backgrounding corpus for the 23 paragraphs contained in the WENN dataset; this larger set of documents is henceforth called the WENN corpus. It was not possible to resource the original set of 364 headlines and précis gathered by Lee et al. (2005) from the ABC online news mail service. Therefore, in an attempt to provide a news media-based corpus that was similar in style to the original corpus of ABC documents used by Lee and colleagues, articles from Canada's Toronto Star newspaper were used. Moreover, the Toronto Star corpus comprised of 55,021 current affairs articles published during 2005."

Further notes[edit]