Facet-based opinion retrieval from blogs

From WikiLit
Jump to: navigation, search
Publication (help)
Facet-based opinion retrieval from blogs
Authors: Olga Vechtomova [edit item]
Citation: Information Processing and Management 46 (1): 71-88. 2010.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1016/j.ipm.2009.06.005.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Facet-based opinion retrieval from blogs is a publication by Olga Vechtomova.


[edit] Abstract

The paper presents methods of retrieving blog posts containing opinions about an entity expressed in the query. The methods use a lexicon of subjective words and phrases compiled from manually and automatically developed resources. One of the methods uses the {Kullback-Leibler} divergence to weight subjective words occurring near query terms in documents, another uses proximity between the occurrences of query terms and subjective words in documents, and the third combines both factors. Methods of structuring queries into facets, facet expansion using Wikipedia, and a facet-based retrieval are also investigated in this work. The methods were evaluated using the TREC 2007 and 2008 Blog track topics, and proved to be highly effective.

[edit] Research questions

"The paper presents methods of retrieving blog posts containing opinions about an entity expressed in the query. The methods use a lexicon of subjective words and phrases compiled from manually and automatically developed resources. One of the methods uses the Kullback-Leibler divergence to weight subjective words occurring near query terms in documents, another uses proximity between the occurrences of query terms and subjective words in documents, and the third combines both factors. Methods of structuring queries into facets, facet expansion using Wikipedia, and a facet-based retrieval are also investigated in this work. The methods were evaluated using the TREC 2007 and 2008 Blog track topics, and proved to be highly effective."

Research details

Topics: Textual information retrieval [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "The Kullback-Leibler divergence measures the relative entropy between two probability distributions. It was defined in

information theory (Losee, 1990) and used in many information retrieval and natural language processing tasks, for example, in query expansion following pseudo-relevance feedback (Carpineto et al., 2001)." [edit item]

Research design: Design science, Experiment, Mathematical modeling, Statistical analysis [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"In this paper new methods of retrieving documents containing opinions expressed about an entity or entities specified by the user in the query were proposed. The main stages of the proposed methods are as follows: (1) Collection pre-processing. Our experiments demonstrate that this stage has a significant impact on the effectiveness of document retrieval from blogs in terms of both topic- and opinion-relevance. The major performance-improving steps at this stage include the removal of HTML tags, scripts and style definitions, and all lines where the hyperlinks account for 50% or more of the words. (2) Query processing. A method of building faceted queries by utilising Wikipedia was developed. The method consisted of the following steps: identifying concepts in the topic titles by matching them to Wikipedia article titles, grouping concepts into facets, and expanding each facet with new concepts by using Wikipedia article redirects and valid abbreviations. The evaluation of different query processing levels demonstrates the merit of all three steps in query processing. Expanding queries using only target pages, redirected to from the Wikipedia page titles found in the query (“limitedQE-Wiki-phrases”) is better than expanding the query with other pages redirecting to the same targets (“fullQE-Wiki-phrases”). (3) Document retrieval. Retrieval of the initial document set using a topic-based ranking method, such as BM25. (4) Opinion-based document re-ranking. Three methods were proposed: - KLD-based method (KLD), using the Kullback-Leibler divergence scores of the subjective words in the windows around query term occurrences; - Proximity-based method (dist), using distances between a query term occurrence and each of the co-occurring subjective words; - A method combining the previous two (KLD+dist). In addition, all of these methods contain a Facet Distance component, which factors in the distance between query terms/phrases from different facets, and a Facet Validation component, which down-ranks documents that do not contain at least one concept from each facet. Evaluation demonstrates that the proposed methods are highly effective, and are among the best-performing methods developed by the Blog track 2007 and 2008 participants. Specifically, the proposed methods achieved the highest improvements over the standard baseline run provided by Blog 2008 organisers “Baseline 4” compared to other opinionfinding runs submitted by the participants. Series of experiments were conducted to determine the effect of the major components (FV, FD, KLD and dist) on performance. The results indicate that all components, in general, have a positive effect on performance. However, the proximity of query terms to subjective words does not always improve the performance when used in conjunction with KL divergence of subjective words. Specifically, “KLD+dist-FD-FV-subj-bm25” yielded lower MAPop and P10op than the run “KLD-FD-FV-subj-bm25” on limitedQE-Wiki-phrase baseline (Blog 2007 topics). “KLD+dist-FD-FV-subj-bm25” and “KLD+dist-FD-FV-subj-b4” on the other hand, yielded higher MAPop and R-precisionop on the other two baselines: limitedQE-Wiki-phrase (Blog 2008 topics) and Baseline 4. An analysis of the methods’ performance by topic categories based on the type of entity expressed in the query was performed. It was found that the methods are most effective in finding opinions about events, products, geographical locations and people. They were least effective in finding opinions about entities in the category “media/art”, which included TV shows, films and books, and in the category “miscellaneous”, which mostly contained abstract concepts."

[edit] Comments

"In this paper new methods of retrieving documents containing opinions expressed about an entity or entities specified by the user in the query were proposed.

Research design: now "Design science, Statistical analysis", but could also be "experiment" and statistical modeling". Table 1 and 2 displays results of an experiment with mathematical analysis, "Opinion-based document ranking methods" has some mathematical modeling

"Data source" could apart from Wikipedia pages also be "Experiment responses".

"Wikipedia data extraction": where is it written that this is the live/dump? It seems just to state "Wikipedia"."


Further notes[edit]

Facts about "Facet-based opinion retrieval from blogs"RDF feed
AbstractThe paper presents methods of retrieving bThe paper presents methods of retrieving blog posts containing opinions about an entity expressed in the query. The methods use a lexicon of subjective words and phrases compiled from manually and automatically developed resources. One of the methods uses the {Kullback-Leibler} divergence to weight subjective words occurring near query terms in documents, another uses proximity between the occurrences of query terms and subjective words in documents, and the third combines both factors. Methods of structuring queries into facets, facet expansion using Wikipedia, and a facet-based retrieval are also investigated in this work. The methods were evaluated using the TREC 2007 and 2008 Blog track topics, and proved to be highly effective.topics, and proved to be highly effective.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
CommentsIn this paper new methods of retrieving doIn this paper new methods of retrieving documents containing opinions expressed about an entity or entities specified by the user in the query were proposed.

Research design: now "Design science, Statistical analysis", but could also be "experiment" and statistical modeling". Table 1 and 2 displays results of an experiment with mathematical analysis, "Opinion-based document ranking methods" has some mathematical modeling

"Data source" could apart from Wikipedia pages also be "Experiment responses".

"Wikipedia data extraction": where is it written that this is the live/dump? It seems just to state "Wikipedia".
/dump? It seems just to state "Wikipedia".
ConclusionIn this paper new methods of retrieving doIn this paper new methods of retrieving documents containing opinions expressed about an entity or entities specified by the user in the query were proposed. The main stages of the proposed methods are as follows: (1) Collection pre-processing. Our experiments demonstrate that this stage has a significant impact on the effectiveness of document retrieval from blogs in terms of both topic- and opinion-relevance. The major performance-improving steps at this

stage include the removal of HTML tags, scripts and style definitions, and all lines where the hyperlinks account for 50% or more of the words. (2) Query processing. A method of building faceted queries by utilising Wikipedia was developed. The method consisted of the following steps: identifying concepts in the topic titles by matching them to Wikipedia article titles, grouping concepts into facets, and expanding each facet with new concepts by using Wikipedia article redirects and valid abbreviations. The evaluation of different query processing levels demonstrates the merit of all three steps in query processing. Expanding queries using only target pages, redirected to from the Wikipedia page titles found in the query (“limitedQE-Wiki-phrases”) is better than expanding the query with other pages redirecting to the same targets (“fullQE-Wiki-phrases”). (3) Document retrieval. Retrieval of the initial document set using a topic-based ranking method, such as BM25. (4) Opinion-based document re-ranking. Three methods were proposed: - KLD-based method (KLD), using the Kullback-Leibler divergence scores of the subjective words in the windows around query term occurrences; - Proximity-based method (dist), using distances between a query term occurrence and each of the co-occurring subjective words; - A method combining the previous two (KLD+dist). In addition, all of these methods contain a Facet Distance component, which factors in the distance between query terms/phrases from different facets, and a Facet Validation component, which down-ranks documents that do not contain at least one concept from each facet. Evaluation demonstrates that the proposed methods are highly effective, and are among the best-performing methods developed by the Blog track 2007 and 2008 participants. Specifically, the proposed methods achieved the highest improvements over the standard baseline run provided by Blog 2008 organisers “Baseline 4” compared to other opinionfinding runs submitted by the participants. Series of experiments were conducted to determine the effect of the major components (FV, FD, KLD and dist) on performance. The results indicate that all components, in general, have a positive effect on performance. However, the proximity of query terms to subjective words does not always improve the performance when used in conjunction with KL divergence of subjective words. Specifically, “KLD+dist-FD-FV-subj-bm25” yielded lower MAPop and P10op than the run “KLD-FD-FV-subj-bm25” on limitedQE-Wiki-phrase baseline (Blog 2007 topics). “KLD+dist-FD-FV-subj-bm25” and “KLD+dist-FD-FV-subj-b4” on the other hand, yielded higher MAPop and R-precisionop on the other two baselines: limitedQE-Wiki-phrase (Blog 2008 topics) and Baseline 4. An analysis of the methods’ performance by topic categories based on the type of entity expressed in the query was performed. It was found that the methods are most effective in finding opinions about events, products, geographical locations and people. They were least effective in finding opinions about entities in the category “media/art”, which

included TV shows, films and books, and in the category “miscellaneous”, which mostly contained abstract concepts.
which mostly contained abstract concepts.
Data sourceExperiment responses + and Wikipedia pages +
Doi10.1016/j.ipm.2009.06.005 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Facet-based%2Bopinion%2Bretrieval%2Bfrom%2Bblogs%22 +
Has authorOlga Vechtomova +
Has domainComputer science +
Has topicTextual information retrieval +
Issue1 +
Pages71-88 +
Peer reviewedYes +
Publication typeJournal article +
Published inInformation Processing and Management +
Research designDesign science +, Experiment +, Mathematical modeling + and Statistical analysis +
Research questionsThe paper presents methods of retrieving bThe paper presents methods of retrieving blog posts containing opinions about an entity expressed in the

query. The methods use a lexicon of subjective words and phrases compiled from manually and automatically developed resources. One of the methods uses the Kullback-Leibler divergence to weight subjective words occurring near query terms in documents, another uses proximity between the occurrences of query terms and subjective words in documents, and the third combines both factors. Methods of structuring queries into facets, facet expansion using Wikipedia, and a facet-based retrieval are also investigated in this work. The methods were evaluated using the TREC 2007 and 2008 Blog track topics, and proved to be highly effective.topics, and

proved to be highly effective.
Revid11,131 +
TheoriesThe Kullback-Leibler divergence measures tThe Kullback-Leibler divergence measures the relative entropy between two probability distributions. It was defined in

information theory (Losee, 1990) and used in many information retrieval and natural language processing tasks, for

example, in query expansion following pseudo-relevance feedback (Carpineto et al., 2001).
levance feedback (Carpineto et al., 2001).
Theory typeDesign and action +
TitleFacet-based opinion retrieval from blogs
Unit of analysisArticle +
Urlhttp://dx.doi.org/10.1016/j.ipm.2009.06.005 +
Volume46 +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2010 +