Difference between revisions of "Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction"

From WikiLit
Jump to: navigation, search
(abstract fix)
(changed the collected data time dimension)
Line 29: Line 29:
 
|research_design=Experiment
 
|research_design=Experiment
 
|collected_datatype=Experiment responses, Wikipedia pages
 
|collected_datatype=Experiment responses, Wikipedia pages
|collected_data_time_dimension=N/A
+
|collected_data_time_dimension=Cross-sectional
 
|unit_of_analysis=Article
 
|unit_of_analysis=Article
 
|wikipedia_data_extraction=Secondary dataset
 
|wikipedia_data_extraction=Secondary dataset

Revision as of 02:08, January 29, 2014

Publication (help)
Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction
Authors: Jovan Pehcevski, James A. Thom, Anne-Marie Vercoustre, Vladimir Naumovski [edit item]
Citation: Information Retrieval 13 (5): 568. 2010 October.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1007/s10791-009-9125-9.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction is a publication by Jovan Pehcevski, James A. Thom, Anne-Marie Vercoustre, Vladimir Naumovski.


[edit] Abstract

Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.

[edit] Research questions

"Entity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.

In this paper, we describe our approach to ranking entities from the Wikipedia XML document collection. Our approach is based on the following hypotheses: 1. A good entity page is a page that answers the query, or a query extended with names of target categories (task 1) or entity examples (task 2). 2. A good entity page is a page associated with a category close to the target category (task 1) or to the categories of the entity examples (task 2). 3. A good entity page is referred to by a page answering the query; this is an adaptation of the HITS (Kleinberg 1999) algorithm to the problem of entity ranking. 4. A good entity page is referred to by focused contexts with many occurrences of the entity examples (task 2). A broad context could be the full page that contains the entity examples, while smaller and more narrow focused contexts could be XML elements such as paragraphs, lists, or tables."

Research details

Topics: Ranking and clustering systems [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Secondary dataset [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Not specified [edit item]

[edit] Conclusion

"In this paper, we have presented our system for ranking Wikipedia entities in answer to a query. The system implements an entity ranking approach for the INEX Wikipedia XML document collection that is based on exploiting the interesting structural and semantic properties of the collection. Our experiments have demonstrated that utilising categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be used to further improve the entity ranking performance.

When utilising Wikipedia categories in our system, we found that using lexical similarity between category names results in an effective entity ranking approach, so long as the category index comprises documents containing only category names. We also found that the best entity ranking approach is the one that uses sets of categories directly attached to both the example and the answer entities, and that using various extensions of these two sets significantly decreases the entity ranking performance. Importantly, when comparing the scores of the best performing approaches across the two entity ranking tasks, we found that the query strategy that uses example entities to identify the set of target categories is significantly more effective than the strategy that uses the set of loosely defined target categories. In the future, we plan to introduce different category weighting rules that we hope would better distinguish the answer entities that are more similar to the entity examples.

When utilising the link structure of Wikipedia in our system, we found that the locality of Wikipedia links can be exploited to significantly improve the effectiveness of entity ranking. Using an approach that takes the broad context of a full Wikipedia page as a baseline, we evaluated two alternative approaches that take narrow contexts around the entity examples: one that uses static (predefined) types of elements such as paragraphs, lists and tables; and another that dynamically identifies the contexts by utilising the underlying XML document structure. Although the entity ranking performances of the two approaches were similar, both of them nevertheless significantly outperformed the approach that takes the broad context of a full Wikipedia page. In the future, we plan to further improve our linkrank algorithm by varying the number of entity examples and incorporating relevance feedback that we expect would reveal other useful entities that could be used to identify better contexts.

When utilising topic difficulty prediction in our system, we found that it is possible to predict accurately a two-class level of topic difficulty with a classifier generated from a selected number of static features extracted from the INEX topic definition and the Wikipedia document collection. Interestingly, when analysing the impact of four classes of topic difficulty on the optimal parameter values of our system, we found that for the Easy topics the use of Wikipedia categories is very important while for the Difficult topics the link structure plays an important role. The application of topic prediction in tuning our system has shown encouraging improvement over the approach that does not use prediction, but we need a larger test collection to confirm the significance of our findings. The major limitation of our topic prediction approach is that it relies on the INEX topic definition that is much richer than standard Web queries (and so hardly applicable in practice). In the future we plan to develop a dynamic query prediction approach based solely on the query terms and (among other things) on the similarity scores of the relevant entities retrieved by our XER system.

Our entity ranking system was evaluated as one of the best performing systems when compared with other participating systems in both the INEX 2007 and INEX 2008 XER tracks. In the future, we aim at further developing our entity ranking algorithms by incorporating natural language processing techniques that we expect would reveal more potentially relevant entities. We also recognise that the entity ranking techniques presented in this paper have been developed for specific INEX tasks and tested on one (XML) version of the Wikipedia. However, they demonstrate the potential of using Wikipedia pages for assisting in more general focused retrieval tasks performed on even more diverse collections (such as the Web), which we plan to investigate and carry out in the future."

[edit] Comments


Further notes[edit]

Facts about "Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction"RDF feed
AbstractEntity ranking has recently emerged as a rEntity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.er improve the entity ranking performance.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
ConclusionIn this paper, we have presented our systeIn this paper, we have presented our system for ranking Wikipedia entities in answer to a query. The system implements an entity ranking approach for the INEX Wikipedia XML document collection that is based on exploiting the interesting structural and semantic properties of the collection. Our experiments have demonstrated that utilising categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be used to further improve the entity ranking performance.

When utilising Wikipedia categories in our system, we found that using lexical similarity between category names results in an effective entity ranking approach, so long as the category index comprises documents containing only category names. We also found that the best entity ranking approach is the one that uses sets of categories directly attached to both the example and the answer entities, and that using various extensions of these two sets significantly decreases the entity ranking performance. Importantly, when comparing the scores of the best performing approaches across the two entity ranking tasks, we found that the query strategy that uses example entities to identify the set of target categories is significantly more effective than the strategy that uses the set of loosely defined target categories. In the future, we plan to introduce different category weighting rules that we hope would better distinguish the answer entities that are more similar to the entity examples.

When utilising the link structure of Wikipedia in our system, we found that the locality of Wikipedia links can be exploited to significantly improve the effectiveness of entity ranking. Using an approach that takes the broad context of a full Wikipedia page as a baseline, we evaluated two alternative approaches that take narrow contexts around the entity examples: one that uses static (predefined) types of elements such as paragraphs, lists and tables; and another that dynamically identifies the contexts by utilising the underlying XML document structure. Although the entity ranking performances of the two approaches were similar, both of them nevertheless significantly outperformed the approach that takes the broad context of a full Wikipedia page. In the future, we plan to further improve our linkrank algorithm by varying the number of entity examples and incorporating relevance feedback that we expect would reveal other useful entities that could be used to identify better contexts.

When utilising topic difficulty prediction in our system, we found that it is possible to predict accurately a two-class level of topic difficulty with a classifier generated from a selected number of static features extracted from the INEX topic definition and the Wikipedia document collection. Interestingly, when analysing the impact of four classes of topic difficulty on the optimal parameter values of our system, we found that for the Easy topics the use of Wikipedia categories is very important while for the Difficult topics the link structure plays an important role. The application of topic prediction in tuning our system has shown encouraging improvement over the approach that does not use prediction, but we need a larger test collection to confirm the significance of our findings. The major limitation of our topic prediction approach is that it relies on the INEX topic definition that is much richer than standard Web queries (and so hardly applicable in practice). In the future we plan to develop a dynamic query prediction approach based solely on the query terms and (among other things) on the similarity scores of the relevant entities retrieved by our XER system.

Our entity ranking system was evaluated as one of the best performing systems when compared with other participating systems in both the INEX 2007 and INEX 2008 XER tracks. In the future, we aim at further developing our entity ranking algorithms by incorporating natural language processing techniques that we expect would reveal more potentially relevant entities. We also recognise that the entity ranking techniques presented in this paper have been developed for specific INEX tasks and tested on one (XML) version of the Wikipedia. However, they demonstrate the potential of using Wikipedia pages for assisting in more general focused retrieval tasks performed on even more diverse collections (such as the Web), which we plan to investigate and carry out in the future.
o investigate and carry out in the future.
Doi10.1007/s10791-009-9125-9 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Entity%2Branking%2Bin%2BWikipedia%3A%2Butilising%2Bcategories%2C%2Blinks%2Band%2Btopic%2Bdifficulty%2Bprediction%22 +
Has authorJovan Pehcevski +, James A. Thom +, Anne-Marie Vercoustre + and Vladimir Naumovski +
Has domainComputer science +
Has topicRanking and clustering systems +
Issue5 +
MonthOctober +
Pages568 +
Peer reviewedYes +
Publication typeJournal article +
Published inInformation Retrieval +
Research designExperiment +
Research questionsEntity ranking has recently emerged as a rEntity ranking has recently emerged as a research field that aims at retrieving entities as answers to a query. Unlike entity extraction where the goal is to tag names of entities in documents, entity ranking is primarily focused on returning a ranked list of relevant entity names for the query. Many approaches to entity ranking have been proposed, and most of them were evaluated on the INEX Wikipedia test collection. In this paper, we describe a system we developed for ranking Wikipedia entities in answer to a query. The entity ranking approach implemented in our system utilises the known categories, the link structure of Wikipedia, as well as the link co-occurrences with the entity examples (when provided) to retrieve relevant entities as answers to the query. We also extend our entity ranking approach by utilising the knowledge of predicted classes of topic difficulty. To predict the topic difficulty, we generate a classifier that uses features extracted from an INEX topic definition to classify the topic into an experimentally pre-determined class. This knowledge is then utilised to dynamically set the optimal values for the retrieval parameters of our entity ranking system. Our experiments demonstrate that the use of categories and the link structure of Wikipedia can significantly improve entity ranking effectiveness, and that topic difficulty prediction is a promising approach that could also be exploited to further improve the entity ranking performance.

In this paper, we describe our approach to ranking entities from the Wikipedia XML document collection. Our approach is based on the following hypotheses: 1. A good entity page is a page that answers the query, or a query extended with names of target categories (task 1) or entity examples (task 2). 2. A good entity page is a page associated with a category close to the target category (task 1) or to the categories of the entity examples (task 2). 3. A good entity page is referred to by a page answering the query; this is an adaptation of the HITS (Kleinberg 1999) algorithm to the problem of entity ranking.

4. A good entity page is referred to by focused contexts with many occurrences of the entity examples (task 2). A broad context could be the full page that contains the entity examples, while smaller and more narrow focused contexts could be XML elements such as paragraphs, lists, or tables.
ents such as paragraphs, lists, or tables.
Revid10,604 +
TheoriesUndetermined
Theory typeDesign and action +
TitleEntity ranking in Wikipedia: utilising categories, links and topic difficulty prediction
Unit of analysisArticle +
Urlhttp://proquest.umi.com/pqdweb?did=2152237331&Fmt=7&clientId=10306&RQT=309&VName=PQD +
Volume13 +
Wikipedia coverageSample data +
Wikipedia data extractionSecondary dataset +
Wikipedia languageNot specified +
Wikipedia page typeArticle +
Year2010 +