Why finding entities in Wikipedia is difficult, sometimes

From WikiLit
Jump to: navigation, search
Publication (help)
Why finding entities in Wikipedia is difficult, sometimes
Authors: Gianluca Demartini, Claudiu S. Firan, Tereza Iofciu, Ralf Krestel, Wolfgang Nejdl [edit item]
Citation: Information Retrieval 13 (5): 534. 2010 October.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1007/s10791-010-9135-7.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Why finding entities in Wikipedia is difficult, sometimes is a publication by Gianluca Demartini, Claudiu S. Firan, Tereza Iofciu, Ralf Krestel, Wolfgang Nejdl.


[edit] Abstract

Entity Retrieval (ER)--in comparison to classical search--aims at finding individual entities instead of relevant documents. Finding a list of entities requires therefore techniques different to classical search engines. In this paper, we present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections. An analysis of entity-centric queries reveals different aspects and problems related to ER and shows limitations of current systems performing ER with Wikipedia. It also indicates which approaches are suitable for which kinds of queries.

[edit] Research questions

"we present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections."

Research details

Topics: Other information retrieval topics [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Other [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article, Information categorization and navigation [edit item]
Wikipedia language: English [edit item]

[edit] Conclusion

"In this paper we presented a general model for ranking entities and we showed how the model can be applied to different real world scenarios. We described in detail a possible instantiation of the model and a set of algorithms designed for the Wikipedia dataset. We make use of the Wikipedia structure—page links and categories—and employ an accurate ontology to remove possible noise in Wikipedia category assignments. The results show that, in the used test collection, category assignments can be both very helpful for retrieval as well as misleading depending on the query syntax. We also employ several NLP techniques to transform the query and to fill the gaps between the query and the Wikipedia language models. We extract essential information (lexical expressions, key concepts, named entities) from the query, as well as expand the terms (by means of synonyms or related words) to find entities by specific spelling variants of their attributes. By combining several techniques we can achieve a relatively high effectiveness of the ER system; still, further improvement is possible by selectively applying the methods for different queries. The experimental evaluation of the ER algorithms has shown that by combining our approaches we achieve an average improvement of 24% in terms of xInfAP and of 30% in terms of P@10 on the XER task of the INEX-XER 2008 test collection. While the proposed techniques were designed for the ER task, experimental results for the list completion task are consistent. While more experimentation is needed to conclude that the proposed techniques perform well in general, we have shown how they improve effectiveness on the used test collection.

We also saw that it might be possible to apply and/or combine different approaches depending on the query in order to maximize effectiveness—e.g., by using our methods we achieve an xInfAP value of over 0.7 for 20% of the queries of the used test collection and the mean xInfAP can be further boosted by 27% only by selecting the appropriate approach for each given topic. We leave as future work the research question of automatically selecting appropriate approaches for each query (e.g., by estimating the expected number of relevant results). We also point out that initial steps toward this goal have been done in Vercoustre et al. (2009) by applying machine learning techniques to predict query difficulty."

[edit] Comments

""In this paper we presented a general model for ranking entities and we showed how the model can be applied to different real world scenarios." p. 564"


Further notes[edit]

Facts about "Why finding entities in Wikipedia is difficult, sometimes"RDF feed
AbstractEntity Retrieval (ER)--in comparison to clEntity Retrieval (ER)--in comparison to classical search--aims at finding individual entities instead of relevant documents. Finding a list of entities requires therefore techniques different to classical search engines. In this paper, we present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections. An analysis of entity-centric queries reveals different aspects and problems related to ER and shows limitations of current systems performing ER with Wikipedia. It also indicates which approaches are suitable for which kinds of queries.s are suitable for which kinds of queries.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Comments"In this paper we presented a general model for ranking entities and we showed how the model can be applied to different real world scenarios." p. 564
ConclusionIn this paper we presented a general modelIn this paper we presented a general model for ranking entities and we showed how the model can be applied to different real world scenarios. We described in detail a possible instantiation of the model and a set of algorithms designed for the Wikipedia dataset. We make use of the Wikipedia structure—page links and categories—and employ an accurate ontology to remove possible noise in Wikipedia category assignments. The results show that, in the used test collection, category assignments can be both very helpful for retrieval as well as misleading depending on the query syntax. We also employ several NLP techniques to transform the query and to fill the gaps between the query and the Wikipedia language models. We extract essential information (lexical expressions, key concepts, named entities) from the query, as well as expand the terms (by means of synonyms or related words) to find entities by specific spelling variants of their attributes. By combining several techniques we can achieve a relatively high effectiveness of the ER system; still, further improvement is possible by selectively applying the methods for different queries. The experimental evaluation of the ER algorithms has shown that by combining our approaches we achieve an average improvement of 24% in terms of xInfAP and of 30% in terms of P@10 on the XER task of the INEX-XER 2008 test collection. While the proposed techniques were designed for the ER task, experimental results for the list completion task are consistent. While more experimentation is needed to conclude that the proposed techniques perform well in general, we have shown how they improve effectiveness on the used test collection. We also saw that it might be possible to apply and/or combine different approaches depending on the query in order to maximize effectiveness—e.g., by using our methods we achieve an xInfAP value of over 0.7 for 20% of the queries of the used test collection and the mean xInfAP can be further boosted by 27% only by selecting the appropriate approach for each given topic. We leave as future work the research question of automatically selecting appropriate approaches for each query (e.g., by estimating the expected number of relevant results). We also point out that initial steps toward this goal have been done in Vercoustre et al. (2009) by applying machine learning techniques to predict query difficulty.ng techniques to predict query difficulty.
Data sourceExperiment responses + and Wikipedia pages +
Doi10.1007/s10791-010-9135-7 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Why%2Bfinding%2Bentities%2Bin%2BWikipedia%2Bis%2Bdifficult%2C%2Bsometimes%22 +
Has authorGianluca Demartini +, Claudiu S. Firan +, Tereza Iofciu +, Ralf Krestel + and Wolfgang Nejdl +
Has domainComputer science +
Has topicOther information retrieval topics +
Issue5 +
MonthOctober +
Pages534 +
Peer reviewedYes +
Publication typeJournal article +
Published inInformation Retrieval +
Research designExperiment +
Research questionswe present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections.
Revid11,059 +
TheoriesUndetermined
Theory typeDesign and action +
TitleWhy finding entities in Wikipedia is difficult, sometimes
Unit of analysisArticle +
Urlhttp://proquest.umi.com/pqdweb?did=2152237371&Fmt=7&clientId=10306&RQT=309&VName=PQD +
Volume13 +
Wikipedia coverageOther +
Wikipedia data extractionDump +
Wikipedia languageEnglish +
Wikipedia page typeArticle + and Information categorization and navigation +
Year2010 +