Exploiting Wikipedia and EuroWordNet to solve cross-lingual question answering

From WikiLit
Jump to: navigation, search
Publication (help)
Exploiting Wikipedia and EuroWordNet to solve cross-lingual question answering
Authors: Sergio Ferrandez, Antonio Toral, Oscar Ferrandez, Antonio Ferrandez, Rafael Munoz [edit item]
Citation: Information Sciences 179 (20): 3473-3488. 2009.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1016/j.ins.2009.06.031.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Exploiting Wikipedia and EuroWordNet to solve cross-lingual question answering is a publication by Sergio Ferrandez, Antonio Toral, Oscar Ferrandez, Antonio Ferrandez, Rafael Munoz.


[edit] Abstract

This paper describes a new advance in solving Cross-Lingual Question Answering {(CL-QA)} tasks. It is built on three main pillars: (i) the use of several multilingual knowledge resources to reference words between languages (the Inter Lingual Index (ILI) module of EuroWordNet and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. This novel approach overcomes the errors caused by the common use of Machine Translation (MT) services by CL-QA systems. We also expose some studies and experiments that justify the importance of analyzing whether a Named Entity should be translated or not. Experimental results in bilingual scenarios show that our approach performs better than an MT based CL-QA approach achieving an average improvement of 36.7%.

[edit] Research questions

"This paper describes a new advance in solving Cross-Lingual Question Answering (CL–QA) tasks. It is built on three main pillars: (i) the use of several multilingual knowledge resources to reference words between languages (the Inter Lingual Index (ILI) module of EuroWordNet and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. This novel approach overcomes the errors caused by the common use of Machine Translation (MT) services by CL–QA systems. We also expose some studies and experiments that justify the importance of analyzing whether a Named Entity should be translated or not."

Research details

Topics: Cross-language information retrieval [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Archival records, Experiment responses, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: English, Spanish [edit item]

[edit] Conclusion

"To sum up, this paper illustrates a new advance to solve CL–QA tasks. Specifically, we present a robust CL methodology (implemented in a CL–QA system called BRILIW) and its evaluation in English–Spanish scenarios. The main contributions of our research are listed below:

• The use of several multilingual knowledge resources (Wikipedia and ILI) to reference words between languages (proposal (i)). Our hypothesis is that both resources contain complementary information and therefore a combination could achieve better CL–QA performance. Wikipedia multilingual knowledge is incorporated into our CL–QA system in the NET module. In contrast, the other multilingual resource (ILI) employed by the system is used for translating common nouns and verbs.

• The consideration of more than only one translation per word in order to search candidate answers (proposal (ii)). Different to common MT based CL–QA systems, our proposal considers more than one translation per word by means of using the different synsets of each word in the ILI module of EWN.

• The analysis of the question in its original language (proposal (iii)). Analyzing the question in its original language avoids lexical and syntactical noise that could be introduced to the system by wrong question translations.

• A study justifying the need for correct translations of NEs has been presented, in which we observed that the percentage of questions with NEs is quite high (87.7% on average). Nearly half these entities should be translated (41.2% on average) since these NEs are differently named depending on the language. Obviously, the remaining percentage of NEs must be checked since the system does not know if the NEs should be translated or not. In our strategy, this control is developed by the NET module using Wikipedia where most of the NEs are present.

• Three different experiments are detailed in order to present the evolution of our CL–QA approach. Furthermore, these experiments show the improvements obtained with each new proposal (adding internal bilingual dictionaries +9.3% and adding the NET module +22.2%). Exploiting Wikipedia to support this procedure ensures the up-to-date status of the module. To our knowledge, we have been the first research group to apply multilingual knowledge from Wikipedia within the CL–QA environment.

• The evaluation of our CL–QA methodology is developed using the CLEF 2004, 2005 and 2006 sets of English and Spanish questions (1200 in total). The results obtained are very promising. BRILIW obtains an improvement of 36.7% compared to the MT based approach.

Our technique for solving the CL–QA task increases the precision of the CL to the level of monolingual QA runs. Compared to other state-of-the-art CL–QA systems, our approach obtains better results. In fact, our average precision loss of CL with respect to the monolingual run is around 7.2% whereas in the English–Spanish QA task at CLEF 2006

[39] the precision of the English–Spanish CL–QA task was approximately 50% lower than for the monolingual Spanish task.

• Our CL methodology is able to solve questions that cannot be solved by MT based CL–QA approaches. There are a lot of questions that neither MT based CL–QA system would be able to solve.

With the aim of proving this last affirmation and the up-to-date and real characteristics of our strategy, we have tested the NET module with questions that contains the NEs involved in the Top Searches Google News 2001–2006 and the precision of translating NEs climbed to 100%.

This last test of the NET module shows and demonstrates the correct process of NE translation which is carried out by our CL technique. Compared to on-line MT services (results down to 46.7% on average), our approach obtains much better results. Furthermore, we want to emphasize that in most of these questions, a MT based CL–QA system would not be able to search the correct answer, because the NE of the question is usually wrongly translated by the MT services."

[edit] Comments


Further notes[edit]

Facts about "Exploiting Wikipedia and EuroWordNet to solve cross-lingual question answering"RDF feed
AbstractThis paper describes a new advance in solvThis paper describes a new advance in solving Cross-Lingual Question Answering {(CL-QA)} tasks. It is built on three main pillars: (i) the use of several multilingual knowledge resources to reference words between languages (the Inter Lingual Index (ILI) module of EuroWordNet and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. This novel approach overcomes the errors caused by the common use of Machine Translation (MT) services by CL-QA systems. We also expose some studies and experiments that justify the importance of analyzing whether a Named Entity should be translated or not. Experimental results in bilingual scenarios show that our approach performs better than an MT based CL-QA approach achieving an average improvement of 36.7%.achieving an average improvement of 36.7%.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
ConclusionTo sum up, this paper illustrates a new adTo sum up, this paper illustrates a new advance to solve CL–QA tasks. Specifically, we present a robust CL methodology (implemented in a CL–QA system called BRILIW) and its evaluation in English–Spanish scenarios. The main contributions of our research are listed below:

• The use of several multilingual knowledge resources (Wikipedia and ILI) to reference words between languages (proposal (i)). Our hypothesis is that both resources contain complementary information and therefore a combination could achieve better CL–QA performance. Wikipedia multilingual knowledge is incorporated into our CL–QA system in the NET module. In contrast, the other multilingual resource (ILI) employed by the system is used for translating common nouns and verbs.

• The consideration of more than only one translation per word in order to search candidate answers (proposal (ii)). Different to common MT based CL–QA systems, our proposal considers more than one translation per word by means of using the different synsets of each word in the ILI module of EWN.

• The analysis of the question in its original language (proposal (iii)). Analyzing the question in its original language avoids lexical and syntactical noise that could be introduced to the system by wrong question translations.

• A study justifying the need for correct translations of NEs has been presented, in which we observed that the percentage of questions with NEs is quite high (87.7% on average). Nearly half these entities should be translated (41.2% on average) since these NEs are differently named depending on the language. Obviously, the remaining percentage of NEs must be checked since the system does not know if the NEs should be translated or not. In our strategy, this control is developed by the NET module using Wikipedia where most of the NEs are present.

• Three different experiments are detailed in order to present the evolution of our CL–QA approach. Furthermore, these experiments show the improvements obtained with each new proposal (adding internal bilingual dictionaries +9.3% and adding the NET module +22.2%). Exploiting Wikipedia to support this procedure ensures the up-to-date status of the module. To our knowledge, we have been the first research group to apply multilingual knowledge from Wikipedia within the CL–QA environment.

• The evaluation of our CL–QA methodology is developed using the CLEF 2004, 2005 and 2006 sets of English and Spanish questions (1200 in total). The results obtained are very promising. BRILIW obtains an improvement of 36.7% compared to the MT based approach.

Our technique for solving the CL–QA task increases the precision of the CL to the level of monolingual QA runs. Compared to other state-of-the-art CL–QA systems, our approach obtains better results. In fact, our average precision loss of CL with respect to the monolingual run is around 7.2% whereas in the English–Spanish QA task at CLEF 2006

[39] the precision of the English–Spanish CL–QA task was approximately 50% lower than for the monolingual Spanish task.

• Our CL methodology is able to solve questions that cannot be solved by MT based CL–QA approaches. There are a lot of questions that neither MT based CL–QA system would be able to solve.

With the aim of proving this last affirmation and the up-to-date and real characteristics of our strategy, we have tested the NET module with questions that contains the NEs involved in the Top Searches Google News 2001–2006 and the precision of translating NEs climbed to 100%.

This last test of the NET module shows and demonstrates the correct process of NE translation which is carried out by our CL technique. Compared to on-line MT services (results down to 46.7% on average), our approach obtains much better results. Furthermore, we want to emphasize that in most of these questions, a MT based CL–QA system would not be able to search the correct answer, because the NE of the question is usually wrongly translated by the MT services.
lly wrongly translated by the MT services.
Data sourceArchival records +, Experiment responses + and Wikipedia pages +
Doi10.1016/j.ins.2009.06.031 +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Exploiting%2BWikipedia%2Band%2BEuroWordNet%2Bto%2Bsolve%2Bcross-lingual%2Bquestion%2Banswering%22 +
Has authorSergio Ferrandez +, Antonio Toral +, Oscar Ferrandez +, Antonio Ferrandez + and Rafael Munoz +
Has domainComputer science +
Has topicCross-language information retrieval +
Issue20 +
Pages3473-3488 +
Peer reviewedYes +
Publication typeJournal article +
Published inInformation Sciences +
Research designExperiment +
Research questionsThis paper describes a new advance in solvThis paper describes a new advance in solving Cross-Lingual Question Answering (CL–QA) tasks. It is built on three main pillars: (i) the use of several multilingual knowledge resources to reference words between languages (the Inter Lingual Index (ILI) module of EuroWordNet and the multilingual knowledge encoded in Wikipedia); (ii) the consideration of more than only one translation per word in order to search candidate answers; and (iii) the analysis of the question in the original language without any translation process. This novel approach overcomes the errors caused by the common use of Machine Translation (MT) services by CL–QA systems. We also expose some studies and experiments that justify the importance of analyzing whether a Named Entity should be translated or not. Named Entity should be translated or not.
Revid10,759 +
TheoriesUndetermined
Theory typeDesign and action +
TitleExploiting Wikipedia and EuroWordNet to solve cross-lingual question answering
Unit of analysisArticle +
Urlhttp://dx.doi.org/10.1016/j.ins.2009.06.031 +
Volume179 +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageEnglish + and Spanish +
Wikipedia page typeArticle +
Year2009 +