Browse wiki

Jump to: navigation, search
Cross-language plagiarism detection
Abstract Cross-language plagiarism detection deals Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120 000 test documents which are selected from the corpora JRC-Acquis andWikipedia, so that for each test document highly similar documents are available in all of the 6 languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.translations but does not generalize well.
Added by wikilit team Added on initial load  +
Collected data time dimension Cross-sectional  +
Comments wikipedia pages and another corpus (JRC-Acquis)
Conclusion The evaluation covers three experiments wiThe evaluation covers three experiments with two aligned corpora, the comparable Wikipedia corpus and the parallel JRC-Acquis corpus. In the experiments the models are employed in different tasks related to cross-language ranking in order to determine whether or not they can be used to retrieve documents known to be highly similar across languages. Our findings include that the CL-C3G model and the CLESA model are in general better suited for this task, while CL-ASA achieves good results on professional and automatic translations. CL-CNG outperforms CL-ESA and CL-ASA. However, unlike the former, CL-ESA and CL-ASA can also be used on language pairs whose alphabet or syntax are unrelated.rs whose alphabet or syntax are unrelated.
Data source Experiment responses  + , Archival records  + , Wikipedia pages  +
Doi 10.1007/s10579-009-9114-z +
Google scholar url http://scholar.google.com/scholar?ie=UTF-8&q=%22Cross-language%2Bplagiarism%2Bdetection%22  +
Has author Martin Potthast + , Alberto Barrón-Cedeño + , Benno Stein + , Paolo Rosso +
Has domain Computer science +
Has topic Cross-language information retrieval +
Month January  +
Peer reviewed Yes  +
Publication type Journal article  +
Published in Language Resources and Evaluation +
Research design Experiment  +
Research questions Cross-language plagiarism detection deals Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared.ls CL-CNG, CL-ESA and CL-ASA are compared.
Revid 10,721  +
Theories Undetermined
Theory type Design and action  +
Title Cross-language plagiarism detection
Unit of analysis Article  +
Url http://dx.doi.org/10.1007/s10579-009-9114-z  +
Wikipedia coverage Sample data  +
Wikipedia data extraction Dump  +
Wikipedia language Multiple  +
Wikipedia page type Article  +
Year 2010  +
Creation dateThis property is a special property in this wiki. 15 March 2012 20:25:39  +
Categories Cross-language information retrieval  + , Computer science  + , Publications  +
Modification dateThis property is a special property in this wiki. 30 January 2014 20:22:50  +
hide properties that link here 
  No properties link to this page.
 

 

Enter the name of the page to start browsing from.