Cross-language plagiarism detection

From WikiLit
Jump to: navigation, search
Publication (help)
Cross-language plagiarism detection
Authors: Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, Paolo Rosso [edit item]
Citation: Language Resources and Evaluation  : . 2010 January.
Publication type: Journal article
Peer-reviewed: Yes
Database(s):
DOI: 10.1007/s10579-009-9114-z.
Google Scholar cites: Citations
Link(s): Paper link
Added by Wikilit team: Added on initial load
Search
Article: Google Scholar BASE PubMed
Other scholarly wikis: AcaWiki Brede Wiki WikiPapers
Web search: Bing Google Yahoo!Google PDF
Other:
Services
Format: BibTeX
Cross-language plagiarism detection is a publication by Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, Paolo Rosso.


[edit] Abstract

Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120 000 test documents which are selected from the corpora JRC-Acquis andWikipedia, so that for each test document highly similar documents are available in all of the 6 languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.

[edit] Research questions

"Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared."

Research details

Topics: Cross-language information retrieval [edit item]
Domains: Computer science [edit item]
Theory type: Design and action [edit item]
Wikipedia coverage: Sample data [edit item]
Theories: "Undetermined" [edit item]
Research design: Experiment [edit item]
Data source: Experiment responses, Archival records, Wikipedia pages [edit item]
Collected data time dimension: Cross-sectional [edit item]
Unit of analysis: Article [edit item]
Wikipedia data extraction: Dump [edit item]
Wikipedia page type: Article [edit item]
Wikipedia language: Multiple [edit item]

[edit] Conclusion

"The evaluation covers three experiments with two aligned corpora, the comparable Wikipedia corpus and the parallel JRC-Acquis corpus. In the experiments the models are employed in different tasks related to cross-language ranking in order to determine whether or not they can be used to retrieve documents known to be highly similar across languages. Our findings include that the CL-C3G model and the CLESA model are in general better suited for this task, while CL-ASA achieves good results on professional and automatic translations. CL-CNG outperforms CL-ESA and CL-ASA. However, unlike the former, CL-ESA and CL-ASA can also be used on language pairs whose alphabet or syntax are unrelated."

[edit] Comments

"wikipedia pages and another corpus (JRC-Acquis)"


Further notes[edit]

Facts about "Cross-language plagiarism detection"RDF feed
AbstractCross-language plagiarism detection deals Cross-language plagiarism detection deals with the automatic identification

and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120 000 test documents which are selected from the corpora JRC-Acquis andWikipedia, so that for each test document highly similar documents are available in all of the 6 languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of

languages. CL-ASA works best on “exact” translations but does not generalize well.
translations but does not generalize well.
Added by wikilit teamAdded on initial load +
Collected data time dimensionCross-sectional +
Commentswikipedia pages and another corpus (JRC-Acquis)
ConclusionThe evaluation covers three experiments wiThe evaluation covers three experiments with two aligned corpora, the comparable

Wikipedia corpus and the parallel JRC-Acquis corpus. In the experiments the models are employed in different tasks related to cross-language ranking in order to determine whether or not they can be used to retrieve documents known to be highly similar across languages. Our findings include that the CL-C3G model and the CLESA model are in general better suited for this task, while CL-ASA achieves good results on professional and automatic translations. CL-CNG outperforms CL-ESA and CL-ASA. However, unlike the former, CL-ESA and CL-ASA can also be used

on language pairs whose alphabet or syntax are unrelated.
rs whose alphabet or syntax are unrelated.
Data sourceExperiment responses +, Archival records + and Wikipedia pages +
Doi10.1007/s10579-009-9114-z +
Google scholar urlhttp://scholar.google.com/scholar?ie=UTF-8&q=%22Cross-language%2Bplagiarism%2Bdetection%22 +
Has authorMartin Potthast +, Alberto Barrón-Cedeño +, Benno Stein + and Paolo Rosso +
Has domainComputer science +
Has topicCross-language information retrieval +
MonthJanuary +
Peer reviewedYes +
Publication typeJournal article +
Published inLanguage Resources and Evaluation +
Research designExperiment +
Research questionsCross-language plagiarism detection deals Cross-language plagiarism detection deals with the automatic identification

and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections fromthe document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the

three models CL-CNG, CL-ESA and CL-ASA are compared.
ls CL-CNG, CL-ESA and CL-ASA are compared.
Revid10,721 +
TheoriesUndetermined
Theory typeDesign and action +
TitleCross-language plagiarism detection
Unit of analysisArticle +
Urlhttp://dx.doi.org/10.1007/s10579-009-9114-z +
Wikipedia coverageSample data +
Wikipedia data extractionDump +
Wikipedia languageMultiple +
Wikipedia page typeArticle +
Year2010 +