Abstract

Plagiarism detection is the process of analysing a scientific text and to find potential plagiarised passages. In this context, non-automated procedures have proven to be time-consuming and subjective. Especially in the light of a steadily increasing number of scientific publications, automated software-aided approaches represent valuable instruments to effectively detect plagiarized text. Conventional plagiarism software compares text passages against potential original documents based on matching strings. In contrast, intrinsic plagiarism detection attempts to detect plagiarized sections based on stylometric features. Thus, this procedure enables to discover sudden changes in the writing style. The recognition of stylistic inconsistencies is closely associated with the field of Authorship Attribution, especially in the use of textual features. The present thesis focuses on the development and implementation of a prototype of intrinsic plagiarism detection. The developed approach automatically extracts stylometric features from a given text and performs a multivariate cluster analysis. The respective clusters represent groups of text passages exhibiting similar stilometric properties and can therefore be associated with the respective number of authors. The input data (text) is represented by articles from the English-language edition of the online encyclopedia Wikipedia. The evaluation results demonstrate that the conducted procedure enables to approximately distinguish between text passages originating form different authors. Furthermore, it was shown that the reliability of the results are strongly dependent on the number of authors. The approximation of the correct author class structure depends among others on the determination of the number of clusters. The resulting number is validated by an own developed quality measure.

Reference

Schneider, D. (2015). Intrinsische Plagiatserkennung durch stilometrische Clusteranalyse [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.25561