Text Anomalies Detection Using Histograms of Words
Abstract
Authors of written texts mainly can be characterized by some collection of attributes obtained from texts. Texts of the same author are very similar from the style point of view. We can consider that attributes of a full text are very similar to attributes of parts in the same text. In the same thoughts can be compared different parts of the same text. In the paper, we describe an algorithm based on histograms of a mapped text to interval. In the mapping, it is kipped the word order as in the text. Histograms are analyzed from a cluster point of view. If a cluster dispersion is not large, the text is probably written by the same author. If the cluster dispersion is large, the text will be split in two or more parts and the same analysis will be done for the text parts. The experiments were done on English and Arabic texts. For combined English texts our algorithm covers that texts were not written by one author. We have got the similar results for combined Arabic texts. Our algorithm can be used to basic text analysis if the text was written by one author.
Keywords
Full Text:
PDFReferences
pan-plagiarism-corpus-2011.part1.rar,
ttp://www.uniweimar.de/en/media/chairs/webis/corpora/pan-pc-11/ .
King Saud University Corpus of Classical Arabic.
http://ksucorpus.ksu.edu.sa .
I. Bensalem, P. Rosso, S. Chikhi: A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection. CLEF 2013, LNCS 8138, pp. 53-58, 2013.
A. Neme, J.R.G. Pulido, A. Muñoz, S. Hernández, T. Dey: Stylistics analysis and authorship attribution algorithms based on self-organizing maps, Neurocomputing, 147 5 January 2015, pp. 147–159.
E. Stamatatos: Authorship attribution based on feature set subspacing ensembles. International Journal on Artificial Intelligence Tools, 15.5, pp. 823-838.
H. J. Escalante, T. Solorio, M. Montes-y-Gomez: Local Histograms of Character N-grams for Authorship Attribution. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 288-298, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics.
G. Oberreuter, G. LHuillier, S. A. R´ıos, and J. D. Velasquez: Approaches for Intrinsic and External Plagiarism Detection. Notebook for PAN at CLEF 2011 .
F. I. Haj Hassan, M. A. Chaurasia: N-Gram Based Text Author Verification. IPCSI vol. 36 (2012), IACSIT press, Singapore.
E. Stamatatos, A survey of modern authorship attribution methods, J. Am. Soc. Inf. Sci. Technol. 60(3) (2010), pp. 538-556.
A. Almarimi, G. Andrejková: Discrepancies Detection in Arabic and English Documents, ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 5, No.17 , September 2015. ISSN : 2322-5157, pp. 69-75.
A. Almarimi, G. Andrejková: Document Verification using N-grams and Histograms of Words. IEEE 13th International Scientific Conference on Informatics, November 18-20 Poprad Slovakia, pp. 21-26.
G. Lebanon, Y. Mao, J. Dillon: The locally weighted bag of words framework for document representation. Journal of Machine Learning Research 8 (2007), pp. 2405-2441.