This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. As a first stage, these syntactical features of the text were used as a structural representation of the document’s text. However, the produced strings of tags not only work as text representative but also provide for text size reduction. This improves the processing efficiency of comparing the document's representative strings using the LCS. A score is generated by computing an accumulative value based on the number of the LCSs found. In the second stage, documents that score well in the first stage are subjected to further comparison using the actual words of the sentences (content) in a sentence by sentence fashion. An overall final is generated as a measure of similarity using the common words (accumulated for the whole document) and the total number of LCSs from the first step. Experiments were done on two different corpora. Results obtained have showed the utility of the proposed procedure in calculating similarities between written documents. The overall discrimination power was maintained while the size of the documents was reduced using only a representative of the document based on the tagged string.
Published in | International Journal of Intelligent Information Systems (Volume 5, Issue 6) |
DOI | 10.11648/j.ijiis.20160506.11 |
Page(s) | 82-87 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2016. Published by Science Publishing Group |
Syntactical Structures, Document Similarity, Bag-of-words, Longest Common Subsequence
[1] | Chowdhury G. Introduction to modern information retrieval. Facet publishing; 2010 Jul 31. |
[2] | L. Bergroth, H. Hakonen and T. Taita, “A Survey of Longest Common Subsequence Algorithms”, In String Processing and Information Retrieval, 7th. International Symposium on, 27-29 Sept. 2000., pp. 39–48. |
[3] | H. Schmid, “Probabilistic Part-of-Speech Tagging Using Decision Trees”, Intern Conf. on New Methods in Language Processing, Germany, 1994, pp. 4-9. |
[4] | Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics. 2010 Sep 1; 11(5):473-83. |
[5] | T. Brants, “TnT -- Statistical Part-of-Speech Tagging”, in Proceedings of the 6th Applied Natural Language Processing Conference (ANLP), (Seattle, Washington, USA, April 29 - May 4 2000), 2000, pp. 224-231. |
[6] | Baral, C., Local Alignment: Smith-Waterman algorithm, CSE 591: Computational Molecular Biology Course, Arizona State University, 2004. |
[7] | Y. Liu and L. Liang, “A Dual-method Model for Copy Detection”, IEEE, IAT Workshops, 2006, pp. 634-7. |
[8] | K. Monostori, R. Finkel, A. Zaslavsky, G. Hodasz and M. Pataki, “Comparison of Overlap Detection Techniques”, Intern. Conference on Computational Science, Amsterdam, Holand, 21-24 Apr., 2002, pp 51-60. |
[9] | Clough, P., Old and new challenges in automatic plagiarism detection, Department of Information Studies, University of Sheffield, 2003. |
[10] | Bull, J., C. Collins, E. Coughlin and D. Sharp, Technical Review of Plagiarism Detection Software Report, Computer Assisted Assessment Centre, University of Luton, Luton, UK. |
[11] | Stein B, zu Eissen SM. Fingerprint-based Similarity Search and its Applications. Universität Weimar. 2007. |
[12] | Kang, N., A. Gelbukh and S. Han, PPChecker: Plagiarism Pattern Checker in Document Copy Detection, 2006. |
[13] | Steinberger, R., B. Pouliquen and J. Hagman, Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC, Springer-Verlag Berlin Heidelberg, 2002. |
[14] | Poinçot, P., S. Lesteven and F. Murtagh, Comparison of Two “Document Similarity Search Engines”, ASP Conference Series, Vol. 153, 1998. |
[15] | Clough, P. and Stevenson M. Developing a Corpus for Plagiarized Short Answers, Language Resources and Evaluation: 45:5-24, London: springer, 2011. |
[16] | Grune, D, and M, Huntjens, Detecting copied submissions in computer science workshops, Vakgroep Informatica, Faculteit Wiskunde & Informatica, Vrije Universiteit, AMSTERDAM, 1989. |
[17] | A, G. Maguitman, F, Menczer, H. Roinestad and A. Vespignani, “Algorithmic Detection of Semantic Similarity”, International World Wide Web Conference Committee, 2005, pp. 107-116. |
[18] | Mihalcea, R., C, Corley and C, Strapparava, Corpus-based and Knowledge-based Measures of Text Semantic Similarity, American Association for Artificial Intelligence, Jul, 2006. |
[19] | D. M. Campbell, W. R. Chen and R. D. Smith, “Copy Detection Systems for Digital Documents”, IEEE, Washington, DC, USA, May, 2000, pp. 78-88. |
[20] | H. Schmid, “Improvements in Part-of-Speech Tagging With an Application To German”, EACL SIGDAT workshop, in Dubai (UAE), 1995. |
[21] | Elzinga C, Rahmann S, Wang H. Algorithms for subsequence combinatorics. Theoretical Computer Science. 2008 Dec 28; 409(3):394-404. |
[22] | Sagayam R, Srinivasan S, Roshni S. A survey of text mining: Retrieval, extraction and indexing techniques. International Journal of Computational Engineering Research. 2012 Sep; 2(5). |
[23] | Natural Language Processing in Information Retrieval Thorsten Brants Google Inc, 2004. |
[24] | Clough, P. and Stevenson, M., “Developing A Corpus of Plagiarised Short Answers”, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, Volume 45(1), pp. 5-24. 2010. |
[25] | Elhadi, M. Al-Tobi, M. "Detection of Duplication in Documents and WebPages Based Documents Syntactical Structures through an Improved Longest Common Subsequence", IJIPM: International Journal of Information Processing and Management, Vol. 1, No. 1, pp. 138~147, 2010. |
[26] | Pradhan N, Gyanchandani M, Wadhvani R. A Review on Text Similarity Technique used in IR and its Application. International Journal of Computer Applications. 2015 Jan 1; 120(9). |
[27] | Roshdi A, Roohparvar A. Review: Information Retrieval Techniques and Applications. International Journal of Computer Networks and Communications Security VOL. 3, NO. 9, SEPTEMBER 2015, 373–377. |
[28] | Gomaa WH, Fahmy AA. A survey of text similarity approaches. International Journal of Computer Applications. 2013 Jan 1; 68(13). |
[29] | Traina AJ, Traina Jr C, Cordeiro RL, editors. Similarity Search and Applications: 7th International Conference, SISAP 2014, Los Cabos, Mexico, October 29-31, 2104, Proceedings. Springer; 2014 Oct 8. |
[30] | Cambria E, White B. Jumping NLP curves: a review of natural language processing research. IEEE Computational Intelligence Magazine. 2014 May;9(2):48-57 |
[31] | W. Daelemans, J. Zavrel, P. Berck and S. Gillis, “MBT: A Memory-Based Part of Speech Tagger Generator”, in Proceedings of Fourth Workshop on Very Large Corpora (WVLC), University of Copenhagen, Copenhagen, Denmark, August 5- 9 1996), 1996, pp. 14-27. |
[32] | Yonghong Mao, Natural Language Processing Module (Part of Speech Tagging and Sentence Parsing), Cognitive Science in Context Laboratory, Cornell University, New York, U.S., 1997. |
[33] | J. Cussens, “Part-of-Speech Tagging Using Progol”, in Proceedings of the 7th International Workshop (Inductive Logic Programming), (Prague, Czech Republic, September 17-20 1997), 1997, pp. 93-108. |
APA Style
Mohamed Taybe Elhadi. (2016). Using Text's Terms and Syntactical Properties for Document Similarity. International Journal of Intelligent Information Systems, 5(6), 82-87. https://doi.org/10.11648/j.ijiis.20160506.11
ACS Style
Mohamed Taybe Elhadi. Using Text's Terms and Syntactical Properties for Document Similarity. Int. J. Intell. Inf. Syst. 2016, 5(6), 82-87. doi: 10.11648/j.ijiis.20160506.11
AMA Style
Mohamed Taybe Elhadi. Using Text's Terms and Syntactical Properties for Document Similarity. Int J Intell Inf Syst. 2016;5(6):82-87. doi: 10.11648/j.ijiis.20160506.11
@article{10.11648/j.ijiis.20160506.11, author = {Mohamed Taybe Elhadi}, title = {Using Text's Terms and Syntactical Properties for Document Similarity}, journal = {International Journal of Intelligent Information Systems}, volume = {5}, number = {6}, pages = {82-87}, doi = {10.11648/j.ijiis.20160506.11}, url = {https://doi.org/10.11648/j.ijiis.20160506.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20160506.11}, abstract = {This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. As a first stage, these syntactical features of the text were used as a structural representation of the document’s text. However, the produced strings of tags not only work as text representative but also provide for text size reduction. This improves the processing efficiency of comparing the document's representative strings using the LCS. A score is generated by computing an accumulative value based on the number of the LCSs found. In the second stage, documents that score well in the first stage are subjected to further comparison using the actual words of the sentences (content) in a sentence by sentence fashion. An overall final is generated as a measure of similarity using the common words (accumulated for the whole document) and the total number of LCSs from the first step. Experiments were done on two different corpora. Results obtained have showed the utility of the proposed procedure in calculating similarities between written documents. The overall discrimination power was maintained while the size of the documents was reduced using only a representative of the document based on the tagged string.}, year = {2016} }
TY - JOUR T1 - Using Text's Terms and Syntactical Properties for Document Similarity AU - Mohamed Taybe Elhadi Y1 - 2016/12/05 PY - 2016 N1 - https://doi.org/10.11648/j.ijiis.20160506.11 DO - 10.11648/j.ijiis.20160506.11 T2 - International Journal of Intelligent Information Systems JF - International Journal of Intelligent Information Systems JO - International Journal of Intelligent Information Systems SP - 82 EP - 87 PB - Science Publishing Group SN - 2328-7683 UR - https://doi.org/10.11648/j.ijiis.20160506.11 AB - This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. As a first stage, these syntactical features of the text were used as a structural representation of the document’s text. However, the produced strings of tags not only work as text representative but also provide for text size reduction. This improves the processing efficiency of comparing the document's representative strings using the LCS. A score is generated by computing an accumulative value based on the number of the LCSs found. In the second stage, documents that score well in the first stage are subjected to further comparison using the actual words of the sentences (content) in a sentence by sentence fashion. An overall final is generated as a measure of similarity using the common words (accumulated for the whole document) and the total number of LCSs from the first step. Experiments were done on two different corpora. Results obtained have showed the utility of the proposed procedure in calculating similarities between written documents. The overall discrimination power was maintained while the size of the documents was reduced using only a representative of the document based on the tagged string. VL - 5 IS - 6 ER -