| Peer-Reviewed

Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus

Received: 23 March 2016     Accepted: 7 June 2016     Published: 18 June 2016
Views:       Downloads:
Abstract

Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.

Published in International Journal of Intelligent Information Systems (Volume 5, Issue 3)
DOI 10.11648/j.ijiis.20160503.12
Page(s) 42-47
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2016. Published by Science Publishing Group

Keywords

Comparable Corpus, Corpus Quality, Hamshahri Corpus, Query, RATF Factor

References
[1] A. Blets, E. kow, “Extracting Parallel Fragments from Comparable Corpora for Date-to-Text Generation”, Proceeding INLG’10 Procedeeing of the 6th International Natural Language Generation Conference, 2007, pp. 167-171.
[2] P. Fung, “Finding terminology translations from nonparallel corpora”, Proceedings of the Fifth Workshop on Very Large Corpora, pages 192–202, 1997.
[3] R. Rapp, Automatic identification of word translations fromunrelated english and german corpora. In Proceedings of the 37th annual meeting of the association for Computational Linguistics on Computational Linguistics, pages 519–526, Morristown.
[4] D. Herv´e, E. Gaussier, and F. Sadat, An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, COLING, pages 1–7, Taipei, Taiwan.
[5] R. Xavier, Y. Sasaki, M. Tonoike, S. Sato, and T. Utsuro, Compiling French-Japanese terminologies from the web. In proceedings of the 11st EACL, 2006, pages 225–232, Trento, Italy.
[6] E. Morin, D. B´eatrice, T. Koichi and K. Kyo, Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th ACL, 2007, pages 664– 671, Prague, Czech Republic.
[7] J. Xu, W. Croft, “Query expansion using local and global document analysis”, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 18–22 August 1996, pages 4–11.
[8] R. Xiao and X. Hu, Corpus-Based Studies of Translational Chinese in English-Chinese Translation, Springer Heidelberg New York Dordrecht London, 2015, ISSN 2197-8689, ISSN 2197-8697 (electronic), New Frontiers in Translation Studies, ISBN 978-3-642-41362-9, ISBN 978-3-642-41363-6 (eBook), DOI 10.1007/978-3-642-41363-6.
[9] K. Benjamin Tsou, Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared, Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 1–2, Beijing, China, July 30, 2015.
[10] P. Fung and P. Cheung, “Mining very Non-parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM”, In EMNLP 2004, pages 57-63.
[11] T. Tao, C. X. Zhai, “Mining Comparable Bilingual Text Corpora for Cross-Language Information Integration,” in SIGKDD, 2005, pp. 691-696.
[12] T. Talvensaari, J. Laurikkala, K. Jarvelin, M. Juhola, H. Keskustalo, “Creating and Exploiting a Comparable Corpus in Cross-Language Information Retrieval”, ACM Trans. Inf. Syst., Vol. 25, No. 1, 2007, pp. 4.
[13] T. Talvensaari, “Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR,” Advances in Information Retrieval, 2008, pp. 114-125.
[14] L. Shao and H. T. Ng, “Mining New Word Translations from Comparable Corpora”, In: COLING 2004.
[15] M. Tonoike, T. Utsuro, and S. Sato, “Compositional Translation Estimation of Technical Terms using a Domain/Topic-Specific Corpus collected from the Web”, Journal of Natural Language Processing, Vol. 14, No. 2, pp. 33-68, April 2007.
[16] D. Shezaf and A. Rappoport,. Bilingual Lexicon Generation Using Non-Aligned Signatures. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, 2010, pp. 98–07.
[17] X. Saralegi, I. San Vicente and A. Gurrutxaga, =Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain. In Proc. of the 1st Workshop on Building and Using Comparable Corpora (BUCC) at LREC 2008.
[18] B. Li, E. Gaussier, “Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora,” in Proceeding of the 23rd International Conference on Computational Linguistics, Beijing, China: Coling Organizing Committee, 2010, pp. 644-652.
[19] NJ, USA. Association for Computational Linguistics. Ghayoomi, Momtazi, Bijankhan, A study of corpus development for Persian, International Journal of Asian Language Processing 20(1), 2010.
[20] H. Hashemi, A. Shakery, H. Faili, Creating Persian English Comparable Corpus, CLEF, 2010.
Cite This Article
  • APA Style

    Seyede Roya Mohammadi, Noushin Riahi. (2016). Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. International Journal of Intelligent Information Systems, 5(3), 42-47. https://doi.org/10.11648/j.ijiis.20160503.12

    Copy | Download

    ACS Style

    Seyede Roya Mohammadi; Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int. J. Intell. Inf. Syst. 2016, 5(3), 42-47. doi: 10.11648/j.ijiis.20160503.12

    Copy | Download

    AMA Style

    Seyede Roya Mohammadi, Noushin Riahi. Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus. Int J Intell Inf Syst. 2016;5(3):42-47. doi: 10.11648/j.ijiis.20160503.12

    Copy | Download

  • @article{10.11648/j.ijiis.20160503.12,
      author = {Seyede Roya Mohammadi and Noushin Riahi},
      title = {Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus},
      journal = {International Journal of Intelligent Information Systems},
      volume = {5},
      number = {3},
      pages = {42-47},
      doi = {10.11648/j.ijiis.20160503.12},
      url = {https://doi.org/10.11648/j.ijiis.20160503.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20160503.12},
      abstract = {Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.},
     year = {2016}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus
    AU  - Seyede Roya Mohammadi
    AU  - Noushin Riahi
    Y1  - 2016/06/18
    PY  - 2016
    N1  - https://doi.org/10.11648/j.ijiis.20160503.12
    DO  - 10.11648/j.ijiis.20160503.12
    T2  - International Journal of Intelligent Information Systems
    JF  - International Journal of Intelligent Information Systems
    JO  - International Journal of Intelligent Information Systems
    SP  - 42
    EP  - 47
    PB  - Science Publishing Group
    SN  - 2328-7683
    UR  - https://doi.org/10.11648/j.ijiis.20160503.12
    AB  - Multilingual corpora are the main sources in language information retrieval fields. The quality of many researches such as machine translation strongly depends on the quality of these corpora. One of these corpora's is comparable corpus. Considering their quality, these corpora contain broad range of information but constructing them has its special problems which lead to a few numbers of pairs in comparable corpus unlike its large dataset. In this paper we present a new method for increasing the quality and quantity of comparable corpus. We built a Persian-English comparable corpus from two independent news collections: BBC news in English and Hamshahri news in Persian.
    VL  - 5
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • Computer Engineering Department, Alzahra University, Tehran, Iran

  • Computer Engineering Department, Alzahra University, Tehran, Iran

  • Sections