Motivation: The discovery and assessment genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows- Wheeler Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools [30] [31], together with scripts written in R programming language, could provide an efficient solution for calling variants. We focused on RADSeq-based marker selection for Arabidopsis thaliana. RADSeq consists short reads that do not cover the whole reference genome. Finally, SNPs as output in Variant Call Format (VCF) have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels is helpful and provides us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly and we showed that our pipeline is efficient in RADSeq-based marker selection.
Published in | European Journal of Biophysics (Volume 6, Issue 1) |
DOI | 10.11648/j.ejb.20180601.12 |
Page(s) | 7-16 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2018. Published by Science Publishing Group |
NGS-RADSeq, Arabidopsis thaliana (TAIR10), GATK, SAMTools, Chi-Square Test, HWE-P, Reliable SNPs
[1] | Allen, R. S., Nakasugi, K., Doran, R. L., Millar, A. A., & Waterhouse, P. M. (2013). Facile mutant identification via a single parental backcross method and application of whole genome sequencing based mapping pipelines. Frontiers in plant science, 4. |
[2] | Almgren, P., BENDAHL, P., Bengtsson, H., Hössjer, O., & Perfekt, R. (2003). Statistics in genetics. Lecture notes, Lund. |
[3] | Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G., & Hohenlohe, P. A. (2016). Harnessing the power of radseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17 (2), 81–92. |
[4] | Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A.,. Johnson, E. A. (2008). Rapid snp discovery and genetic mapping using sequenced rad markers. PloS one, 3 (10), e3376. |
[5] | Bergelson, J., Kreitman, M., & Nordborg, M. (n. d.). Columbia col-0: n28167 or cs28167 [Computer software manual]. Retrieved from https://www.arabidopsis.org/abrc/catalog/natural a ccession 5 html. |
[6] | Catchen, J., Hohenlohe, P. A., Amores, S. B. A., & Cresko, W. A. (2014 Nov 25). Stacks: An analysis tool set for population genomics. NIH Public Access, PMC. |
[7] | Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A.,... others (2011). The variant call format and vcftools. Bioinformatics, 27 (15), 2156–2158. |
[8] | Davey, J., & Blaxter, M. L. (2011). Radseq: next-generation population genetics. Briefingsin Functional Genomics, 9, 108. |
[9] | De Pristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C.,... others (2011). A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature genetics, 43 (5), 491–498. |
[10] | De Summa, S., Malerba, G., Pinto, R., Mori, A., Mijatovic, V., & Tommasi, S. (2017). Gatk hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC bioinformatics, 18 (5), 119. |
[11] | Emigh, T. H. (1980). A comparison of tests for hardy-weinberg equilibrium. Biometrics, 627–642. Group, S. F. S. W., et al. (2014). Sequence alignment/map format specification. Tech. rep. Version 1. 2015. url: http://samtools. github. io/hts-specs/SAMv1. pdf (visited on 01/04/2015). |
[12] | Herzeel, C., Costanza, P., Ashby, T., & Wuyts, R. (2013). Performance analysis of bwa alignment (Tech. Rep.). Technical Report Exascience Life Lab. 59. |
[13] | Ishii, K., Kazama, Y., Hirano, T., Hamada, M., Ono, Y., Yamada, M., & Abe, T. (2016). Amap: A pipeline for whole-genome mutation detection in arabidopsis thaliana. Genes & genetic systems, 91 (4), 229–233. |
[14] | Kosugi, S., Natsume, S., Yoshida, K., MacLean, D., Cano, L., Kamoun, S., & Terauchi, R. (2013). Coval: Improving alignment quality and variant calling accuracy for next-generation sequencing data. PLoS one. |
[15] | Li, H. (2010). Mathematical notes on samtools algorithms. October. |
[16] | Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv. |
[17] | Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30 (20), 2843–2851. |
[18] | Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with burrows–wheeler-transform. Bioinformatics, 26 (5), 589–595. |
[19] | Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,... Durbin, R. (2009). The sequence alignment/map format and samtools. Bioinformatics, 25 (16), 2078–2079. |
[20] | Marrano, A., Birolo, G., Prazzoli, M. L., Lorenzi, S., Valle, G., & Grando, M. S. (2017). Snp-discovery by rad-sequencing in a germplasm collection of wild and cultivated grapevines (v. vinifera l.). PloS one, 12 (1), e0170655. |
[21] | McCormick, R. F., Truong, S. K., & Mullet, J. E. (2015). Rig: recalibration and interrelation of genomic sequence data with the gatk. G3: Genes, Genomes, Genetics, 5 (4), 655–665. |
[22] | McKenna, A., et al. (2016). The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. genomet+ research. published in advance jul. 19, 2010. |
[23] | Molnar, M., & Ilie, L. (2015). Correcting illumina data. Briefings in bioinformatics, 16, 588–599. |
[24] | Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and snp calling from next-generation sequencing data. Nature Reviews Genetics, 12 (6), 443–451. |
[25] | Ossowski, S., Schneeberger, K., Clark, R. M., Lanz, C., Warthmann, N., & Weigel, D. (2008). Sequencing of natural strains of arabidopsis thaliana with short reads. Genome research, 18 (12), 2024–2033. |
[26] | Peter J. A. Cock, Christopher J Fields, Naohisa Goto, Michael Lheuer, Peter M Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, Vol (38), No (6). 16 December 2009. |
[27] | Runs, E. S. (n. d.). Estimating sequencing coverage. |
[28] | Thorvaldsdóttir, H., Robinson, J. T., & Mesirov, J. P. (2013). Integrative genomics viewer (igv): high-performance genomics data visualization and exploration. Briefings in bioinformatics. |
[29] | Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy Moonshine, A.,... others (2013). From fastq data to high-confidence variant 60 calls: the genome analysis toolkit best practices pipeline. Current protocols in bioinformatics, 11–10. |
[30] | Wang, J., Scofield, D., Street, N. R., & Ingvarsson, P. K. (2015). Variant calling using ngs data in european aspen (populus tremula). In Advances in the understanding of biological sciences using next generation sequencing (ngs) approaches (pp. 43–61). Springer. |
[31] | Warden, C. D., Adamson, A. W., Neuhausen, S. L., & Wu, X. (2014). Detailed comparison of two popular variant calling packages for exome and targeted exon studies. Peer J, 2, e600. |
[32] | Weigel, D., & Mott, R. (2009). The 1001 genomes project for arabidopsis thaliana. Genome biology, 10 (5), 107. |
[33] | Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of hardy-weinberg equilibrium. The American Journal of Human Genetics, 76 (5), 887–893. |
APA Style
Hanan Begali. (2018). A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (RADSeq). European Journal of Biophysics, 6(1), 7-16. https://doi.org/10.11648/j.ejb.20180601.12
ACS Style
Hanan Begali. A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (RADSeq). Eur. J. Biophys. 2018, 6(1), 7-16. doi: 10.11648/j.ejb.20180601.12
AMA Style
Hanan Begali. A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (RADSeq). Eur J Biophys. 2018;6(1):7-16. doi: 10.11648/j.ejb.20180601.12
@article{10.11648/j.ejb.20180601.12, author = {Hanan Begali}, title = {A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (RADSeq)}, journal = {European Journal of Biophysics}, volume = {6}, number = {1}, pages = {7-16}, doi = {10.11648/j.ejb.20180601.12}, url = {https://doi.org/10.11648/j.ejb.20180601.12}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ejb.20180601.12}, abstract = {Motivation: The discovery and assessment genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows- Wheeler Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools [30] [31], together with scripts written in R programming language, could provide an efficient solution for calling variants. We focused on RADSeq-based marker selection for Arabidopsis thaliana. RADSeq consists short reads that do not cover the whole reference genome. Finally, SNPs as output in Variant Call Format (VCF) have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels is helpful and provides us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly and we showed that our pipeline is efficient in RADSeq-based marker selection.}, year = {2018} }
TY - JOUR T1 - A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (RADSeq) AU - Hanan Begali Y1 - 2018/01/20 PY - 2018 N1 - https://doi.org/10.11648/j.ejb.20180601.12 DO - 10.11648/j.ejb.20180601.12 T2 - European Journal of Biophysics JF - European Journal of Biophysics JO - European Journal of Biophysics SP - 7 EP - 16 PB - Science Publishing Group SN - 2329-1737 UR - https://doi.org/10.11648/j.ejb.20180601.12 AB - Motivation: The discovery and assessment genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows- Wheeler Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools [30] [31], together with scripts written in R programming language, could provide an efficient solution for calling variants. We focused on RADSeq-based marker selection for Arabidopsis thaliana. RADSeq consists short reads that do not cover the whole reference genome. Finally, SNPs as output in Variant Call Format (VCF) have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels is helpful and provides us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly and we showed that our pipeline is efficient in RADSeq-based marker selection. VL - 6 IS - 1 ER -