The mechanism of prokaryotic gene expression remains incompletely understood. Promoters are regions in genome that locating upstream to genes and regulate of gene expressions. Despite more and more E. coli K-12 promoter sequences have been obtained experimentally, and some regions such as -10 region and -30 region have been described, the features in promoter sequences are far from explicitly characterized. Here, we address this challenge using an approach based on the deep convolutional neural network (CNN). We collected six classes of E. coli K-12 promoter sequences which are all annotated as with strong evidence and belong to only one promoter class in RegulonDB database. Then, we applied the CNN model to recognize the six classes of promoters. The CNN model achieved an accuracy of above 97% for all six classes of promoters. Next, we extracted the weight matrix of the last convolution layer in CNN with the Grad-Cam algorithm, and convert the weight matrix to an information content matrix. Finally, we visualized the information content matrix as promoter logos using the logomaker tool and discover the promoter features in six classes of promoters. Our approach could not only find the previous described promoter feature regions, but could also discover promoter features with better sensitivity and accuracy. We provide a novel computational approach to discover features in biological sequences.
Published in | Computational Biology and Bioinformatics (Volume 8, Issue 1) |
DOI | 10.11648/j.cbb.20200801.13 |
Page(s) | 15-19 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2020. Published by Science Publishing Group |
Convolution Neural Network (CNN), Promoter, Biological Sequence, Features
[1] | He W, Jia C, Duan Y, et al. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. [J] BMC Systems Biology, 2018, 12 (4): 44. |
[2] | Barrios H, Valderrama B, Morett E. Compilation and analysis of sigma (54)-dependent promoter sequences. [J] Nucleic Acids Research, 1999, 27 (22): 4305-4313. |
[3] | Gruber TM, Gross CA. Multiple sigma subunits and the partitioning of bacterial transcription space. [J] Annual Review of Microbiology, 2003, 57: 441–66. |
[4] | Kang JG, Hahn MY, Ishihama A, Roe JH. Identification of sigma factors for growth phase-related promoter selectivity of RNA polymerases from Streptomyces coelicolor A3 (2). [J] Nucleic Acids Research, 1997, 25 (13): 2566-73. |
[5] | Santos-Zavaleta A, Salgado H, Gama-Castro S, et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. [J] Nucleic acids research, 2019, 47: D212-D220. |
[6] | Lecun Y L, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. [J] Proceedings of the IEEE, 1998, 86 (11): 2278-2324. |
[7] | Lecun Y, Boser B, Denker J, et al. Backpropagation Applied to Handwritten Zip Code Recognition. [J] Neural Computation, 2014, 1 (4): 541-551. |
[8] | Alipanahi B, Delong A, Weirauch MT, et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. [J] Nature biotechnology, 2015, 33, 831. |
[9] | Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. [J] Nature methods, 2015, 12: 931. |
[10] | Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. [J] Genome research, 2016, 26: 990-999. |
[11] | Eraslan G, Avsec Ž, Gagneur J, et al. Deep learning: new computational modelling techniques for genomics. [J] Nature Reviews Genetics, 2019, 20: 389-403. |
[12] | Gershenzon NI, Stormo GD, Ioshikhes IP. Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. [J] Nucleic Acids Research, 2005, 33 (7): 2290-301. |
[13] | Zhang L, Luo L. Splice site prediction with quadratic discriminant analysis using diversity measure. [J] Nucleic Acids Research, 2003, 31 (21): 6214-6220. |
[14] | Drioli S, Felluga F, Forzato C, et al. The recognition and prediction of σ 70, promoters in Escherichia coli K-12. [J] Journal of Theoretical Biology, 2006, 242 (1): 135. |
[15] | Gordon JJ, Towsey MW, Hogan JM, et al. Improved prediction of bacterial transcription start sites. [J] Bioinformatics, 2006, 22 (2): 142-148. |
[16] | Wang L, Wan P. Prediction of Escherichia Coli K-12 Promoters Using Convolutional Neural Network. [J] Computational Biology and Bioinformatics, 2018, 6: 2. |
[17] | Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv: 1610.02391, 2019, DOI: 10.1007/s11263-019-01228-7. |
[18] | Tareen A, Kinney JB. Logomaker: Beautiful sequence logos in python. [J] Bioinformatics, 2020, 36 (7): 2272–2274. |
[19] | Crooks GE, Hon G, Chandonia JM, et al. WebLogo: a sequence logo generator. [J] Genome research, 2004, 14: 1188-1190. |
APA Style
Mengmeng Zhang, Lu Wang, Ping Wan. (2020). Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network. Computational Biology and Bioinformatics, 8(1), 15-19. https://doi.org/10.11648/j.cbb.20200801.13
ACS Style
Mengmeng Zhang; Lu Wang; Ping Wan. Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network. Comput. Biol. Bioinform. 2020, 8(1), 15-19. doi: 10.11648/j.cbb.20200801.13
AMA Style
Mengmeng Zhang, Lu Wang, Ping Wan. Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network. Comput Biol Bioinform. 2020;8(1):15-19. doi: 10.11648/j.cbb.20200801.13
@article{10.11648/j.cbb.20200801.13, author = {Mengmeng Zhang and Lu Wang and Ping Wan}, title = {Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network}, journal = {Computational Biology and Bioinformatics}, volume = {8}, number = {1}, pages = {15-19}, doi = {10.11648/j.cbb.20200801.13}, url = {https://doi.org/10.11648/j.cbb.20200801.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20200801.13}, abstract = {The mechanism of prokaryotic gene expression remains incompletely understood. Promoters are regions in genome that locating upstream to genes and regulate of gene expressions. Despite more and more E. coli K-12 promoter sequences have been obtained experimentally, and some regions such as -10 region and -30 region have been described, the features in promoter sequences are far from explicitly characterized. Here, we address this challenge using an approach based on the deep convolutional neural network (CNN). We collected six classes of E. coli K-12 promoter sequences which are all annotated as with strong evidence and belong to only one promoter class in RegulonDB database. Then, we applied the CNN model to recognize the six classes of promoters. The CNN model achieved an accuracy of above 97% for all six classes of promoters. Next, we extracted the weight matrix of the last convolution layer in CNN with the Grad-Cam algorithm, and convert the weight matrix to an information content matrix. Finally, we visualized the information content matrix as promoter logos using the logomaker tool and discover the promoter features in six classes of promoters. Our approach could not only find the previous described promoter feature regions, but could also discover promoter features with better sensitivity and accuracy. We provide a novel computational approach to discover features in biological sequences.}, year = {2020} }
TY - JOUR T1 - Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network AU - Mengmeng Zhang AU - Lu Wang AU - Ping Wan Y1 - 2020/06/20 PY - 2020 N1 - https://doi.org/10.11648/j.cbb.20200801.13 DO - 10.11648/j.cbb.20200801.13 T2 - Computational Biology and Bioinformatics JF - Computational Biology and Bioinformatics JO - Computational Biology and Bioinformatics SP - 15 EP - 19 PB - Science Publishing Group SN - 2330-8281 UR - https://doi.org/10.11648/j.cbb.20200801.13 AB - The mechanism of prokaryotic gene expression remains incompletely understood. Promoters are regions in genome that locating upstream to genes and regulate of gene expressions. Despite more and more E. coli K-12 promoter sequences have been obtained experimentally, and some regions such as -10 region and -30 region have been described, the features in promoter sequences are far from explicitly characterized. Here, we address this challenge using an approach based on the deep convolutional neural network (CNN). We collected six classes of E. coli K-12 promoter sequences which are all annotated as with strong evidence and belong to only one promoter class in RegulonDB database. Then, we applied the CNN model to recognize the six classes of promoters. The CNN model achieved an accuracy of above 97% for all six classes of promoters. Next, we extracted the weight matrix of the last convolution layer in CNN with the Grad-Cam algorithm, and convert the weight matrix to an information content matrix. Finally, we visualized the information content matrix as promoter logos using the logomaker tool and discover the promoter features in six classes of promoters. Our approach could not only find the previous described promoter feature regions, but could also discover promoter features with better sensitivity and accuracy. We provide a novel computational approach to discover features in biological sequences. VL - 8 IS - 1 ER -