1.上海交通大学生命科学技术学院,微生物代谢国家重点实验室,上海 200030
2.复旦大学化学系,上海 200243
[ "吕靖伟(1996—),硕士研究生。研究方向为天然产物合成基因挖掘。E-mail:jingwei_lv@sjtu.edu.cn" ]
[ "丁伟(1981—),男,博士,副教授,博士生导师。研究方向为微生物代谢及合成生物学。E-mail:weiding@sjtu.edu.cn" ]
收稿:2022-03-07,
修回:2022-04-23,
纸质出版:2022-12-31
移动端阅览
吕靖伟, 邓子新, 张琪, 丁伟. 基于深度学习识别RiPPs前体肽及裂解位点[J]. 合成生物学, 2022, 3(6): 1262-1276
LYU Jingwei, DENG Zixin, ZHANG Qi, DING Wei. Identification of RiPPs precursor peptides and cleavage sites based on deep learning[J]. Synthetic Biology Journal, 2022, 3(6): 1262-1276
吕靖伟, 邓子新, 张琪, 丁伟. 基于深度学习识别RiPPs前体肽及裂解位点[J]. 合成生物学, 2022, 3(6): 1262-1276 DOI: 10.12211/2096-8280.2022-016.
LYU Jingwei, DENG Zixin, ZHANG Qi, DING Wei. Identification of RiPPs precursor peptides and cleavage sites based on deep learning[J]. Synthetic Biology Journal, 2022, 3(6): 1262-1276 DOI: 10.12211/2096-8280.2022-016.
得益于基因测序技术的快速发展,基因组测序数据呈现爆炸式增长,核糖体合成和翻译后修饰肽(RiPPs)是近十年逐渐进入人们视野的一大类肽类天然产物。这类化合物在自然界中分布极其广泛,具有丰富的结构多样性和生物活性多样性,是天然药物的重要来源。RiPPs的发现主要依赖低通量生物实验,传统方法精确但成本高昂,随着新型计算机技术的更新迭代,包括antiSMASH、RiPP-PRISM等在内的生物信息学工具能够极大加速RiPPs挖掘进程,但依然无法突破基于同源性方法(例如搜索保守的生物合成酶)的限制——无法有效识别具有不同生物合成机制的新型RiPPs。在这里,本文首次基于自然语言处理预训练模型BERT,提出四种可以完全依赖序列数据识别RiPPs而非基于同源性及基因组上下文信息的深度学习模型,通过对各模型进行验证分析和对比,最终确定在RiPPs识别赛道上表现卓越的最佳模型BERiPPs(bidirectional language model for enhancing the performance of identification of RiPPs precursor peptides)。BERiPPs能够在不考虑基因组背景的情况下以无偏见的方式识别RiPPs前体肽,并可通过条件随机场生成对前导肽裂解位点的预测,为高通量挖掘全新RiPPs提供了思路,并在一定程度下揭示了前体肽和修饰酶间的生物学底层关系。
Genome sequencing data showed explosive growth attributed to the rapid development of DNA sequencing technology. Ribosomally synthesized and post-translationally modified peptides are a kind of natural peptide product that gradually came into people's view in the last decade. These compounds are widely distributed in nature
diverse in structure and bioactivity
and are important sources of natural drugs. The discovery of RiPPs mainly relies on low-throughput biological experiments
which are accurate but costly. With the development of new information technologies
bioinformatics tools such as antiSMASH and RIPP-Prism can greatly accelerate the process of RiPPs mining. However
methods based on gene homology
such as searching for conserved biosynthetic enzymes
are still unable to effectively identify novel RiPPs with different biosynthetic mechanisms. Here
for the first time
based on the natural language processing pre-training model BERT
four deep learning models that can fully rely on sequence data to identify RiPPs instead of homology and genomic context information are proposed and trained on the same RiPPs dataset. Through verification and comparison of these models
the best model BERiPPs performs well on the RiPPs identification track and is as accurate as the homology-based method. BERiPPs can identify RiPPs precursor peptides and RiPPs classes in an unbiased manner regardless of the genomic background
extending the range of novel RiPPs captured by approximately 60% compared to homology-based approaches. By combining BERiPPs with a conditional random field
the prediction of the cleavage site of the leader peptide can be indirectly generated with high accuracy by the recognition of each amino acid label in the sequence. The deep learning based on the pre-training model provides the possibility for high-throughput mining of novel RiPPs in a manner different from that of the gene context-dependent methods and reveals the underlying biological relationship between precursor peptides and modified enzymes.
2
MARTENS E , DEMAIN A L . The antibiotic resistance crisis, with a focus on the United States [J ] . The Journal of Antibiotics , 2017 , 70 ( 5 ): 520 - 526 .
HUTCHINGS M I , TRUMAN A W , WILKINSON B . Antibiotics: past, present and future [J ] . Current Opinion in Microbiology , 2019 , 51 : 72 - 80 .
HUDSON G A , MITCHELL D A . RiPP antibiotics: Biosynthesis and engineering potential [J ] . Current Opinion in Microbiology , 2018 , 45 : 61 - 69 .
WANG F T , WEI W Q , ZHAO J F , et al . Genome mining and biosynthesis study of a type B linaridin reveals a highly versatile α- N -methyltransferase [J ] . CCS Chemistry , 2021 , 3 ( 3 ): 1049 - 1057 .
SKINNIDER M A , JOHNSTON C W , EDGAR R E , et al . Genomic charting of ribosomally synthesized natural product chemical space facilitates targeted mining [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2016 , 113 ( 42 ): E6343 - E6351 .
ARNISON P G , BIBB M J , BIERBAUM G , et al . Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature [J ] . Natural Product Reports , 2013 , 30 ( 1 ): 108 - 160 .
YU Y , ZHANG Q , VAN DER DONK W A . Insights into the evolution of lanthipeptide biosynthesis [J ] . Protein Science , 2013 , 22 ( 11 ): 1478 - 1489 .
ZHONG Z , HE B B , LI J , et al . Challenges and advances in genome mining of ribosomally synthesized and post-translationally modified peptides (RiPPs) [J ] . Synthetic and Systems Biotechnology , 2020 , 5 ( 3 ): 155 - 172 .
BLIN K , SHAW S , STEINKE K , et al . antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline [J ] . Nucleic Acids Research , 2019 , 47 ( W1 ): W81 - W87 .
HETRICK K J , VAN DER DONK W A . Ribosomally synthesized and post-translationally modified peptide natural product discovery in the genomic era [J ] . Current Opinion in Chemical Biology , 2017 , 38 : 36 - 44 .
HYATT D , CHEN G L , LOCASCIO P F , et al . Prodigal: prokaryotic gene recognition and translation initiation site identification [J ] . BMC Bioinformatics , 2010 , 11 : 119 .
DELCHER A L , BRATKE K A , POWERS E C , et al . Identifying bacterial genes and endosymbiont DNA with Glimmer [J ] . Bioinformatics , 2007 , 23 ( 6 ): 673 - 679 .
VAN HEEL A J , DE JONG A , MONTALBÁN-LÓPEZ M , et al . BAGEL3: automated identification of genes encoding bacteriocins and (non-) bactericidal posttranslationally modified peptides [J ] . Nucleic Acids Research , 2013 , 41 ( W1 ): W448 - W453 .
TIETZ J I , SCHWALEN C J , PATEL P S , et al . A new genome-mining tool redefines the lasso peptide biosynthetic landscape [J ] . Nature Chemical Biology , 2017 , 13 ( 5 ): 470 - 478 .
MERWIN N J , MOUSA W K , DEJONG C A , et al . DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2020 , 117 ( 1 ): 371 - 380 .
AGRAWAL P , KHATER S , GUPTA M , et al . RiPPMiner: a bioinformatics resource for deciphering chemical structures of RiPPs based on prediction of cleavage and cross-links [J ] . Nucleic Acids Research , 2017 , 45 ( W1 ): W80 - W88 .
DE LOS SANTOS E L C . NeuRiPP: Neural network identification of RiPP precursor peptides [J ] . Scientific Reports , 2019 , 9 : 13406 .
SHIN H C , ROTH H R , GAO M C , et al . Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning [J ] . IEEE Transactions on Medical Imaging , 2016 , 35 ( 5 ): 1285 - 1298 .
SUNDERMEYER M , SCHLÜTER R , NEY H . LSTM neural networks for language modeling [C ] // 13th Annual conference of the International Speech Communication Association 2012 (INTERSPEECH 2012) . Portland, OR, USA : International Speech Communications Association , 2012 : 194 - 197 .
DEVLIN J , CHANG M W , LEE K , et al . BERT: pre-training of Deep Bidirectional Transformers for Language Understanding [C ] // Proceedings of NAACL-HLT . 2019 : 4171 - 4186 .
TENNEY I , DAS D , PAVLICK E . BERT rediscovers the classical NLP pipeline [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Florence, Italy . Stroudsburg, PA, USA : Association for Computational Linguistics , 2019 : 4593 - 4601 .
CHO K , VAN MERRIENBOER B , GULCEHRE C , et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation [C ] // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar . Stroudsburg, PA, USA : Association for Computational Linguistics , 2014 : 1724 - 1734 .
Bahdanau D , Cho K H , Bengio Y . Neural machine translation by jointly learning to align and translate [C ] // 3rd International Conference on Learning Representations , ICLR 2015 . 2015 .
Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need [C ] // Advances in neural information processing systems . 2017 : 5998 - 6008 .
SHERSTINSKY A . Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network [J ] . Physica D: Nonlinear Phenomena , 2020 , 404 : 132306 .
WANG Q , LI B , XIAO T , et al . Learning deep transformer models for machine translation [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Florence, Italy . Stroudsburg, PA, USA : Association for Computational Linguistics , 2019 : 1810 - 1822 .
SÖDING J . Protein homology detection by HMM-HMM comparison [J ] . Bioinformatics , 2005 , 21 ( 7 ): 951 - 960 .
LIU L Y , REN X , SHANG J B , et al . Efficient contextualized representation: Language model pruning for sequence labeling [C ] // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . Brussels, Belgium . Stroudsburg, PA, USA : Association for Computational Linguistics , 2018 : 1215 - 1225 .
Huang Z , Xu W , Yu K . Bidirectional LSTM-CRF models for sequence tagging [EB/OL ] . arXiv preprint : 2015 , arXiv: 1508 . 01991 . https://doi.org/10.48550/arXiv.1508.01991 https://doi.org/10.48550/arXiv.1508.01991
ZHAO H S , JIA J Y , KOLTUN V . Exploring self-attention for image recognition [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE , 2020 , 10073 - 10082 .
Dodge J , Ilharco G , Schwartz R , et al . Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping [EB/OL ] . arXiv preprint : 2020 , arXiv: 2002 . 06305 . https://doi.org/10.48550/arXiv.2002.06305 https://doi.org/10.48550/arXiv.2002.06305
YADAV S , SHUKLA S . Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification [C ] // 2016 IEEE 6th International Conference on Advanced Computing . Bhimavaram, India : IEEE , 2016 : 78 - 83 .
ZHANG Z L , SABUNCU M . Generalized cross entropy loss for training deep neural networks with noisy labels [J ] . Montréal: NeurIPS , 2018 , 31 .
LIN T Y , GOYAL P , GIRSHICK R , et al . Focal loss for dense object detection [C ] // 2017 IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 2999 - 3007 .
LI B , SHER D , KELLY L , et al . Catalytic promiscuity in the biosynthesis of cyclic peptide secondary metabolites in planktonic marine cyanobacteria [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2010 , 107 ( 23 ): 10430 - 10435 .
JUMPER J , EVANS R , PRITZEL A , et al . Highly accurate protein structure prediction with AlphaFold [J ] . Nature , 2021 , 596 ( 7873 ): 583 - 589 .
XIE Q Z , LUONG M T , HOVY E , et al . Self-training with noisy student improves ImageNet classification [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE , 2020 : 10684 - 10695 .
0
浏览量
1
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621