

浏览全部资源
扫码关注微信
1.中国科学院深圳先进技术研究院,深圳合成生物学创新研究院,中国科学院定量工程生物学重点实验室,广东 深圳 518055
2.密歇根大学计算医学与生物信息学院,美国 密歇根州 安娜堡 48109
Received:02 January 2025,
Revised:2025-03-04,
Published:30 June 2025
移动端阅览
张成辛. 基于文本数据挖掘的蛋白功能预测:机遇与挑战[J]. 合成生物学, 2025, 6(3): 603-616
ZHANG Chengxin. Challenges and opportunities in text mining-based protein function annotation[J]. Synthetic Biology Journal, 2025, 6(3): 603-616
张成辛. 基于文本数据挖掘的蛋白功能预测:机遇与挑战[J]. 合成生物学, 2025, 6(3): 603-616 DOI: 10.12211/2096-8280.2025-002.
ZHANG Chengxin. Challenges and opportunities in text mining-based protein function annotation[J]. Synthetic Biology Journal, 2025, 6(3): 603-616 DOI: 10.12211/2096-8280.2025-002.
理解蛋白质的生物学功能是定量合成生物学成功的前提。然而,除了少数模式生物外,大多数生物中有许多蛋白质的功能尚未通过实验进行解析。因此,开发自动、准确的蛋白质功能预测算法尤为重要。近年来,以深度学习为代表的人工智能算法成为蛋白质生物信息学发展的主流。在蛋白质功能预测领域,深度学习尤为显著。例如,在最近几届国际蛋白质功能预测大赛(Critical Assessment of Function Annotation,CAFA)中,排名靠前的算法使用深度学习模型(主要是大语言模型)实现基于文本数据挖掘的蛋白质功能预测。具体而言,这些方法或直接利用从科学文献中提取的文本特征来预测基因本体(Gene Ontology,GO),或通过具有相似文献的模板蛋白质来预测GO。尽管在开发更强大的深度学习模型用于基于文本挖掘的蛋白质功能注释方面已有大量研究,基于文本挖掘的蛋白质功能预测算法在处理科学文献数据时仍存在一些长期被忽视的问题。本文首先回顾了蛋白质功能注释中现有的方法和挑战:第一,大多数基于文本挖掘的蛋白质功能预测器仅使用由UniProt数据库管理员为目标蛋白手工收集的PubMed摘要,忽略了尚未被UniProt收录的文献;第二,几乎所有方法都只处理摘要,而忽略了PubMed Central和Europe PMC等数据库中可获得的更详尽的全文文献;第三,鲜有研究工作能自动区分低通量实验、高通量研究和计算预测等不同类别的科研文献,这大大增加了基于文本进行功能注释的难度。此外,本文还提出了利用人工智能最新发展的有前景的方法,以改进基于文本挖掘的蛋白质功能注释。这有助于开发下一代文本挖掘工具,针对性攻克文本数据处理的现有困难,以实现更准确的功能注释。
Understanding the biological function of proteins is crucial for advancing quantitative synthetic biology. Except for a small number of model organisms
most species contain many proteins whose functions have not been experimentally verified
necessitating the development of accurate
automated protein function annotation methods. Recent progress in protein bioinformatics
particularly in predicting protein structures and functions
has been driven significantly by the application of artificial intelligence (AI) algorithms
with a notable emphasis on deep learning models. For instance
the top-ranked methods in recent Critical Assessment of Function Annotation (CAFA) challenge have used deep learning models
primarily large language models
to perform text mining-based protein function annotation. These methods either predict Gene Ontology (GO) terms directly from text features extracted from scientific literatures or from template proteins with databases. Despite the extensive work in developing increasingly powerful deep learning models for text mining-based protein function annotation
several major challenges have been overlooked when parsing scientific literature data. This manuscript reviews existing methods and challenges in protein function annotation. First
many text mining-based protein function predictors rely exclusively on PubMed abstracts collected by UniProt curators for the query protein
ignoring literatures that have not been reviewed by biocurators. Consequently
protein functions predicted by text mining might overlap with those from manual curation of the UniProt Gene Ontology Annotation. Second
nearly all methods only parse PubMed abstracts
ignoring the more informative full-text documents often available in the PubMed Central and Europe PMC repositories. Third
few studies have been proposed to automatically differentiate between different categories of literatures
such as low and high throughput experiments
and computational predictions. This manuscript also proposes promising approaches to enhance text mining-based protein function annotation using the latest development in AI
which is expected to contribute to the development of next-generation text mining tools for more accurate function annotation.
2
ASHBURNER M , BALL C A , BLAKE J A , et al . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium [J ] . Nature Genetics , 2000 , 25 ( 1 ): 25 - 29 .
International Union of Biochemistry , Nomenclature Committee . Enzyme nomenclature, 1978: recommendations of the Nomenclature Committee of the International Union of Biochemistry on the nomenclature and classification of enzymes [M ] . New York : Academic Press , 1979 .
GARGANO M A , MATENTZOGLU N , COLEMAN B , et al . The human phenotype ontology in 2024: phenotypes around the world [J ] . Nucleic Acids Research , 2024 , 52 ( D1 ): D1333 - D1346 .
The UniProt Consortium . UniProt: the universal protein knowledgebase in 2025 [J ] . Nucleic Acids Research , 2025 , 53 ( D1 ): D609 – D617 .
HUNTLEY R P , SAWFORD T , MUTOWO-MEULLENET P , et al . The GOA database: gene Ontology annotation updates for 2015 [J ] . Nucleic Acids Research , 2015 , 43 ( Database issue ): D1057 - D1063 .
FELDMANN P , EICHER E N , LEEVERS S J , et al . Control of growth and differentiation by Drosophila RasGAP, a homolog of p120 ras-GTPase-activating protein [J ] . Molecular and Cellular Biology , 1999 , 19 ( 3 ): 1928 - 1937 .
GAUDET P , LIVSTONE M S , LEWIS S E , et al . Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium [J ] . Briefings in Bioinformatics , 2011 , 12 ( 5 ): 449 - 462 .
WEI X Q , ZHANG C X , FREDDOLINO P L , et al . Detecting Gene Ontology misannotations using taxon-specific rate ratio comparisons [J ] . Bioinformatics , 2020 , 36 ( 16 ): 4383 - 4388 .
MARTIN D M A , BERRIMAN M , BARTON G J . GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes [J ] . BMC Bioinformatics , 2004 , 5 : 178 .
CONESA A , GÖTZ S . Blast2GO: a comprehensive suite for functional analysis in plant genomics [J ] . International Journal of Plant Genomics , 2008 , 2008 ( 1 ): 619832 .
PIOVESAN D , MARTELLI P L , FARISELLI P , et al . BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences [J ] . Nucleic Acids Research , 2011 , 39 ( Web Server issue ): W197 - W202 .
ALTSCHUL S F , MADDEN T L , SCHÄFFER A A , et al . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J ] . Nucleic Acids Research , 1997 , 25 ( 17 ): 3389 - 3402 .
WASS M N , STERNBERG M J E . ConFunc: functional annotation in the twilight zone [J ] . Bioinformatics , 2008 , 24 ( 6 ): 798 - 806 .
HAWKINS T , CHITALE M , LUBAN S , et al . PFP Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data [J ] . Proteins: Structure, Function, and Bioinformatics , 2009 , 74 ( 3 ): 566 - 582 .
GONG Q T , NING W , TIAN W D . GoFDR: a sequence alignment based method for predicting protein functions [J ] . Methods , 2016 , 93 : 3 - 14 .
MAHLICH Y , STEINEGGER M , ROST B , et al . HFSP: high speed homology-driven function annotation of proteins [J ] . Bioinformatics , 2018 , 34 ( 13 ): i304 - i312 .
STEINEGGER M , SÖDING J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets [J ] . Nature Biotechnology , 2017 , 35 ( 11 ): 1026 - 1028 .
KULMANOV M , HOEHNDORF R . DeepGOPlus: improved protein function prediction from sequence [J ] . Bioinformatics , 2020 , 36 ( 2 ): 422 - 429 .
KULMANOV M , HOEHNDORF R . DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms [J ] . Bioinformatics , 2022 , 38 ( S1 ): i238 - i245 .
YUAN Q M , XIE J J , XIE J C , et al . Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion [J ] . Briefings in Bioinformatics , 2023 , 24 ( 3 ): bbad117 .
BUCHFINK B , REUTER K , DROST H G . Sensitive protein alignments at tree-of-life scale using DIAMOND [J ] . Nature Methods , 2021 , 18 ( 4 ): 366 - 368 .
ZHANG C X , LYDIA FREDDOLINO P . A large-scale assessment of sequence database search tools for homology-based protein function prediction [EB/OL ] . bioRxiv , 2023 : 2023 .11. 14 .567021. ( 2023-11-16 )[ 2024-12-01 ] . https://doi.org/10.1101/2023.11.14.567021 https://doi.org/10.1101/2023.11.14.567021 .
ZHANG C X , FREDDOLINO L , ZHANG Y . COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information [J ] . Nucleic Acids Research , 2017 , 45 ( W1 ): W291 - W299 .
ZHANG C X , ZHENG W , FREDDOLINO P L , et al . MetaGO: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping [J ] . Journal of Molecular Biology , 2018 , 430 ( 15 ): 2256 - 2265 .
ZHANG Y , SKOLNICK J . TM-align: a protein structure alignment algorithm based on the TM-score [J ] . Nucleic Acids Research , 2005 , 33 ( 7 ): 2302 - 2309 .
ZHANG C X , ZHANG X , FREDDOLINO L , et al . BioLiP2: an updated structure database for biologically relevant ligand-protein interactions [J ] . Nucleic Acids Research , 2024 , 52 ( D1 ): D404 - D412 .
LASKOWSKI R A . The ProFunc function prediction server [J ] . Methods in Molecular Biology , 2017 , 1611 : 75 - 95 .
KRISSINEL E , HENRICK K . Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions [J ] . Acta Crystallographica Section D , Biological Crystallography, 2004 , 60(Pt 12 Pt 1 ): 2256 - 2268 .
BARKER J A , THORNTON J M . An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis [J ] . Bioinformatics , 2003 , 19 ( 13 ): 1644 - 1649 .
ZHANG C X , LIU Q C , FREDDOLINO L . StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction [EB/OL ] . bioRxiv , 2024 : 2024 . 05 . 15 . 594113 . ( 2024-05-18 )[ 2024-12-01 ] . https://doi.org/10.1101/2024.05.15.594113 https://doi.org/10.1101/2024.05.15.594113 .
VAN KEMPEN M , KIM S S , TUMESCHEIT C , et al . Fast and accurate protein structure search with Foldseek [J ] . Nature Biotechnology , 2024 , 42 ( 2 ): 243 - 246 .
VARADI M , ANYANGO S , DESHPANDE M , et al . AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models [J ] . Nucleic Acids Research , 2022 , 50 ( D1 ): D439 - D444 .
MISTRY J , CHUGURANSKY S , WILLIAMS L , et al . Pfam: the protein families database in 2021 [J ] . Nucleic Acids Research , 2021 , 49 ( D1 ): D412 - D419 .
LIU Q C , ZHANG C X , FREDDOLINO L . InterLabelGO+: unraveling label correlations in protein function prediction [J ] . Bioinformatics , 2024 , 40 ( 11 ): btae655 .
GLIGORIJEVIĆ V , RENFREW P D , KOSCIOLEK T , et al . Structure-based protein function prediction using graph convolutional networks [J ] . Nature Communications , 2021 , 12 ( 1 ): 3168 .
MA W J , ZHANG S G , LI Z , et al . Enhancing protein function prediction performance by utilizing AlphaFold-predicted protein structures [J ] . Journal of Chemical Information and Modeling , 2022 , 62 ( 17 ): 4008 - 4017 .
QIU X Y , WU H , SHAO J Y . TALE-cmap: protein function prediction based on a TALE-based architecture and the structure information from contact map [J ] . Computers in Biology and Medicine , 2022 , 149 : 105938 .
YANG Y X , JERGER A , FENG S , et al . Improved enzyme functional annotation prediction using contrastive learning with structural inference [J ] . Communications Biology , 2024 , 7 ( 1 ): 1690 .
LAN L , DJURIC N , GUO Y H , et al . MS-kNN: protein function prediction by integrating multiple data sources [J ] . BMC Bioinformatics , 2013 , 14 ( Suppl 3 ): S8 .
PIOVESAN D , TOSATTO S C E . INGA 2.0: improving protein function prediction for the dark proteome [J ] . Nucleic Acids Research , 2019 , 47 ( W1 ): W373 - W378 .
YOU R H , ZHANG Z H , XIONG Y , et al . GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank [J ] . Bioinformatics , 2018 , 34 ( 14 ): 2465 - 2473 .
BLUM M , CHANG H Y , CHUGURANSKY S , et al . The InterPro protein families and domains database: 20 years on [J ] . Nucleic Acids Research , 2021 , 49 ( D1 ): D344 - D354 .
CHEN T Q , GUESTRIN C . XGBoost: a scalable tree boosting system [C/OL ] // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . San Francisco California USA . ACM , 2016 : 785 - 794 . ( 2016-08-13)[2024-12-01] . https://doi.org/10.1145/2939672.2939785 https://doi.org/10.1145/2939672.2939785 .
YOU R H , YAO S W , XIONG Y , et al . NetGO: improving large-scale protein function prediction with massive network information [J ] . Nucleic Acids Research , 2019 , 47 ( W1 ): W379 - W387 .
YAO S W , YOU R H , WANG S J , et al . NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information [J ] . Nucleic Acids Research , 2021 , 49 ( W1 ): W469 - W475 .
KULMANOV M , KHAN M A , HOEHNDORF R , et al . DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier [J ] . Bioinformatics , 2018 , 34 ( 4 ): 660 - 668 .
SANDERSON T , BILESCHI M L , BELANGER D , et al . ProteInfer, deep neural networks for protein functional inference [J ] . eLife , 2023 , 12 : e80942 .
RYU J Y , KIM H U , LEE S Y . Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2019 , 116 ( 28 ): 13996 - 14001 .
HAN S R , PARK M , KOSARAJU S , et al . Evidential deep learning for trustworthy prediction of enzyme commission number [J ] . Briefings in Bioinformatics , 2023 , 25 ( 1 ): bbad401 .
ZHU Y H , ZHANG C X , YU D J , et al . Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction [J ] . PLoS Computational Biology , 2022 , 18 ( 12 ): e1010793 .
KULMANOV M , GUZMÁN-VEGA F J , ROGGLI P D , et al . DeepGO-SE: protein function prediction as Approximate Semantic Entailment [EB/OL ] . bioRxiv , 2023 : 2023 .09. 26 .559473. ( 2023-09-28 )[ 2024-12-01 ] . https://doi.org/10.1101/2023.09.26.559473 https://doi.org/10.1101/2023.09.26.559473 .
KIM G B , KIM J Y , LEE J A , et al . Functional annotation of enzyme-encoding genes using deep learning with transformer layers [J ] . Nature Communications , 2023 , 14 ( 1 ): 7370 .
YU T H , CUI H Y , LI J C , et al . Enzyme function prediction using contrastive learning [J ] . Science , 2023 , 379 ( 6639 ): 1358 - 1363 .
LIN Z M , AKIN H , RAO R , et al . Evolutionary-scale prediction of atomic-level protein structure with a language model [J ] . Science , 2023 , 379 ( 6637 ): 1123 - 1130 .
ELNAGGAR A , HEINZINGER M , DALLAGO C , et al . ProtTrans: toward understanding the language of life through self-supervised learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 10 ): 7112 - 7127 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C/OL ] // Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017[2024-12-01] . https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html .
RADIVOJAC P , CLARK W T , ORON T R , et al . A large-scale evaluation of computational protein function prediction [J ] . Nature Methods , 2013 , 10 ( 3 ): 221 - 227 .
ZHOU N H , JIANG Y X , BERGQUIST T R , et al . The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens [J ] . Genome Biology , 2019 , 20 ( 1 ): 244 .
JIANG Y X , ORON T R , CLARK W T , et al . An expanded evaluation of protein function prediction methods shows an improvement in accuracy [J ] . Genome Biology , 2016 , 17 ( 1 ): 184 .
YAN H Y , WANG S J , LIU H C , et al . GORetriever: reranking protein-description-based GO candidates by literature-driven deep information retrieval for protein function annotation [J ] . Bioinformatics , 2024 , 40 ( S2 ): ii53-ii61.
CHUA Z M , RAJESH A , SINHA S , et al . PROTGOAT: improved automated protein function predictions using Protein Language Models [EB/OL ] . bioRxiv , 2024 : 2024 . 04 . 01.587572 .( 2024-04-02 )[ 2024-12-01 ] . https://doi.org/10.1101/2024.04.01.587572 https://doi.org/10.1101/2024.04.01.587572 .
COZZETTO D , BUCHAN D W , BRYSON K , et al . Protein function prediction by massive integration of evolutionary analyses and multiple data sources [J ] . BMC Bioinformatics , 2013 , 14 ( 3 ): S1 .
YOU R H , HUANG X D , ZHU S F . DeepText2GO: improving large-scale protein function prediction with deep semantic text representation [J ] . Methods , 2018 , 145 : 82 - 90 .
LE Q , MIKOLOV T . Distributed representations of sentences and documents; proceedings of the international conference on machine learning [C/OL ] // Proceedings of the 31st International Conference on Machine Learning , PMLR , 2014 , 32 ( 2 ): 1188 - 1196 [2024-12-04] . https://proceedings.mlr.press/v32/le14.html https://proceedings.mlr.press/v32/le14.html .
GU Y , TINN R , CHENG H , et al . Domain-specific language model pretraining for biomedical natural language processing [J ] . ACM Transactions on Computing for Healthcare , 2021 , 3 ( 1 ): 1 - 23 .
COHAN A , FELDMAN S , BELTAGY I , et al . SPECTER: document-level representation learning using citation-informed transformers [EB/OL ] . arXiv , 2020 : 200407180 . ( 2020-05-20 )[ 2024-12-01 ] . https://doi.org/10.48550/arXiv.2004.07180 https://doi.org/10.48550/arXiv.2004.07180 .
REIMERS N , GUREVYCH I . Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [EB/OLJ ] . arXiv , 2019 : 190810084 . ( 2019-08-27 )[ 2024-12-01 ] . https://doi.org/10.48550/arXiv.1908.10084 https://doi.org/10.48550/arXiv.1908.10084 .
WU J S , YIN Q , ZHANG C X , et al . Function prediction for G protein-coupled receptors through text mining and induction matrix completion [J ] . ACS Omega , 2019 , 4 ( 2 ): 3045 - 3054 .
BADAL V D , KUNDROTAS P J , VAKSER I A . Text mining for protein docking [J ] . PLoS Computational Biology , 2015 , 11 ( 12 ): e1004630 .
KAFKAS Ş , HOEHNDORF R . Ontology based text mining of gene-phenotype associations: application to candidate gene prediction [J ] . Database , 2019 , 2019 : baz019 .
CZARNECKI J , NOBELI I , SMITH A M , et al . A text-mining system for extracting metabolic reactions from full-text articles [J ] . BMC Bioinformatics , 2012 , 13 : 172 .
VERSPOOR K M , COHN J D , RAVIKUMAR K E , et al . Text mining improves prediction of protein functional sites [J ] . PLoS One , 2012 , 7 ( 2 ): e32171 .
WEI X Q , ZOU S , XIE Z H , et al . EDIL3 deficiency ameliorates adverse cardiac remodelling by neutrophil extracellular traps (NET)-mediated macrophage polarization [J ] . Cardiovascular Research , 2022 , 118 ( 9 ): 2179 - 2195 .
PAFILIS E , BUTTIGIEG P L , FERRELL B , et al . EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation [J ] . Database , 2016 , 2016 : baw005 .
WEI C H , KAO H Y , LU Z Y . PubTator: a web-based text mining tool for assisting biocuration [J ] . Nucleic Acids Research , 2013 , 41 ( Web Server issue ): W518 - W522 .
WEBER L , SÄNGER M , MÜNCHMEYER J , et al . HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition [J ] . Bioinformatics , 2021 , 37 ( 17 ): 2792 - 2794 .
GIORGI J M , BADER G D . Towards reliable named entity recognition in the biomedical domain [J ] . Bioinformatics , 2020 , 36 ( 1 ): 280 - 286 .
FURRER L , JANCSO A , COLIC N , et al . OGER++: hybrid multi-type entity recognition [J ] . Journal of Cheminformatics , 2019 , 11 ( 1 ): 7 .
0
Views
2
下载量
1
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621