The growing volume of biomedical big data is driving a tremendous demand for advanced bioinformatics software.
Artificial intelligence (AI) is expected to accelerate and enhance the development process of bioinformatics software.
AI-augmented cloud computing is paving the way for the future of autonomous research.
[1] | Lander, E.S., Linton, L.M., Birren, B., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409: 860−921. DOI: 10.1038/35057062. |
[2] | Altschul, S.F., Gish, W., Miller, W., et al. (1990). Basic local alignment search tool. J. Mol. Biol. 215: 403−410. DOI: 10.1016/s0022-2836(05)80360-2. |
[3] | Johnson, M., Zaretskaya, I., Raytselis, Y., et al. (2008). NCBI BLAST: A better web interface. Nucleic Acids Res. 36: W5−9. DOI: 10.1093/nar/gkn201. |
[4] | Li, W., Cowley, A., Uludag, M., et al. (2015). The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 43: W580−584. DOI: 10.1093/nar/gkv279. |
[5] | Buchfink, B., Reuter, K., and Drost, H.G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18: 366−368. DOI: 10.1038/s41592-021-01101-x. |
[6] | Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inform. 23 : 205−211. DOI: 10.1142/9781848165632_0019. |
[7] | Tamura, K., Stecher, G., and Kumar, S. (2021). MEGA11: Molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38: 3022−3027. DOI: 10.1093/molbev/msab120. |
[8] | Shah, N., Nute, M.G., Warnow, T., et al. (2019). Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35: 1613−1614. DOI: 10.1093/bioinformatics/bty833. |
[9] | Madden, T.L., Busby, B., and Ye, J. (2019). Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35: 2699−2700. DOI: 10.1093/bioinformatics/bty1026. |
[10] | Xu, Y., Liu, X., Cao, X., et al. (2021). Artificial intelligence: A powerful paradigm for scientific research. The Innovation 2: 100179. DOI: 10.1016/j.xinn.2021.100179. |
[11] | Huang, T., Lan, L., Fang, X., et al. (2015). Promises and challenges of big data computing in health sciences. Big Data Research 2: 2−11. DOI: 10.1016/j.bdr.2015.02.002. |
[12] | Wen, L., Li, G., Huang, T., et al. (2022). Single-cell technologies: From research to application. The Innovation 3: 100342. DOI: 10.1016/j.xinn.2022.100342. |
[13] | Falk, T., Mai, D., Bensch, R., et al. (2019). U-Net: Deep learning for cell counting, detection, and morphometry. Nat. Methods 16: 67−70. DOI: 10.1038/s41592-018-0261-2. |
[14] | Huang, T., Xu, H., Wang, H., et al. (2023). Artificial intelligence for medicine: Progress, challenges, and perspectives. The Innovation Medicine 1: 100030. DOI: 10.59717/j.xinn-med.2023.100030. |
[15] | Xun, D., Wang, R., Zhang, X., et al. (2024). Microsnoop: A generalist tool for microscopy image representation. The Innovation 5 . DOI: 10.1016/j.xinn.2023.100541. |
[16] | Zhao, C., Guo, L., Dong, J., et al. (2021). Mass spectrometry imaging-based multi-modal technique: Next-generation of biochemical analysis strategy. The Innovation 2 . DOI: 10.1016/j.xinn.2021.100151. |
[17] | Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583−589. DOI: 10.1038/s41586-021-03819-2. |
[18] | Huang, T. and Li, Y. (2023). Current progress, challenges, and future perspectives of language models for protein representation and protein design. The Innovation 4: 100446. DOI: 10.1016/j.xinn.2023.100446. |
[19] | Novakovsky, G., Dexter, N., Libbrecht, M.W., et al. (2023). Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24: 125−137. DOI: 10.1038/s41576-022-00532-2. |
[20] | Yang, A., Troup, M., and Ho, J.W.K. (2017). Scalability and validation of big data bioinformatics software. Comput. Struct. Biotechnol. J. 15: 379−386. DOI: 10.1016/j.csbj.2017.07.002. |
[21] | Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3: 160018. DOI: 10.1038/sdata.2016.18. |
[22] | Rehm, H.L., Page, A.J.H., Smith, L., et al. (2021). GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1 : 100029. DOI: 10.1016/j.xgen.2021.100029. |
[23] | Sayers, E.W., Beck, J., Bolton, E.E., et al. (2024). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 52: D33−D43. DOI: 10.1093/nar/gkad1044. |
[24] | Cantelli, G., Bateman, A., Brooksbank, C., et al. (2022). The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res. 50: D11−D19. DOI: 10.1093/nar/gkab1127. |
[25] | Bao, Y. and Xue, Y. (2023). From BIG Data Center to China National Center for bioinformation. Genomics, Proteomics Bioinf. 21: 900−903. DOI: 10.1016/j.gpb.2023.10.001. |
[26] | Cock, P.J., Fields, C.J., Goto, N., et al. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38: 1767−1771. DOI: 10.1093/nar/gkp1137. |
[27] | Reimers, M. and Carey, V.J. (2006). Bioconductor: An open source framework for bioinformatics and computational biology. Methods Enzymol. 411: 119−134. DOI: 10.1016/s0076-6879(06)11008-3. |
[28] | Kodama, Y., Shumway, M., and Leinonen, R. (2012). The sequence read archive: Explosive growth of sequencing data. Nucleic Acids Res. 40: D54−56. DOI: 10.1093/nar/gkr854. |
[29] | Sayers, E.W., Cavanaugh, M., Clark, K., et al. (2024). GenBank 2024 update. Nucleic Acids Res. 52: D134−D137. DOI: 10.1093/nar/gkad903. |
[30] | O'Leary, N.A., Wright, M.W., Brister, J.R., et al. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44: D733−745. DOI: 10.1093/nar/gkv1189. |
[31] | Yuan, D., Ahamed, A., Burgin, J., et al. (2024). The European nucleotide archive in 2023. Nucleic Acids Res. 52: D92−D97. DOI: 10.1093/nar/gkad1067. |
[32] | UniProt Consortium. (2023). UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51: D523−D531. DOI: 10.1093/nar/gkac1052. |
[33] | Martin, F.J., Amode, M.R., Aneja, A., et al. (2023). Ensembl 2023. Nucleic Acids Res. 51: D933−d941. DOI: 10.1093/nar/gkac958. |
[34] | Wang, Y., Song, F., Zhu, J., et al. (2017). GSA: Genome sequence archive. Genomics, Proteomics Bioinf. 15: 14−18. DOI: 10.1016/j.gpb.2017.01.001. |
[35] | Schomburg, I., Chang, A., Hofmann, O., et al. (2002). BRENDA: A resource for enzyme data and metabolic information. Trends Biochem. Sci. 27: 54−56. DOI: 10.1016/s0968-0004(01)02027-8. |
[36] | Yurekten, O., Payne, T., Tejera, N., et al. (2024). MetaboLights: Open data repository for metabolomics. Nucleic Acids Res. 52: D640−D646. DOI: 10.1093/nar/gkad1045. |
[37] | Wishart, D.S., Knox, C., Guo, A.C., et al. (2009). HMDB: A knowledgebase for the human metabolome. Nucleic Acids Res. 37: D603−610. DOI: 10.1093/nar/gkn810. |
[38] | Ma, L., Zou, D., Liu, L., et al. (2023). Database Commons: A catalog of worldwide biological databases. Genomics, Proteomics Bioinf. 21: 1054−1058. DOI: 10.1016/j.gpb.2022.12.004. |
[39] | Chen, Y.B., Chattopadhyay, A., Bergen, P., et al. (2007). The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System-a one-stop gateway to online bioinformatics databases and software tools. Nucleic Acids Res. 35 :D780-785. DOI: 10.1093/nar/gkl781. |
[40] | Ison, J., Ienasescu, H., Chmura, P., et al. (2019). The bio. tools registry of software tools and data resources for the life sciences. Genome Biol. 20: 164. DOI: 10.1186/s13059-019-1772-6. |
[41] | Zhao, Q., Zhou, X., Wu, J., et al. (2024). BioTreasury: A community-based repository enabling indexing and rating of bioinformatics tools. Sci. China Life Sci. 67: 221−229. DOI: 10.1007/s11427-023-2509-x. |
[42] | McKinney, W. (2011). Pandas: A foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14: 1−9. |
[43] | Van Der Walt, S., Colbert, S.C., and Varoquaux, G. (2011). The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng. 13: 22−30. DOI: 10.1109/MCSE.2011.37. |
[44] | Fabian, P. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12: 2825−2830. |
[45] | Ginestet, C. (2011). ggplot2: Elegant graphics for data analysis. Journal of the Royal Statistical Society Series A (Statistics in Society) 174 : 245–246. DOI: 10.1111/j.1467-985X.2010.00676_9.x. |
[46] | Langmead, B. and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9: 357−359. DOI: 10.1038/nmeth.1923. |
[47] | Stein, L.D. (2013). Using GBrowse 2.0 to visualize and share next-generation sequence data. Brief Bioinform. 14 :162-171. DOI: 10.1093/bib/bbt001. |
[48] | Skinner, M.E., Uzilov, A.V., Stein, L.D., et al. (2009). JBrowse: A next-generation genome browser. Genome Res. 19: 1630−1638. DOI: 10.1101/gr.094607.109. |
[49] | Koster J. and Rahmann, S. (2018). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 34: 3600. DOI: 10.1093/bioinformatics/bty350. |
[50] | Di Tommaso, P., Chatzou, M., Floden, E.W., et al. (2017). Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35: 316−319. DOI: 10.1038/nbt.3820. |
[51] | Chen, T., Liu, Y.X., and Huang, L. (2022). ImageGP: An easy-to-use data visualization web server for scientific researchers. Imeta. 1: e5. DOI: 10.1002/imt2.5. |
[52] | Wen, T., Xie, P., Yang, S., et al. (2022). ggClusterNet: An R package for microbiome network analysis and modularity-based multiple network layouts. Imeta. 1: e32. DOI: 10.1002/imt2.32. |
[53] | Yu, G., Wang, L.G., Han, Y., et al. (2012). clusterProfiler: An R package for comparing biological themes among gene clusters. Omics 16: 284−287. DOI: 10.1089/omi.2011.0118. |
[54] | Wu, T., Hu, E., Xu, S., et al. (2021). clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2 :100141. DOI: 10.1016/j.xinn.2021.100141. |
[55] | Yu, G., Lam, T.T., Zhu, H., et al. (2018). Two methods for mapping and visualizing associated data on phylogeny using ggtree. Mol. Biol. Evol. 35: 3041−3043. DOI: 10.1093/molbev/msy194. |
[56] | The Gene Ontology Consortium (2019). The Gene Ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47 :D330-d338. DOI: 10.1093/nar/gky1055. |
[57] | Knuth, D.E. (1984). Literate programming. Comput. J. 27: 97−111. DOI: 10.1093/comjnl/27.2.97. |
[58] | Zhang, X.O., Dong, R., Zhang, Y., et al. (2016). Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26: 1277−1287. DOI: 10.1101/gr.202895.115. |
[59] | Ma, X.K., Zhai, S.N., and Yang, L. (2023). Approaches and challenges in genome-wide circular RNA identification and quantification. Trends Genet. 39: 897−907. DOI: 10.1016/j.tig.2023.09.006. |
[60] | Kim, D. and Salzberg, S.L. (2011). TopHat-Fusion: An algorithm for discovery of novel fusion transcripts. Genome Biol. 12: R72. DOI: 10.1186/gb-2011-12-8-r72. |
[61] | Dobin, A., Davis, C.A., Schlesinger, F., et al. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29: 15−21. DOI: 10.1093/bioinformatics/bts635. |
[62] | Koppad, S., B, A., Gkoutos, G.V., et al. (2021). Cloud computing enabled big multi-omics data analytics. Bioinform. Biol. Insights 15 :11779322211035921. DOI: 10.1177/11779322211035921. |
[63] | Griebel, L., Prokosch, H.U., Köpcke, F., et al. (2015). A scoping review of cloud computing in healthcare. BMC Med. Inform. Decis. Mak. 15: 17. DOI: 10.1186/s12911-015-0145-7. |
[64] | Otasek, D., Morris, J.H., Bouças, J., et al. (2019). Cytoscape automation: Empowering workflow-based network analysis. Genome Biol. 20: 185. DOI: 10.1186/s13059-019-1758-4. |
[65] | Shannon, P., Markiel, A., Ozier, O., et al. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13: 2498−2504. DOI: 10.1101/gr.1239303. |
[66] | Pillich, R.T., Chen, J., Churas, C., et al. (2023). NDEx IQuery: A multi-method network gene set analysis leveraging the Network Data Exchange. Bioinformatics 39 : btad118. DOI: 10.1093/bioinformatics/btad118. |
[67] | Pillich, R.T., Chen, J., Rynkov, V., et al. (2017). NDEx: A community resource for sharing and publishing of biological networks. Methods Mol. Biol. 1558: 271−301. DOI: 10.1007/978-1-4939-6783-4_13. |
[68] | Galaxy Community (2022). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50 : W345-w351. DOI: 10.1093/nar/gkac247. |
[69] | Lee, C.M., Barber, G.P., Casper, J., et al. (2020). UCSC Genome Browser enters 20th year. Nucleic Acids Res. 48: D756−D761. DOI: 10.1093/nar/gkz1012. |
[70] | Nassar, L.R., Barber, G.P., Benet-Pagès, A., et al. (2023). The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 51: D1188−d1195. DOI: 10.1093/nar/gkac1072. |
[71] | Kasprzyk, A. (2011). BioMart: Driving a paradigm change in biological data management. Database 2011: bar049. DOI: 10.1093/database/bar049. |
[72] | Langmead, B. and Nellore, A. (2018). Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19: 208−219. DOI: 10.1038/nrg.2017.113. |
[73] | WANG, H., YU, Y., WANG, T., et al. (2023). Crowd intelligence paradigm: A new paradigm shift in software development. Sci. Sin. Inf. 53: 1490. DOI: 10.1360/SSI-2023-0064. |
[74] | Kent, W.J., Zweig, A.S., Barber, G., et al. (2010). BigWig and BigBed: Enabling browsing of large distributed datasets. Bioinformatics 26: 2204−2207. DOI: 10.1093/bioinformatics/btq351. |
[75] | Pohl, A. and Beato, M. (2014). bwtool: A tool for bigWig files. Bioinformatics 30: 1618−1619. DOI: 10.1093/bioinformatics/btu056. |
[76] | Retel, J.S., Poehlmann, A., Chiou, J., et al. (2024). A fast machine learning dataloader for epigenetic tracks from BigWig files. Bioinformatics 40 : btad767. DOI: 10.1093/bioinformatics/btad767. |
[77] | Mason, C.E., Zumbo, P., Sanders, S., et al. (2010). Standardizing the next generation of bioinformatics software development with BioHDF (HDF5). Adv. Exp. Med. Biol. 680: 693−700. DOI: 10.1007/978-1-4419-5913-3_77. |
[78] | Dougherty, M.T., Folk, M.J., Zadok, E., et al. (2009). Unifying biological image formats with HDF5. Commun. ACM 52: 42−47. DOI: 10.1145/1562764.1562781. |
[79] | Mrozek, D., Małysiak-Mrozek, B., and Siążnik, A. (2013). Search GenBank: Interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information. BMC Bioinformatics 14: 73. DOI: 10.1186/1471-2105-14-73. |
[80] | Nadkarni, P.M. and Parikh, C.R. (2012). An eUtils toolset and its use for creating a pipeline to link genomics and proteomics analyses to domain-specific biomedical literature. J. Clin. Bioinforma. 2: 9. DOI: 10.1186/2043-9113-2-9. |
[81] | Conford, B., Almsaeed, A., Buehler, S., et al. (2020). Tripal EUtils: A Tripal module to increase exchange and reuse of genome assembly metadata. Database 2019 : baz143. DOI: 10.1093/database/baz143. |
[82] | Smedley, D., Haider, S., Ballester, B., et al. (2009). BioMart-biological queries made easy. BMC Genomics 10 : 22. DOI: 10.1186/1471-2164-10-22. |
[83] | Bharadwaj, A. and Cormode, G. (2024). Federated computation: A survey of concepts and challenges. Distrib. Parallel Databases 42 : 299–335. DOI: 10.1007/s10619-023-07438-w. |
[84] | Narmadha, K. and Varalakshmi, P. (2022). Federated learning in healthcare: A privacy preserving approach. Stud. Health Technol. Inform. 294: 194−198. DOI: 10.3233/shti220436. |
[85] | Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature 550: 354−359. DOI: 10.1038/nature24270. |
[86] | Biever, C. (2023). ChatGPT broke the Turing test - the race is on for new ways to assess AI. Nature 619: 686−689. DOI: 10.1038/d41586-023-02361-7. |
[87] | Lan, L., Huang, T., Li, Y., et al. (2023). A survey of cross-lingual text classification and its applications on fake news detection. WSARAI 01: 2350003. DOI: 10.1142/s2811032323500030. |
[88] | Stražar, M., Žagar, L., Kokošar, J., et al. (2019). scOrange-a tool for hands-on training of concepts from single-cell data analytics. Bioinformatics 35: i4−i12. DOI: 10.1093/bioinformatics/btz348. |
[89] | Abolhasani, M. and Kumacheva, E. (2023). The rise of self-driving labs in chemical and materials sciences. Nat. Synth. 2: 483−492. DOI: 10.1038/s44160-022-00231-0. |
[90] | Ha, T., Lee, D., Kwon, Y., et al. (2023). AI-driven robotic chemist for autonomous synthesis of organic molecules. Sci. Adv. 9 : eadj0461. DOI: 10.1126/sciadv.adj0461. |
[91] | Zhu, Q., Zhang, F., Huang, Y., et al. (2022). An all-round AI-Chemist with a scientific mind. Natl. Sci. Rev. 9 : :nwac190. DOI: 10.1093/nsr/nwac190. |
Ma X.-K., Yu Y., Huang T., et al., (2024). Bioinformatics software development: Principles and future directions. The Innovation Life 2(3): 100083. https://doi.org/10.59717/j.xinn-life.2024.100083 |
The growth of data at NCBI
Comparison of the package managers pip and conda.
Diagram of the GitHub workflow.
The development of R package simpleGO.
Development of software for analyzing NGS datasets
Global architecture of large-scale platform for open science