The growing volume of biomedical big data is driving a tremendous demand for advanced bioinformatics software.
Artificial intelligence (AI) is expected to accelerate and enhance the development process of bioinformatics software.
AI-augmented cloud computing is paving the way for the future of autonomous research.
[1] | Lander, E.S., Linton, L.M., Birren, B., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409: 860−921. DOI: 10.1038/35057062. |
[2] | Altschul, S.F., Gish, W., Miller, W., et al. (1990). Basic local alignment search tool. J. Mol. Biol. 215: 403−410. DOI: 10.1016/s0022-2836(05)80360-2. |
[3] | Johnson, M., Zaretskaya, I., Raytselis, Y., et al. (2008). NCBI BLAST: A better web interface. Nucleic Acids Res. 36: W5−9. DOI: 10.1093/nar/gkn201. |
[4] | Li, W., Cowley, A., Uludag, M., et al. (2015). The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 43: W580−584. DOI: 10.1093/nar/gkv279. |
[5] | Buchfink, B., Reuter, K., and Drost, H.G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18: 366−368. DOI: 10.1038/s41592-021-01101-x. |
[6] | Eddy, S.R. (2009). A new generation of homology search tools based on probabilistic inference. Genome Inform. 23 : 205−211. DOI: 10.1142/9781848165632_0019. |
[7] | Tamura, K., Stecher, G., and Kumar, S. (2021). MEGA11: Molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38: 3022−3027. DOI: 10.1093/molbev/msab120. |
[8] | Shah, N., Nute, M.G., Warnow, T., et al. (2019). Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35: 1613−1614. DOI: 10.1093/bioinformatics/bty833. |
[9] | Madden, T.L., Busby, B., and Ye, J. (2019). Reply to the paper: Misunderstood parameters of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35: 2699−2700. DOI: 10.1093/bioinformatics/bty1026. |
[10] | Xu, Y., Liu, X., Cao, X., et al. (2021). Artificial intelligence: A powerful paradigm for scientific research. The Innovation 2: 100179. DOI: 10.1016/j.xinn.2021.100179. |
[11] | Huang, T., Lan, L., Fang, X., et al. (2015). Promises and challenges of big data computing in health sciences. Big Data Research 2: 2−11. DOI: 10.1016/j.bdr.2015.02.002. |
[12] | Wen, L., Li, G., Huang, T., et al. (2022). Single-cell technologies: From research to application. The Innovation 3: 100342. DOI: 10.1016/j.xinn.2022.100342. |
[13] | Falk, T., Mai, D., Bensch, R., et al. (2019). U-Net: Deep learning for cell counting, detection, and morphometry. Nat. Methods 16: 67−70. DOI: 10.1038/s41592-018-0261-2. |
[14] | Huang, T., Xu, H., Wang, H., et al. (2023). Artificial intelligence for medicine: Progress, challenges, and perspectives. The Innovation Medicine 1: 100030. DOI: 10.59717/j.xinn-med.2023.100030. |
[15] | Xun, D., Wang, R., Zhang, X., et al. (2024). Microsnoop: A generalist tool for microscopy image representation. The Innovation 5 . DOI: 10.1016/j.xinn.2023.100541. |
[16] | Zhao, C., Guo, L., Dong, J., et al. (2021). Mass spectrometry imaging-based multi-modal technique: Next-generation of biochemical analysis strategy. The Innovation 2 . DOI: 10.1016/j.xinn.2021.100151. |
[17] | Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583−589. DOI: 10.1038/s41586-021-03819-2. |
[18] | Huang, T. and Li, Y. (2023). Current progress, challenges, and future perspectives of language models for protein representation and protein design. The Innovation 4: 100446. DOI: 10.1016/j.xinn.2023.100446. |
[19] | Novakovsky, G., Dexter, N., Libbrecht, M.W., et al. (2023). Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24: 125−137. DOI: 10.1038/s41576-022-00532-2. |
[20] | Yang, A., Troup, M., and Ho, J.W.K. (2017). Scalability and validation of big data bioinformatics software. Comput. Struct. Biotechnol. J. 15: 379−386. DOI: 10.1016/j.csbj.2017.07.002. |
[21] | Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3: 160018. DOI: 10.1038/sdata.2016.18. |
[22] | Rehm, H.L., Page, A.J.H., Smith, L., et al. (2021). GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1 : 100029. DOI: 10.1016/j.xgen.2021.100029. |
[23] | Sayers, E.W., Beck, J., Bolton, E.E., et al. (2024). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 52: D33−D43. DOI: 10.1093/nar/gkad1044. |
[24] | Cantelli, G., Bateman, A., Brooksbank, C., et al. (2022). The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res. 50: D11−D19. DOI: 10.1093/nar/gkab1127. |
[25] | Bao, Y. and Xue, Y. (2023). From BIG Data Center to China National Center for bioinformation. Genomics, Proteomics Bioinf. 21: 900−903. DOI: 10.1016/j.gpb.2023.10.001. |
[26] | Cock, P.J., Fields, C.J., Goto, N., et al. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38: 1767−1771. DOI: 10.1093/nar/gkp1137. |
[27] | Reimers, M. and Carey, V.J. (2006). Bioconductor: An open source framework for bioinformatics and computational biology. Methods Enzymol. 411: 119−134. DOI: 10.1016/s0076-6879(06)11008-3. |
[28] | Kodama, Y., Shumway, M., and Leinonen, R. (2012). The sequence read archive: Explosive growth of sequencing data. Nucleic Acids Res. 40: D54−56. DOI: 10.1093/nar/gkr854. |
[29] | Sayers, E.W., Cavanaugh, M., Clark, K., et al. (2024). GenBank 2024 update. Nucleic Acids Res. 52: D134−D137. DOI: 10.1093/nar/gkad903. |
[30] | O'Leary, N.A., Wright, M.W., Brister, J.R., et al. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44: D733−745. DOI: 10.1093/nar/gkv1189. |
[31] | Yuan, D., Ahamed, A., Burgin, J., et al. (2024). The European nucleotide archive in 2023. Nucleic Acids Res. 52: D92−D97. DOI: 10.1093/nar/gkad1067. |
[32] | UniProt Consortium. (2023). UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51: D523−D531. DOI: 10.1093/nar/gkac1052. |
[33] | Martin, F.J., Amode, M.R., Aneja, A., et al. (2023). Ensembl 2023. Nucleic Acids Res. 51: D933−d941. DOI: 10.1093/nar/gkac958. |
[34] | Wang, Y., Song, F., Zhu, J., et al. (2017). GSA: Genome sequence archive. Genomics, Proteomics Bioinf. 15: 14−18. DOI: 10.1016/j.gpb.2017.01.001. |
[35] | Schomburg, I., Chang, A., Hofmann, O., et al. (2002). BRENDA: A resource for enzyme data and metabolic information. Trends Biochem. Sci. 27: 54−56. DOI: 10.1016/s0968-0004(01)02027-8. |
[36] | Yurekten, O., Payne, T., Tejera, N., et al. (2024). MetaboLights: Open data repository for metabolomics. Nucleic Acids Res. 52: D640−D646. DOI: 10.1093/nar/gkad1045. |
[37] | Wishart, D.S., Knox, C., Guo, A.C., et al. (2009). HMDB: A knowledgebase for the human metabolome. Nucleic Acids Res. 37: D603−610. DOI: 10.1093/nar/gkn810. |
[38] | Ma, L., Zou, D., Liu, L., et al. (2023). Database Commons: A catalog of worldwide biological databases. Genomics, Proteomics Bioinf. 21: 1054−1058. DOI: 10.1016/j.gpb.2022.12.004. |
[39] | Chen, Y.B., Chattopadhyay, A., Bergen, P., et al. (2007). The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System-a one-stop gateway to online bioinformatics databases and software tools. Nucleic Acids Res. 35 :D780-785. DOI: 10.1093/nar/gkl781. |
[40] | Ison, J., Ienasescu, H., Chmura, P., et al. (2019). The bio. tools registry of software tools and data resources for the life sciences. Genome Biol. 20: 164. DOI: 10.1186/s13059-019-1772-6. |
[41] | Zhao, Q., Zhou, X., Wu, J., et al. (2024). BioTreasury: A community-based repository enabling indexing and rating of bioinformatics tools. Sci. China Life Sci. 67: 221−229. DOI: 10.1007/s11427-023-2509-x. |
[42] | McKinney, W. (2011). Pandas: A foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14: 1−9. |
[43] | Van Der Walt, S., Colbert, S.C., and Varoquaux, G. (2011). The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng. 13: 22−30. DOI: 10.1109/MCSE.2011.37. |
[44] | Fabian, P. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12: 2825−2830. |
[45] | Ginestet, C. (2011). ggplot2: Elegant graphics for data analysis. Journal of the Royal Statistical Society Series A (Statistics in Society) 174 : 245–246. DOI: 10.1111/j.1467-985X.2010.00676_9.x. |
[46] | Langmead, B. and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9: 357−359. DOI: 10.1038/nmeth.1923. |
[47] | Stein, L.D. (2013). Using GBrowse 2.0 to visualize and share next-generation sequence data. Brief Bioinform. 14 :162-171. DOI: 10.1093/bib/bbt001. |
[48] | Skinner, M.E., Uzilov, A.V., Stein, L.D., et al. (2009). JBrowse: A next-generation genome browser. Genome Res. 19: 1630−1638. DOI: 10.1101/gr.094607.109. |
[49] | Koster J. and Rahmann, S. (2018). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 34: 3600. DOI: 10.1093/bioinformatics/bty350. |
[50] | Di Tommaso, P., Chatzou, M., Floden, E.W., et al. (2017). Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35: 316−319. DOI: 10.1038/nbt.3820. |
[51] | Chen, T., Liu, Y.X., and Huang, L. (2022). ImageGP: An easy-to-use data visualization web server for scientific researchers. Imeta. 1: e5. DOI: 10.1002/imt2.5. |
[52] | Wen, T., Xie, P., Yang, S., et al. (2022). ggClusterNet: An R package for microbiome network analysis and modularity-based multiple network layouts. Imeta. 1: e32. DOI: 10.1002/imt2.32. |
[53] | Yu, G., Wang, L.G., Han, Y., et al. (2012). clusterProfiler: An R package for comparing biological themes among gene clusters. Omics 16: 284−287. DOI: 10.1089/omi.2011.0118. |
[54] | Wu, T., Hu, E., Xu, S., et al. (2021). clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation 2 :100141. DOI: 10.1016/j.xinn.2021.100141. |
[55] | Yu, G., Lam, T.T., Zhu, H., et al. (2018). Two methods for mapping and visualizing associated data on phylogeny using ggtree. Mol. Biol. Evol. 35: 3041−3043. DOI: 10.1093/molbev/msy194. |
[56] | The Gene Ontology Consortium (2019). The Gene Ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47 :D330-d338. DOI: 10.1093/nar/gky1055. |
[57] | Knuth, D.E. (1984). Literate programming. Comput. J. 27: 97−111. DOI: 10.1093/comjnl/27.2.97. |
[58] | Zhang, X.O., Dong, R., Zhang, Y., et al. (2016). Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26: 1277−1287. DOI: 10.1101/gr.202895.115. |
[59] | Ma, X.K., Zhai, S.N., and Yang, L. (2023). Approaches and challenges in genome-wide circular RNA identification and quantification. Trends Genet. 39: 897−907. DOI: 10.1016/j.tig.2023.09.006. |
[60] | Kim, D. and Salzberg, S.L. (2011). TopHat-Fusion: An algorithm for discovery of novel fusion transcripts. Genome Biol. 12: R72. DOI: 10.1186/gb-2011-12-8-r72. |
[61] | Dobin, A., Davis, C.A., Schlesinger, F., et al. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29: 15−21. DOI: 10.1093/bioinformatics/bts635. |
[62] | Koppad, S., B, A., Gkoutos, G.V., et al. (2021). Cloud computing enabled big multi-omics data analytics. Bioinform. Biol. Insights 15 :11779322211035921. DOI: 10.1177/11779322211035921. |
[63] | Griebel, L., Prokosch, H.U., Köpcke, F., et al. (2015). A scoping review of cloud computing in healthcare. BMC Med. Inform. Decis. Mak. 15: 17. DOI: 10.1186/s12911-015-0145-7. |
[64] | Otasek, D., Morris, J.H., Bouças, J., et al. (2019). Cytoscape automation: Empowering workflow-based network analysis. Genome Biol. 20: 185. DOI: 10.1186/s13059-019-1758-4. |
[65] | Shannon, P., Markiel, A., Ozier, O., et al. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13: 2498−2504. DOI: 10.1101/gr.1239303. |
[66] | Pillich, R.T., Chen, J., Churas, C., et al. (2023). NDEx IQuery: A multi-method network gene set analysis leveraging the Network Data Exchange. Bioinformatics 39 : btad118. DOI: 10.1093/bioinformatics/btad118. |
[67] | Pillich, R.T., Chen, J., Rynkov, V., et al. (2017). NDEx: A community resource for sharing and publishing of biological networks. Methods Mol. Biol. 1558: 271−301. DOI: 10.1007/978-1-4939-6783-4_13. |
[68] | Galaxy Community (2022). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50 : W345-w351. DOI: 10.1093/nar/gkac247. |
[69] | Lee, C.M., Barber, G.P., Casper, J., et al. (2020). UCSC Genome Browser enters 20th year. Nucleic Acids Res. 48: D756−D761. DOI: 10.1093/nar/gkz1012. |
[70] | Nassar, L.R., Barber, G.P., Benet-Pagès, A., et al. (2023). The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 51: D1188−d1195. DOI: 10.1093/nar/gkac1072. |
[71] | Kasprzyk, A. (2011). BioMart: Driving a paradigm change in biological data management. Database 2011: bar049. DOI: 10.1093/database/bar049. |
[72] | Langmead, B. and Nellore, A. (2018). Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19: 208−219. DOI: 10.1038/nrg.2017.113. |
[73] | WANG, H., YU, Y., WANG, T., et al. (2023). Crowd intelligence paradigm: A new paradigm shift in software development. Sci. Sin. Inf. 53: 1490. DOI: 10.1360/SSI-2023-0064. |
[74] | Kent, W.J., Zweig, A.S., Barber, G., et al. (2010). BigWig and BigBed: Enabling browsing of large distributed datasets. Bioinformatics 26: 2204−2207. DOI: 10.1093/bioinformatics/btq351. |
[75] | Pohl, A. and Beato, M. (2014). bwtool: A tool for bigWig files. Bioinformatics 30: 1618−1619. DOI: 10.1093/bioinformatics/btu056. |
[76] | Retel, J.S., Poehlmann, A., Chiou, J., et al. (2024). A fast machine learning dataloader for epigenetic tracks from BigWig files. Bioinformatics 40 : btad767. DOI: 10.1093/bioinformatics/btad767. |
[77] | Mason, C.E., Zumbo, P., Sanders, S., et al. (2010). Standardizing the next generation of bioinformatics software development with BioHDF (HDF5). Adv. Exp. Med. Biol. 680: 693−700. DOI: 10.1007/978-1-4419-5913-3_77. |
[78] | Dougherty, M.T., Folk, M.J., Zadok, E., et al. (2009). Unifying biological image formats with HDF5. Commun. ACM 52: 42−47. DOI: 10.1145/1562764.1562781. |
[79] | Mrozek, D., Małysiak-Mrozek, B., and Siążnik, A. (2013). Search GenBank: Interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information. BMC Bioinformatics 14: 73. DOI: 10.1186/1471-2105-14-73. |
[80] | Nadkarni, P.M. and Parikh, C.R. (2012). An eUtils toolset and its use for creating a pipeline to link genomics and proteomics analyses to domain-specific biomedical literature. J. Clin. Bioinforma. 2: 9. DOI: 10.1186/2043-9113-2-9. |
[81] | Conford, B., Almsaeed, A., Buehler, S., et al. (2020). Tripal EUtils: A Tripal module to increase exchange and reuse of genome assembly metadata. Database 2019 : baz143. DOI: 10.1093/database/baz143. |
[82] | Smedley, D., Haider, S., Ballester, B., et al. (2009). BioMart-biological queries made easy. BMC Genomics 10 : 22. DOI: 10.1186/1471-2164-10-22. |
[83] | Bharadwaj, A. and Cormode, G. (2024). Federated computation: A survey of concepts and challenges. Distrib. Parallel Databases 42 : 299–335. DOI: 10.1007/s10619-023-07438-w. |
[84] | Narmadha, K. and Varalakshmi, P. (2022). Federated learning in healthcare: A privacy preserving approach. Stud. Health Technol. Inform. 294: 194−198. DOI: 10.3233/shti220436. |
[85] | Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature 550: 354−359. DOI: 10.1038/nature24270. |
[86] | Biever, C. (2023). ChatGPT broke the Turing test - the race is on for new ways to assess AI. Nature 619: 686−689. DOI: 10.1038/d41586-023-02361-7. |
[87] | Lan, L., Huang, T., Li, Y., et al. (2023). A survey of cross-lingual text classification and its applications on fake news detection. WSARAI 01: 2350003. DOI: 10.1142/s2811032323500030. |
[88] | Stražar, M., Žagar, L., Kokošar, J., et al. (2019). scOrange-a tool for hands-on training of concepts from single-cell data analytics. Bioinformatics 35: i4−i12. DOI: 10.1093/bioinformatics/btz348. |
[89] | Abolhasani, M. and Kumacheva, E. (2023). The rise of self-driving labs in chemical and materials sciences. Nat. Synth. 2: 483−492. DOI: 10.1038/s44160-022-00231-0. |
[90] | Ha, T., Lee, D., Kwon, Y., et al. (2023). AI-driven robotic chemist for autonomous synthesis of organic molecules. Sci. Adv. 9 : eadj0461. DOI: 10.1126/sciadv.adj0461. |
[91] | Zhu, Q., Zhang, F., Huang, Y., et al. (2022). An all-round AI-Chemist with a scientific mind. Natl. Sci. Rev. 9 :nwac190. DOI: 10.1093/nsr/nwac190. |
Ma X.-K., Yu Y., Huang T., et al., (2024). Bioinformatics software development: Principles and future directions. The Innovation Life 2(3): 100083. https://doi.org/10.59717/j.xinn-life.2024.100083 |
To request copyright permission to republish or share portions of our works, please visit Copyright Clearance Center's (CCC) Marketplace website at marketplace.copyright.com.
The growth of data at NCBI
Comparison of the package managers pip and conda.
Diagram of the GitHub workflow.
The development of R package simpleGO.
Development of software for analyzing NGS datasets
Global architecture of large-scale platform for open science