Implementation and data mining of external biological databases

V. Langraf; K. Petrovičová; V. V. Brygadyrenko

doi:10.15421/0225114

V. Langraf Constantine the Philosopher University in Nitra
K. Petrovičová University of Agriculture in Nitra
V. V. Brygadyrenko Oles Honchar Dnipro National University

Keywords: ITIS, SQL, query, data quality, SSMS.

Abstract

The implementation of external biological databases is a key approach that allows researchers to consolidate scattered information from different sources into a collaborative unified system. In practice, this means that data from projects such as GenBank, UniProt, and Ensembl are automatically retrieved, transformed into a unified format, and stored in a relational or NoSQL database using ETL processes. This approach ensures that sequence data, gene ann o tations, or protein information are always consistent and ready for further analysis, eliminating the risk of manual copying or incorrect mapping of entities. The aim of this study was to design and implement a process for integrating data from an external ITIS (Integrated Taxonomic Information System) into a relational database in a Microsoft SQL Server environment. After analysing the ITIS schemas and data formats, we prepared tools for automated ETL (E x tract, Transform, Load), which loaded 19 source files with taxonomic and metadata data using bulk import (BULK INSERT). Data normalisation and consistency checking ensured reliable linking of entities (identifiers, authors, comments, and vernaculars). To demonstrate the usefulness of the solution, we performed a preliminary SQL data extraction analysis: we found that the database contains 107,540 unique references to genera , of which the most numerous is the genus Euphorbia (5,009 records); the most comments on taxa were added in 2015 and 2001; and the highest frequency of publications was recorded in 2018 -2023 . These results confirm the suitability of MS SQL for systematic taxonomy studies and open up space for further automation of updates and expansion of the analysis to include temporal or geolocation trends.

References

Akoka, J., Comyn-Wattiau, I., & Laoufi, N. (2017). Research on Big Data – a systematic mapping study. Computer Standards and Interfaces, 54, 105–115.

Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M., Yefanov, A., Lee, H., Zhang, N., Robertson, C. L., Serova, N., Davis, S., & Soboleva, A. (2013). NCBI GEO: Archive for functional genomics data sets – update. Nucleic Acids Research, 41, D991–D995.

Baxevanis A. D. (2011). The importance of biological databases in biological discovery. Current Protocols in Bioinformatics, 34(1), 1–6.

Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Ostell, J., Pruitt, K. D., & Sayers, E. W. (2018). GenBank. Nucleic acids research, 46(D1), D41–D47.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The protein data bank. Nucleic Acids Research, 28(1), 235–242.

Birney, E., & Clamp, M. (2004). Biological database design and implementation. Briefings in Bioinformatics, 5(1), 31–38.

Cochrane, G., Karsch-Mizrachi, I., & Nakamura, Y. (2010). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Research, 39, D15–D18.

Coronel, C., & Morris, S. (2014). Database systems: Design, implementation and management. 11th ed. Cengage Learning, Stamford.

De Lorenzo, V., Prather, K. L., Chen, G., O’Day, E., von Kameke, C., Oyarzún, D. A., Hosta‐Rigau, L., Alsafar, H., Cao, C., Ji, W., Okano, H., Roberts, R. J., Ronaghi, M., Yeung, K., Zhang, F., & Lee, S. Y. (2018). The power of synthetic biology for bioproduction, remediation and pollution control. EMBO Reports, 19(4), e45658.

Federhen, S. (2011). The NCBI taxonomy database. Nucleic Acids Research, 40(D1), D136–D143.

Galperin, M. Y. (2004). The molecular biology database collection: 2004 update. Nucleic Acids Research, 32, 3–22.

Gligorijević, V., & Pržulj, N. (2015). Methods for biological data integration: perspectives and challenges. Journal of the Royal Society, Interface, 12(112), 20150571.

Hogeweg, P. (2011). The roots of bioinformatics in theoretical biology. PLoS Computational Biology, 7(3), e1002021.

Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., & Tanabe, M. (2020). KEGG: integrating viruses and cellular organisms. Nucleic Acids Research, 49(D1), D545–D551.

Kasprzyk A. (2011). BioMart: Driving a paradigm change in biological data management. Database, 2011, bar049.

Lee, T. J., Pouliot, Y., Wagner, V., Gupta, P., Stringer-Calvert, D. W., Tenenbaum, J. D., & Karp, P. D. (2006). BioWarehouse: A bioinformatics database warehouse toolkit. BMC Bioinformatics, 7, 170.

Leinonen, R., Sugawara, H., & Shumway, M. (2010). The sequence read archive. Nucleic Acids Research, 39, D19–D21.

Luscombe, N. M., Greenbaum, D., & Gerstein, M. (2001). What is bioinformatics? An introduction and overview. Yearbook of Medical Informatics, 10(1), 83–100.

Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., Leipe, D., Mcveigh, R., O’Neill, K., Robbertse, B., Sharma, S., Soussov, V., Sullivan, J. P., Sun, L., Turner, S., & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: A comprehensive update on curation, resources and tools. Database, 2020, baaa062.

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.-A., Scheuermann, R. H., Shah, N., Whetzel, P. L., & Lewis, S. (2007). The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11), 1251–1255.

Steinmann, V. W., & Porter, J. M. (2002). Phylogenetic relationships in Euphorbieae (Euphorbiaceae) based on ITS and ndhF sequence data. Annals of the Missouri Botanical Garden, 89(4), 453–490.

Tiwari, S., Wee, H. M., & Daryanto, Y. (2018). Big data analytics in supply chain management between 2010 and 2016: Insights to industries. Computers and Industrial Engineering, 115, 319–330.

Triplet, T., & Butler, G. (2013). A review of genomic data warehousing systems. Briefings in Bioinformatics, 15(4), 471–483.

Wang, Y.-L., Jian, X., & Wang, S. (2022). Characterization of the complete chloroplast genome of Euphorbia pekinensis Rupr. (Euphorbiaceae). Mitochondrial DNA Part B, 7(8), 1550–1552.

Yang, A., Troup, M., & Ho, J. W. K. (2017). Scalability and validation of big data bioinformatics software. Computational and Structural Biotechnology Journal, 15, 379–386.

Implementation and data mining of external biological databases

Abstract

References

Most read articles by the same author(s)