Scientometric Analysis of the Relationship Between Artificial Intelligence and Data Engineering: Trends,Collaboration, and Evolution
Palavras-chave:
Artificial intelligence, Data Engineering, Data Preprocessing, Deep learning, Machine learning, ScientometricsResumo
Data engineering has become a fundamental step for artificial intelligence (AI) because the quality of
training depends on the quality of the data. This thematic connection is relatively new, and the
literature on the topic is quite scattered. Therefore, the purpose of this review is to identify the main
contributions in this area by applying the Tree of Science algorithm. This algorithm processes query
results from Scopus and Web of Science to classify articles into root, trunk, and branches. The main
result of the study was the identification of three emerging areas. The first focuses on the use of AI
to analyze and enhance complex scientific data, emphasizing solving specific optimization
challenges. The second addresses the practical implementation of AI, tackling issues such as data
cleaning and improving operational efficiency. The third highlights the development of AI from its
creation to its maintenance once implemented, leveraging data engineering as a key tool to enhance
AI training and performance. These findings are noteworthy because they shed light on the current
use of AI applications to optimize processes in various sectors. The ability to process large volumes
of data quickly improves efficiency and accelerates decision-making in sectors such as healthcare,
industry, and cybersecurity. This enables personalized diagnostics, process optimization, and
immediate responses to threats. However, it is important to emphasize that these benefits rely on data quality and the proper implementation of analysis systems, which ensure effective processing and
reliable results.
Downloads
Referências
[1]C. Ordonez, W. Macyna, and L. Bellatreche, "Data engineering and modeling for artificial intelligence", Data Knowl. Eng., vol. 153, p. 102346, 2024.
[2]V. Sessions and M. Valtorta, "The effects of data quality on machine learning algorithms", in Proceedings, pp. 485–498, 2006.
[3]D. T. Holloway, M. A. Kon, and C. DeLisi, "Machine learning methods for transcription data integration", IBM J. Res. Dev., vol. 50, pp. 631–643, 2006.
[4]S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, "DeepSense", in Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017. doi: 10.1145/3038912.3052577.
[5]F. Napolitano et al., "Drug repositioning: a machine-learning approach through data integration", J. Cheminform., vol. 5, p. 30, 2013.
[6]G. Manogaran et al., "Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering", Wirel. Pers. Commun., vol. 102, pp. 2099–2116, 2018.
[7]J. Wang, X. Niu, L. Zhang, Z. Liu, and X. Huang, "A wind speed forecasting system for the construction of a smart grid with two-stage data processing based on improved ELM and deep learning strategies", Expert Syst. Appl., vol. 241, p. 122487, 2024.
[8]S. Surehali, T. Han, J. Huang, A. Kumar, and N. Neithalath, "On the use of machine learning and data-transformation methods to predict hydration kinetics and strength of alkali-activated mine tailings-based binders", Constr. Build. Mater., vol. 419, p. 135523, 2024.
[9]D. V. Jyothi, D. T. Sreelatha, D. T. M. Thiyagu, R. Sowndharya, and N. Arvinth, "A data management system for smart cities leveraging artificial intelligence modeling techniques to enhance privacy and security", J. Internet Serv. Inf. Secur., vol. 14, pp. 37–51, 2024.
[10]J. Y. Kim, W.-S. Ryu, D. Kim, and E. Y. Kim, "Better performance of deep learning pulmonary nodule detection using chest radiography with pixel level labels in reference to computed tomography: data quality matters", Sci. Rep., vol. 14, p. 15967, 2024.
[11]C. Huang et al., "Machine-learning-based data processing techniques for vehicle-to-vehicle channel modeling", IEEE Commun. Mag., vol. 57, pp. 109–115, 2019.
[12]S. Mieruch, S. Demirel, S. Simoncelli, R. Schlitzer, and S. Seitz, "SalaciaML: A deep learning approach for supporting ocean data quality control", Front. Mar. Sci., vol. 8, 2021.
[13]A. Gharieb et al., "In-house integrated big data management platform for exploration and production operations digitalization: From data gathering to generative AI through machine learning implementation using cost-effective open-source technologies - experienced mature workflow", in Day 2 Tue, April 23, 2024, SPE, 2024. doi: 10.2118/218560-ms.
[14]Y. Yin and J. Antonio, "Application of 3D laser scanning technology for image data processing in the protection of ancient building sites through deep learning", Image Vis. Comput., vol. 102, p. 103969, 2020.
[15]S. Zheng et al., "Big data processing architecture for radio signals empowered by deep learning: Concept, experiment, applications and challenges", IEEE Access, vol. 6, pp. 55907–55922, 2018.
[16]K. Zhang, "Incorporating deep learning model development with an end-to-end data pipeline", IEEE Access, vol. 12, pp. 127522–127531, 2024.
[17]T. Harrison et al., "The data firehose and AI in government", in Proceedings of the 20th Annual International Conference on Digital Government Research, ACM, New York, NY, USA, 2019. doi: 10.1145/3325112.3325245.
[18]X. Xiang et al., "Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning", Nat. Commun., vol. 12, p. 3238, 2021.
[19]Y. Yu et al., "Improved prediction of bacterial CRISPRi guide efficiency from depletion screens through mixed-effect machine learning and data integration", Genome Biol., vol. 25, p. 13, 2024.
[20]O. P. Olawale and S. Ebadinezhad, "Cybersecurity anomaly detection: AI and Ethereum blockchain for a secure and tamperproof IoHT data management", IEEE Access, vol. 12, pp. 131605–131620, 2024.
[21]R. Koulali, H. Zaidani, and M. Zaim, "Image classification approach using machine learning and an industrial Hadoop-based data pipeline", Big Data Res., vol. 24, p. 100184, 2021.
[22]G. Manias et al., "Advanced data processing of pancreatic cancer data integrating ontologies and machine learning techniques to create holistic health records", Sensors (Basel), vol. 24, p. 1739, 2024.
[23]Y. Shen et al., "A deep-learning-based data-management scheme for intelligent control of wastewater treatment processes under resource-constrained IoT systems", IEEE Internet Things J., vol. 11, pp. 25757–25770, 2024.
[24]J. Wang, "Practical research on blended teaching in higher vocational institutes based on ICVE platform: Take artificial intelligence data processing course as an example", in Proc. 2022 3rd Int. Conf. Educ., Knowl. Inf. Manag. (ICEKIM), IEEE, 2022. doi: 10.1109/ICEKIM55072.2022.00047.
[25]Y. Li and A. Ngom, "Data integration in machine learning", in Proc. 2015 IEEE Int. Conf. Bioinf. Biomed. (BIBM), IEEE, 2015. doi: 10.1109/BIBM.2015.7359925.
[26]S. Guo, J. Lin, Y. Zhang, and Z.-L. Huang, "Enhancing the data processing speed of a deep-learning-based three-dimensional single molecule localization algorithm (FD-DeepLoc) with a combination of feature compression and pipeline programming", J. Innov. Opt. Health Sci., 2024. doi: 10.1142/S1793545824500251.