Transcripción de audio a texto en Sesiones Municipales de Planeta Rica

Jaime Andrés Ruiz-Melendres; Jorge Eliecer Gómez-Gómez

Authors

Jaime Andrés Ruiz-Melendres Universidad de Córdoba https://orcid.org/0009-0000-7511-4131
Jorge Eliecer Gómez-Gómez Universidad de Córdoba https://orcid.org/0000-0001-8746-9386

Keywords:

AI, artificial intelligence, audio-to-text, transcription, sessions, whisper, spleeter

Abstract

The document addresses the issue of manually transcribing municipal sessions in Planeta Rica. It investigates the use of open-source tools to automate the transcription of audio to text in these sessions with the aim of improving efficiency and accuracy in this process. The importance of integrating models into the system to address different aspects and enhance transcription quality is emphasized. In this regard, two artificial intelligence models are mentioned: OpenAI's Whisper and Deezer's Spleeter. Whisper is a general-purpose speech recognition model. On the other hand, Spleeter is an audio track separation tool that utilizes pre-trained models to separate voices from any audio track. Furthermore, an architecture is developed to enable the automatic integration of these models. This architecture is based on the use of Python for managing the artificial intelligence models, while the application's backend is developed using Go and the frontend with Next.js/React. This allowed for the automation of transcriptions for Planeta Rica's municipal council sessions, improving both efficiency and precision in the process.

Downloads

Download data is not yet available.

References

Departamento Administrativo de la Función Pública (2020, marzo 20) - Concept 97471 of 2020.Disponoble en: https://www.funcionpublica.gov.co/eva/gestornormativo/norma.php?i=127682

Shelley (2023, September 19). How much will AI cost in 2022? Developers.

A. Samsukha, (2023, June 1) The rise of Speech AI: a Game-Changer in the Tech world (n.d.). Nasscom | the Official Community of Indian IT Industry.

I. Manco, E. Benetos, E. Quinton, y G. Fazekas,“Papers with Code - Contrastive Audio-Language Learning for Music” Agosto 2022.

P. Flach,“Machine Learning: The Art and Science of Algorithms that Make Sense of Data” 2012.

MultiComp Lab (2017, October 4). Multimodal Machine Learning | MultiComp. MultiComp | MultiComp Lab's Mission Is to Build the Algorithms and Computational Foundation to Understand the Interdependence Between Human Verbal, Visual, and Vocal Behaviors Expressed During Social Communicative Interactions.

C. Chen, D. Han, and J. Wang, "Multimodal EncoderDecoder Attention Networks for Visual Question Answering," IEEE Access, pp. 1-1, 2 2020.

Papers with Code - VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (2021, April 22).

R. Merritt, (2022, April 19). What is a Transformer Model? [9] NVIDIA Blog. Official NVIDIA Latin America Blog.

What are convolutional neural networks? IBM (n.d.).

Chan, W. (2015, August 5). Listen, attend and spell. arXiv.org.

D. Amodei, (2015, December 8). Deep Speech 2: End-to-End speech recognition in English and Mandarin. arXiv.org.

J. Llisterri, (n.d.). The synthesis units.

Olivier M. Emorine and Pierre M. Martin. 1988. The MULTIVOC text-to-speech conversion system. In Proceedings of the second conference on applied natural language processing (ANLC '88). Association for Computational Linguistics, USA, 115-120.

MHTTS: Fast Multi-head Text-to-speech For Spontaneous Speech With Imperfect Transcription (2022, October 1). IEEE Conference Publication | IEEE Xplore.

Van Den Oord, A. (2016, September 12). WaveNet: a generative model for raw audio. arXiv.org.

R. Hennequin, A. Khlif, F. Voituret, M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open-Source Software, 5(50), 2154,2020.

C.A. Louis, C. A. Ancy, “Research on DNN Methods in Music Source Separation Tools with emphasis to Spleeter. International Research Journal on Advanced Science Hub, 3(Special Issue 6S), 24-28, 2021.

O. Ronneberger, “U-NET: Convolutional Networks for Biomedical Image Segmentation”. Computer Vision and Pattern Recognition, 2015.

A. Radford, J. Wook Kim, T. Xu, G. Brockman, C. McLeavey, y I. Sutskever, “Robust speech recognition via Large-Scale Weak Supervision”.2022.

OpenAi About (n.d.).

Python documentation (n.d.).

Maldeadora (2018). What is Frontend and Backend: characteristics, differences and examples. Platzi.

Documentation - The Go Programming Language (n.d.).

Docs (n.d.). Next.js.

Transcription from audio to text in Municipal Sessions in Planeta Rica

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

Language

Information