Integrating AI Speech and Voice Recognition: Advancing Large Language Models with Speech-to-Phoneme and Text Recognition Systems

Integrating AI Speech and Voice Recognition: Advancing Large Language Models with Speech-to-Phoneme and Text Recognition Systems International Journal of Computational Linguistics Research Jiang Hua 17 1 2026 https://doi.org/10.6025/ijclr/2026/17/1/18-34 https://www.dline.info/jcl/fulltext/v17n1/jclv17n1_2.pdf This paper proposes a novel Speech to Phoneme to Text (SPG) framework that integrates speech recognition with Large Language Models (LLMs) through phoneme level intermediation. This approach aims to enhance LLMs' ability to process spoken input accurately, especially in challenging conditions such as accented speech, background noise, or low resource languages, while enabling real time, multimodal, and speaker aware voice interaction. The system comprises four core components: (1) a Speech to Phoneme (S2P) module, built on self supervised models like Wav2Vec 2.0 or Wav2VecBERT 2.0, fine tuned to output International Phonetic Alphabet (IPA) sequences; (2) a Phoneme to Text (P2G/LLM) module, which uses multilingual LLMs (e.g., mT5) to convert phonemes into fluent text, leveraging contextual understanding for better disambiguation; (3) a Voice Recognition Enhancer, integrating speaker embeddings (e.g., from ECAPA-TDNN) for diarization and personalization; and (4) innovative training and inference strategies including Data Augmentation with Noisy Phonemes (DANP), Top-K Marginalized (TKM) training, LoRA based fine tuning, and delayed fusion to reduce latency and computational load. Evaluation targets include <5% Word Error Rate (WER) on CommonVoice, <10% Phoneme Error Rate (PER), <3.5 tokens/sec latency, and <5% speaker verification Equal Error Rate (EER). The framework is validated using datasets such as CommonVoice, TIMIT, and LibriSpeech, as well as a multilingual PR corpus. TF-IDF analysis of this corpus reveals language specific expressions of concepts such as "sustainability," underscoring the need for semantic, rather than lexical, alignment in multilingual ASR. The SPG architecture advances accessibility, cross lingual transfer, and real world robustness while reducing token usage by 86% compared to direct audio to LLM baselines.