

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>Integrating AI Speech and Voice Recognition: Advancing Large Language Models with Speech-to-Phoneme and Text Recognition Systems</title>
  <journal>International Journal of Computational Linguistics Research</journal>
  <author>Jiang Hua</author>
  <volume>17</volume>
  <issue>1</issue>
  <year>2026</year>
  <doi>https://doi.org/10.6025/ijclr/2026/17/1/18-34</doi>
  <url>https://www.dline.info/jcl/fulltext/v17n1/jclv17n1_2.pdf</url>
  <abstract>This paper proposes a novel Speech to Phoneme to Text (SPG) framework that integrates speech recognition
with Large Language Models (LLMs) through phoneme level intermediation. This approach aims to enhance
LLMs' ability to process spoken input accurately, especially in challenging conditions such as accented
speech, background noise, or low resource languages, while enabling real time, multimodal, and speaker
aware voice interaction.
The system comprises four core components: (1) a Speech to Phoneme (S2P) module, built on self supervised
models like Wav2Vec 2.0 or Wav2VecBERT 2.0, fine tuned to output International Phonetic Alphabet (IPA)
sequences; (2) a Phoneme to Text (P2G/LLM) module, which uses multilingual LLMs (e.g., mT5) to convert
phonemes into fluent text, leveraging contextual understanding for better disambiguation; (3) a Voice Recognition
Enhancer, integrating speaker embeddings (e.g., from ECAPA-TDNN) for diarization and
personalization; and (4) innovative training and inference strategies including Data Augmentation with
Noisy Phonemes (DANP), Top-K Marginalized (TKM) training, LoRA based fine tuning, and delayed fusion to
reduce latency and computational load.
Evaluation targets include &lt;5% Word Error Rate (WER) on CommonVoice, &lt;10% Phoneme Error Rate (PER),
&lt;3.5 tokens/sec latency, and &lt;5% speaker verification Equal Error Rate (EER). The framework is validated
using datasets such as CommonVoice, TIMIT, and LibriSpeech, as well as a multilingual PR corpus. TF-IDF
analysis of this corpus reveals language specific expressions of concepts such as &quot;sustainability,&quot; underscoring
the need for semantic, rather than lexical, alignment in multilingual ASR. The SPG architecture advances
accessibility, cross lingual transfer, and real world robustness while reducing token usage by 86% compared
to direct audio to LLM baselines.</abstract>
</record>
