Modeling strategies for speech enhancement in the latent space of a neural audio codec

Item request has been placed!

Item request cannot be made.

Processing Request

Read More Add to Saved list

Author(s): Kammoun, Sofiene; Alameda-Pineda, Xavier; Leglaive, Simon
Source:
ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing ; https://hal.science/hal-05335192 ; ICASSP 2025 - IEEE International Conference on Acoustics, Speech and Signal Processing, May 2026, Barcelona, Spain
Subject Terms:
Speech enhancement; neural audio codec; autoregressive modeling; latent representations; discrete tokens; [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]; [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing
Document Type:
conference object
Language:
English

Additional Information
- Contributors:
  Institut d'Électronique et des Technologies du numéRique (IETR); Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes); Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - Ecole Polytechnique de l'Université de Nantes (Nantes Univ - EPUN); Nantes Université - pôle Sciences et technologie; Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ)-Nantes Université - pôle Sciences et technologie; Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ); Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN); Centre Inria de l'Université Grenoble Alpes; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA); ANR-23-CE23-0009,DEGREASE,Modèles génératifs et d'inférence par apprentissage profond pour le rehaussement de la parole faiblement supervisé(2023)
- Publication Information:
  CCSD
- Publication Date:
  2026
- Collection:
  Université Grenoble Alpes: HAL
- Subject Terms:
  Spain
- Subject Terms:
  Barcelona, Spain
- Abstract:
  International audience ; Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and adding encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.
- Online Access:
  https://hal.science/hal-05335192
  https://hal.science/hal-05335192v3/document
  https://hal.science/hal-05335192v3/file/ICASSP_2026__copy_%20%281%29.pdf
- Rights:
  https://creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/OpenAccess
- Accession Number:
  edsbas.9F240E60

Comments

No Comments.

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Contact

Follow us