Technology

Art

SPARCLE: A Speaker-Aware Grapheme Model for Improved Text-to-Speech Synthesis

[Submitted on 1 May 2026]

1h ago· 2 min readenInsight

technology science

Summary

This paper introduces SPARCLE, a speaker-aware grapheme representation model for text-to-speech (TTS) synthesis. Unlike traditional phoneme-based approaches that rely on grapheme-to-phoneme (G2P) systems, SPARCLE enriches character representations with speaker-specific acoustic information using a contrastive learning objective aligned with Wav2Vec2 representations. The model improves TTS generation quality, particularly in low-resource settings, reducing word error rates by half compared to standard grapheme-based models.

Source

bskySPARCLE: A Speaker-Aware Grapheme Model for Improved Text-to-Speech Synthesisarxiv.org

Key quotes

· 4 pulled

Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling.

Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings.

SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity.

We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.

Snippet from the RSS feed

Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speak

You might also wanna read

Tauformer: A Topological Transformer Architecture Using Laplacian-Derived Scalar Attention

The article discusses Tauformer, a novel topological transformer architecture that replaces traditional dot-product attention with a Laplaci

tuned.org.uk·5mo ago

Hume AI Open-Sources TADA: Text-Acoustic Synchronization for Faster, More Reliable Speech Generation

Hume AI has open-sourced TADA (Text-Acoustic Dual Alignment), a novel speech-language model that addresses fundamental limitations in curren

hume.ai·3mo ago

Microsoft Open-Sources VibeVoice: A Speech-to-Text AI for Long-Form Audio Transcription

Microsoft has open-sourced VibeVoice, a frontier voice AI system that includes VibeVoice-ASR, a unified speech-to-text model capable of hand

GitHub·2mo ago

Kitten TTS: A Lightweight 25MB AI Voice Model for CPU-Based Speech Synthesis

The article introduces Kitten TTS, a groundbreaking 25MB AI voice model that operates efficiently on CPUs without requiring GPUs or expensiv

algogist.com·11mo ago

VibeVoice: An Open-Source Text-to-Speech Framework for Expressive Multi-Speaker Audio Generation

VibeVoice is a novel open-source framework for generating expressive, long-form, multi-speaker conversational audio (like podcasts) from tex

microsoft.github.io·10mo ago

PHOTON: Hierarchical Autoregressive Model for Efficient Language Generation

PHOTON is a new hierarchical autoregressive model architecture that addresses the memory and latency limitations of traditional Transformers

arxiv.org·5mo ago

Comments

No comments yet. Be the first.