SPARCLE: A Speaker-Aware Grapheme Model for Improved Text-to-Speech Synthesis
By
[Submitted on 1 May 2026]
Summary
This paper introduces SPARCLE, a speaker-aware grapheme representation model for text-to-speech (TTS) synthesis. Unlike traditional phoneme-based approaches that rely on grapheme-to-phoneme (G2P) systems, SPARCLE enriches character representations with speaker-specific acoustic information using a contrastive learning objective aligned with Wav2Vec2 representations. The model improves TTS generation quality, particularly in low-resource settings, reducing word error rates by half compared to standard grapheme-based models.
Source
Key quotes
· 4 pulledRecent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling.
Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings.
SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity.
We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.
You might also wanna read
Tauformer: A Topological Transformer Architecture Using Laplacian-Derived Scalar Attention
The article discusses Tauformer, a novel topological transformer architecture that replaces traditional dot-product attention with a Laplaci
Hume AI Open-Sources TADA: Text-Acoustic Synchronization for Faster, More Reliable Speech Generation
Hume AI has open-sourced TADA (Text-Acoustic Dual Alignment), a novel speech-language model that addresses fundamental limitations in curren
Microsoft Open-Sources VibeVoice: A Speech-to-Text AI for Long-Form Audio Transcription
Microsoft has open-sourced VibeVoice, a frontier voice AI system that includes VibeVoice-ASR, a unified speech-to-text model capable of hand
Kitten TTS: A Lightweight 25MB AI Voice Model for CPU-Based Speech Synthesis
The article introduces Kitten TTS, a groundbreaking 25MB AI voice model that operates efficiently on CPUs without requiring GPUs or expensiv
algogist.com·11mo agoVibeVoice: An Open-Source Text-to-Speech Framework for Expressive Multi-Speaker Audio Generation
VibeVoice is a novel open-source framework for generating expressive, long-form, multi-speaker conversational audio (like podcasts) from tex
PHOTON: Hierarchical Autoregressive Model for Efficient Language Generation
PHOTON is a new hierarchical autoregressive model architecture that addresses the memory and latency limitations of traditional Transformers

Comments
Sign in to join the conversation.
No comments yet. Be the first.