Microsoft Open-Sources VibeVoice: A Speech-to-Text AI for Long-Form Audio Transcription
By
tosh
Pulled from the oven just right. Trustworthy, fact-dense, deeply satisfying.
Summary
Microsoft has open-sourced VibeVoice, a frontier voice AI system that includes VibeVoice-ASR, a unified speech-to-text model capable of handling 60-minute long-form audio in a single pass. The model generates structured transcriptions containing speaker identification (Who), timestamps (When), and content (What), with support for user-customized context. The ASR model is now integrated into the Hugging Face Transformers library, making it accessible for seamless integration into various projects. The project is hosted on GitHub under the Microsoft organization and includes experimental speaker features.
Key quotes
· 3 pulledWe open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context.
VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.
We added experimental speakers to VibeVoice‑R
You might also wanna read
Microsoft Launches MAI-Voice-1 Speech Generation Model with Sub-Second Audio Processing
Microsoft has launched MAI-Voice-1, a highly efficient speech generation model that can generate a full minute of audio in under a second on

Microsoft Launches First In-House AI Models MAI-Voice-1 and MAI-1-preview
Microsoft has launched its first in-house AI models called MAI-Voice-1 and MAI-1-preview. The MAI-Voice-1 speech model can generate a minute
Microsoft Launches MAI-Transcribe-1: Multilingual Speech-to-Text Model for Production Use
Microsoft has launched MAI-Transcribe-1, a new multilingual speech-to-text model designed for production use. The model offers best-in-class
Vogent Voicelab: Platform for Optimized Open-Source Voice Model Inference
Vogent Voicelab is a platform that optimizes and post-trains top open-source voice models like Sesame's CSM-1B, Dia, and Chatterbox to gener
Microsoft Launches Free Copilot Audio Expressions Tool for Text-to-Speech Conversion
Microsoft has launched Copilot Audio Expressions, a free AI tool that converts text into expressive audio. The tool offers two modes: Emotiv
Xiaomi releases MiMo-V2.5-ASR: open-source 8B speech recognition model supporting Mandarin, English, dialects, and song lyrics
MiMo-V2.5-ASR is an 8-billion-parameter open-source speech recognition model developed by Xiaomi. It supports transcription of Mandarin, Eng
