All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Stable Audio 3: Open-Source Latent Diffusion Models for Variable-Length Audio Generation

By

guardienaveugle

11d ago· 2 min readen

Summary

Stable Audio 3 is a family of latent diffusion models (small, medium, large) for variable-length audio generation and editing. The models can generate several minutes of audio, support inpainting for targeted editing, and use a novel semantic-acoustic autoencoder for efficient latent-space generation. They are trained on licensed and Creative Commons data, can generate music and sounds in under 2 seconds on an H200 GPU, and the small and medium model weights are released open-source for consumer-grade hardware.

Key quotes

· 3 pulled
Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent.
We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4.
Snippet from the RSS feed
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing

You might also wanna read