CraneAI Labs Releases v1.2 of Streaming ASR Model for Luganda, Shona, and Swahili
Summary
CraneAI Labs released version 1.2 of their crane-nemo-asr model, a streaming automatic speech recognition (ASR) system fine-tuned from NVIDIA's Nemotron-3.5 ASR model. It supports Luganda, Shona, and Swahili (with English retained) using a FastConformer Cache-Aware RNN-Transducer architecture with ~600M parameters. The key improvement in v1.2 is the recovery of long training clips (over 20 seconds) that were previously dropped, enabling transcription of longer conversational monologues. The model is designed for real-time, cache-aware streaming transcription conditioned on a language-ID prompt.
Source
Key quotes
· 5 pulledA streaming automatic speech recognition model for Luganda, Shona, and Swahili (with English retained), fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b
The model transcribes conversational and read speech in real time (cache-aware streaming) and is conditioned on a language-ID prompt.
What's new in 1.2 — more training data, from the data we already had.
Earlier versions dropped every training clip longer than 20 s (long conversational monologues), because the trainer can't fit them. 1.2 recovers them.
We're on a journey to advance and democratize artificial intelligence through open source and open science.
You might also wanna read
Drax: A Discrete Flow Matching Framework for State-of-the-Art Speech Recognition
Drax is a novel discrete flow matching framework for automatic speech recognition (ASR) that achieves state-of-the-art recognition accuracy
Building Ultra-Low-Latency Voice Agents with NVIDIA Open Models
This technical guide demonstrates how to build ultra-low-latency voice agents using NVIDIA's open models, including the newly launched Nemot
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
Neural Audio Codecs: Bridging the Gap Between Language Models and Audio Processing
This article explores the technical challenge of integrating audio directly into large language models (LLMs) using neural audio codecs. It
Hume AI Open-Sources TADA: Text-Acoustic Synchronization for Faster, More Reliable Speech Generation
Hume AI has open-sourced TADA (Text-Acoustic Dual Alignment), a novel speech-language model that addresses fundamental limitations in curren
Jet-Nemotron: Hybrid Language Model Architecture with PostNAS Achieves High Efficiency and Accuracy
Jet-Nemotron is a new family of hybrid-architecture language models that achieves comparable or superior accuracy to leading models like Qwe

Comments
Sign in to join the conversation.
No comments yet. Be the first.