Technology

Art

CraneAI Labs Releases v1.2 of Streaming ASR Model for Luganda, Shona, and Swahili

15h ago· 4 min readen

technology programming ai/ml speech recognition

Summary

CraneAI Labs released version 1.2 of their crane-nemo-asr model, a streaming automatic speech recognition (ASR) system fine-tuned from NVIDIA's Nemotron-3.5 ASR model. It supports Luganda, Shona, and Swahili (with English retained) using a FastConformer Cache-Aware RNN-Transducer architecture with ~600M parameters. The key improvement in v1.2 is the recovery of long training clips (over 20 seconds) that were previously dropped, enabling transcription of longer conversational monologues. The model is designed for real-time, cache-aware streaming transcription conditioned on a language-ID prompt.

Source

Twitter / XCraneAI Labs Releases v1.2 of Streaming ASR Model for Luganda, Shona, and Swahilihuggingface.co

Key quotes

· 5 pulled

A streaming automatic speech recognition model for Luganda, Shona, and Swahili (with English retained), fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b

The model transcribes conversational and read speech in real time (cache-aware streaming) and is conditioned on a language-ID prompt.

What's new in 1.2 — more training data, from the data we already had.

Earlier versions dropped every training clip longer than 20 s (long conversational monologues), because the trainer can't fit them. 1.2 recovers them.

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Snippet from the RSS feed

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

You might also wanna read

Drax: A Discrete Flow Matching Framework for State-of-the-Art Speech Recognition

Drax is a novel discrete flow matching framework for automatic speech recognition (ASR) that achieves state-of-the-art recognition accuracy

huggingface.co·7mo ago

Building Ultra-Low-Latency Voice Agents with NVIDIA Open Models

This technical guide demonstrates how to build ultra-low-latency voice agents using NVIDIA's open models, including the newly launched Nemot

daily.co·5mo ago

RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment

This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

arxiv.org·1mo ago

Neural Audio Codecs: Bridging the Gap Between Language Models and Audio Processing

This article explores the technical challenge of integrating audio directly into large language models (LLMs) using neural audio codecs. It

kyutai.org·8mo ago

Hume AI Open-Sources TADA: Text-Acoustic Synchronization for Faster, More Reliable Speech Generation

Hume AI has open-sourced TADA (Text-Acoustic Dual Alignment), a novel speech-language model that addresses fundamental limitations in curren

hume.ai·3mo ago

Jet-Nemotron: Hybrid Language Model Architecture with PostNAS Achieves High Efficiency and Accuracy

Jet-Nemotron is a new family of hybrid-architecture language models that achieves comparable or superior accuracy to leading models like Qwe

arxiv.org·9mo ago

Comments

No comments yet. Be the first.