Workers AI - Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support
1y ago
Source
CloudflareWorkers AI - Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA supportcloudflare.comHappy Developer Week 2025! Workers AI is excited to announce a couple of new features and improvements available today. Check out our blog for all the announcement details. Faster inference + New models We’re rolling out some in-place improvements to our models that can help speed up inference by 2-4x! Users of the models below will enjoy an automatic speed boost starting today: @cf/meta/llama-3.3-70b-instruct-fp8-fast gets a speed boost of 2-4x, leveraging techniques like speculative decoding, prefix caching, and an updated inference backend. @cf/baai/bge-small-en-v1.5 , @cf/baai/bge-base-en-v1.5 , @cf/baai/bge-large-en-v1.5 get an updated back end, which should improve inference times by 2x. With the bge models, we’re also announcing a new parameter called pooling which can take cls or mean as options. We highly recommend using pooling: cls which will help generate more accurate embeddings. However, embeddings generated with cls pooling are not backwards compatible with mean pooling. For this to not be a breaking change, the default remains as mean pooling. Please specify pooling: cls to enjoy more accurate embeddings going forward. We’re also excited to launch a few new models in our catalog to help round out your experience with Workers AI. We’ll be deprecating some older models in the future, so stay tuned for a deprecation announcement. Today’s new models include: @cf/mistralai/mistral-small-3.1-24b-instruct : a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling. @cf/google/gemma-3-12b-it : well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages. @cf/qwen/qwq-32b : a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini. @cf/qwen/qwen2.5-coder-32b-instruct : the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o. Batch Inference Introducing a new batch inference feature that allows you to send us an array of requests, which we will fulfill as fast as possible and send them back as an array. This is really helpful for large workloads such as summarization, embeddings, etc. where you don’t have a human-in-the-loop. Using the batch API will guarantee that your requests are fulfilled eventually, rather than erroring out if we don’t have enough capacity at a given time. Check out the tutorial to get started! Models that support batch inference today include: @cf/meta/llama-3.3-70b-instruct-fp8-fast @cf/baai/bge-small-en-v1.5 @cf/baai/bge-base-en-v1.5 @cf/baai/bge-large-en-v1.5 @cf/baai/bge-m3 @cf/meta/m2m100-1.2b Expanded LoRA support We’ve upgraded our LoRA experience to include 8 newer models, and can support ranks of up to 32 with a 300MB safetensors file limit (previously limited to rank of 8 and 100MB safetensors) Check out our LoRAs page to get started. Models that support LoRAs now include: @cf/meta/llama-3.2-11b-vision-instruct @cf/meta/llama-3.3-70b-instruct-fp8-fast @cf/meta/llama-guard-3-8b @cf/meta/llama-3.1-8b-instruct-fast (coming soon) @cf/deepseek-ai/deepseek-r1-distill-qwen-32b (coming soon) @cf/qwen/qwen2.5-coder-32b-instruct @cf/qwen/qwq-32b @cf/mistralai/mistral-small-3.1-24b-instruct @cf/google/gemma-3-12b-it
You might also wanna read
How Multi-Token Prediction drafters accelerate Gemma 4 inference by up to 3x
This article explains how Google's Gemma 4 models achieve up to 3x faster inference through Multi-Token Prediction (MTP) drafters and specul
General Compute Launches ASIC-Based Inference Cloud for Faster AI Agent Performance
General Compute is an inference cloud built on ASICs (purpose-built alternatives to Nvidia GPUs) designed specifically for AI inference, not
Factory: AI Coding Agents That Integrate with Existing Developer Workflows
Factory is a platform for AI coding agents called 'Droids' that integrates with developers' existing workflows and tools. Unlike other AI de

New Alibaba AI framework skips loading every tool, cutting agent token use 99%
VentureBeat·2d ago
How OpenClaw and AI agent harnesses are reshaping LLMs, inference, and CPU demand
The article discusses how AI agent harnesses like OpenClaw are transforming the LLM landscape by enabling models to automate complex tasks b
Understanding Continuous Batching in Large Language Models: From Attention Mechanisms to Throughput Optimization
This technical blog post explains continuous batching in large language models (LLMs) by starting from first principles of attention mechani
huggingface.co·4mo ago

Comments
Sign in to join the conversation.
No comments yet. Be the first.