Workers AI - Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support

Happy Developer Week 2025! Workers AI is excited to announce a couple of new features and improvements available today. Check out our blog for all the announcement details. Faster inference + New models We’re rolling out some in-place improvements to our models that can help speed up inference by 2-4x! Users of the models below will enjoy an automatic speed boost starting today: @cf/meta/llama-3.3-70b-instruct-fp8-fast gets a speed boost of 2-4x, leveraging techniques like speculative decoding, prefix caching, and an updated inference backend. @cf/baai/bge-small-en-v1.5 , @cf/baai/bge-base-en-v1.5 , @cf/baai/bge-large-en-v1.5 get an updated back end, which should improve inference times by 2x. With the bge models, we’re also announcing a new parameter called pooling which can take cls or mean as options. We highly recommend using pooling: cls which will help generate more accurate embeddings. However, embeddings generated with cls pooling are not backwards compatible with mean pooling. For this to not be a breaking change, the default remains as mean pooling. Please specify pooling: cls to enjoy more accurate embeddings going forward. We’re also excited to launch a few new models in our catalog to help round out your experience with Workers AI. We’ll be deprecating some older models in the future, so stay tuned for a deprecation announcement. Today’s new models include: @cf/mistralai/mistral-small-3.1-24b-instruct : a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling. @cf/google/gemma-3-12b-it : well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages. @cf/qwen/qwq-32b : a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini. @cf/qwen/qwen2.5-coder-32b-instruct : the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o. Batch Inference Introducing a new batch inference feature that allows you to send us an array of requests, which we will fulfill as fast as possible and send them back as an array. This is really helpful for large workloads such as summarization, embeddings, etc. where you don’t have a human-in-the-loop. Using the batch API will guarantee that your requests are fulfilled eventually, rather than erroring out if we don’t have enough capacity at a given time. Check out the tutorial to get started! Models that support batch inference today include: @cf/meta/llama-3.3-70b-instruct-fp8-fast @cf/baai/bge-small-en-v1.5 @cf/baai/bge-base-en-v1.5 @cf/baai/bge-large-en-v1.5 @cf/baai/bge-m3 @cf/meta/m2m100-1.2b Expanded LoRA support We’ve upgraded our LoRA experience to include 8 newer models, and can support ranks of up to 32 with a 300MB safetensors file limit (previously limited to rank of 8 and 100MB safetensors) Check out our LoRAs page to get started. Models that support LoRAs now include: @cf/meta/llama-3.2-11b-vision-instruct @cf/meta/llama-3.3-70b-instruct-fp8-fast @cf/meta/llama-guard-3-8b @cf/meta/llama-3.1-8b-instruct-fast (coming soon) @cf/deepseek-ai/deepseek-r1-distill-qwen-32b (coming soon) @cf/qwen/qwen2.5-coder-32b-instruct @cf/qwen/qwq-32b @cf/mistralai/mistral-small-3.1-24b-instruct @cf/google/gemma-3-12b-it

Workers AI - Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support

Source

You might also wanna read

How Multi-Token Prediction drafters accelerate Gemma 4 inference by up to 3x

General Compute Launches ASIC-Based Inference Cloud for Faster AI Agent Performance

Factory: AI Coding Agents That Integrate with Existing Developer Workflows

New Alibaba AI framework skips loading every tool, cutting agent token use 99%

How OpenClaw and AI agent harnesses are reshaping LLMs, inference, and CPU demand

Understanding Continuous Batching in Large Language Models: From Attention Mechanisms to Throughput Optimization

Comments