All Topics

Technology

Art

Apple Releases Pico-Banana-400K: Large-Scale Dataset for Text-Guided Image Editing Research

dvrp

7mo ago· 5 min readenCode

100/100

Golden Brown

Bagelometer↗

Hot, fresh, and worth queueing round the block for.

Score100TypenewsSentimentpositive

Summary

Apple has released Pico-Banana-400K, a large-scale dataset of approximately 400,000 text-image-edit triplets designed for advancing research in text-guided image editing. The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations from low-level color adjustments to high-level object, scene, and stylistic edits. It includes ~257K single-turn text-image-edit triplets for supervised fine-tuning and ~56K single-turn text-image(positive)-image(negative)-edit examples for preference learning. The dataset is hosted on GitHub as an open-source contribution to the research community.

Key quotes

· 3 pulled

Pico-Banana-400K is a large-scale dataset of ~400K text–image–edit triplets designed to advance research in text-guided image editing.

The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations—from low-level color adjustments to high-level object, scene, and stylistic edits.

~257K single-turn text–image–edit triplets for SFT, ~56K single-turn text-image(positive) - image(negative)-edit for preference

Snippet from the RSS feed

Contribute to apple/pico-banana-400k development by creating an account on GitHub.

You might also wanna read

ByteDance Releases Lance: A 3B-Parameter Unified Multimodal Model for Image and Video Tasks

ByteDance has released Lance, a 3B-active-parameter native unified multimodal model capable of handling image and video understanding, gener

github.com·11d ago

Capybara: A Unified Visual Creation Model for Visual Synthesis and Editing

Capybara is a unified visual creation model and framework for high-quality visual synthesis and manipulation tasks. It leverages advanced di

github.com·3mo ago

Developer Enables Vision Capabilities for Local LLMs Using Google Lens and OpenCV

A developer created an MCP server that enables local LLMs like GPT-OSS-120B to perform Google searches and gain vision capabilities without

news.ycombinator.com·3mo ago

TurboDiffusion: Video Diffusion Model Acceleration Framework Achieves 100-200x Speedup

TurboDiffusion is a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200 times on a single R

github.com·5mo ago

StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing

StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha

arxiv.org·7mo ago

Reflections on DwarfStar 4's rapid rise in local AI inference

The author reflects on the unexpected popularity of DwarfStar 4 (DS4), a local AI inference project. They attribute its success to the conve

antirez.com·1d ago