Apple Releases Pico-Banana-400K: Large-Scale Dataset for Text-Guided Image Editing Research
By
dvrp
Hot, fresh, and worth queueing round the block for.
Summary
Apple has released Pico-Banana-400K, a large-scale dataset of approximately 400,000 text-image-edit triplets designed for advancing research in text-guided image editing. The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations from low-level color adjustments to high-level object, scene, and stylistic edits. It includes ~257K single-turn text-image-edit triplets for supervised fine-tuning and ~56K single-turn text-image(positive)-image(negative)-edit examples for preference learning. The dataset is hosted on GitHub as an open-source contribution to the research community.
Key quotes
· 3 pulledPico-Banana-400K is a large-scale dataset of ~400K text–image–edit triplets designed to advance research in text-guided image editing.
The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations—from low-level color adjustments to high-level object, scene, and stylistic edits.
~257K single-turn text–image–edit triplets for SFT, ~56K single-turn text-image(positive) - image(negative)-edit for preference
You might also wanna read
ByteDance Releases Lance: A 3B-Parameter Unified Multimodal Model for Image and Video Tasks
ByteDance has released Lance, a 3B-active-parameter native unified multimodal model capable of handling image and video understanding, gener
Capybara: A Unified Visual Creation Model for Visual Synthesis and Editing
Capybara is a unified visual creation model and framework for high-quality visual synthesis and manipulation tasks. It leverages advanced di
Developer Enables Vision Capabilities for Local LLMs Using Google Lens and OpenCV
A developer created an MCP server that enables local LLMs like GPT-OSS-120B to perform Google searches and gain vision capabilities without
TurboDiffusion: Video Diffusion Model Acceleration Framework Achieves 100-200x Speedup
TurboDiffusion is a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200 times on a single R
StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing
StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha
Reflections on DwarfStar 4's rapid rise in local AI inference
The author reflects on the unexpected popularity of DwarfStar 4 (DS4), a local AI inference project. They attribute its success to the conve
