In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
We propose an automated data construction pipeline focused on generating high-fidelity, fine-grained image-editing pairs with detailed instructions that demonstrate advanced reasoning and understanding. We categorize the image editing tasks into three types: removal, addition, and replacement. Figure below presents our data preparation workflow.
The overall architecture of InsightEdit is depicted in Figure below. It mainly consists of a comprehension module, a bridging module, and a generation module. Specifically, the comprehension module leverages MLLM to comprehend the image editing task; the bridging module integrates both text and image features into the denoising process of the diffusion model; and the generation module receives editing guidance via the diffusion model to generate the target image.