Alterbute: Editing Intrinsic Attributes of Objects in Images

1Google, 2The Hebrew University of Jerusalem, 3Reichman University
*Indicates Equal Advising
Click on any image to see results

Abstract

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) – fine-grained visual identity categories (e.g., "Porsche 911 Carrera") that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

Material Edits

Color Edits

Shape Edits

Texture Edits

Click on any image to see results

Visual Named Entity (VNE)

Visual Named Entities (VNEs) are fine-grained visual identity categories (e.g., "Porsche 911 Carrera", "iPhone 16 Pro") that reflect how people naturally refer to specific object types. Unlike broad categories (e.g., "car"), which are too coarse and permit excessive variation that conflicts with our intuitive sense of identity, or instance-level identifiers, which are overly restrictive and allow minimal variation, VNEs strike a practical balance. Specifically, VNEs group visually similar objects sharing a common semantic label, permitting variations in intrinsic and extrinsic attributes while preserving identity.

VNE


We use Gemini to assign textual VNE labels to objects detected in OpenImages. VNE objects (e.g., "Porsche 911 Carrera") are grouped into VNE clusters, while unlabeled instances are filtered out. Example VNE clusters are shown on the right.

labels


For each VNE-labeled object, we additionally prompt Gemini to extract intrinsic attribute descriptions, which serve as textual prompts during training.

Approach

training


Alterbute fine-tunes a diffusion model for text-guided intrinsic attribute editing. Inputs are arranged in a 1 x 2 image grid. The left half contains the noisy latent of the target image, while the right half contains a reference image sampled from the same VNE cluster. The model is conditioned on this reference image, a textual prompt describing the desired intrinsic attributes, a background image, and a binary object mask (both represented as grids). The diffusion loss is applied only to the left half to focus the learning on the edited region.

inference


Using the same architecture (grid omitted for clarity), Alterbute edits the input image directly by reusing its original background and mask. For color, texture, or material edits, we use precise segmentation masks (top). For shape edits where the target geometry is unknown, we use coarse bounding-box masks (bottom).

Acknowledgements

We thank Shira Bar-On for creating the figures and visualizations. We also thank Tomer Golany, Dani Lischinski, Asaf Shul, Shmuel Peleg, Bar Cavia, and Nadav Magar for their valuable feedback and discussions. Tal Reiss is supported by the Google PhD Fellowship.

We thank owners of images on this site (link for attributions) for sharing their valuable assets.

BibTeX