We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) – fine-grained visual identity categories (e.g., "Porsche 911 Carrera") that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
Visual Named Entities (VNEs) are fine-grained visual identity categories (e.g., "Porsche 911 Carrera", "iPhone 16 Pro") that reflect how people naturally refer to specific object types. Unlike broad categories (e.g., "car"), which are too coarse and permit excessive variation that conflicts with our intuitive sense of identity, or instance-level identifiers, which are overly restrictive and allow minimal variation, VNEs strike a practical balance. Specifically, VNEs group visually similar objects sharing a common semantic label, permitting variations in intrinsic and extrinsic attributes while preserving identity.
We use Gemini to assign textual VNE labels to objects detected in OpenImages. VNE objects (e.g., "Porsche 911 Carrera") are grouped into VNE clusters, while unlabeled instances are filtered out. Example VNE clusters are shown on the right.
For each VNE-labeled object, we additionally prompt Gemini to extract intrinsic attribute descriptions, which serve as textual prompts during training.
Alterbute fine-tunes a diffusion model for text-guided intrinsic attribute editing. Inputs are arranged in a 1 x 2 image grid. The left half contains the noisy latent of the target image, while the right half contains a reference image sampled from the same VNE cluster. The model is conditioned on this reference image, a textual prompt describing the desired intrinsic attributes, a background image, and a binary object mask (both represented as grids). The diffusion loss is applied only to the left half to focus the learning on the edited region.
Using the same architecture (grid omitted for clarity), Alterbute edits the input image directly by reusing its original background and mask. For color, texture, or material edits, we use precise segmentation masks (top). For shape edits where the target geometry is unknown, we use coarse bounding-box masks (bottom).
We thank Shira Bar-On for creating the figures and visualizations. We also thank Tomer Golany, Dani Lischinski, Asaf Shul, Shmuel Peleg, Bar Cavia, and Nadav Magar for their valuable feedback and discussions. Tal Reiss is supported by the Google PhD Fellowship.
We thank owners of images on this site (link for attributions) for sharing their valuable assets.