Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss^{1, 2}, Daniel Winter¹, Matan Cohen¹

Alex Rav-Acha¹, Yael Pritch¹, Ariel Shamir^{* 1, 3}, Yedid Hoshen^{* 1, 2}

¹Google, ²The Hebrew University of Jerusalem, ³Reichman University

^*Indicates Equal Advising

Paper arXiv

Material: Garlic

Material: Transparent Glass

Material: Grass

Shape: Jeep

Shape: Peace Sign

Shape: Waving Hello

Texture: Giraffe

Texture: Rainbow

Texture: Neon Lights

Color: Cyan

Color: Green

Color: Maroon

Click on any image to see results

Abstract

We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) – fine-grained visual identity categories (e.g., "Porsche 911 Carrera") that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

Material Edits

Material: Marble

Material: Fire

Material: Ice

Material: Mirror

Material: Pink Wool

Material: Lego

Material: Granite

Material: Crystal

Material: Cardboard

Material: Oak

Material: Fur

Material: Cork

Color Edits

Color: White

Color: Orange

Color: Teal

Color: Black

Color: Magenta

Color: Red

Color: Leopard

Color: Green and Red

Shape Edits

Shape: Three seat sofa

Shape: Raising hand muscles

Shape: Three seat sofa

Shape: Square

Shape: Round

Shape: Opened engine hood

Shape: Namaste

Shape: Sitting and raising a paw

Shape: Raising hands

Shape: Yoga

Texture Edits

Texture: Polka Dot

Texture: Zebra

Texture: Leopard

Texture: Flames

Texture: Graffiti

Texture: White Stripes

Texture: Tie Dye

Click on any image to see results

Visual Named Entity (VNE)

Visual Named Entities (VNEs) are fine-grained visual identity categories (e.g., "Porsche 911 Carrera", "iPhone 16 Pro") that reflect how people naturally refer to specific object types. Unlike broad categories (e.g., "car"), which are too coarse and permit excessive variation that conflicts with our intuitive sense of identity, or instance-level identifiers, which are overly restrictive and allow minimal variation, VNEs strike a practical balance. Specifically, VNEs group visually similar objects sharing a common semantic label, permitting variations in intrinsic and extrinsic attributes while preserving identity.

We use Gemini to assign textual VNE labels to objects detected in OpenImages. VNE objects (e.g., "Porsche 911 Carrera") are grouped into VNE clusters, while unlabeled instances are filtered out. Example VNE clusters are shown on the right.

For each VNE-labeled object, we additionally prompt Gemini to extract intrinsic attribute descriptions, which serve as textual prompts during training.

Approach

Alterbute fine-tunes a diffusion model for text-guided intrinsic attribute editing. Inputs are arranged in a 1 x 2 image grid. The left half contains the noisy latent of the target image, while the right half contains a reference image sampled from the same VNE cluster. The model is conditioned on this reference image, a textual prompt describing the desired intrinsic attributes, a background image, and a binary object mask (both represented as grids). The diffusion loss is applied only to the left half to focus the learning on the edited region.

Using the same architecture (grid omitted for clarity), Alterbute edits the input image directly by reusing its original background and mask. For color, texture, or material edits, we use precise segmentation masks (top). For shape edits where the target geometry is unknown, we use coarse bounding-box masks (bottom).

Acknowledgements

We thank Shira Bar-On for creating the figures and visualizations. We also thank Tomer Golany, Dani Lischinski, Asaf Shul, Shmuel Peleg, Bar Cavia, and Nadav Magar for their valuable feedback and discussions. Tal Reiss is supported by the Google PhD Fellowship.

We thank owners of images on this site (link for attributions) for sharing their valuable assets.

BibTeX


      @article{reiss2026alterbute,
          title={Alterbute: Editing Intrinsic Attributes of Objects in Images},
          author={Reiss, Tal and Winter, Daniel and Cohen, Matan and Rav-Acha, Alex and Pritch, Yael and Shamir, Ariel and Hoshen, Yedid},
          journal={arXiv preprint arXiv:2601.10714},
          year={2026}
        }