CareCom: Generative Image Composition with Calibrated Reference Features

Jiaxuan Chen¹, Bo Zhang¹, Qingdong He², Jinlong Peng², Li Niu^1,3

¹ Shanghai Jiao Tong University ² Youtu Lab, Tencent ³ miguo.ai

Accepted by AAAI 2026

Abstract

Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

The necessity of using multiple reference images

Multiple reference images could provide supplementary information, while a single reference image may lack the information of other viewpoints or occluded parts. In the following example, we show the composition results using 1 to 5 reference images. When only using one reference image capturing the left side of the bus, the generated bus does not have a door, because the model does not have such reasoning ability to infer that a bus should have a door on the right side if it does not have a door on the left side. When using more reference images, the generated bus has a door by learning from both left side and right side of the bus.

How to get matched reference images?

The matched reference image means the foreground image which is compatible with the background in terms of illumination, geometry, and semantics.

(a) If there are a large number of reference images, based on the background image with foreground bounding box, we can use Foreground Object Search (FOS) to retrieve the suitable reference image from the pool of reference images.

(b) Alternatively, we can hallucinate the matched reference images. For efficiency, we opt for hallucinating reference features (the features of reference images) instead of reference images. Given the background image with foreground bounding box and existing reference features, we attempt to generate the matched reference features, that is, the reference features of matched images.

Method

The illustration of our method: (a) Given multiple foreground reference images, we extract their global/local features, which are passed through the calibration module. The calibrated features are injected into the decoder of denoising UNet. (b) Illustration of seeking for the spatial correspondence of local patches between foreground reference and ground-truth foreground. (c) The structure of calibration module.

Results

More comparison results between our method and baselines. From left to right, we show the background image with foreground bounding box, the reference images of foreground, the composition results of Anydoor, ControlCom, ObjectStitch, InsertAnything, Unicombine, and our method.

Previous work

CareCom is built upon our previous work MureObjectStitch, which is an extension of ObjectStitch. MureObjectStitch is simple and effective, actually performing more stably than CareCom. For practical usage and comparison, we recommend using MureObjectStitch.

BibTex

 @inproceedings{carecom2026,

    title={CareCom: Generative Image Composition with Calibrated Reference Features},

    author={Chen, Jiaxuan and Zhang, Bo and He, Qingdong and Peng, Jinlong and Niu, Li},

    booktitle={AAAI},

    year={2026}

  }