CareCom: Generative Image Composition with Calibrated Reference Features


Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.
Multiple reference images could provide supplementary information, while a single reference image may lack the information of other viewpoints or occluded parts. In the following example, we show the composition results using 1 to 5 reference images. When only using one reference image capturing the left side of the bus, the generated bus does not have a door, because the model does not have such reasoning ability to infer that a bus should have a door on the right side if it does not have a door on the left side. When using more reference images, the generated bus has a door by learning from both left side and right side of the bus.
The matched reference image means the foreground image which is compatible with the background in terms of illumination, geometry, and semantics.
The illustration of our method: (a) Given multiple foreground reference images, we extract their global/local features, which are passed through the calibration module. The calibrated features are injected into the decoder of denoising UNet. (b) Illustration of seeking for the spatial correspondence of local patches between foreground reference and ground-truth foreground. (c) The structure of calibration module.
More comparison results between our method and baselines. From left to right, we show the background image with foreground bounding box, the reference images of foreground, the composition results of Anydoor, ControlCom, ObjectStitch, InsertAnything, Unicombine, and our method.
CareCom is built upon our previous work MureObjectStitch, which is an extension of ObjectStitch. MureObjectStitch is simple and effective, actually performing more stably than CareCom. For practical usage and comparison, we recommend using MureObjectStitch.
@inproceedings{carecom2026,
title={CareCom: Generative Image Composition with Calibrated Reference Features},
author={Chen, Jiaxuan and Zhang, Bo and He, Qingdong and Peng, Jinlong and Niu, Li},
booktitle={AAAI},
year={2026}
}