SAMControl: Controlling Pose and Object for Image Editing with Soft Attention Mask

Zhang, Y; Wang, C; Fang, F; Zhuge, Y; Fan, H; Chang, X; Deng, C; Yang, Y

SAMControl: Controlling Pose and Object for Image Editing with Soft Attention Mask

Zhang, Y Wang, C

Fang, F Zhuge, Y Fan, H Chang, X

Deng, C Yang, Y

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Journal Article
Citation:: ACM Transactions on Multimedia Computing Communications and Applications

In Progress

	Filename	Description	Size
	SAMControl.pdf	Accepted version	37.93 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is being processed and is not currently available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Y
dc.contributor.author	Wang, C https://orcid.org/0000-0003-1297-768X
dc.contributor.author	Fang, F
dc.contributor.author	Zhuge, Y
dc.contributor.author	Fan, H
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Deng, C
dc.contributor.author	Yang, Y
dc.date.accessioned	2025-03-02T23:51:34Z
dc.date.available	2025-03-02T23:51:34Z
dc.identifier.citation	ACM Transactions on Multimedia Computing Communications and Applications
dc.identifier.issn	1551-6857
dc.identifier.issn	1551-6865
dc.identifier.uri	http://hdl.handle.net/10453/185478
dc.description.abstract	<jats:p> To achieve content-consistent results in text-conditioned image editing, existing methods typically employ a <jats:bold>reconstruction branch</jats:bold> to capture the source image details via diffusion inversion and a <jats:bold>generation branch</jats:bold> to synthesize the target image based on the given textual prompt and the masked source image details. However, accurately segmenting source details is challenging with the current fixed-threshold mask strategy. Additionally, the inadequacies in the inversion process can lead to insufficient retention of source details. In this paper, we propose a method called SAMControl ( <jats:bold>S</jats:bold> oft <jats:bold>A</jats:bold> ttention <jats:bold>M</jats:bold> ask) to adaptively control the pose and object details for image editing. SAMControl dynamically learns flexible attention masks for different images at various diffusion steps. Furthermore, in the reconstruction branch, we utilize a direct inversion technique to ensure the fidelity of source details within SAM. Extensive qualitative and quantitative results demonstrate the effectiveness of the proposed method. </jats:p>
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	ACM Transactions on Multimedia Computing Communications and Applications
dc.relation.isbasedon	10.1145/3702999
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing, 0806 Information Systems
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4606 Distributed computing and systems software
dc.subject.classification	4607 Graphics, augmented reality and games
dc.title	SAMControl: Controlling Pose and Object for Image Editing with Soft Attention Mask
dc.type	Journal Article
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
utslib.for	0806 Information Systems
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	University of Technology Sydney/UTS Groups
pubs.organisational-group	University of Technology Sydney/UTS Groups/Australian Artificial Intelligence Institute (AAII)
utslib.copyright.status	in_progress	*
dc.date.updated	2025-03-02T23:51:30Z
pubs.publication-status	Published online

Abstract:

To achieve content-consistent results in text-conditioned image editing, existing methods typically employ a reconstruction branch to capture the source image details via diffusion inversion and a generation branch to synthesize the target image based on the given textual prompt and the masked source image details. However, accurately segmenting source details is challenging with the current fixed-threshold mask strategy. Additionally, the inadequacies in the inversion process can lead to insufficient retention of source details. In this paper, we propose a method called SAMControl ( S oft A ttention M ask) to adaptively control the pose and object details for image editing. SAMControl dynamically learns flexible attention masks for different images at various diffusion steps. Furthermore, in the reconstruction branch, we utilize a direct inversion technique to ensure the fidelity of source details within SAM. Extensive qualitative and quantitative results demonstrate the effectiveness of the proposed method.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/185478