Journal of Textile Research ›› 2025, Vol. 46 ›› Issue (03): 167-176.doi: 10.13475/j.fzxb.20240503801

• Apparel Engineering • Previous Articles     Next Articles

High-precision 3-D virtual try-on model based on cross-attention multi-view generation and diffusion

YU Haoran, WANG Ping(), WANG Hao, DING Dong   

  1. College of Information Science and Technology, Donghua University, Shanghai 201620, China
  • Received:2024-05-16 Revised:2024-11-19 Online:2025-03-15 Published:2025-04-16
  • Contact: WANG Ping E-mail:pingwang@dhu.edu.cn

Abstract:

Objective In the era of 5G Internet communication, virtual try-on technology not only enriches the online fashion consumption experience but also drives the advancement of digital fashion. However, existing generative 3-D virtual try-on models often lack adequate depth features, leading to issues such as stereo distortion and reduced accuracy. In this study, a high precision 3-D virtual try-on model called MV-3DVTON was introduced, which is based on a generative diffusion architecture.

Method A cross-attention mechanism was integrated into the generative diffusion architecture to facilitate the processing of semantic relationships between two information sequences: the features of the fitting image from the frontal view and the features of the target human body's position from the back view. The integration aimed to generate multi-view fitting images. A multi-view depth encoder-decoder network was employed to extract high-precision depth information from various perspectives. Subsequently, the generated multi-view information was consolidated into a 3-D colored point cloud, which, through Poisson reconstruction, enabled the realization of a 360° viewable, lifelike 3-D virtual try-on system.

Results In comparison to CASD and ADGAN, MV-3DVTON exhibits superiority across all metrics. Notably, MV-3DVTON achieved a 22.96% reduction in LPIPS (perceptual image patch similarity) and a 12.08% decrease in FID (fréchet inception distance) compared to the CASD model. This decrease signifies that the generated images better align with human perception, resulting in a more realistic image effect with higher distribution similarity. In order to visually demonstrate the model's generalization across diverse application scenarios, comparison results of multi-view image generation was presented for different genders across four clothing virtual fitting scenarios which are solid color, striped, checkered, and patterned. The generated images by MV-3DVTON exhibited increased similarity with real images in various test scenarios, indicating enhanced generalization capability and the ability to produce more realistic multi-view fitting images. Contrasted with M3D-VTON, MV-3DVTON demonstrated superior performance across all error indices. Particularly noteworthy is the significant decrease in the AbsRel (absolute relative error) index by approximately 12.21% and RMSE (the root mean square error) index by about 5%. These decreases highlight the new model's capability to reduce errors between the estimated and real values of depth information, thereby enhancing the accuracy of depth estimation. Moreover, the novel approach leveraged the frontal and back images of the human body as inputs to the model. With the ability to predict and integrate depth information from multiple viewpoints, the new model was successful in capturing finer depth details compared to existing methods that rely solely on the monocular frontal view's mirror image. By examining the depth maps generated by MV-3DVTON and M3D-VTON from the frontal and back views, it is evident that M3D-VTON exhibits inaccuracies in delineating human body and garment boundaries. Conversely, the depth maps generated by the new model offer clearer boundaries of various human body structures, with enhanced realism in details. Furthermore, the comparison results of selected front, side, and back effects demonstrate MV-3DVTON's ability to generate multi-view virtual fitting effects effectively. This underscores that the predicted multi-view human body depth features are more accurate, providing a comprehensive, multi-view encompassing, and high-precision 3-D virtual fitting experience.

Conclusion A thorough qualitative and quantitative evaluation of the MV-3DVTON method was proposed. The statistical metrics from the test dataset demonstrate that the new model, along with the system of multi-view generation diffusion incorporating a cross-attention mechanism proposed by MV-3DVTON, achieves high-precision generation, accurate prediction estimation, and effective fusion of multi-view depth feature information within a point cloud framework. The new model effectively addresses the challenges associated with inadequate depth information and low accuracy in 3-D feature extraction encountered in conventional monocular mirroring methods. It offers a superior 3-D virtual fitting experience with multiple viewpoints, accommodating various human body postures and clothing textures. This innovative model boasts user-friendly interaction and holds promising prospects for applications within the digital fashion domain.

Key words: 3-D virtual try-on, generation diffusion, cross-attention, multi-view generation, point cloud reconstruction

CLC Number: 

  • TS942.8

Fig.1

Overall architecture of MV-3DVTON model"

Fig.2

Multi-view generative diffusion model with cross-attention layers. (a) Architecture diagram of Diffusion generation and diffusion process; (b) Introduction of denoising neural network unit with cross-attention"

Fig.3

Depth information prediction network"

Fig.4

Comparison of multi-view fitting image generation. (a) Front view source image of input model; (b) Human posture characteristics of target viewpoint; (c) Sample real image of target viewpoint; (d) Target viewpoint image generation results by ADGAN model; (e) Target viewpoint image generation results by CASD model; (f) Results of target viewpoint image generation results by MV-3DVTON model proposed"

Tab.1

Evaluation metrics for virtual try-on images"

方法 LPIPS FID SSIM PSNR
ADGAN 0.230 3 29.296 8 0.631 8 15.563 1
CASD 0.162 0 25.347 6 0.695 7 17.087 6
本文方法 0.124 8 22.286 2 0.732 9 17.650 7

Tab.2

Effects of cross-attention on image metrics"

方法 LPIPS FID SSIM PSNR
未引入交叉注意力 0.147 5 34.392 2 0.674 3 16.034 5
本文MV-3DVTON 0.124 8 22.286 2 0.732 9 17.650 7

Fig.5

Results of ablation experiments with multi-view virtual fitting. (a) Results of female multi-view try-on ablation trial; (b) Results of male multi-view try-on ablation trial"

Tab.3

Error evaluation of 3-D depth information"

方法 E A b s R e l E S q R e l E R M S E
M3D-VTON 0.086 0 0.324 3 0.198 7
本文MV-3DVTON 0.075 5 0.311 7 0.189 4

Fig.6

Comparison of depth maps in multi-view virtual try-on"

Fig.7

Comparison of 3-D virtual try-on effects. (a) Positive view; (b) Side view; (c) Backside view"

[1] 郑小虎, 刘正好, 陈峰, 等. 纺织工业智能发展现状与展望[J]. 纺织学报, 2023, 44(8): 205-216.
ZHENG Xiaohu, LIU Zhenghao, CHEN Feng, et al. Current status and prospect of intelligent development in textile industry[J]. Journal of Textile Research, 2023, 44(8): 205-216.
[2] 雷鸽, 李小辉. 数字化服装结构设计技术的研究进展[J]. 纺织学报, 2022, 43(4): 203-209.
LEI Ge, LI Xiaohui. Review of digital pattern-making technology in garment production[J]. Journal of Textile Research, 2022, 43(4): 203-209.
[3] UMETANI N, KAUFMAN D M, IGARASHI T, et al. Sensitive couture for interactive garment modeling and editing[J]. ACM Transactions on Graphics, 2011, 30(4): 1-12.
[4] BARTLE A, SHEFFER A, KIM V G, et al. Physics-driven pattern adjustment for direct 3D garment editing[J]. ACM Transactions on Graphics, 2016, 35(4): 1-11.
[5] 万燕, 陈林下. 基于Kinect的虚拟试衣相关技术的研究[J]. 智能计算机与应用, 2018, 8(4): 63-68.
WAN Yan, CHEN Linxia. Research on virtual fitting technology based on Kinect[J]. Intelligent Computer and Application, 2018, 8(4): 63-68.
[6] 黎博文, 王萍, 刘玉叶. 基于人体动态特征的三维服装虚拟试穿技术[J]. 纺织学报, 2021, 42(9): 144-149.
LI Bowen, WANG Ping, LIU Yuye. 3-D virtual try-on technique based on dynamic feature of body postures[J]. Journal of Textile Research, 2021, 42(9): 144-149.
[7] ZHAO F, XIE Z, KAMPFFMEYER M, et al. M3D-VTON: a monocular-to-3D virtual try-on network[C]// Proceedings of the 2021 IEEE/CVF Internation-al Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 13239-13249.
[8] 袁甜甜, 王鑫, 罗炜豪, 等. 基于注意力机制和视觉转换器的三维虚拟试衣网络[J]. 纺织学报, 2023, 44(7): 192-198.
YUAN Tiantian, WANG Xin, LUO Weihao, et al. Three-dimensional virtual try-on network based on attention mechanism and vision transformer[J]. Journal of Textile Research, 2023, 44(7): 192-198.
[9] HAN X, WU Z, WU Z, et al. VITON: an image-based virtual try-on network[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7543-7552.
[10] WANG B, ZHENG H, LIANG X, et al. Toward characteristic-preserving image-based virtual try-on network[C]// Proceedings of the European conference on computer vision(ECCV). Berlin: Springer, 2018: 589-604.
[11] RAFFIEE A, SOLLAMI M. GarmentGAN:photo-realistic adversarial fashion transfer [C]// Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR). Piscataway: IEEE, 2021: 3923-3930.
[12] 刘玉叶, 王萍. 基于纹理特征学习的高精度虚拟试穿智能算法[J]. 纺织学报, 2023, 44(5): 177-183.
LIU Yuye, WANG Ping. High-precision intelligent algorithm for virtual fitting based on texture feature learning[J]. Journal of Textile Research, 2023, 44(5): 177-183.
[13] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// Proceedings of the 34th Conference on Neural Information Processing Systems. San Diego: NIPS, 2020: 6840-6851.
[14] WANG B, ZHENG H, LIANG X, et al. TryOnDiffusion: a tale of two UNets[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 4606-4615.
[15] REN Y, Yu X, CHEN J, et al. Deep image spatial transformation for person image generation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE, 2020: 7690-7699.
[16] ROY P, BHATTACHARYA S, GHOSH S, et al. Multi-scale attention guided pose transfer[J]. Pattern Recognition, 2023. DOI: 10.1016/j.patcog.2023.109315.
[17] LU Y, GU B, OUYANG W, et al. LSG-GAN: latent space guided generative adversarial network for person pose transfer[J]. Knowledge-Based Systems, 2023. DOI: 10.1016/j.knosys.2023.110852.
[18] KHATUN A, DENMAN S, SRIDHARAN S, et al. Pose-driven attention-guided image generation for person re-identification[J]. Pattern Recognition, 2023. DOI: 10.1016/j.patcog.2022.109246.
[19] BHUNIA A K, KHAN S, CHOLAKKAL H, et al. Person image synthesis via denoising diffusion model[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE, 2023: 5968-5976.
[20] ZHU Z, HUANG T, SHI B, et al. Progressive pose attention transfer for person image gener-ation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE, 2019: 2347-2356.
[21] TANG H, BAI S, ZHANG L, et al. Xinggan for person image generation[C]// Proceedings of the European conference on computer vision(ECCV). Berlin: Springer, 2020: 717-734.
[22] ZHANG P, YANG L, LAI J H, et al. Exploring dual-task correlation for pose guided person image gener-ation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE, 2022: 7713-7722.
[23] PU G, MEN Y, MAO Y, et al. Controllable person image synthesis with attribute-decomposed GAN[C]// Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence. Piscataway: IEEE, 2022: 1514-1532.
[24] ZHOU X, YIN M, CHEN X, et al. Cross attention based style distribution for controllable person image synthesis[C]// Proceedings of the European conference on computer vision(ECCV). Berlin: Springer, 2022: 161-178.
[25] KARRAS J, HOLYNSKI A, WANG T C, et al. Dreampose:fashion image-to-video synthesis via stable diffusion[C]// 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2023: 22623-22633.
[26] LIU Z, LUO P, QIU S, et al. Deepfashion:powering robust clothes recognition and retrieval with rich annotations [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 1096-1104.
[27] SAITO S, SIMON T, SARAGIH J, et al. Pifuhd: multilevel pixel-aligned implicit function for high-resolution 3D human digitization[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CA: IEEE Computer Society, 2020: 84-93.
[28] 曹玉东, 刘海燕, 贾旭, 等. 基于深度学习的图像质量评价方法综述[J]. 计算机工程与应用, 2021, 57(23): 27-36.
doi: 10.3778/j.issn.1002-8331.2106-0228
CAO Yudong, LIU Haiyan, JIA Xu, et al. Overview of image quality assessment method based on deep learn-ing[J]. Computer Engineering and Applications, 2021, 57(23) : 27-36.
doi: 10.3778/j.issn.1002-8331.2106-0228
[29] EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]// Proceedings of the 27th International Conference on Neural Information Processing System. San Diego: NIPS, 2014: 2366-2374.
[1] LI Bowen, WANG Ping, LIU Yuye. 3-D virtual try-on technique based on dynamic feature of body postures [J]. Journal of Textile Research, 2021, 42(09): 144-149.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!