Journal of Textile Research ›› 2023, Vol. 44 ›› Issue (07): 192-198.doi: 10.13475/j.fzxb.20220508401

• Apparel Engineering • Previous Articles     Next Articles

Three-dimensional virtual try-on network based on attention mechanism and vision transformer

YUAN Tiantian1, WANG Xin1, LUO Weihao1, MEI Chennan1, WEI Jingyan1, ZHONG Yueqi1,2()   

  1. 1. College of Textiles, Donghua University, Shanghai 201620, China
    2. Key Laboratory of Textile Science & Technology, Ministry of Education, Donghua University, Shanghai 201620, China
  • Received:2022-05-30 Revised:2022-11-24 Online:2023-07-15 Published:2023-08-10

Abstract:

Objective Three-dimensional (3-D) virtual try-on can provide an intuitive and realistic view for online shopping and has great potential commercial value. However, there are some problems in the existing 3-D virtual try-on network, such as inaccurate generated 3-D human models, unclear model edges and excessive clothing deformation in the virtual fitting, which greatly limit the application of this technology in real scenarios.

Method In order to solve the above problems, this research proposed the network named T3D-VTON, a deep neural network introducing convolutional attention mechanism and vision transformer. The network was designed to have three modules: 1) a convolutional block attention module that was added to the feature extraction module to make the network focus on the key information and reduce the influence of irrelevant information; 2) a depth estimation network which was created to adopt an encoder-decoder structure for the establishment of a multi-scale neural network combining Resnet and transformer; 3) a feature fusion module that aimed to fuse 2-D and 3-D information to obtain the final 3-D virtual fitting model. The effect of adding the convolution attention mechanism and vision transformer module on the performance of the network was investigated in details. The performance of the network is mainly expressed by the virtual fitting results and the accuracy of the human body model. Qualitative and quantitative comparative analyses were conducted between this experiment and the benchmark network.

Results The quantitative experimental results showed that the structure similarity index measure (SSIM) was improved by 0.015 7 compared with the baseline network, and the peak signal-to-noise ratio (PSNR) improved by 0.113 2. The above results indicated that the image generation quality is improved without much loss of information. In terms of human model generation accuracy, compared to the baseline network, absolute relative error was reduced by 0.037 and square relative error was reduced by 0.014 in the results of depth estimation, indicating that the 3-D human model generated by this network was more accurate and the depth in the depth map was more consistent with the original given ground truth. The qualitative experimental results showed that the deformation of the garment fitted the original area of the target body more closely without excessive deformation and reduced the generation of garment artifacts. When dealing with complex textures, the network was able to better preserve the pattern and material of the garment fabric. The generated 3-D human body try-on model showed the front and side effects of the human body model, suggesting that the 3-D human body model generated by the network presents clearer contour edges and effectively eliminates the adhesion between the arms and the abdomen. When the knees are close to each other for example, the network would be able to eliminate the adhesion between the knees.

Conclusion The convolutional block attention module and vision transformer introduced by the T3D-VTON network are able to preserve the textural patterns and brand logos on the garment surface when dealing with complex textures. The structure can effectively regulate the garment deformation and blend reasonably with the dressing area of the target character. When generating the 3-D human body model, the network can produce a clearer edge and has more accurate shape generation capability. The method can finally present a 3-D human body model with richer surface texture and more accurate body shape, which provides a fast and economical solution to realize a single image to 3-D virtual application.

Key words: virtual try-on, vision transformer, attention mechanism, depth estimation, three-dimensional reconstruction

CLC Number: 

  • TS942.8

Fig. 1

T3D-VTON overall network structure"

Fig. 2

Convolutional block attention module"

Fig. 3

Depth estimation network"

Fig. 4

Flowchart of 3-D fusion network"

Tab. 1

Quantitative evaluation result of 2-D virtual try-on"

网络名称 SSIM值 PSNR值
M3D-VTON 0.921 8 20.420 5
T3D-VTON 0.937 5 20.533 7

Fig. 5

Visual comparison of 2-D virtual try-on. (a) Comparison of deformation of garment; (b) Comparison of deformation of garment and artifacts eliminated; (c) Comparison of logo preserved and artifacts eliminated"

Tab. 2

Quantitative evaluation result of depth estimation"

网络名称 AbsRel值 SqRel值
M3D-VTON 7.223 0.375
T3D-VTON 7.186 0.361

Fig. 6

Visual comparison of 3-D human body. (a) Visual comparison of abdominal adhesions in 3-D human body;(b) Visual comparison of abdominal and knee adhesions in 3-D human body"

[1] HAN X T, WU Z X, WU Z, et al. Viton: an image-based virtual try-on network[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2018: 7543-7552.
[2] WANG B C, ZHENG H B, LIANG X D, et al. Toward characteristic-preserving image-based virtual try-on network[C]// Proceedings of the European Conference on Computer Vision (ECCV). Berlin:Springer-Verlag, 2018: 589-604.
[3] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J]. Communications of the ACM, 2014, 27(2): 2672-2680.
[4] HONDA S. Viton-gan: virtual try-on image generator trained with adversarial loss[C]// Eurographics on Computer Vision and Pattern Recognition. Genova: Computer Graphics Forum, 2019:9-10.
[5] CHOI S, PARK S, LEE M, et al. Viton-hd: high-resolution virtual try-on via misalignment-aware normalization[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2021: 14131-14140.
[6] LOPER M, MAHMOOD N, ROMERO J, et al. Smpl: a skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6): 1-16.
[7] KANAZAWA A, BLACK M J, JACOBS D W, et al. End-to-end recovery of human shape and pose[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2018: 7122-7131.
[8] ZHU H, ZUO X X, WANG S, et al. Detailed human shape estimation from a single image by hierarchical mesh deformation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2019: 4491-4500.
[9] SAITO S, HUANG Z, NATSUME R, et al. Pifu: pixel-aligned implicit function for high-resolution clothed human digitization[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society, 2019: 2304-2314.
[10] HE T, COLLOMOSSE J, JIN H L, et al. Geo-pifu: geometry and pixel aligned implicit functions for single-view human reconstruction[J]. Advances in Neural Information Processing Systems, 2020, 33: 9276-9287.
[11] HUANG Z, XU Y L, LASSNER C, et al. Arch: animatable reconstruction of clothed humans[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2020: 3093-3102.
[12] EFENDIEV Y, LEUNG W T, LIN G, et al. Hei-human: a hybrid explicit-implicit learning for multiscale problems[C]// 4th Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Berlin:Springer-Verlag, 2021:251-262.
[13] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2018: 7132-7141.
[14] FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2019: 3146-3154.
[15] WOO S, PARK J, LEE J Y, et al. Cbam: convolutional block attention module[C]// Proceedings of the European Conference on Computer vision (ECCV). Berlin:Springer-Verlag, 2018: 3-19.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2017: 5998-6008.
[17] ZHAO F W, XIE Z Y, KAMPFFMEYER M, et al. M3D-VTON: a monocular-to-3D virtual try-on network[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Communications Society, 2021: 13239-13249.
[18] WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.
doi: 10.1109/tip.2003.819861 pmid: 15376593
[19] HUYNH-THU Q, GHANBARI M. Scope of validity of PSNR in image/video quality assessment[J]. Electronics Letters, 2008, 44(13): 800-801.
doi: 10.1049/el:20080522
[1] FU Han, HU Feng, GONG Jie, YU Lianqing. Defect reconstruction algorithm for fabric defect detection [J]. Journal of Textile Research, 2023, 44(07): 103-109.
[2] CHEN Jia, YANG Congcong, LIU Junping, HE Ruhan, LIANG Jinxing. Cross-domain generation for transferring hand-drawn sketches to garment images [J]. Journal of Textile Research, 2023, 44(01): 171-178.
[3] GU Meihua, LIU Jie, LI Liyao, CUI Lin. Clothing image segmentation method based on feature learning and attention mechanism [J]. Journal of Textile Research, 2022, 43(11): 163-171.
[4] LI Bowen, WANG Ping, LIU Yuye. 3-D virtual try-on technique based on dynamic feature of body postures [J]. Journal of Textile Research, 2021, 42(09): 144-149.
[5] ZHANG Yijie, LI Tao, LÜ Yexin, DU Lei, ZOU Fengyuan. Progress in garment ease design and its modeling methods [J]. Journal of Textile Research, 2021, 42(04): 184-190.
[6] XIA Haibang, HUANG Hongyun, DING Zuohua. Clothing comfort evaluation based on transfer learning and support vector machine [J]. Journal of Textile Research, 2020, 41(06): 125-131.
[7] PAN Bo, ZHONG Yueqi. Image-based three-dimensional garment reconstruction [J]. Journal of Textile Research, 2020, 41(04): 123-128.
[8] . Algorithm for three-dimensional reconstruction of textile contour based on multi-directional imaging [J]. JOURNAL OF TEXTILE RESEARCH, 2018, 39(04): 144-150.
[9] YANG Gang;;ZHONG Yueqi;. It is not rare for penetration phenomena to occur between garment and mannequin when 3-D garment is dressed onto various mannequins. In order to enhance the reusability of the 3-D scanned garment model, an algorithm based on same layer penetration compensation has been proposed, in which, the penetration detection and compensation between garment and mannequin are expressed respectively by the crossover and compensation between garment vertex and body triangle, and between garment edge/body triangle. The over-deformation is compensated via the position adjustment procedure. Experimental results verify that this method is an efficient approach for reusing 3-D scanned garment models. [J]. JOURNAL OF TEXTILE RESEARCH, 2010, 31(10): 134-138.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!