Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer

TONG Wei; ZHANG Miaomiao; LI Dongfang; WU Qi; SONG Aiguo

doi:10.11999/JEIT221244

Volume 45 Issue 10

Oct. 2023

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2023 > 45(10): 3483-3491

TONG Wei, ZHANG Miaomiao, LI Dongfang, WU Qi, SONG Aiguo. Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244

Citation:

TONG Wei, ZHANG Miaomiao, LI Dongfang, WU Qi, SONG Aiguo. Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244

Citation:

PDF( 4341 KB)

Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer

doi: 10.11999/JEIT221244 cstr: 32379.14.JEIT221244

TONG Wei^{1, 2},
ZHANG Miaomiao²,
LI Dongfang³,
WU Qi^{2
,
,},
SONG Aiguo⁴

1.
School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
2.
School of Electronic, Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
3.
School of Electrical Engineering and Automation, Fuzhou University, Fuzhou 350108, China
4.
School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Funds: The National Natural Science Foundation of China (U1933125, 62171274), The National Natural Science Foundation of China “Ye Qisun” Key Project (U2241228), The Defense Innovation Project (193-CXCY-A04-01-11-03,223-CXCY-A04-05-09-01), Shanghai Science and Technology Major Project (2021SHZDZX)

Received Date: 2022-09-26
Rev Recd Date: 2022-11-28

Available Online: 2022-11-30

Publish Date: 2023-10-31

Abstract

Abstract

Learning-based Multiple-View Stereo (MVS) aims to reconstruct dense 3D scene representation. However, previous methods utilize additional 2D network modules to learn the cross view visibility for cost aggregation, ignoring the consistency assumption of 2D contextual features in the 3D depth direction. In addition, the current multi-stage depth inference model still requires a high depth sampling rate, and depth hypothesis is sampled within static and preset depth range, which is prone to generate errorneous depth inference in the object boundary and occluded area. To alleviate these problems, a multi-view stereo network based on edge assisted epipolar Transformer is proposed. The improvements of this work over the state of the art are as: Depth regression is replaced by the multi-depth hypotheses classification to ensure the accuracy with limited depth sampling rate and GPU consumption. Epipolar Transformer block is developed for reliable cross view cost aggregation, and edge detection branch is designed to constrain the consistency of edge features in the epipolar direction. A dynamic depth range sampling mechanism based on probabilistic cost volume is applied to improve the accuracy of uncertain areas. Comprehensive comparisons with the state of the art are conducted on public benchmarks, which indicate that the proposed method can reconstruct dense scene representations with limited memory bottleblock. Specifically, compared with Cas-MVSNet, the memory consumption is reducted by 35%, the depth sampling rate is reduced by about 50%, and the overall error on DTU datasets is reduced from 0.355 to 0.325.
- Multiview scene reconstruction,
- Multi-View Stereo(MVS),
- Depth inference,
- Epipolar geometry,
- Transformer

FullText(HTML)

References(23)

References

[1]	GALLIANI S, LASINGER K, and SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 873–881.
[2]	GU Xiaodong, FAN Zhiwen, ZHU Siyu, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2492–2501.
[3]	LUO Keyang, GUAN Tao, JU Lili, et al. Attention-aware multi-view stereo[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1587–1596.
[4]	YAO Yao, LUO Zixin, LI Shiwei, et al. MVSNet: Depth inference for unstructured multi-view stereo[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 785–801.
[5]	YAO Yao, LUO Zixin, LI Shiwei, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, 2019: 5520–5529.
[6]	XU Haofei and ZHANG Juyong. AANET: Adaptive aggregation network for efficient stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1956–1965.
[7]	CHENG Shuo, XU Zexiang, ZHU Shilin, et al. Deep stereo using adaptive thin volume representation with uncertainty awareness[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2521–2531.
[8]	YI Hongwei, WEI Zizhuang, DING Mingyu, et al. Pyramid multi-view stereo net with self-adaptive view aggregation[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 766–782.
[9]	HE Chenhang, ZENG Hui, HUANG Jianqiang, et al. Structure aware single-stage 3D object detection from point cloud[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11870–11879.
[10]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: Representing scenes as neural radiance fields for view synthesis[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 405–421.
[11]	LUO Shitong and HU Wei. Diffusion probabilistic models for 3D point cloud generation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2836–2844.
[12]	ZHANG Jingyang, YAO Yao, LI Shiwei, et al. Visibility-aware multi-view stereo network[C/OL]. Proceedings of the 31st British Machine Vision Conference, 2020.
[13]	XI Junhua, SHI Yifei, WANG Yijie, et al. RayMVSNet: Learning ray-based 1D implicit fields for accurate multi-view stereo[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8585–8595.
[14]	YU Zehao and GAO Shenghua. Fast-MVSNet: Sparse-to-dense multi-view stereo with learned propagation and Gauss–Newton refinement[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1946–1955.
[15]	WANG Fangjinhua, GALLIANI S, VOGEL C, et al. PatchmatchNet: Learned multi-view patchmatch stereo[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Minnepolis, USA, 2021: 14189–14198.
[16]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minnepolis, USA, 2019: 4171–4186.
[17]	LI Zhaoshuo, LIU Xingtong, DRENKOW N, et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6177–6186.
[18]	SUN Jiaming, SHEN Zehong, WANG Yuang, et al. LoFTR: Detector-free local feature matching with transformers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8918–8927.
[19]	WANG Xiaofeng, ZHU Zheng, QIN Fangbo, et al. MVSTER: Epipolar transformer for efficient multi-view stereo[J]. arXiv: 2204.07346, 2022.
[20]	DING Yikang, YUAN Wentao, ZHU Qingtian, et al. TransMVSNet: Global context-aware multi-view stereo network with transformers[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8575–8584.
[21]	ZHU Jie, PENG Bo, LI Wanqing, et al. Multi-view stereo with transformer[J]. arXiv: 2112.00336, 2021.
[22]	WEI Zizhuang, ZHU Qingtian, MIN Chen, et al. AA-RMVSNet: Adaptive aggregation recurrent multi-view stereo network[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6167–6176.
[23]	YI Puyuan, TANG Shengkun, and YAO Jian. DDR-Net: Learning multi-stage multi-view stereo with dynamic depth range[J]. arXiv: 2103.14275, 2021.