基于边缘辅助极线Transformer的多视角场景重建

童伟; 张苗苗; 李东方; 吴奇; 宋爱国

doi:10.11999/JEIT221244

基于边缘辅助极线Transformer的多视角场景重建

doi: 10.11999/JEIT221244

童伟^{1, 2},
张苗苗²,
李东方³,
吴奇^2, ,,
宋爱国⁴

1.
南京理工大学机械工程学院南京 210094
2.
上海交通大学电子信息与电气工程学院上海 200240
3.
福州大学电气工程与自动化学院福州 350108
4.
东南大学仪器科学与工程学院南京 210096

基金项目: 国家自然科学基金 (U1933125, 62171274)，国家自然科学基金“叶企孙”重点项目(U2241228)，国防创新特区项目 (193-CXCY-A04-01-11-03,223-CXCY-A04-05-09-01)，上海市级科技重大专项 (2021SHZDZX)

详细信息

作者简介:
童伟：男，博士生，研究方向为SLAM、场景感知和人机交互

张苗苗：女，博士生，研究方向为神经网络、强化学习、动态规划

李东方：男，博士，讲师，研究方向为仿生机器人、无人机飞控

吴奇：男，博士，教授，研究方向为脑认知、视脑交互

宋爱国：男，博士，教授，研究方向为脑机接口、脑机融合技术

通讯作者:
吴奇　Edmondqwu@sjtu.edu.cn

中图分类号: TP391.4
计量
- 文章访问数: 510
- HTML全文浏览量: 312
- PDF下载量: 140
- 被引次数: 0
出版历程
- 收稿日期: 2022-09-26
- 修回日期: 2022-11-28
- 网络出版日期: 2022-11-30
- 刊出日期: 2023-10-31

Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer

TONG Wei^{1, 2},
ZHANG Miaomiao²,
LI Dongfang³,
WU Qi^{2
, ,},
SONG Aiguo⁴

1.
School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
2.
School of Electronic, Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
3.
School of Electrical Engineering and Automation, Fuzhou University, Fuzhou 350108, China
4.
School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Funds: The National Natural Science Foundation of China (U1933125, 62171274), The National Natural Science Foundation of China “Ye Qisun” Key Project (U2241228), The Defense Innovation Project (193-CXCY-A04-01-11-03,223-CXCY-A04-05-09-01), Shanghai Science and Technology Major Project (2021SHZDZX)

摘要

摘要: 基于深度学习的多视角立体几何(MVS)旨在通过多个视图重建出稠密的3维场景。然而现有的方法通常设计复杂的2D网络模块来学习代价体聚合的跨视角可见性，忽略了跨视角2维上下文特征在3D深度方向的一致性假设。此外，基于多阶段的深度推断方法仍需要较高的深度采样率，并且在静态或预先设定的范围内采样深度值，容易在物体边界以及光照遮挡等区域产生错误的深度推断。为了缓解这些问题，该文提出一种基于边缘辅助极线Transformer的密集深度推断模型。与现有工作相比，具体改进如下：将深度回归转换为多深度值分类进行求解，在有限的深度采样率和GPU占用下保证了推断精度；设计一种极线Transformer模块提高跨视角代价体聚合的可靠性，并引入边缘检测分支约束边缘特征在极线方向的一致性；为了提高弱纹理区域的精度，设计了基于概率成本体积的动态深度范围采样机制。与主流的方法在公开的数据集上进行了综合对比，实验结果表明所提模型能够在有限的显存占用下重建出稠密准确的3D场景。特别地，相比于Cas-MVSNet，所提模型的显存占用降低了35%，深度采样率降低约50%，DTU数据集的综合误差从0.355降低至0.325。
- 多视角场景重建 /
- 多视角立体几何 /
- 深度估计 /
- 极线几何 /
- Transformer
Abstract: Learning-based Multiple-View Stereo (MVS) aims to reconstruct dense 3D scene representation. However, previous methods utilize additional 2D network modules to learn the cross view visibility for cost aggregation, ignoring the consistency assumption of 2D contextual features in the 3D depth direction. In addition, the current multi-stage depth inference model still requires a high depth sampling rate, and depth hypothesis is sampled within static and preset depth range, which is prone to generate errorneous depth inference in the object boundary and occluded area. To alleviate these problems, a multi-view stereo network based on edge assisted epipolar Transformer is proposed. The improvements of this work over the state of the art are as: Depth regression is replaced by the multi-depth hypotheses classification to ensure the accuracy with limited depth sampling rate and GPU consumption. Epipolar Transformer block is developed for reliable cross view cost aggregation, and edge detection branch is designed to constrain the consistency of edge features in the epipolar direction. A dynamic depth range sampling mechanism based on probabilistic cost volume is applied to improve the accuracy of uncertain areas. Comprehensive comparisons with the state of the art are conducted on public benchmarks, which indicate that the proposed method can reconstruct dense scene representations with limited memory bottleblock. Specifically, compared with Cas-MVSNet, the memory consumption is reducted by 35%, the depth sampling rate is reduced by about 50%, and the overall error on DTU datasets is reduced from 0.355 to 0.325.
- Multiview scene reconstruction /
- Multi-View Stereo(MVS) /
- Depth inference /
- Epipolar geometry /
- Transformer

HTML全文

图 1 所提多视角深度图推断网络结构

下载: 全尺寸图片幻灯片

图 2 跨视角代价体聚合注意力模块

下载: 全尺寸图片幻灯片

图 3 不同方法的显存占用与运行时间对比

下载: 全尺寸图片幻灯片

图 4 所提方法与Cas-MVSNet的重建结果比较

下载: 全尺寸图片幻灯片

图 5 所提方法在Tanks&Templates数据集的重建结果

下载: 全尺寸图片幻灯片

图 6 代价体聚合的特征图可视化对比

下载: 全尺寸图片幻灯片

图 7 不同方法的深度图定性对比

下载: 全尺寸图片幻灯片

表 1 DTU测试集上不同方法的重建结果定量比较

方法	准确性	完整性	综合性
Gipuma Colmap	0.283 0.400	0.873 0.644	0.578 0.532
MVSNet	0.456	0.646	0.551
MVSCRF	0.371	0.426	0.398
Fast-MVSNet	0.336	0.403	0.370
R-MVSNet	0.383	0.452	0.417
Cas-MVSNet	0.325	0.385	0.355
PatchmatchNet	0.427	0.277	0.352
AA-RMVSNet	0.376	0.339	0.357
本文	0.364	0.286	0.325

下载: 导出CSV

表 2 不同方法在Tanks & Temples数据集的定量比较

方法	Mean	M60	Train	Horse	Lighthouse	Family	Panther	Playground	Francis
MVSNet	43.48	55.99	28.55	25.07	50.09	53.96	50.86	47.90	34.69
DDR-Net	54.91	55.57	47.17	43.43	55.20	76.18	52.28	56.04	53.36
UCSNet	54.83	55.60	47.89	43.03	54.00	76.09	51.49	57.38	53.16
AA-RMVSNet	61.51	64.05	46.65	51.53	64.02	77.77	59.47	60.85	54.90
Cas-MVSNet	56.42	53.96	46.56	46.20	55.33	76.36	54.02	58.17	58.45
本文	56.75	57.33	50.49	51.12	56.09	75.59	54.26	56.10	53.03

下载: 导出CSV

表 3 DTU测试集上消融实验定量比较

方法	平均绝对值误差	固定阈值的预测精度(%)
方法	平均绝对值误差	< 2 mm	< 4 mm	< 8 mm
基准	8.42	77.17	83.03	89.86
基准+分类损失	8.30	79.07	86.70	90.10
本文	7.69	80.25	86.81	90.52

下载: 导出CSV

表 4 DTU测试集上不同模块的定量比较(mm)

模型	分类损失	极线Transformer	边缘辅助模块	动态采样模块	准确性	完整性	综合性
基准					0.346	0.398	0.372
本文-A	√				0.380	0.334	0.357
本文-B	√	√			0.360	0.302	0.331
本文-C	√	√	√		0.351	0.303	0.327
本文	√	√	√	√	0.364	0.286	0.325

下载: 导出CSV

表 5 DTU测试集上动态采样模块消融实验定量比较

方法	1st范围(mm)	2nd范围(mm)	2nd覆盖占比(%)	3rd范围(mm)	3rd覆盖占比(%)	深度采样数
Cas-MVSNet	508.8	169.72	0.9532	21.09	0.8441	48,32,8
UCSNet	508.8	29.46	0.8507	10.10	0.7310	64,32,8
DDR-Net	508.8	139.46	0.9317	19.24	0.8435	48,32,8
本文	508.8	54.42	0.8891	9.16	0.8381	16,8,4
本文+动态采样	508.8	78.12	0.9003	9.16	0.8412	16,8,4

下载: 导出CSV

参考文献(23)

[1]	GALLIANI S, LASINGER K, and SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 873–881.
[2]	GU Xiaodong, FAN Zhiwen, ZHU Siyu, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2492–2501.
[3]	LUO Keyang, GUAN Tao, JU Lili, et al. Attention-aware multi-view stereo[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1587–1596.
[4]	YAO Yao, LUO Zixin, LI Shiwei, et al. MVSNet: Depth inference for unstructured multi-view stereo[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 785–801.
[5]	YAO Yao, LUO Zixin, LI Shiwei, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, 2019: 5520–5529.
[6]	XU Haofei and ZHANG Juyong. AANET: Adaptive aggregation network for efficient stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1956–1965.
[7]	CHENG Shuo, XU Zexiang, ZHU Shilin, et al. Deep stereo using adaptive thin volume representation with uncertainty awareness[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2521–2531.
[8]	YI Hongwei, WEI Zizhuang, DING Mingyu, et al. Pyramid multi-view stereo net with self-adaptive view aggregation[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 766–782.
[9]	HE Chenhang, ZENG Hui, HUANG Jianqiang, et al. Structure aware single-stage 3D object detection from point cloud[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11870–11879.
[10]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: Representing scenes as neural radiance fields for view synthesis[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 405–421.
[11]	LUO Shitong and HU Wei. Diffusion probabilistic models for 3D point cloud generation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2836–2844.
[12]	ZHANG Jingyang, YAO Yao, LI Shiwei, et al. Visibility-aware multi-view stereo network[C/OL]. Proceedings of the 31st British Machine Vision Conference, 2020.
[13]	XI Junhua, SHI Yifei, WANG Yijie, et al. RayMVSNet: Learning ray-based 1D implicit fields for accurate multi-view stereo[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8585–8595.
[14]	YU Zehao and GAO Shenghua. Fast-MVSNet: Sparse-to-dense multi-view stereo with learned propagation and Gauss–Newton refinement[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1946–1955.
[15]	WANG Fangjinhua, GALLIANI S, VOGEL C, et al. PatchmatchNet: Learned multi-view patchmatch stereo[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Minnepolis, USA, 2021: 14189–14198.
[16]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minnepolis, USA, 2019: 4171–4186.
[17]	LI Zhaoshuo, LIU Xingtong, DRENKOW N, et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6177–6186.
[18]	SUN Jiaming, SHEN Zehong, WANG Yuang, et al. LoFTR: Detector-free local feature matching with transformers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8918–8927.
[19]	WANG Xiaofeng, ZHU Zheng, QIN Fangbo, et al. MVSTER: Epipolar transformer for efficient multi-view stereo[J]. arXiv: 2204.07346, 2022.
[20]	DING Yikang, YUAN Wentao, ZHU Qingtian, et al. TransMVSNet: Global context-aware multi-view stereo network with transformers[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8575–8584.
[21]	ZHU Jie, PENG Bo, LI Wanqing, et al. Multi-view stereo with transformer[J]. arXiv: 2112.00336, 2021.
[22]	WEI Zizhuang, ZHU Qingtian, MIN Chen, et al. AA-RMVSNet: Adaptive aggregation recurrent multi-view stereo network[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6167–6176.
[23]	YI Puyuan, TANG Shengkun, and YAO Jian. DDR-Net: Learning multi-stage multi-view stereo with dynamic depth range[J]. arXiv: 2103.14275, 2021.