高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于边缘辅助极线Transformer的多视角场景重建

童伟 张苗苗 李东方 吴奇 宋爱国

童伟, 张苗苗, 李东方, 吴奇, 宋爱国. 基于边缘辅助极线Transformer的多视角场景重建[J]. 电子与信息学报, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244
引用本文: 童伟, 张苗苗, 李东方, 吴奇, 宋爱国. 基于边缘辅助极线Transformer的多视角场景重建[J]. 电子与信息学报, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244
TONG Wei, ZHANG Miaomiao, LI Dongfang, WU Qi, SONG Aiguo. Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244
Citation: TONG Wei, ZHANG Miaomiao, LI Dongfang, WU Qi, SONG Aiguo. Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3483-3491. doi: 10.11999/JEIT221244

基于边缘辅助极线Transformer的多视角场景重建

doi: 10.11999/JEIT221244
基金项目: 国家自然科学基金 (U1933125, 62171274),国家自然科学基金“叶企孙”重点项目(U2241228),国防创新特区项目 (193-CXCY-A04-01-11-03,223-CXCY-A04-05-09-01),上海市级科技重大专项 (2021SHZDZX)
详细信息
    作者简介:

    童伟:男,博士生,研究方向为SLAM、场景感知和人机交互

    张苗苗:女,博士生,研究方向为神经网络、强化学习、动态规划

    李东方:男,博士,讲师,研究方向为仿生机器人、无人机飞控

    吴奇:男,博士,教授,研究方向为脑认知、视脑交互

    宋爱国:男,博士,教授,研究方向为脑机接口、脑机融合技术

    通讯作者:

    吴奇 Edmondqwu@sjtu.edu.cn

  • 中图分类号: TP391.4

Multiview Scene Reconstruction Based on Edge Assisted Epipolar Transformer

Funds: The National Natural Science Foundation of China (U1933125, 62171274), The National Natural Science Foundation of China “Ye Qisun” Key Project (U2241228), The Defense Innovation Project (193-CXCY-A04-01-11-03,223-CXCY-A04-05-09-01), Shanghai Science and Technology Major Project (2021SHZDZX)
  • 摘要: 基于深度学习的多视角立体几何(MVS)旨在通过多个视图重建出稠密的3维场景。然而现有的方法通常设计复杂的2D网络模块来学习代价体聚合的跨视角可见性,忽略了跨视角2维上下文特征在3D深度方向的一致性假设。此外,基于多阶段的深度推断方法仍需要较高的深度采样率,并且在静态或预先设定的范围内采样深度值,容易在物体边界以及光照遮挡等区域产生错误的深度推断。为了缓解这些问题,该文提出一种基于边缘辅助极线Transformer的密集深度推断模型。与现有工作相比,具体改进如下:将深度回归转换为多深度值分类进行求解,在有限的深度采样率和GPU占用下保证了推断精度;设计一种极线Transformer模块提高跨视角代价体聚合的可靠性,并引入边缘检测分支约束边缘特征在极线方向的一致性;为了提高弱纹理区域的精度,设计了基于概率成本体积的动态深度范围采样机制。与主流的方法在公开的数据集上进行了综合对比,实验结果表明所提模型能够在有限的显存占用下重建出稠密准确的3D场景。特别地,相比于Cas-MVSNet,所提模型的显存占用降低了35%,深度采样率降低约50%,DTU数据集的综合误差从0.355降低至0.325。
  • 图  1  所提多视角深度图推断网络结构

    图  2  跨视角代价体聚合注意力模块

    图  3  不同方法的显存占用与运行时间对比

    图  4  所提方法与Cas-MVSNet的重建结果比较

    图  5  所提方法在Tanks&Templates数据集的重建结果

    图  6  代价体聚合的特征图可视化对比

    图  7  不同方法的深度图定性对比

    表  1  DTU测试集上不同方法的重建结果定量比较

    方法准确性完整性综合性
    Gipuma
    Colmap
    0.283
    0.400
    0.873
    0.644
    0.578
    0.532
    MVSNet0.4560.6460.551
    MVSCRF0.3710.4260.398
    Fast-MVSNet0.3360.4030.370
    R-MVSNet0.3830.4520.417
    Cas-MVSNet0.3250.3850.355
    PatchmatchNet0.4270.2770.352
    AA-RMVSNet0.3760.3390.357
    本文0.3640.2860.325
    下载: 导出CSV

    表  2  不同方法在Tanks & Temples数据集的定量比较

    方法MeanM60TrainHorseLighthouseFamilyPantherPlaygroundFrancis
    MVSNet43.4855.9928.5525.0750.0953.9650.8647.9034.69
    DDR-Net54.9155.5747.1743.4355.2076.1852.2856.0453.36
    UCSNet54.8355.6047.8943.0354.0076.0951.4957.3853.16
    AA-RMVSNet61.5164.0546.6551.5364.0277.7759.4760.8554.90
    Cas-MVSNet56.4253.9646.5646.2055.3376.3654.0258.1758.45
    本文56.7557.3350.4951.1256.0975.5954.2656.1053.03
    下载: 导出CSV

    表  3  DTU测试集上消融实验定量比较

    方法平均绝对值误差固定阈值的预测精度(%)
    < 2 mm< 4 mm< 8 mm
    基准8.4277.1783.0389.86
    基准+分类损失8.3079.0786.7090.10
    本文7.6980.2586.8190.52
    下载: 导出CSV

    表  4  DTU测试集上不同模块的定量比较(mm)

    模型分类损失极线Transformer边缘辅助模块动态采样模块准确性完整性综合性
    基准0.3460.3980.372
    本文-A0.3800.3340.357
    本文-B0.3600.3020.331
    本文-C0.3510.3030.327
    本文0.3640.2860.325
    下载: 导出CSV

    表  5  DTU测试集上动态采样模块消融实验定量比较

    方法1st范围(mm)2nd范围(mm)2nd覆盖占比(%)3rd范围(mm)3rd覆盖占比(%)深度采样数
    Cas-MVSNet508.8169.720.953221.090.844148,32,8
    UCSNet508.829.460.850710.100.731064,32,8
    DDR-Net508.8139.460.931719.240.843548,32,8
    本文508.854.420.88919.160.838116,8,4
    本文+动态采样508.878.120.90039.160.841216,8,4
    下载: 导出CSV
  • [1] GALLIANI S, LASINGER K, and SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 873–881.
    [2] GU Xiaodong, FAN Zhiwen, ZHU Siyu, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2492–2501.
    [3] LUO Keyang, GUAN Tao, JU Lili, et al. Attention-aware multi-view stereo[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1587–1596.
    [4] YAO Yao, LUO Zixin, LI Shiwei, et al. MVSNet: Depth inference for unstructured multi-view stereo[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 785–801.
    [5] YAO Yao, LUO Zixin, LI Shiwei, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, USA, 2019: 5520–5529.
    [6] XU Haofei and ZHANG Juyong. AANET: Adaptive aggregation network for efficient stereo matching[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1956–1965.
    [7] CHENG Shuo, XU Zexiang, ZHU Shilin, et al. Deep stereo using adaptive thin volume representation with uncertainty awareness[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 2521–2531.
    [8] YI Hongwei, WEI Zizhuang, DING Mingyu, et al. Pyramid multi-view stereo net with self-adaptive view aggregation[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 766–782.
    [9] HE Chenhang, ZENG Hui, HUANG Jianqiang, et al. Structure aware single-stage 3D object detection from point cloud[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11870–11879.
    [10] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: Representing scenes as neural radiance fields for view synthesis[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 405–421.
    [11] LUO Shitong and HU Wei. Diffusion probabilistic models for 3D point cloud generation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 2836–2844.
    [12] ZHANG Jingyang, YAO Yao, LI Shiwei, et al. Visibility-aware multi-view stereo network[C/OL]. Proceedings of the 31st British Machine Vision Conference, 2020.
    [13] XI Junhua, SHI Yifei, WANG Yijie, et al. RayMVSNet: Learning ray-based 1D implicit fields for accurate multi-view stereo[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8585–8595.
    [14] YU Zehao and GAO Shenghua. Fast-MVSNet: Sparse-to-dense multi-view stereo with learned propagation and Gauss–Newton refinement[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1946–1955.
    [15] WANG Fangjinhua, GALLIANI S, VOGEL C, et al. PatchmatchNet: Learned multi-view patchmatch stereo[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Minnepolis, USA, 2021: 14189–14198.
    [16] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minnepolis, USA, 2019: 4171–4186.
    [17] LI Zhaoshuo, LIU Xingtong, DRENKOW N, et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6177–6186.
    [18] SUN Jiaming, SHEN Zehong, WANG Yuang, et al. LoFTR: Detector-free local feature matching with transformers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 8918–8927.
    [19] WANG Xiaofeng, ZHU Zheng, QIN Fangbo, et al. MVSTER: Epipolar transformer for efficient multi-view stereo[J]. arXiv: 2204.07346, 2022.
    [20] DING Yikang, YUAN Wentao, ZHU Qingtian, et al. TransMVSNet: Global context-aware multi-view stereo network with transformers[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8575–8584.
    [21] ZHU Jie, PENG Bo, LI Wanqing, et al. Multi-view stereo with transformer[J]. arXiv: 2112.00336, 2021.
    [22] WEI Zizhuang, ZHU Qingtian, MIN Chen, et al. AA-RMVSNet: Adaptive aggregation recurrent multi-view stereo network[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 6167–6176.
    [23] YI Puyuan, TANG Shengkun, and YAO Jian. DDR-Net: Learning multi-stage multi-view stereo with dynamic depth range[J]. arXiv: 2103.14275, 2021.
  • 加载中
图(7) / 表(5)
计量
  • 文章访问数:  608
  • HTML全文浏览量:  391
  • PDF下载量:  148
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-09-26
  • 修回日期:  2022-11-28
  • 网络出版日期:  2022-11-30
  • 刊出日期:  2023-10-31

目录

    /

    返回文章
    返回