Skip to main content
placeholder image

D-net: A generalised and optimised deep network for monocular depth estimation

Journal Article


Abstract


  • Depth estimation is an essential component in computer vision systems for achieving 3D scene understanding. Efficient and accurate depth map estimation has numerous applications including self-driving vehicles and virtual reality tools. This paper presents a new deep network, called D-Net, for depth estimation from a single RGB image. The proposed network can be trained end-to-end, and its structure can be customised to meet different requirements in model size, speed, and prediction accuracy. Our approach gathers strong global and local contextual features at multiple resolutions, and then transfers these to high resolutions for clearer depth maps. For the encoder backbone, D-Net can utilise many state-of-the-art models including EfficientNet, HRNet and Swin Transformer to obtain dense depth maps. The proposed D-net is designed to have minimal parameters and reduced computational complexity. Extensive evaluations on the NYUv2 and KITTI benchmark datasets show that our model is highly accurate across multiple backbones, and it achieves state-of-the-art performance on both benchmarks when combined with the Swin Transformer and HRNets.

Publication Date


  • 2021

Citation


  • Thompson, J. L., Phung, S. L., & Bouzerdoum, A. (2021). D-net: A generalised and optimised deep network for monocular depth estimation. IEEE Access, 9, 134543-134555. doi:10.1109/ACCESS.2021.3116380

Scopus Eid


  • 2-s2.0-85116974992

Start Page


  • 134543

End Page


  • 134555

Volume


  • 9

Abstract


  • Depth estimation is an essential component in computer vision systems for achieving 3D scene understanding. Efficient and accurate depth map estimation has numerous applications including self-driving vehicles and virtual reality tools. This paper presents a new deep network, called D-Net, for depth estimation from a single RGB image. The proposed network can be trained end-to-end, and its structure can be customised to meet different requirements in model size, speed, and prediction accuracy. Our approach gathers strong global and local contextual features at multiple resolutions, and then transfers these to high resolutions for clearer depth maps. For the encoder backbone, D-Net can utilise many state-of-the-art models including EfficientNet, HRNet and Swin Transformer to obtain dense depth maps. The proposed D-net is designed to have minimal parameters and reduced computational complexity. Extensive evaluations on the NYUv2 and KITTI benchmark datasets show that our model is highly accurate across multiple backbones, and it achieves state-of-the-art performance on both benchmarks when combined with the Swin Transformer and HRNets.

Publication Date


  • 2021

Citation


  • Thompson, J. L., Phung, S. L., & Bouzerdoum, A. (2021). D-net: A generalised and optimised deep network for monocular depth estimation. IEEE Access, 9, 134543-134555. doi:10.1109/ACCESS.2021.3116380

Scopus Eid


  • 2-s2.0-85116974992

Start Page


  • 134543

End Page


  • 134555

Volume


  • 9