Depth perception is essential for scene understanding, autonomous navigation and augmented reality. Depth estimation from a single 2D image is challenging due to the lack of reliable cues, e.g. stereo correspondences and motions. Modern approaches exploit multi-scale feature extraction to provide more powerful representations for deep networks. However, these studies only use simple addition or concatenation to combine the extracted multi-scale features. This paper proposes a novel region-based self-attention (rSA) unit for effective feature fusions. The rSA recalibrates the multi-scale responses by explicitly modelling the dependency between channels in separate image regions. We discretize continuous depths to formulate an ordinal depth classification problem in which the relative order between categories is preserved. The experiments are performed on a dataset of 4410 RGB-D images, captured in outdoor environments at the University of Wollongong's campus. The proposed module improves the models on small-sized datasets by 22% to 40%.