Gaussian process (GP) models are powerful tools for Bayesian classification, but their limitation is the high computational cost. Existing approximation methods to reduce the cost of GP classification can be categorized into either global or local approaches. Global approximations, which summarize training data with inducing points, cannot account for non-stationarity and locality in complex datasets. Local approximations, which fit a GP for each sub-region of the input space, are prone to overfitting. This paper proposes a GP classification method that effectively utilizes both global and local information through a hierarchical model. The upper layer consists of a global sparse GP to coarsely model the entire dataset. The lower layer is composed of a mixture of GP experts, which use local information to learn a fine-grained model. The key idea to avoid overfitting and to enforce correlation among the experts is to incorporate global information into their shared prior mean function. A variational inference algorithm is developed for simultaneous learning of the global GP, the experts, and the gating network by maximizing a lower bound of the log marginal likelihood. We explicitly represent the variational distributions of the global variables so that the model conditioned on these variables factorizes in the observations. This way, stochastic optimization can be employed during learning to cater for large-scale problems. Experiments on a wide range of benchmark datasets demonstrate the advantages of the model, as a stand-alone classifier or as the top layer of a deep neural network, in terms of scalability and predictive power.