This paper examines appropriate network and server infrastructures for the provision of a realistic audio scene from the perspective of avatars in large-scale virtual environments. The audio scene of each avatar combines the voices and other sources of sound in the vicinity of the avatar, spatially placed, attenuated according to distance from the listener, and addition of sound effects to reflect the acoustic characteristics of the environment. We examine a range of delivery options including central-server, peer-to-peer with and without multicast, distributed proxies, distributed locale servers, and a hybrid model. We provide numerical results on the effect of different virtual world characteristics such as avatar density, hearing range, and correlation between the positions in the virtual and physical worlds. We compare delivery architectures based on a set of delay metrics aimed at measuring the interactive delay between avatars as well as accuracy of the scene. We make several recommendations on scalable implementation of such applications.