The paper presents two empirical cases of expert musicians—a classical string quartet and a solo, free improvisation saxophonist—to analyze the explanatory power and reach of theories in the field of expertise studies and joint action. We argue that neither the positions stressing top-down capacities of prediction, planning or perspective-taking, nor those emphasizing bottom-up embodied processes of entrainment, motor-responses and emotional sharing can do justice to the empirical material. We then turn to hybrid theories in the expertise debate and interactionist accounts of cognition. Attempting to strengthen and extend them, we offer ‘Arch’: an overarching conception of musical interaction as an externalized, cognitive scaffold that encompasses high and low-level cognition, internal and external processes, as well as the shared normative space including the musical materials in which the musicians perform. In other words, ‘Arch’ proposes interaction as a multivariate multimodal overarching scaffold necessary to explain not only cases of joint performance, but equally of solo improvisation.