Automated radiographic report generation is challenging in at least two aspects. First, medical images are very similar to each other and the visual differences of clinic importance are often fine-grained. Second, the disease-related words may be submerged by many similar sentences describing the common content of the images, causing the abnormal to be misinterpreted as the normal in the worst case. To tackle these challenges, this paper proposes a pure transformer-based framework to jointly enforce better visual-textual alignment, multi-label diagnostic classification, and word importance weighting, to facilitate report generation. To the best of our knowledge, this is the first pure transformer-based framework for medical report generation, which enjoys the capacity of transformer in learning long range dependencies for both image regions and sentence words. Specifically, for the first challenge, we design a novel mechanism to embed an auxiliary image-text matching objective into the transformer’s encoder-decoder structure, so that better correlated image and text features could be learned to help a report to discriminate similar images. For the second challenge, we integrate an additional multi-label classification task into our framework to guide the model in making correct diagnostic predictions. Also, a term-weighting scheme is proposed to reflect the importance of words for training so that our model would not miss key discriminative information. Our work achieves promising performance over the state-of-the-arts on two benchmark datasets, including the largest dataset MIMIC-CXR.