Background: Electronic medical records are increasingly used for research with limited external validation of their data. Objective: This study investigates the validity of electronic medical data (EMD) for estimating diabetes prevalence in general practitioner (GP) patients by comparing EMD with national Bettering the Evaluation and Care of Health (BEACH) data. Method: A “decision tree” was created using inclusion/exclusion of pre-agreed variables to determine the prob-ability of diabetes in absence of diagnostic label, including diagnoses (coded/free-text diabetes, polycystic ovarian syn-drome, impaired glucose tolerance, impaired fasting glucose), diabetic annual cycle of care (DACC), hemoglobin (HbA1
>6.5%, and prescription (metformin, other diabetes medications). Via SQL query, cases were identified in EMD of five Illawarra and Southern Practice Network practices (30,007 active patients; from 2 years to January 2015). Patient-based Supplementary Analysis of Nominated Data (SAND) sub-studies from BEACH investigating diabetes prevalence (1172 GPs; 35,162 patients; November 2012 to February 2015) were comparison data. SAND results were adjusted for number of GP encounters per year, per patient, and then age–sex standardised to match age–sex distribution of EMD patients. Cluster-adjusted 95% confidence intervals (CIs) were calculated for both datasets. Results: EMD diabetes prevalence (T1 and/or T2) was 6.5% (95% CI: 4.1–8.9). Following age–sex standardisation, SAND prevalence, not significantly different, was 6.7% (95% CI: 6.3–7.1). Extracting only coded diagnosis missed 13.0% of probable cases, subsequently identified through the presence of metformin/other diabetes medications medications (*without other indicator variables; 6.1%), free-text diabetes label (3.8%), HbA1c result* (1.6%), DACC* (1.3%), and diabetes medications* (0.2%). Discussion: While complex, proxy variables can improve usefulness of EMD for research. Without their consideration, EMD results should be interpreted with caution. Conclusion: Enforceable, transparent data linkages in EMRs would resolve many problems with identification of diagnoses. Ongoing data quality improvement remains essential.