SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes

SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.


💡 Research Summary

SynGP500 is a newly released, clinician‑curated synthetic corpus of 500 Australian general practice (GP) medical notes designed to fill a critical gap in primary‑care NLP resources. The authors combine three pillars—curriculum‑based clinical breadth (RACGP 2022 Curriculum), epidemiological calibration (BEACH study), and multi‑dimensional contextual grounding—to ensure that the dataset reflects both the statistical distribution of real‑world GP encounters and the full spectrum of conditions that Australian GPs must be prepared to manage, including rare, curriculum‑specified diseases that are seldom seen in a single practice.

Case selection is performed in three tiers: common conditions (e.g., hypertension, type‑2 diabetes), less‑common but curriculum‑important conditions (e.g., Addison’s disease), and long‑tail rare diseases (e.g., granulomatosis with polyangiitis). Frequencies are weighted to match the national BEACH prevalence data, guaranteeing that the overall condition distribution mirrors actual Australian GP consultation patterns. The authors report that, across 28 BEACH categories, most frequencies differ by only ±1–2 %, with a modest under‑representation of an “Other” bucket that reflects the intentional curriculum focus.

Multi‑dimensional grounding adds realism beyond disease prevalence. Five grounding dimensions are defined: (1) up‑to‑date Australian clinical guidelines and PBS prescribing rules, (2) nine distinct consultation contexts (standard clinic, bulk‑billing, residential aged‑care, telehealth, home visit, after‑hours, community health, Aboriginal health services, mobile outreach), (3) geographic remoteness (MM1–MM7 based on a modified Monash model), (4) psychosocial determinants (housing instability, cultural factors, language barriers, family dynamics, adherence challenges), and (5) SNOMED CT‑AU mapping for each condition. For example, the management plan for STEMI varies from immediate PCI in MM1 settings to thrombolysis and patient retrieval in MM6‑MM7 remote locations, illustrating how resource constraints shape clinical reasoning.

Synthetic note generation leverages GPT‑5 (temperature 1.0) with prompts that encode the above grounding information. To avoid the “mode collapse” typical of naïve LLM generation, the authors create a library of synthetic GP “personas” that vary in verbosity, abbreviation density, note structure (SOAP, narrative, hybrid), explicitness of differential diagnosis, safety‑netting detail, and typo rate. Consequently, the 500 notes display a wide length distribution (213–1,444 words, mean 606 ± 257) and realistic stylistic noise (0.83 % typo rate). The dataset includes adult and elderly patients (no pediatric cases in this release) and is annotated with SNOMED CT‑AU concept identifiers.

Validation proceeds on three fronts. (1) Epidemiological validation uses LLM‑based categorisation of notes into BEACH complaint categories, with 10 % manual review confirming classification accuracy; the resulting prevalence aligns closely with real data. (2) Stylometric analysis shows high lexical diversity (MA‑TTR 0.946 for 25‑word windows) and substantial intra‑note variation in article and copula density, indicating that the text is not template‑driven. (3) Semantic diversity is assessed via all‑mpnet‑base‑v2 embeddings; pairwise cosine similarity averages 0.52 (range 0.09–0.95) and UMAP visualisations reveal a dispersed embedding space, further evidence against mode collapse.

To demonstrate downstream utility, the authors conduct a medical concept extraction experiment. They create 19 fictional GP notes containing 648 manually annotated entities (SNOMED CT‑AU concepts) and evaluate MedCAT v2.0, pre‑trained on SynGP500 for 0–4 epochs. Performance improves with more pre‑training epochs, with the most clinically relevant “grouped‑type” F1 rising from 0.71 (no pre‑training) to 0.78 after four epochs, suggesting that the synthetic corpus can meaningfully boost concept‑recognition models.

The dataset is released under a CC BY‑NC‑SA 4.0 license as plain‑text UTF‑8 files named {SNOMED code}_{ID}_{condition name}.txt. Because it is fully synthetic and privacy‑preserving, researchers and educators can use it without ethics board approval, accelerating development of Australian‑specific clinical NLP tools for decision support, quality improvement, and education.

Key contributions: (1) First publicly available Australian GP text corpus that is synthetic yet clinically grounded, (2) A reproducible framework that integrates curriculum breadth, epidemiological realism, and contextual diversity, (3) Comprehensive validation showing linguistic, semantic, and epidemiological fidelity, and (4) Empirical evidence that pre‑training on SynGP500 improves downstream medical concept extraction. Future work includes extending the corpus to pediatric cases, adding rare specialty conditions, and conducting formal validation against real GP notes once ethical clearance is obtained.


Comments & Academic Discussion

Loading comments...

Leave a Comment