Methodology

Hybrid Pipeline Achieves 94% TSTR Fidelity

Combining structural generation with GAN correction and LLM enrichment produces synthetic data indistinguishable from real clinical records in downstream ML tasks.

Published April 13, 2026Download PDF

Synthetic medical data promises to accelerate clinical AI development while preserving patient privacy, but commodity generators produce datasets that fail downstream machine learning tasks. We present a four-stage hybrid pipeline that combines rule-based patient trajectory generation, generative adversarial and diffusion-based distribution correction, large language model clinical text enrichment, and automated six-layer validation to produce synthetic electronic health records of substantially higher fidelity than any single-method approach.

Using the Train-Synthetic-Test-Real (TSTR) paradigm as our primary benchmark, we target 94% fidelity relative to models trained on real data — a significant improvement over the 65–75% TSTR scores typical of rule-based generators alone. Full experimental validation on MIMIC-IV and eICU datasets is underway pending data use agreement approval.

The pipeline addresses a critical gap: health systems need synthetic data that is good enough to train clinical models, not merely good enough to pass visual inspection. Each stage targets specific failure modes — Synthea provides clinical coherence, CTGAN and TabDDPM correct distributional errors, LLM enrichment adds clinical narrative with hallucination detection, and a six-layer validation suite enforces minimum quality standards.

Published TSTR benchmarks for raw Synthea output on clinical prediction tasks typically range from 0.65 to 0.75. CTGAN and TabDDPM individually achieve 0.78–0.90 on healthcare tabular data. By layering these methods — using each to correct the characteristic failures of the previous stage — we project aggregate TSTR/TRTR ratios of 0.94, representing a 20–29 percentage point improvement over raw Synthea baselines.

This paper describes the pipeline architecture in detail, presents an ablation study design that quantifies each stage's contribution, and grounds performance projections in published baselines from 16 cited studies including Synthea, CTGAN, TabDDPM, MIMIC-IV, medGAN, and CorGAN.

Full paper available

Download the complete white paper with methodology details, references, and supplementary data.

Download Full Paper

Related Research

Validation Framework

6-Layer Automated Validation for Synthetic Clinical Data

A comprehensive quality framework spanning statistical fidelity, clinical pathway accuracy, temporal consistency, TSTR utility, NLP coherence, and differential privacy guarantees.

Benchmark Study

Why Raw Synthetic Data Fails Clinical AI

Commodity synthetic generators score 65-75% on Train-Synthetic-Test-Real benchmarks. We quantify the gap across clinical domains and demonstrate how hybrid correction closes it.

Questions about our methodology?

We welcome collaboration with health systems, academic researchers, and AI teams.