AI/ML

Differential Privacy and Federated Learning: HIPAA-Compliant Healthcare ML Pipeline Architecture

February 20269 min read

Lucio Durán

Engineering Manager & AI Solutions Architect

The Architecture

Hospital A ──┐
Hospital B ──┤
Hospital C ──┼──→ [Secure Aggregation Server (AWS)] ──→ [Global Model]
Hospital D ──┤ ↑ only encrypted gradients
Hospital E ──┘ ↓ only aggregated updates

Each hospital runs a local training node. The central server never sees raw data, raw gradients, or individual model updates. It only sees the result of secure aggregation — the encrypted sum of all updates, which it can decrypt but cannot decompose back into individual hospital contributions.

Differential Privacy: The Theory You Actually Need

Differential privacy is a mathematical definition, not a technique. A randomized algorithm M satisfies (ε, δ)-differential privacy if for all datasets D and D' differing in one record, and for all possible outputs S:

P[M(D) ∈ S] ≤ e^ε · P[M(D') ∈ S] + δ

In English: removing or adding any single patient's data changes the output distribution by at most a factor of e^ε, plus a negligible δ failure probability. The smaller ε is, the stronger the privacy guarantee.

For deep learning, this is applied through DP-SGD (Differentially Private Stochastic Gradient Descent):

Clip each per-sample gradient to a maximum norm C
Sum the clipped gradients
Add Gaussian noise calibrated to C and ε
Update the model with the noisy gradient

import torch
from opacus import PrivacyEngine

model = PneumoniaResNet(pretrained=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
data_loader = torch.utils.data.DataLoader(hospital_dataset, batch_size=64)

# Wrap with Opacus privacy engine
privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
 module=model,
 optimizer=optimizer,
 data_loader=data_loader,
 epochs=50,
 target_epsilon=8.0,
 target_delta=1e-5,
 max_grad_norm=1.0, # Clipping bound C
)

# Training loop — looks identical to normal PyTorch
for epoch in range(50):
 for batch in data_loader:
 images, labels = batch
 optimizer.zero_grad()
 output = model(images)
 loss = F.binary_cross_entropy_with_logits(output, labels)
 loss.backward()
 optimizer.step() # Opacus handles clipping + noise internally

 # Check remaining privacy budget
 epsilon = privacy_engine.get_epsilon(delta=1e-5)
 print(f"Epoch {epoch}: ε = {epsilon:.2f}")
 if epsilon > 8.0:
 print("Privacy budget exhausted. Stopping training.")
 break

The key parameter decisions that require careful experimentation:

max_grad_norm=1.0: Too low and you clip away useful signal. Too high and you need more noise to achieve the same ε. Sweeping from 0.1 to 10.0 typically yields 1.0 as optimal for ResNet architectures.
target_epsilon=8.0: A pragmatic choice. ε=1.0 is the academic gold standard but cost us 8% accuracy. ε=8.0 gave meaningful privacy guarantees while keeping accuracy within 2% of the non-private baseline.
target_delta=1e-5: Standard practice — δ should be less than 1/N where N is the dataset size. Our smallest hospital had ~40k images, so 1e-5 gives good margin.

Rényi Differential Privacy for Budget Tracking

This is where most tutorials get the math wrong. If you run DP-SGD for 100 epochs, the naive privacy cost is 100 × ε_per_step. This is called basic composition, and it's dramatically pessimistic.

Rényi Differential Privacy (RDP) gives tighter bounds through a different divergence measure:

from opacus.accountants import RDPAccountant

accountant = RDPAccountant()

for epoch in range(num_epochs):
 for step in range(steps_per_epoch):
 # Each step adds noise with these parameters
 accountant.step(
 noise_multiplier=noise_multiplier,
 sample_rate=batch_size / len(dataset),
 )

 # Convert RDP guarantee to (ε, δ)-DP
 epsilon = accountant.get_epsilon(delta=1e-5)
 print(f"After epoch {epoch}: ε = {epsilon:.2f} (RDP)")

In a representative pipeline, basic composition gives ε=12.4 after 50 epochs. RDP gave ε=4.7 for the exact same training run. That's not a minor improvement — it's the difference between a publishable result and a rejected paper. Same model, same noise, same actual privacy, but a much tighter accounting of the budget.

Federated Learning with PySyft

PySyft provides the federation layer. Each hospital runs a DataSite that controls access to its data:

import syft as sy

# At Hospital A — the data owner
hospital_a = sy.orchestra.launch(
 name="hospital-a",
 port=8081,
 dev_mode=False,
 reset=True,
)

# Register the dataset with privacy budget
dataset = sy.Dataset(
 name="chest-xray-hospital-a",
 description="De-identified chest X-rays, 2020-2025",
 asset_list=[
 sy.Asset(
 name="images",
 data=real_image_tensor, # Never leaves this node
 mock=generate_mock_images(), # Synthetic data for testing
 ),
 sy.Asset(
 name="labels",
 data=real_labels,
 mock=generate_mock_labels(),
 ),
 ],
)
hospital_a.upload_dataset(dataset)

The central coordinator (running on AWS infrastructure) orchestrates training without accessing data:

# At the coordinator — the model trainer
hospitals = [
 sy.login(url="https://hospital-a.internal:8081", email="ml@coordinator.org", password="..."),
 sy.login(url="https://hospital-b.internal:8082", email="ml@coordinator.org", password="..."),
 sy.login(url="https://hospital-c.internal:8083", email="ml@coordinator.org", password="..."),
 sy.login(url="https://hospital-d.internal:8084", email="ml@coordinator.org", password="..."),
 sy.login(url="https://hospital-e.internal:8085", email="ml@coordinator.org", password="..."),
]

# Define the training code that runs on each hospital
@sy.syft_function_single_use(
 images=hospitals[0].datasets["chest-xray-hospital-a"]["images"],
 labels=hospitals[0].datasets["chest-xray-hospital-a"]["labels"],
)
def train_local(images, labels):
 import torch
 from opacus import PrivacyEngine

 model = load_global_model() # Deserialized from coordinator
 optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
 data_loader = make_loader(images, labels, batch_size=64)

 privacy_engine = PrivacyEngine()
 model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
 module=model, optimizer=optimizer, data_loader=data_loader,
 epochs=1, target_epsilon=0.5, target_delta=1e-5, max_grad_norm=1.0,
 )

 model.train()
 for batch_images, batch_labels in data_loader:
 optimizer.zero_grad()
 loss = F.binary_cross_entropy_with_logits(model(batch_images), batch_labels)
 loss.backward()
 optimizer.step()

 return extract_model_updates(model) # Only gradients, not raw data

Secure Aggregation: The Missing Piece

Even with differential privacy on each hospital's update, the coordinator could potentially learn something about a specific hospital's data distribution from its individual gradient update. Secure aggregation prevents this.

The protocol (simplified):

import numpy as np
from cryptography.hazmat.primitives.asymmetric import x25519

def secure_aggregate(hospital_updates, threshold=3):
 """
 Hospitals secret-share pairwise masks so the coordinator
 only sees the sum of all updates, not individual ones.
 """
 n = len(hospital_updates)
 masked_updates = []

 for i in range(n):
 mask = np.zeros_like(hospital_updates[i])
 for j in range(n):
 if i == j:
 continue
 # Derive shared seed from Diffie-Hellman key exchange
 shared_seed = derive_shared_seed(
 private_keys[i], public_keys[j]
 )
 pairwise_mask = prng_from_seed(shared_seed, shape=mask.shape)

 if i < j:
 mask += pairwise_mask
 else:
 mask -= pairwise_mask
 # Masks cancel out when summed: hospital_i adds +mask_ij,
 # hospital_j adds -mask_ij

 masked_updates.append(hospital_updates[i] + mask)

 # Coordinator sums masked updates — masks cancel, only true sum remains
 aggregate = sum(masked_updates) / n
 return aggregate

When the coordinator sums all masked updates, the pairwise masks cancel out perfectly, leaving only the true average of all hospitals' updates. The coordinator never sees any individual hospital's actual gradient.

The FLAME (Federated Learning with Anonymous Masked Exchange) variant, which adds a dropout-tolerant threshold scheme — if a hospital disconnects mid-round, the remaining hospitals can still reconstruct the aggregate without that hospital's update.

The Non-IID Problem: Why Hospital Data Is Weird

Real hospital data is wildly non-IID (not independently and identically distributed). Hospital A is a pediatric center — mostly children's chest X-rays. Hospital C is a geriatric facility — elderly patients with comorbidities. Hospital E is a rural clinic — different imaging equipment, different patient demographics.

Standard FedAvg falls apart with non-IID data. After 10 rounds, divergent local models average to garbage. The fix was FedProx with a proximal term that penalizes local models for drifting too far from the global model:

def train_local_fedprox(model, global_model, data_loader, mu=0.01):
 model.train()
 global_params = {name: param.clone() for name, param in global_model.named_parameters()}

 for batch_images, batch_labels in data_loader:
 optimizer.zero_grad()
 loss = F.binary_cross_entropy_with_logits(model(batch_images), batch_labels)

 # Proximal term: penalize drift from global model
 proximal_loss = 0.0
 for name, param in model.named_parameters():
 proximal_loss += ((param - global_params[name]) ** 2).sum()
 loss += (mu / 2) * proximal_loss

 loss.backward()
 optimizer.step()

This single change brought convergence from "never" to 30 rounds. The mu parameter controls the tug-of-war between fitting local data and staying close to the global model.

Results That Passed the Audit

After 50 federated rounds with DP (ε=8.0 total budget per hospital):

| Metric | Centralized | Federated (no DP) | Federated + DP (ε=8) | |--------|-------------|--------------------|-----------------------| | AUC | 0.961 | 0.942 | 0.934 | | Sensitivity | 0.923 | 0.911 | 0.897 | | Specificity | 0.954 | 0.938 | 0.931 | | PPV | 0.891 | 0.872 | 0.858 |

The 2.7% AUC drop from centralized to federated+DP is clinically acceptable for a screening tool. The HIPAA auditor's report specifically called out the formal privacy guarantee (ε=8.0, δ=1e-5) as a strength — they'd never seen a quantitative privacy bound before, just "the data was anonymized."

Membership Inference Attack Testing

Running membership inference attacks against the trained model validates the privacy guarantees empirically:

from ml_privacy_meter import run_population_metric_attack

# Train shadow models on similar data
attack_results = run_population_metric_attack(
 target_model=federated_dp_model,
 population_data=held_out_data,
 member_data=training_data_sample,
 num_shadow_models=10,
)

print(f"Attack AUC: {attack_results.auc:.3f}")
# Without DP: Attack AUC = 0.71 (model is leaking membership info)
# With DP ε=8: Attack AUC = 0.53 (barely better than random guessing)
# With DP ε=1: Attack AUC = 0.51 (essentially random)

At ε=8.0, the membership inference attack achieved only 53% AUC — barely above the 50% random baseline. This empirically confirms that the DP guarantee is working: an attacker can't meaningfully determine whether a specific patient's X-ray was in the training set.

Practical Lessons

Communication costs dominate. Each federated round sends model updates (~180 MB for a ResNet-50) from five hospitals through encrypted channels. That's nearly 1 GB per round, 50 GB total. Compressing gradients with top-k sparsification (sending only the top 1% of gradient values) cuts communication to ~5 MB per round with minimal accuracy loss.

Privacy budgets are a one-way street. Once you spend epsilon, it's gone forever for that dataset. It is common to burn through significant budget during hyperparameter tuning because every evaluation on the private data costs epsilon. The fix: do all tuning on the mock (synthetic) data, and only run the final training on real data.

Hospital IT departments are the real bottleneck. The ML engineering typically takes 6 weeks. Getting five hospital IT departments to open firewall ports, approve Docker containers, and allocate GPU resources can take 10 weeks. Starting the IT procurement process on day one is essential.

This type of pipeline runs in production, retraining monthly with fresh data from all participating hospitals. No patient data ever leaves a hospital network. The model improves over time as it encounters more diverse cases. And the privacy guarantee is mathematically provable, not just a promise in a privacy policy.

Tips

Tip: Never set epsilon below 1.0 in production without measuring the actual utility loss on your validation set. Academic papers love epsilon=0.1 but in practice, your model accuracy will tank. Start at epsilon=8.0 and tighten gradually while monitoring F1 scores.

Tip: Use Rényi Differential Privacy (RDP) for tracking your privacy budget across training rounds instead of naive composition. With basic composition, 100 rounds at epsilon=0.1 each gives you epsilon=10.0. With RDP, the same 100 rounds gives you epsilon≈3.2. The math is free accuracy.

Tip: Always clip per-sample gradients before adding noise. If you skip clipping, a single outlier patient record can dominate the gradient update, and the noise you add won't be calibrated to the actual sensitivity. Use opacus.PrivacyEngine with max_grad_norm=1.0 as your starting point.

differential-privacyfederated-learningpysyfthealthcare-mlprivacysecure-aggregation

Tools mentioned in this article

AWSTry AWS

AnthropicTry Anthropic

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you. I only recommend tools I personally use and trust.

X / Twitter LinkedIn WhatsApp

Seguime

Back to Blog