Databricks + PySpark for Government Health Data: Architecture Considerations

Running PySpark on Databricks for a government health organisation is fundamentally different from running it for a startup. The technical platform is largely the same. The governance requirements, procurement constraints, security standards, and approval processes are not.

Having operated in this environment at CIHI, here’s what you need to know if you’re moving a government health data workload to Databricks.

The Procurement Reality

Government procurement is slow and detailed. For a cloud platform like Databricks, you’re typically looking at a formal procurement process, security assessment, cloud service agreement review, and potentially a Privacy Impact Assessment before a single byte of health data can be processed in the environment.

Plan for this. If you’re being brought in to build a system and the platform hasn’t been procured yet, the realistic timeline to a production environment capable of handling regulated health data is months, not weeks.

The security assessment will focus on:

Data residency (must data stay in Canada? Which Azure regions qualify?)
Encryption at rest and in transit
Access control architecture
Audit logging capabilities
Incident response and breach notification

Databricks on Azure (Canadian regions) typically satisfies these requirements, but you need documentation to demonstrate that, not just assertions.

Unity Catalog: Essential for Health Data Governance

Unity Catalog is Databricks’ unified governance layer. For government health data, it’s not optional — it’s the architecture.

Key Unity Catalog capabilities for health data:

Column-level security: Restrict access to sensitive columns by role. In health data, this means clinical staff might see diagnosis codes and patient identifiers, while analysts only see pseudonymised data and aggregated statistics.

-- Grant access to pseudonymised view, not the source table
GRANT SELECT ON VIEW analytics.patient_cohort_masked TO `analyst-group`;

-- Source table with real identifiers — restricted to authorised users only
REVOKE SELECT ON TABLE raw.patient_records FROM `analyst-group`;

Row-level security: Province-specific data access. A user from the Ontario Ministry of Health should only see Ontario records.

CREATE ROW FILTER provincial_access ON health_data.patient_records
USING COLUMNS (province_code)
AS (province_code = current_user_province() OR is_admin());

Data lineage: Automatic tracking of which tables and columns flow through which transformations. For health data audits, being able to demonstrate that output statistics came from approved source tables — with no unapproved data joins — is a compliance requirement.

Tags and classifications: Tag columns containing PII, sensitive clinical data, or data subject to specific use restrictions. This metadata flows through the lineage graph, so you can query “which output tables contain columns derived from PII sources.”

Workspace Isolation

Government health data typically requires strict separation between environments and between projects. Databricks’ workspace isolation model maps well to this requirement:

Separate workspaces per environment (dev, test, production): No sharing of clusters or data between environments. Production workspace access is restricted to authorised production data handlers.
Separate workspaces per project where required: If different projects have different data sharing agreements or different sets of authorised users, separate workspaces ensure there’s no possibility of data leakage between projects.
Network isolation: Databricks workspaces in government environments are typically deployed with no public internet access. All traffic routes through your organisation’s network via private endpoints.

Cluster Configuration for Compliance

Cluster configuration affects compliance posture in several ways:

Single-user vs. shared clusters: For health data processing, single-user clusters (or isolated cluster modes) prevent users from inspecting each other’s data. Shared clusters are operationally cheaper but create isolation risks.

No credential passthrough to personal accounts: Cluster credentials should be managed service principals, not individual user credentials. This ensures data access is controlled by the service principal’s permissions, not by whoever happens to be running jobs.

Auto-termination: Clusters should auto-terminate after a defined period of inactivity. This reduces cost and reduces the attack surface — a running cluster with health data loaded in memory is a security risk that termination eliminates.

The Databricks Jobs and Workflow Design

Production health data pipelines in government run as scheduled Databricks Jobs, not interactive notebooks. This is an important distinction:

Jobs run with service principal credentials, not personal credentials
Jobs have defined inputs, outputs, and parameters — no ad-hoc modifications
Job run history is logged and auditable
Failed jobs can be retried without manual intervention

For the CMG Grouper pipeline I built at CIHI, the Databricks Workflow runs nightly, processes the day’s incoming records, and produces output tables consumed by downstream analytical systems. The workflow has defined checkpoints, so a failure in one stage doesn’t require rerunning the entire pipeline.

Cost Management

Government cloud contracts often have strict budget controls. Databricks costs can scale quickly with cluster size and job frequency. Strategies that have worked well:

Spot instances for non-critical workloads: Development and testing workloads can use spot (preemptible) instances at significantly lower cost. Production jobs should use on-demand instances.

Photon for repeated transformations: Databricks Photon (vectorised query engine) is significantly faster for certain operations. For pipelines that run repeatedly on similar data, the cost premium for Photon often pays back in reduced cluster runtime.

Delta table optimisation: Regularly running OPTIMIZE and VACUUM on Delta tables maintains query performance and reduces storage costs from accumulated small files.

What Government Clients Actually Care About

Having worked with federal and provincial health clients, the technical concerns that come up most in government health data work are:

Can I prove what happened? — Complete audit trail, data lineage, access logs
Who had access to what data? — Access controls, service account management, temporary access review
Where did the data go? — Data residency confirmation, no unexpected network calls
What happens if something goes wrong? — Incident response plan, breach notification procedure
How do we turn it off? — Data destruction plan when project ends

Databricks on Azure Canadian regions, with Unity Catalog, addresses most of these technically. The rest is process and documentation.

The platform is genuinely good for this use case. The friction is in the governance processes that surround it — and understanding those processes is as important as understanding the technical platform.