N
Neel Shah
All posts
PySpark 10 min read September 15, 2025

PII Compliance at Scale: PIPEDA and Health Data Privacy with PySpark

A practical engineering guide to building PySpark pipelines that handle Personal Health Information in compliance with PIPEDA and provincial health privacy legislation in Canada.

PySparkPIPEDAPIIComplianceHealthcarePrivacy
N
Neel Shah
Tech Lead · Senior Data Engineer · Ottawa, Canada

Privacy compliance in Canadian health data isn’t a checkbox exercise. PIPEDA, PHIPA (Ontario), PIPA (BC/Alberta), and the various provincial health information acts create a complex web of requirements that your data engineering architecture needs to support — not just your legal documentation.

After years of building PII-compliant health data systems at CIHI, here are the engineering patterns I rely on.

The Compliance Landscape

Before writing a line of code, you need to understand what legislation applies to your data. In Canada:

  • PIPEDA (federal): Applies to private-sector organisations handling personal information in commercial activities
  • PHIPA (Ontario): Governs Personal Health Information for Ontario health information custodians — stricter than PIPEDA in important ways
  • PIPA (BC/Alberta): Provincial private-sector privacy legislation substantially similar to PIPEDA
  • Quebec Law 25: Now one of the strictest privacy regimes in Canada, with significant new obligations around AI and automated decision-making
  • Provincial health information acts: Each province has specific legislation governing how health data collected by the provincial health system can be used

For national health data work, you’re typically operating under data sharing agreements that specify exactly which legislation applies and what additional restrictions the data provider has imposed. Read these agreements carefully — they often add constraints beyond what the legislation requires.

Architecture Principle: Data Minimisation

The most important engineering principle for PII compliance is data minimisation: collect the minimum data needed, process only what you need, and don’t retain what you no longer need.

In PySpark terms, this means:

# Bad: load full dataset, filter later
df_all = spark.read.parquet("s3://health-data/patients/")
df_needed = df_all.filter(df_all.province == "ON")

# Good: push predicate to data source, load only what you need
df_needed = spark.read.parquet("s3://health-data/patients/") \
    .filter(col("province") == "ON") \
    .select("patient_id_hash", "diagnosis_code", "admission_date", "discharge_date")
    # Note: no name, DOB, address — fields not needed for this analysis

The second approach minimises how much PII enters the Spark execution environment. Fewer PII fields in memory means smaller blast radius if something goes wrong, and clearer audit trails.

Pseudonymisation vs. Anonymisation

A common confusion: pseudonymisation (replacing identifiers with reversible tokens) is not anonymisation (irreversible removal of identifying information). PIPEDA and health privacy acts treat these very differently.

For health data pipelines, we use a consistent pseudonymisation approach:

from pyspark.sql.functions import sha2, concat_ws, lit

def pseudonymise_patient_id(df, salt: str):
    """
    Replace direct patient identifier with a pseudonymous hash.
    Salt must be stored in a separate secure key management system.
    This is reversible only by someone with access to the salt.
    """
    return df.withColumn(
        "patient_id_hash",
        sha2(concat_ws("|", col("patient_health_number"), lit(salt)), 256)
    ).drop("patient_health_number")

Key points:

  • The salt is stored in a separate secrets manager (Azure Key Vault in our case), not in the code or config files
  • The original identifier is dropped immediately after pseudonymisation
  • The pseudonymous hash is consistent across runs (same input + same salt = same hash), enabling record linkage without storing the original

Truly anonymised data — where re-identification is not possible even with access to all other datasets — is extremely difficult to achieve for health records. Most health data pipelines work with pseudonymised data under controlled access, not truly anonymous data.

Access Controls in PySpark

Who can access what data is a compliance requirement, not just a security one. In Databricks (our execution environment), we implement:

  1. Column-level security: Sensitive columns (direct identifiers, sensitive diagnosis codes) are restricted by role at the Unity Catalog level
  2. Row-level security: Users from specific provinces can only access that province’s data, enforced by dynamic data masking policies
  3. Audit logging: All data access is logged via Databricks audit logs, which we forward to our SIEM
# Unity Catalog column masking policy (SQL, applied at catalog level)
# This is not PySpark code — it's a catalog-level policy
# CREATE COLUMN MASK sensitive_diagnoses_mask
# ON TABLE health_data.patient_records.diagnosis_code
# USING COLUMNS (diagnosis_code)
# AS (CASE WHEN is_member('sensitive-data-access') THEN diagnosis_code ELSE '***' END)

Data Retention and Deletion

PIPEDA requires that personal information be retained only as long as necessary for the identified purpose. For health data, this often means data destruction schedules tied to the research or operational purpose.

In Delta Lake, we implement retention policies using:

# Vacuum old versions (removes deleted records from storage after retention period)
spark.sql("""
    VACUUM health_data.patient_cohort
    RETAIN 2160 HOURS  -- 90 days of version history
""")

# For hard deletion requirements (right to erasure where applicable)
# Delete records and immediately vacuum (requires disabling safety check in regulated contexts)
spark.sql("""
    DELETE FROM health_data.patient_cohort
    WHERE patient_id_hash = 'hash_value_to_delete'
""")

Hard deletion in Delta Lake requires care — the default behaviour retains old versions for 7 days. For regulatory deletion requirements, you need to vacuum immediately after deletion, which requires explicit confirmation that you understand the implications.

The Privacy Impact Assessment

Every new pipeline that handles PII should go through a Privacy Impact Assessment (PIA) before going into production. The PIA documents:

  1. What personal information is collected/processed
  2. Legal authority for the collection/processing
  3. How data flows through the system
  4. Security controls in place
  5. Retention and destruction schedule
  6. Privacy risks and mitigations

Your PySpark architecture decisions directly inform the PIA. Data minimisation, pseudonymisation approach, access controls, and audit logging all need to be described and justified.

I use the architecture documentation to draft the technical sections of the PIA, then have legal review the whole document. This is much faster than writing the PIA from scratch, and it ensures the technical documentation and the privacy documentation are consistent.

The Bottom Line

PII-compliant PySpark pipelines aren’t architecturally complex — but they require consistent discipline. Data minimisation, pseudonymisation at ingestion, column-level access controls, comprehensive audit logging, and documented retention policies are the core practices.

The key is treating compliance as an engineering requirement, not a documentation requirement. If your pipeline doesn’t implement pseudonymisation technically, a policy document saying you pseudonymise data is not compliant. The code is the compliance.