Technical blog on
data engineering & AI tools
Practical writing on PySpark at scale, PII compliance, cloud data architecture, and using modern AI tools to build faster.
PySpark in Production
Processing 1 Billion+ Health Records with PySpark: Architecture and Lessons
A deep technical walkthrough of the architecture decisions, partitioning strategies, and hard-won lessons from building a PySpark pipeline that processes Canada's national health dataset at population scale.
PII Compliance at Scale: PIPEDA and Health Data Privacy with PySpark
A practical engineering guide to building PySpark pipelines that handle Personal Health Information in compliance with PIPEDA and provincial health privacy legislation in Canada.
Real-time Credit Risk Monitoring with PySpark Streaming in Financial Services
How we built a PySpark Structured Streaming pipeline to process 1 million financial transactions per hour for credit risk monitoring across Apple Card, Walmart Card, and GM Card portfolios.
Databricks + PySpark for Government Health Data: Architecture Considerations
What changes when your PySpark workloads run on Databricks for a government health client — governance requirements, Unity Catalog, workspace isolation, and the procurement reality.
Optimising PySpark Jobs for Large-Scale Financial Transaction Processing
Performance tuning patterns for PySpark pipelines handling hundreds of millions of financial transactions — partitioning strategy, join optimisation, and the profiling workflow that surfaces real bottlenecks.
AI Tools for Data Engineers
How I Use Claude to Accelerate Enterprise Data Engineering
A practical look at how Claude fits into real data engineering workflows — from PySpark code review to PII audit documentation — and why it's become an indispensable productivity tool.
Gemini for Data Analysis: A Practical Review for Enterprise Teams
After testing Google Gemini on real data engineering tasks — schema inference, SQL generation, and multi-modal data documentation — here's an honest assessment of where it fits in an enterprise stack.
Integrating OpenAI into Healthcare Data Pipelines: Lessons Learned
Real lessons from integrating GPT-4 into health data workflows — what works, what requires careful safeguards, and the compliance questions you need to answer before you start.
Perplexity AI as a Research Tool for Compliance and Regulatory Work
Perplexity's cited, real-time search makes it surprisingly useful for staying current on Canadian health privacy regulations, PIPEDA updates, and financial compliance requirements.
Sarvam AI: Building Multilingual AI for Diverse Patient Populations
Sarvam AI's focus on Indian languages and healthcare has lessons for any system serving linguistically diverse populations — and raises important questions about representation in health AI.