N
Neel Shah
All posts
PySpark 10 min read June 15, 2025

Optimising PySpark Jobs for Large-Scale Financial Transaction Processing

Performance tuning patterns for PySpark pipelines handling hundreds of millions of financial transactions — partitioning strategy, join optimisation, and the profiling workflow that surfaces real bottlenecks.

PySparkPerformanceFinancial ServicesOptimisationSpark Tuning
N
Neel Shah
Tech Lead · Senior Data Engineer · Ottawa, Canada

Slow PySpark jobs in financial services aren’t just an engineering problem — they’re a business problem. A credit risk reconciliation job that takes 4 hours instead of 45 minutes means risk exposure that isn’t visible to decision-makers until the next morning. Transaction reporting that runs too slowly to meet regulatory submission deadlines is a compliance issue.

This post is about the practical performance optimisation work I’ve done on financial transaction pipelines — what the bottlenecks actually are, how to find them, and the fixes that produce real improvements.

Start With Profiling, Not Assumptions

The most common mistake in PySpark performance work is optimising based on intuition rather than evidence. I’ve been wrong about what’s slow more times than I’d like to admit.

The Spark UI is the right starting point. For every slow job, I look at:

  1. Stage timeline: Which stages take the most time? Is it one stage dominating, or distributed across many?
  2. Task distribution within stages: Are there a small number of tasks that take much longer than the others? That’s a data skew problem.
  3. Shuffle read/write: Large shuffle volumes are expensive. If shuffle write is larger than input data, something is wrong.
  4. GC time: High garbage collection time in tasks indicates memory pressure — you’re allocating and freeing more than you should be.

For financial transaction data specifically, I find data skew is the most common root cause of slow stages. Large accounts (high-volume merchants, corporate cards) generate dramatically more transactions than average accounts, and if those transactions end up in the same partition, you get a small number of very slow tasks.

Diagnosing and Fixing Data Skew

Skew in financial data often occurs on joins where one side has a dominant key. A join between transactions and account records will be slow if a small number of account IDs account for a large fraction of transactions.

# Diagnose skew: check distribution of records per join key
df_transactions.groupBy("account_id_hash") \
    .count() \
    .orderBy(col("count").desc()) \
    .show(20)

# If top accounts have 100x average transaction count, you have skew

For skewed joins, the standard fix is salting: add a random prefix to the skewed key, explode the smaller side to match, then join on the salted key.

from pyspark.sql.functions import (
    col, lit, rand, floor, explode, array
)

SALT_FACTOR = 10

# Salt the large (skewed) table
df_transactions_salted = df_transactions.withColumn(
    "salt",
    floor(rand() * SALT_FACTOR).cast("int")
).withColumn(
    "account_id_salted",
    concat(col("account_id_hash"), lit("_"), col("salt"))
)

# Explode the small (account) table to match all salt values
df_accounts_exploded = df_accounts.withColumn(
    "salt_values",
    array([lit(i) for i in range(SALT_FACTOR)])
).withColumn(
    "salt", explode(col("salt_values"))
).withColumn(
    "account_id_salted",
    concat(col("account_id_hash"), lit("_"), col("salt"))
)

# Join on salted key — now distributed evenly
df_result = df_transactions_salted.join(
    df_accounts_exploded,
    on="account_id_salted",
    how="left"
).drop("salt", "account_id_salted", "salt_values")

This is more complex than a standard join, but for genuinely skewed data, it can reduce job runtime by 50–80%.

Partition Sizing

The default shuffle partition count (200) is almost never right for financial transaction data. Too few partitions and tasks are too large; too many and you’re paying overhead for thousands of tiny tasks.

The target is partition sizes of 100MB–200MB after shuffle. For financial data, I calculate:

# Estimate appropriate partition count
compressed_data_gb = 50  # your data size
uncompressed_multiplier = 3  # typical compression ratio
target_partition_mb = 150

total_uncompressed_mb = compressed_data_gb * 1024 * uncompressed_multiplier
optimal_partitions = int(total_uncompressed_mb / target_partition_mb)
# For 50GB compressed: ~1,000 partitions

spark.conf.set("spark.sql.shuffle.partitions", optimal_partitions)

For range partitioning on time-series financial data, I partition by (year, month) which matches the most common query patterns (monthly reporting, period reconciliation) and produces predictably sized partitions.

Broadcast Joins for Lookup Tables

Financial transaction processing involves many joins to lookup tables: merchant category codes, interchange rate tables, product fee schedules, regulatory reporting codes. These tables are typically small (< 1M rows) but are joined against every transaction record.

Forcing broadcast joins for these lookups eliminates the shuffle entirely:

from pyspark.sql.functions import broadcast

# Merchant category lookup: ~1,000 rows — safe to broadcast
df_mcc = spark.read.table("reference.merchant_category_codes")

df_enriched = df_transactions.join(
    broadcast(df_mcc),
    on="merchant_category_code",
    how="left"
)

Spark will auto-broadcast tables below spark.sql.autoBroadcastJoinThreshold (default 10MB), but financial lookup tables are often larger after deserialisation. Setting the broadcast hint explicitly is more reliable.

Caching Strategy

Caching is often overused. Caching a DataFrame costs memory and time (the cache isn’t free — it requires a full scan and serialisation). Only cache DataFrames that are:

  1. Expensive to compute (multi-stage transformations)
  2. Used more than once in the same job

For financial pipelines, I typically cache the enriched transaction DataFrame (after all joins and lookups are applied) if it’s used in multiple downstream aggregations:

df_enriched.cache()
df_enriched.count()  # Trigger cache materialisation

# Now multiple aggregations use the cached data without re-reading/re-joining
df_monthly_volume = df_enriched.groupBy("year", "month", "portfolio").agg(...)
df_merchant_summary = df_enriched.groupBy("merchant_category").agg(...)
df_risk_flags = df_enriched.filter(col("risk_score") > threshold).agg(...)

df_enriched.unpersist()  # Release cache when done

Writing Output Efficiently

The output write is often underoptimised. For financial reporting tables that are read by many downstream consumers, write performance and read performance trade off against each other.

For a Delta Lake table that’s queried primarily by date range and portfolio:

# Write with partitioning that matches read patterns
df_result.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("report_year", "report_month", "portfolio") \
    .option("replaceWhere", "report_year = 2025 AND report_month = 8") \
    .saveAsTable("finance.transaction_summary")

# Optimise after write
spark.sql("""
    OPTIMIZE finance.transaction_summary
    WHERE report_year = 2025 AND report_month = 8
    ZORDER BY (merchant_category, account_segment)
""")

The replaceWhere option replaces only the specific partition being updated, rather than the full table — essential for incremental pipeline runs on large historical tables.

The Profiling Workflow

My standard approach for optimising a slow pipeline:

  1. Run the job and record baseline wall time and Spark UI metrics
  2. Identify the slowest stage
  3. Check for skew (task time distribution) vs. volume (all tasks slow equally)
  4. Apply the appropriate fix (salting for skew, repartition for volume)
  5. Re-run and measure
  6. Repeat for the next bottleneck

The key discipline: change one thing at a time and measure the effect. Changing five things simultaneously makes it impossible to know what actually helped.

Financial transaction pipelines reward this systematic approach. The optimisations that look small — correct partition count, broadcast hints for lookups, skew salting for one bad join key — compound into pipeline runs that are 3–5x faster than naive implementations.