Skip to content

Fine tune Dumps 2.0 backfill and event ingestion.

Xcollazo requested to merge fine-tune-backfill into main

A couple of spark.sql.shuffle.partitions changes to make Dumps 2.0 backfill and event ingestion more efficient.

  • When doing the backfill MERGE, spark.sql.shuffle.partitions=5120 create way less files while still generating enough tasks to keep Spark busy.
  • When doing the event MERGE, spark.sql.shuffle.partitions=64 also creates way less files per hour, and it doesn't affect performance much.
  • We also introduce a helper function util.dict_add_or_append_string_value() to update appendable configurations.

Bug: T340863

Edited by Xcollazo

Merge request reports