Fine tune Dumps 2.0 backfill and event ingestion. (!558) · Merge requests · repos / data-engineering / Airflow DAGs · GitLab

How to register an account on GitLab. Due to spam, new accounts are locked until approved by an admin or the approver bot. Your GitLab account gets automatically approved within one hour if you are a member of Trusted Contributors in Gerrit, or a member of the Trusted-Contributors group in Phabricator and linked your Developer account to your Phabricator account. If none of these apply, you can file an unlock request to expedite access.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

Xcollazo requested to merge fine-tune-backfill into main Dec 14, 2023

A couple of spark.sql.shuffle.partitions changes to make Dumps 2.0 backfill and event ingestion more efficient.

When doing the backfill MERGE, spark.sql.shuffle.partitions=5120 create way less files while still generating enough tasks to keep Spark busy.
When doing the event MERGE, spark.sql.shuffle.partitions=64 also creates way less files per hour, and it doesn't affect performance much.
We also introduce a helper function util.dict_add_or_append_string_value() to update appendable configurations.

Bug: T340863

Edited Dec 15, 2023 by Xcollazo