Add support for Iceberg's rewrite_data_files().
In this MR, we expand the maintenance mechanism introduced on !806 (merged) to also support rewrite_data_files().
Example:
iceberg_wmf_dumps_wikitext_raw:
datastore: iceberg
table_name: wmf_dumps.wikitext_raw_rc2
maintenance:
schedule: "@daily"
rewrite_data_files:
enabled: True
strategy: "sort"
sort_order: "wiki_db ASC NULLS FIRST, revision_timestamp ASC NULLS FIRST" # Defined due to Iceberg 1.2.1 bug
options:
"max-concurrent-file-group-rewrites": "40"
"partial-progress.enabled": "true"
spark_kwargs:
driver_memory: "32g"
driver_cores: "4"
executor_memory: "20g"
executor_cores: "2"
pool: mutex_for_wmf_dumps_wikitext_raw
priority_weight: 10
spark_conf:
"spark.dynamicAllocation.maxExecutors": "64"
Notice how the config above now includes a rewrite_data_files section. Under that, Iceberg's strategy, options and more procedure parameters are verbatim from https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files. spark_kwargs and spark_conf are provided to be able to define the spark resources, as these will vary on a per table basis. If not defined, the cluster defaults apply.
Via spark_kwargs you can send in any kwargs that are relevant to a SparkSqlOperator. In the example above we send in kwargs pool and priority_weight to leverage Airflow's pools for this particular table.