Skip to content

coalesce outputs with default workable values

Marco Fossati requested to merge T350009 into main

Add an optional --coalesce argument to relevant CLIs, with default values based on trade-offs between less output files and longer execution time.


  • was already using the default coalesce value of 8
  • a drastic coalesce to the default value leads to crashes of Spark executors in, due to too few nodes handling the whole computation


script coalesce files before files after 8 2049 9 100 1025 101 4 ^ 807k 1k

^ We used repartition, see

Bug: T350009

Edited by Marco Fossati

Merge request reports