Port mjolnir dag from airflow 1
This is the most complex workflow in the search platform airflow instance, ported over to airflow 2 + spark 3. The implementation has changed slightly here from airflow 1 to make things a bit more generic. This was the first dag we implemented in airflow 1 and we have learned a few things since then.
-
Removed MjolnirOperator and replaced with a more generic AutoSizeSparkSubmitOperator. This operator looks at the resource requirements of a task and scales the # of executors to meet limits on the aggregate memory/core count usage. For most tasks this is merely a simplification of configuration. It has a second, optional, functionality to read in a metadata file about the size of a feature matrix that will be read and scales the memory usage to match. This is critical to be able to use the same task definition to train large and small wikis with the same definition without scaling everything to match the largest wiki.
-
Uses jar artifacts for all jvm deps, rather than having spark use ivy to collect dependencies from archiva. With the artifact handling available in this repo it seems reasonable to consistently use that rather than splitting between native artifacts and ivy.
-
Ports over the HiveTablePath helper, used to translate table names into paths on hdfs when tasks are rendered. This helps when doing test runs to only need to change table names and let the paths be auto-detected from hive metadata. This is also required as mjolnir was built to write directly to hdfs and then add partitions to hive separately.
Bug: T329239