Skip to content

Don't propagate the container SPARK_HOME to the hadoop workers

Brouberol requested to merge T364389 into main

When attempting to migrate the airflow-analytics-test scheduler to Kubernetes, I've encountered some errors when executing Skein jobs.

The skein spec contains a spark-submit command with --conf spark.executorEnv.SPARK_HOME=/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/pyspark, which will fail on the hadoop worker side, as this path does not exist there. It only exists in our airflow container, as Blubber won't let us install anything under /usr/lib.

By detecting that the job is running in Kubernetes, we sidestep the wrongful propagation of this environment variable to Skein spec.

Bug: T364389 Signed-off-by: Balthazar Rouberol brouberol@wikimedia.org

Merge request reports