Ensure SPARK_HOME=/usr/lib/spark3

Our previous SPARK_HOME value was /tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/pyspark, and I have a hunch that it was causing issues when submitting Skein jobs launching spark jobs, as some utility function in airlfow-dags was propagating that environment variable to the hadoop worker executing the skein job.

See https://airflow-analytics-test.wikimedia.org/dags/anomaly_detection_useragent_distribution_daily_test/grid?dag_run_id=manual__2024-11-20T09%3A41%3A31.241921%2B00%3A00&task_id=source_metrics&tab=logs

Looking at the YARN/skein logs, I was seeing:

LogContents:
.skein.sh: line 1: spark-submit: command not found
brouberol@an-test-client1002:~$ SPARK_HOME=/usr/lib/spark3
brouberol@an-test-client1002:~$ ls $SPARK_HOME/bin/spark-submit
/usr/lib/spark3/bin/spark-submit

I think that having all parties agreeing on a common value for SPARK_HOME might help.

Signed-off-by: Balthazar Rouberol brouberol@wikimedia.org Bug: T364389

Merge request reports

Loading