Make SparkSqlOperator accept HQL files from GitLab raw URLs.
( Depends on repos/data-engineering/patches/wmf-sparksqlclidriver!1 (merged) )
Context:
The SparkSqlOperator
can take HTTPS URLs on the sql
parameter. This is very convenient, since we can combine this with GitLab's raw rendering API to construct operators like so:
SparkSqlOperator(
task_id="do_hql",
# To run an HQL file, simply use a GitLab's raw URI that points to it.
# See how to build such a URI here:
# https://docs.gitlab.com/ee/api/repository_files.html#get-raw-file-from-repository
# We strongly recommend you use an immutable URI (i.e. one that includes an SHA or a tag) for reproducibility
sql=var_props.get(
'hql_gitlab_raw_path',
'https://gitlab.wikimedia.org/api/v4/projects/1261/repository/files/test%2Ftest_hql.hql/raw?ref=0e4d2a9'
),
query_parameters={
'destination_directory': f'/tmp/xcollazo_test_generic_artifact_deployment_dag/{{{{ts_nodash}}}}',
'snapshot': '2023-01-02',
},
launcher='skein',
)
The operator above pulls the HQL file from GitLab and executes it. This effectively allows users that are interested in running SQL to keep it in their respective repositories without the need to do any artifact creation or declaration, and without depending on Data Eng's release refinery
cadence.
Bug: T333001