Skip to content

Make SparkSqlOperator accept HQL files from GitLab raw URLs.

Xcollazo requested to merge T333001-mechanism-for-deploying-artifacts into main

( Depends on repos/data-engineering/patches/wmf-sparksqlclidriver!1 (merged) )

Context: The SparkSqlOperator can take HTTPS URLs on the sql parameter. This is very convenient, since we can combine this with GitLab's raw rendering API to construct operators like so:

SparkSqlOperator(
        task_id="do_hql",
        # To run an HQL file, simply use a GitLab's raw URI that points to it.
        # See how to build such a URI here:
        # https://docs.gitlab.com/ee/api/repository_files.html#get-raw-file-from-repository
        # We strongly recommend you use an immutable URI (i.e. one that includes an SHA or a tag) for reproducibility
        sql=var_props.get(
            'hql_gitlab_raw_path',
            'https://gitlab.wikimedia.org/api/v4/projects/1261/repository/files/test%2Ftest_hql.hql/raw?ref=0e4d2a9'
        ),
        query_parameters={
            'destination_directory': f'/tmp/xcollazo_test_generic_artifact_deployment_dag/{{{{ts_nodash}}}}',
            'snapshot': '2023-01-02',
        },
        launcher='skein',
    )

The operator above pulls the HQL file from GitLab and executes it. This effectively allows users that are interested in running SQL to keep it in their respective repositories without the need to do any artifact creation or declaration, and without depending on Data Eng's release refinery cadence.

Bug: T333001

Edited by Xcollazo

Merge request reports