WMF Data Workflow Utils merge requestshttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests2022-04-19T19:15:47Zhttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/18Coverage report publishing2022-04-19T19:15:47ZOttomataaotto@wikimedia.orgCoverage report publishingCoverage report publishingCoverage report publishingOttomataaotto@wikimedia.orgOttomataaotto@wikimedia.orghttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/14Resolve "[FIXME] large conda envs cannot be shared in Gitlab"2022-04-13T19:59:14ZGmodenaResolve "[FIXME] large conda envs cannot be shared in Gitlab"Closes #2
_Tagging as draft to start a discussion, and iron out the wrinkles_.
This MR combines the build and publish job together.
When archives exceed a certain size they cannot be cached as job artifacts, therefore making
publicatio...Closes #2
_Tagging as draft to start a discussion, and iron out the wrinkles_.
This MR combines the build and publish job together.
When archives exceed a certain size they cannot be cached as job artifacts, therefore making
publication impossible.
Based on what we discussed with @otto, I'd like to propose a workflow that allows:
- building and publishing automatically on tagged commits.
- allowing manual build and publication on regular commits.
To support local development and testing, a dev should run conda-dist manually. See for example:
https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/blob/5-add-ci-integration-package-conda-env-wip/Makefile#L50https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/15Use PACKAGE_VERSION from shell env2022-04-13T19:44:35ZGmodenaUse PACKAGE_VERSION from shell envCloses #1
This MR adds allows downstream pipelines to define their own `PACKAGE_NAME`, `PACKAGE_VERSION` and `PACKAGE_VERSION_SCRIPT` variables.
There are two uses cases for this:
- Projects that want to manage their version scheme an...Closes #1
This MR adds allows downstream pipelines to define their own `PACKAGE_NAME`, `PACKAGE_VERSION` and `PACKAGE_VERSION_SCRIPT` variables.
There are two uses cases for this:
- Projects that want to manage their version scheme and update process
- Project that are not `setuptools` based (not sure if this is a good thing)Ottomataaotto@wikimedia.orgOttomataaotto@wikimedia.orghttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/16Standardize CI2022-04-12T00:16:39ZOttomataaotto@wikimedia.orgStandardize CI- tox instead of nox
- move lint and pytest configs into setup.cfg- tox instead of nox
- move lint and pytest configs into setup.cfghttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/13Error for mypy in CI2022-04-05T16:10:24ZOttomataaotto@wikimedia.orgError for mypy in CIProactively use mypy and resolve mypy errors.Proactively use mypy and resolve mypy errors.https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/12Template conda jobs2022-04-04T12:55:22ZGmodenaTemplate conda jobsThis merge request adds the capability to extend `build_conda_env` and `publish_conda_env`
in CI pipelines that include them.
# Use case
As a developer I would like to install additional systems dependencies in the docker image
used to ...This merge request adds the capability to extend `build_conda_env` and `publish_conda_env`
in CI pipelines that include them.
# Use case
As a developer I would like to install additional systems dependencies in the docker image
used to build a conda environment. I would like to be able to extend `build_conda_env` and
adjust it as needed.
# Proposed solution
Systems deps are typically installed in a `before_script` step. This MR introduces the following changes.
## conda.yml
A hidden `.before_script` has been added to `conda.yml`. This allows to reference and extend it in pipelines that include `conda.yml` and `conda-dist.yml`.
A global `before_script` is still available and exposed by default. This allows, for instance, for backward compatibility in this repo's `.gitlab-ci.yml`.
## conda-dist.yml
Added a `.build_conda_env` hidden job that implements the building and packing logic.
A public `build_conda_env` extends it to ensure consistent behaviour.
The use of both `.build_conda_env` and `build_conda_env` was necessary to avoid circular deps in downstream pipelines.
A similar approach has been implemented for `.publish_conda_env` and `publish_conda_env`. Technically
we don't need it, but possibly it might be good to be consistent (convention).
# Example
The CI pipeline at https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/blob/5-add-ci-integration-package-conda-env-extend-template/.gitlab-ci.yml#L66 overrides the included
`build_conda_env` with an ad-hoc one that installs required system deps:
```
# Defining build_conda_env in this scope takes precedence over global,
# and allows to override the included name.
build_conda_env:
extends: .build_conda_env
# Override the before_script defined .build_conda_env
before_script:
# This reference adds the global before script declared in conda.yml
- !reference [.setup_conda, before_script]
- apt update # we probably don't need this, since we apt update in .setup_conda
#
- apt install -y libkrb5-dev libsasl2-dev gcc g++
```GmodenaGmodenahttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/11Gitlab CI templates for using, building, and publishing conda envs2022-03-30T14:51:41ZOttomataaotto@wikimedia.orgGitlab CI templates for using, building, and publishing conda envsAdd gitlab-ci-templates for automating publishing conda envs to gitlab.
- By including conda.yml, you get a base miniconda env in /opt/conda.
- By including conda-dist.yml, you get two manual jobs added to your pipeline:
-- build_conda_...Add gitlab-ci-templates for automating publishing conda envs to gitlab.
- By including conda.yml, you get a base miniconda env in /opt/conda.
- By including conda-dist.yml, you get two manual jobs added to your pipeline:
-- build_conda_env - builds the conda dist env for your project and exposes it as a Gitlab CI Artifact (you can download this from the Gitlab Pipeline UI)
-- publish_conda_env - publishes the built conda dist env to a Gitlab Generic Package Registry.
Example usage over in example_job_project:
https://gitlab.wikimedia.org/repos/data-engineering/example_job_project/-/blob/master/.gitlab-ci.ymlhttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/9Split linting and testing, use nox instead of tox2022-03-23T13:51:52ZOttomataaotto@wikimedia.orgSplit linting and testing, use nox instead of toxNo longer using pylint, prefering just flake8 instead.No longer using pylint, prefering just flake8 instead.https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/10Don't capture subprocess.run output.2022-03-22T13:36:24ZGmodenaDon't capture subprocess.run output.There are cases when a subprocess is long running,
or gets stuck waiting for output. In those cases,
`subprocess.run` won't return. However, output is read
only read from a complete process, making debugging
hard.
Since the output of `s...There are cases when a subprocess is long running,
or gets stuck waiting for output. In those cases,
`subprocess.run` won't return. However, output is read
only read from a complete process, making debugging
hard.
Since the output of `subprocess.run` is only used for display,
and not piped to another subprocess object, we now just
print it to the caller's stdout.https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/8Fix bug in fsspec exists call2022-03-10T19:37:29ZOttomataaotto@wikimedia.orgFix bug in fsspec exists callhttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/6Add a Gitlab CI pipeline.2022-03-10T15:27:56ZGmodenaAdd a Gitlab CI pipeline.This MR adds a Gitlab CI pipeline to run tests, mypy
and linting on python 3.7 and 3.9.
Build jobs will be triggered each time a change set is pushed,
or when opening a merge request.
During the `test` stage, two CI jobs will be execut...This MR adds a Gitlab CI pipeline to run tests, mypy
and linting on python 3.7 and 3.9.
Build jobs will be triggered each time a change set is pushed,
or when opening a merge request.
During the `test` stage, two CI jobs will be executed in parallel for
python 3.7 and 3.9 (the envs declared in `tox.ini`).
A manual step in the `publish` stage will build sdist/bdist pacakges and
push them Gitlab's PyPI Registry (e.g. https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/packages/82).
## Implementation details
I opted to use dedicated Python 3.7/3.9 images rather than bootstrapping with `pyenv` (or similar),
to avoid having to manage additional setup logic. To the best of my knowledge
`pyenv` is not packaged in debian buster (the base image I used).GmodenaGmodenahttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/5Automate usage of fsspec hdfs URLs via new pyarrow HDFS API2022-03-03T15:51:48ZOttomataaotto@wikimedia.orgAutomate usage of fsspec hdfs URLs via new pyarrow HDFS API- fsspec_use_new_pyarrow_api - call this to make fsspec always use
new pyarrow API with all hdfs:// URLs.
This is only needed until
https://github.com/fsspec/filesystem_spec/issues/874 is resolved.
- set_hadoop_env_vars - sets nee...- fsspec_use_new_pyarrow_api - call this to make fsspec always use
new pyarrow API with all hdfs:// URLs.
This is only needed until
https://github.com/fsspec/filesystem_spec/issues/874 is resolved.
- set_hadoop_env_vars - sets needed env vars to work with new pyarrow HDFS API.
This is also called by fsspec_use_new_pyarrow_api() by default.
https://phabricator.wikimedia.org/T300876https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/4Draft: Add support for building conda-dist envs2022-02-07T16:34:48ZOttomataaotto@wikimedia.orgDraft: Add support for building conda-dist envsSee changes to README for description.
TODO:
- Figure out is we should build and host docker images somewhere, rather than instructing users to download Dockerfile.conda-dist and build the image themselves.
- Tests for call.py
-- Do we ...See changes to README for description.
TODO:
- Figure out is we should build and host docker images somewhere, rather than instructing users to download Dockerfile.conda-dist and build the image themselves.
- Tests for call.py
-- Do we even want call.py? if we always use Skein to launch from conda packed envs, we don't need call.py.Ottomataaotto@wikimedia.orgOttomataaotto@wikimedia.orghttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/3Remove airflow utilities from this library2021-12-20T14:42:46ZMfornsRemove airflow utilities from this libraryThe airflow utilities were subject to frequent changes,
so we removed them given this library is not meant for
very frequent changes.The airflow utilities were subject to frequent changes,
so we removed them given this library is not meant for
very frequent changes.https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/2Add utils for anomaly detection DAGs2021-12-16T15:55:17ZMfornsAdd utils for anomaly detection DAGsThis includes the anomaly detection DAG factory
and some common use custom operators:
- SparkSubmitOperator
- SparkSQLOperator
- HdfsEmailOperatorThis includes the anomaly detection DAG factory
and some common use custom operators:
- SparkSubmitOperator
- SparkSQLOperator
- HdfsEmailOperatorhttps://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/1Switch from poetry to setuptools via pyproject.toml and setup.cfg2021-12-14T19:57:11ZOttomataaotto@wikimedia.orgSwitch from poetry to setuptools via pyproject.toml and setup.cfgThis is needed because poetry does not support installing
arbitrary script files, which we will need for
https://phabricator.wikimedia.org/T296543This is needed because poetry does not support installing
arbitrary script files, which we will need for
https://phabricator.wikimedia.org/T296543MfornsMforns