- 17 Feb, 2022 2 commits
-
-
Gmodena authored
- 14 Feb, 2022 6 commits
-
-
Gmodena authored
Refactor image-matching project structure. See merge request gmodena/platform-airflow-dags!19
-
Gmodena authored
`ima_v2` should match the convention introduced by the cookiecutter template. Changes introduced to `ima.py` are to allow execution in cluster mode. After some experiment with replacing hive with `spark-sql` shell, I rolled back to using Hive for IMAv1. Some semantics difference make the final export fail. We should consider dropping anyway, since it breaks convention (data should be stored on HDFS/Swift instead). IMAv2 does use `spark-sql` under the hood.
-
Gmodena authored
Add Gitlab CI pipeline config. See merge request gmodena/platform-airflow-dags!28
-
Gmodena authored
-
Gmodena authored
Move pypi extra index to requirements. See merge request gmodena/platform-airflow-dags!29
-
Gmodena authored
Rename repo archive See merge request gmodena/platform-airflow-dags!30
-
- 10 Feb, 2022 3 commits
-
-
Gmodena authored
-
Gmodena authored
Make task_id configurable See merge request gmodena/platform-airflow-dags!31
-
Gmodena authored
-
- 07 Feb, 2022 3 commits
-
-
Gmodena authored
-
Gmodena authored
Make task id unique and input/output optional. See merge request gmodena/platform-airflow-dags!26
-
Gmodena authored
Gitlab pypi indexes should be configured in a project's requirements file.
-
- 23 Dec, 2021 9 commits
-
-
Gmodena authored
Add cookiecutter_replay to gitignore. See merge request gmodena/platform-airflow-dags!27
-
Gmodena authored
-
Luke Bowmaker authored
Onboard sample-project data pipeline See merge request gmodena/platform-airflow-dags!25
-
-
Gmodena authored
Unique task id are necessary to support dynamic dag creation. Optional input/output path are not ideal but useful to help backport spark scripts with different parametrisation requirements.
-
Gmodena authored
Update unit test. Updated expected output after default path changes. See merge request gmodena/platform-airflow-dags!24
-
Gmodena authored
-
Gmodena authored
T295360 more path normalisation Align conventions with documentation. See merge request gmodena/platform-airflow-dags!23
-
Gmodena authored
-
- 22 Dec, 2021 6 commits
-
-
Gmodena authored
Set dag owner. The property is required for a dag to be picked up by the scheduler, and displayed in DAG UI. See merge request gmodena/platform-airflow-dags!22
-
Gmodena authored
The property is required for a dag to be picked up by the scheduler, and displayed in DAG UI.
-
Gmodena authored
Normalise path layout in factories and cookiecutter This MR fixes some inconsistencies between dags boilerplate and the cookiecutter template: * the expected venv location has moved one level up in the deployed project home * project dir (cookiecutter config) is not part of pipelines home; we let boilerplate chain dirs together. See merge request gmodena/platform-airflow-dags!21
-
Gmodena authored
-
Gmodena authored
T295360 datapipeline scaffolding This merge request adds a cookiecutter template to scaffold new data pipelines as described in https://phabricator.wikimedia.org/T295360. This template provides * Integration with our tox config (mypy/flake8/pytest) * A PySpark job template * A pytest template for pyspark code * An Airflow dag template to help users getting started. # Structure changes The project directory largely follows `image-matching`'s strcuture. Notable changes are: * Python code has been moved under `pyspark` * Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like `--files schema.py` - imports will be resolved from the `venv`). # How to test checkout the `T295360-datapipeline-scaffolding` branch and run A new datapipline can be created with: ``` make datapipeline ``` This will generate a new directory for pipeline code under: ```bash your_data_pipeline ``` And install an Airflow dag template under ``` dags/your_data_pipeline_dag.py ``` From the top level directory, you can now run `make test-dags`. The command will check that `dags/your_data_pipeline_dag.py` is a valid airflow dag. The output should look like this: ``` make test-dags ---------- coverage: platform linux, python 3.7.11-final-0 ----------- Name Stmts Miss Cover ----------------------------------------------------------- dags/factory/sequence.py 70 3 96% dags/ima.py 49 5 90% dags/similarusers-train-and-ingest.py 20 0 100% dags/your_data_pipeline_dag.py 19 0 100% ----------------------------------------------------------- TOTAL 158 8 95% =========================== 8 passed, 8 warnings in 12.75s =========================== ______________________________________ summary ____________ ``` See merge request !16
-
Gmodena authored
-
- 17 Dec, 2021 1 commit
-
- 16 Dec, 2021 1 commit
-
-
Gmodena authored
Conda vendored openjdk shows flaky behaviour with the rest of the build pipeline. This change installs adopt openjdk directly on the host system.
-
- 15 Dec, 2021 2 commits
-
-
Gmodena authored
- 13 Dec, 2021 1 commit
-
- 09 Dec, 2021 1 commit
-
-
Clarakosi authored
-
- 08 Dec, 2021 2 commits
- 24 Nov, 2021 3 commits