README.md 5.46 KB
Newer Older
1
[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)
Gmodena's avatar
Gmodena committed
2

3
4
5
6
7
8
9
# * * * *This project has been superseded by [airflow-dags](https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags)* * * *
Please do new development at: <https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags>.
<br>
<br>
<br>
<br>
<br>
Gmodena's avatar
Gmodena committed
10
# datapipelines
Gmodena's avatar
Gmodena committed
11

Gmodena's avatar
Gmodena committed
12
This repo contains data pipelines operationalised by the Generated Data Platform team.
Gmodena's avatar
Gmodena committed
13
You can reach out to us at
Gmodena's avatar
Gmodena committed
14
15
* Wiki: [Generated Data Platform](https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream).
* Phabricator: [Generated Data Platform](https://phabricator.wikimedia.org/project/view/5517/).
Gmodena's avatar
Gmodena committed
16
* Slack: [#data-platform-value-stream](https://wikimedia.slack.com/archives/C02BB8L2S5R).
Gmodena's avatar
Gmodena committed
17

Gmodena's avatar
Gmodena committed
18
19
20
21
# Requirements

Tools provided by this repository require [Docker](https://www.docker.com/). 

Gmodena's avatar
Gmodena committed
22
23
24
25
26
27
28
29
30
31
32
# Data pipelines
> […] a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. […] > https://en.wikipedia.org/wiki/Pipeline_(computing)

A Generated Datasets Platform pipeline is made up by two components:

1. Project specific tasks and data transformation that operate on input (sources) and produce output (sink). We depend on Apache Spark for elastic compute.

2. An [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html), that is a thin orchestration layer that composes and executes tasks

Data pipelines are executed on Hadoop. Elastic compute is provided by Spark (jobs are deployed in cluster mode). Scheduling and orchestration is delegated to Apache Airflow. Currently we support Python based projects. Scala support is planned.

Gmodena's avatar
Gmodena committed
33
34
35
36
37
## Create a new data pipeline

Clone this repo and create a dev branch with:

```
Gmodena's avatar
Gmodena committed
38
39
git@gitlab.wikimedia.org:/repos/generated-data-platform/datapipelines.git
cd datapipelines
Gmodena's avatar
Gmodena committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
git checkout -b your_data_pipeline_branchname
```

A new datapipline can be created with:
```
make datapipeline
```

This will generate a new directory for pipeline code under:
```bash
your_data_pipeline
```

And install an Airflow dag template under
```
dags/your_data_pipeline_dag.py
```

Gmodena's avatar
Gmodena committed
58
More details about the onboarding process and requirements can be found in our [Data Pipeline Onboarding](https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Pipeline_Onboarding/) wiki page.
Gmodena's avatar
Gmodena committed
59

Gmodena's avatar
Gmodena committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
## Repo layout

This repository follows a [monorepo](https://en.wikipedia.org/wiki/Monorepo) strategy. Its structure matches the layout of `AIRFLOW_HOME` on the [an-airflow1003.eqiad.wmnet](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng) airflow instance.

* `dags` contains [Airflow dags](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html) for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag.
* `tests/` contain the `dags` validation test suite. Project specific tests are implemented under `<project-name>`
* `<project-name>` directories contain tasks and data transformations. For an example, see `image-matching`.

##  Deployment

DAGs are currently deployed and scheduled on [an-airflow1003.eqiad.wmnet](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng). This service has no SLO and is meant for development and experimentation use.

The following command will run code checks and deploy data pipelines:
```
make deploy-local-build
```
Gmodena's avatar
Gmodena committed
76
77
78
79
80
81
82
83
84
85
### Deploy a new pipeline

Deployment piplines are declared in the `TARGET` variable in `Makefile`. 
To deploy a new pipeline, append its project directory name to `TARGET`.
For example, if a new pipeline has been created as `my_new_datapipeline`, the new
`TARGET` list would look like the following:

```
TARGET := "image-matching my_new_datapipeline"
```
Gmodena's avatar
Gmodena committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

# CI & code checks

We favour test-driven development with `pytest`, lint with `flake8` and type check with `mypy`. We encourage, but not yet enforce, the use of `isort` and `black` for formatting code. We log errors and information messages with the Python logging library.

## Code checks

We enforce code checks at at DAG and project level

### Dag validation
DAG validation tests live under the toplevel `tests` directory. They can be triggered with
`make test_dags`.

### Project checks

Gmodena's avatar
Gmodena committed
101
102
103
104
105
106
107
The following commands can be executed at top level (they'll be invoked for all projects):

* `make lint-all` triggers project linting.
* `make mypy-all` triggers type checking.
* `mate test-all` triggers unit/integration tests.

Code checks can be triggered for a specific project:
Gmodena's avatar
Gmodena committed
108
109
110
111
112
113
114
115
116
117
118
119

* `make lint` triggers project linting.
* `make mypy` triggers type checking.
* `make test` triggers unit/integration tests.

All targets are configured with [tox](https://pypi.org/project/tox/).

By default, code checks are executed inside a docker container that provides an [Conda
Python](https://docs.conda.io/en/latest/) distribution. They can be run "natively" by passing `SKIP_DOCKER=true`. For example:
```
make test SKIP_DOCKER=true
```
120
121
122
123
124
On wikimedia stat machines we expect conda to be installed under `/usr/lib/anaconda-wmf`. The default conda path
can be overridden by passing `CONDA_PATH` to `make`. For example:
```
make test SKIP_DOCKER=true CONDA_PATH=/path/to/conda
```