🚧 This instance is under construction; expect occasional downtime. Runners available in /repos. Questions? Ask in #wikimedia-gitlab on libera.chat, or under GitLab on Phabricator.

README.md 7.82 KB
Newer Older
1
# Data³
2

3
4
5
6
7
8
9
Data³ is a toolkit and general framework for visualizing just about any data. Wikimedia's engineering productivity team have begun assembling a toolkit to help us organize, analyze and visualize data collected from our development, deployment, testing and project planning processes. There is a need for better tooling and data collection in order to have reliable and accessible data to inform data-driven decision-making. This is important because we need to measure the impact of changes to our deployment processes and team practices so that we can know whether a change to our process is beneficial and quantify the impacts of the changes we make.

The first applications for the Data³ tools are focused on exploring software development and deployment data, as well as workflow metrics exported from Wikimedia's phabricator instance.

The core of the toolkit consists of the following:

* Datasette.io provides a front-end for browsing and querying one or more SQLite databases.
10
* A simple dashboard web app that uses the datasette json api to query sqlite and renders the resulting data as charts (rendered with vega-lite) or html templates for custom reports or interactive displays.
11
12
13
* A comprehensive python library and command line interface for querying and processing Phabricator task data exported via conduit api requests.
* Several custom dashboards for datasette which provide visualization of metrics related to Phabricator tasks and workflows.
* A custom dashboard to explore data and statistics about production MediaWiki deployments.
14

15
16
17
18
## Demo / Development Instance

There is a development & testing instance of Datasette and the Data³ Dashboard at [https://data.releng.team/dev/](https://data.releng.team/dev/)

19
20
## Status

21
22
23
This tool and supporting libraries are currently experimental. The dashboard and initial data model have reached the stage of [MVP](https://en.wikipedia.org/wiki/Minimum_viable_product). The future development direction is currently uncertain but this is a solid foundation to build on.

 This project has a wiki page on MediaWiki.org: [Data³/Metrics-Dashboard](https://www.mediawiki.org/wiki/Data%C2%B3/Metrics-Dashboard )
24
25
26
27
28

## Currently supported data sources:

 * Phabricator's conduit API.

29
## Future Possibilities:
30
31
32

 * Elastic ELK
 * Wikimedia SAL
33
 * GitLab APIs
34
35
36

# Usage

20after4's avatar
20after4 committed
37
## Installation
38
39
40
41
42
43

setup.py will install a command line tool called `dddcli`

To install for development use:

```bash
44
45
46
47
pip3 install virtualenv poetry
virtualenv --python=python3 .venv
source .venv/bin/activate
poetry install
48
49
```

20after4's avatar
20after4 committed
50
### dddcli
51

20after4's avatar
20after4 committed
52
You can use the following sub-commands by running `dddcli sub-command [args]` to access various functionality.
53

20after4's avatar
20after4 committed
54
### Phabricator metrics:  `dddcli metrics`
55

56
57
58
* This tool is used to extract data from phabricator and organize it in a structure that will facilitate further analysis.
* The analysis of task activities can provide some insight into workflows.
* The output if this tool will be used as the data source for charts to visualize certain agile project planning metrics.
59

20after4's avatar
20after4 committed
60
#### cache-columns
61
62
63
64
65
66
67
68
69
The first thing to do is cache the columns for the project you're interested in.
This will speed up future actions because it avoids a lot of unnecessary requests
to Phabricator that would otherwise be required to resolve the names of projects
and workboard columns.

```bash
dddcli metrics cache-columns --project=PHID-PROJ-uier7rukzszoewbhj7ja
```

20after4's avatar
20after4 committed
70
Then you can fetch the actual metrics and map them into local sqlite tables with the map sub-command:
71
72


73
```bash
74
dddcli metrics map --project=#release-engineering-team
75
76
```

77
Note that `--project` accepts either a `PHID` or a project `#hashtag`
20after4's avatar
20after4 committed
78

79
80
81
82
83
84
85
86
87
88
89
90
91
To get cli usage help, try

```bash
dddcli metrics map --help
```

To run it with a test file instead of connecting to phabricator:

```bash
dddcli metrics map --mock=test/train.transactions.json
```

This runs the mapper with data from a file, treating that as a mock api call result (to speed up testing)
92
93
94

If you omit the --mock argument then it will request a rather large amount of data from the phabricator API which takes an extra 20+ seconds to fetch.

95
96
97
### Datasette

The main user interface for the Data³ tool is provided by Datasette.
98

99
100
101
Datasette is installed as a dependency of this repo by running `poetry install` from the repository root.

Once dependencies are installed, you can run datasette from the ddd checkout like this:
102
103

```bash
20after4's avatar
20after4 committed
104
export DATASETTE_PORT=8001
105
106
107
export DATASETTE_HOST=localhost # or use 0.0.0.0 to listen on a public interface
export DATASETTE_DIR=./www  #this should point to the www directory included in this repo.
datasette --reload --metadata www/metadata.yaml -h #DATASETTE_HOST -p $DATASETTE_PORT  $DATASETTE_DIR
108
```
20after4's avatar
20after4 committed
109

110
111
112
113
For deployment on a server, there are sample systemd units in `etc/systemd/*` including a file watcher to
restart datasette when the data changes. Approximately the same behavior is achieved by the --reload argument to the
datasette command given here and that is adequate for development and testing locally.

114
### Datasette Plugins
115
116
117
118
119
120
121

Datasette has been extended with some plugins to add custom functionality.

* See `www/plugins` for Data³ customizations.
* There is also a customized version of datasette-dashboards which is included via a submodule at
`src/datacube-dashboards`.  Do the usual `git submodule update --init` to get that source code.
* There are custom views and routes added in ddd_datasette.py that map urls like /-/ddd/$page/  to files in `www/templates/view/`.
122

123
124
125
126
# Dashboards

The data³ Dashboards web application is documented in [./docs/DefiningDashboards.md](docs/DefiningDashboards.md).

127
# Example code:
128

20after4's avatar
20after4 committed
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
## Conduit API client:

```python
from ddd.phab import Conduit

phab = Conduit()

# Call phabricator's meniphest.search api and retrieve all results
r = phab.request('maniphest.search', {'queryKey': "KpRagEN3fCBC",
                "limit": "40",
                "attachments": {
                    "projects": True,
                    "columns": True
                }})
```

This fetches every page of results, note the API limits a single request to
fetching **at most** 100 objects, however, fetch_all will request each page from the server until all available records have been retrieved:

```python
r.fetch_all()
```


20after4's avatar
20after4 committed
153
154
## PHIDRef

20after4's avatar
20after4 committed
155
Whenever encountering a phabricator `phid`, we use PHIDRef objects to wrap the phid. This provides several conveniences for working with phabricator objects efficiently.  This interactive python session demonstrates how it works:
20after4's avatar
20after4 committed
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181

```python
In [1]: phid = PHIDRef('PHID-PROJ-uier7rukzszoewbhj7ja')
# PHIDRef has a placeholder for the Project instance:
IN [2]: phid.object
Out[2]: Project(name="", phid="PHID-PROJ-uier7rukzszoewbhj7ja")

# Once we call resolve_phids, then the data is filled in from cache or from a conduit request if it's not cached:
In [3]: PHObject.resolve_phids(phab, DataCache(db))
Out[3]: {'PHID-PROJ-uier7rukzszoewbhj7ja': Project(name="Releas...ewbhj7ja")}

# now phid and phid.object are useful:
In [4]: phid.object
Out[4]: Project(name="Release-Engineering-Team", phid="PHID-PROJ-uier7rukzszoewbhj7ja")

In [5]: phid
Out[5]: PHIDRef('PHID-PROJ-uier7rukzszoewbhj7ja', object='Release-Engineering-Team')

In [6]: str(phid.object)
Out[6]: Release-Engineering-Team

In [7]: str(phid)
Out[7]: PHID-PROJ-uier7rukzszoewbhj7ja

```

20after4's avatar
20after4 committed
182
183
184
185
186
1. You can construct a bunch of `PHIDRef` instances and then later on you can fetch all of the data in a single call to phabricator's conduit api. This is accomplished by calling `PHObject.resolve_phids()`.
2. `resolve_phids()` can store a local cache of the phid details in the phobjects table. After calling resolve_phids completes, all `PHObject` instances will contain the `name`, `url` and `status` of the corresponding phabricator objects.
3. An instance of PHIDRef can be used transparently as a database key.
4. `str(PHIDRef_instance)` returns the original `"PHID-TYPE-hash"` string.
5. `PHIDRef_instance.object` returns an instantiated `PHObject` instance.
187