🚧 This instance is under construction; expect occasional downtime. Runners available in /repos. Questions? Ask in #wikimedia-gitlab on libera.chat, or under GitLab on Phabricator.

Commit c4b8ccca authored by Htriedman's avatar Htriedman
Browse files

Update README.md

parent 1322092b
# ores-data
A repository containing training and test datasets for ORES.
\ No newline at end of file
This repository contains:
- JSON-formatted ORES training datasets that contain all attributes used for training in production (in `train/`)
- JSON-formatted test datasets that contain all attributes needed for testing/model card development (in `test/`)
- binaries of many ORES models that are in production (in `models/`)
- markdown files with ORES model architectures, training performance, and other relevant model statistics ((in `model_info/`))
## How I compiled these resources
This dataset encompasses all five model/data types that are served by ORES:
- Edit quality (compiled and run in the [`editquality` repo](https://github.com/wikimedia/editquality))
- Article quality (compiled and run in the [`articlequalty` repo](https://github.com/wikimedia/articlequality))
- Draft quality (compiled and run in the [`draftqualty` repo](https://github.com/wikimedia/draftquality))
- Article topic (compiled and run in the [`drafttopic` repo](https://github.com/wikimedia/drafttopic))
- Draft topic (compiled and run in the [`drafttopic` repo](https://github.com/wikimedia/drafttopic))
As I started working with these repos and the more generalized engine that powers them (the [`revscoring` repo](https://github.com/wikimedia/revscoring)), I found it difficult to figure out where data and models were actually coming from. They were, for the most part, compiled/computed in memory and on the fly, which made efforts towards accessibility, transparency, accountability, and fairness difficult embark on. So I decided to compile as many models, datasets, model performance baselines, architectures, etc. as possible and put them here in a centralized repository to lower the bar to entry.
The hope is that this repository makes it easier for future ML developers to train models on WMF data, provides an easy place to prototype model cards and datasheets, and ultimately creates better models and data for WMF's platform.
The general recipe for assembling this repo was the following:
1. get access to analytics machines that WMF runs
2. `ssh <username>@stat100x.eqiad.wmnet`
3. `git clone https://github.com/wikimedia/<repo>.git`
4. `cd <repo>`
5. `python3 -m venv env`
6. `source env/bin/activate`
7. `pip install -r requirements.txt`
8. `make models` (for the big repos, like `editquality`, I ran `nohup make models &`, which runs the process in the background and doesn't end if your ssh connection ends)
9. copy/rename datasets, models, and model info that result from step 8 into this repo
## Missing datasets and models
Although this repo attempts to be as complete as possible, some datasets and models were not retrivable initially. They are listed below — hopefully this list should shrink over time.
### Models
**editquality**
- viwiki.reverted.gradient_boosting.model
- wikidatawiki.damaging.gradient_boosting.model
- wikidatawiki.goodfaith.gradient_boosting.model
**articlequality**
- frwiki.wp10.gradient_boosting.model
- nlwiki.wp10.gradient_boosting.model
- ptwiki.wp10.gradient_boosting.model
- ruwiki.wp10.gradient_boosting.model
- svwiki.wp10.gradient_boosting.model
- trwiki.wp10.gradient_boosting.model
- ukwiki.wp10.gradient_boosting.model
**draftquality**
- enwiki.draft_quality.gradient_boosting.model
- ptwiki.draft_quality.gradient_boosting.model
**drafttopic**
- arwiki.drafttopic.gradient_boosting.model
- cswiki.drafttopic.gradient_boosting.model
- enwiki.drafttopic.gradient_boosting.model
- euwiki.drafttopic.gradient_boosting.model
- huwiki.drafttopic.gradient_boosting.model
- hywiki.drafttopic.gradient_boosting.model
- kowiki.drafttopic.gradient_boosting.model
- srwiki.drafttopic.gradient_boosting.model
- ukwiki.drafttopic.gradient_boosting.model
- viwiki.drafttopic.gradient_boosting.model
**articletopic**
- cswiki.articletopic.gradient_boosting.model
- enwiki.articletopic.gradient_boosting.model
- euwiki.articletopic.gradient_boosting.model
- huwiki.articletopic.gradient_boosting.model
- hywiki.articletopic.gradient_boosting.model
- kowiki.articletopic.gradient_boosting.model
- srwiki.articletopic.gradient_boosting.model
- ukwiki.articletopic.gradient_boosting.model
- viwiki.articletopic.gradient_boosting.model
- wikidata.articletopic.gradient_boosting.model
### Training datasets
**editquality**
- translatewiki
- trwiki
- ukwiki
- urwiki
- viwiki
- wikidatawiki
**articlequality**
- frwiki
- ptwiki
- ruwiki
- svwiki
- trwiki
**draftquality**
- enwiki
- ptwiki
**drafttopic**
- cswiki
**articletopic**
- cswiki
### Testing datasets
There are few test datasets that we can currently use for model evaluation. As new test sets become available, they will be added to the `test/` directory.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment