Skip to content

T275685 generate production datasets

Gmodena requested to merge T275685-generate-production-datasets into main

Created by: gmodena

This PR adds the capability to automate end to end generation of production datasets.

For more details see the comments in publish.sh. This script will:

  • run the notebook with the algorunner wrapper
  • copy model output to HDFS and expose it via an hive external table (available in superset)
  • run etl/transform.py to generate production data
  • expose production data via an hive external table (available in superset)
  • collect production datasets locally

Datasets will be created for the following wikis:

enwiki arwiki kowiki cswiki viwiki frwiki fawiki ptwiki ruwiki trwiki plwiki hewiki svwiki ukwiki huwiki hywiki srwiki euwiki arzwiki cebwiki dewiki bnwiki

Use

publish.sh <snapshot>

Each time publish.sh is invoked, it records the following data under runs/<run_id>:

  • metrics: a set of timing metrics generated by this script
  • Output: raw model output in tsv format
  • imagerec_prod_${snapshot}: production datasets in tsv format
  • regular.spark.properties: spark properties file for the transform.py job

Each run has an associated, unique, <run_id>. This uuid is propagated to the etl transforms, and will populate the dataset_id in production datasets. This allows reconciliation of a given dataset to the process that generated it.

Example

$ ./publish.sh 2021-01-25
[...]
Datasets are available at runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-01-25
Export summary
22 confidence_rating	source
684441
240156 high	wikidata
293089 low	commons
1182152 medium	wikipedia

$ ls runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-02-25/
prod-arwiki-2021-02-25-wd_image_candidates.tsv	 prod-huwiki-2021-02-25-wd_image_candidates.tsv
prod-arzwiki-2021-02-25-wd_image_candidates.tsv  prod-hywiki-2021-02-25-wd_image_candidates.tsv
prod-bnwiki-2021-02-25-wd_image_candidates.tsv	 prod-kowiki-2021-02-25-wd_image_candidates.tsv
prod-cebwiki-2021-02-25-wd_image_candidates.tsv  prod-plwiki-2021-02-25-wd_image_candidates.tsv
prod-cswiki-2021-02-25-wd_image_candidates.tsv	 prod-ptwiki-2021-02-25-wd_image_candidates.tsv
prod-dewiki-2021-02-25-wd_image_candidates.tsv	 prod-ruwiki-2021-02-25-wd_image_candidates.tsv
prod-enwiki-2021-02-25-wd_image_candidates.tsv	 prod-srwiki-2021-02-25-wd_image_candidates.tsv
prod-euwiki-2021-02-25-wd_image_candidates.tsv	 prod-svwiki-2021-02-25-wd_image_candidates.tsv
prod-fawiki-2021-02-25-wd_image_candidates.tsv	 prod-trwiki-2021-02-25-wd_image_candidates.tsv
prod-frwiki-2021-02-25-wd_image_candidates.tsv	 prod-ukwiki-2021-02-25-wd_image_candidates.tsv
prod-hewiki-2021-02-25-wd_image_candidates.tsv	 prod-viwiki-2021-02-25-wd_image_candidates.tsv

Merge request reports