Skip to content

Research: Add article quality DAG

Bmansurov requested to merge research-article-quality into main

This is a DAG that computes article quality scores for all revisions within a given time frame. Some paths are hard-coded, but we can decide on them later.

The results of running this script (taken from DAG logs):

/usr/lib/spark2/bin/spark-submit --driver-cores 2 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=article-quality-0.0.2.conda/bin/python
    --conf spark.executorEnv.PYSPARK_PYTHON=article-quality-0.0.2.conda/bin/python
    --master yarn --conf spark.sql.shuffle.partitions=1024 --conf spark.shuffle.service.enabled=True
    --conf spark.dynamicAllocation.enabled=True --conf spark.dynamicAllocation.maxExecutors=96
    --conf spark.dynamicAllocation.minExecutors=8 --conf spark.dynamicAllocation.initialExecutors=32
    --conf spark.hadoop.fs.permissions.umask-mode=000 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=article-quality-0.0.2.conda/bin/python
    --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=article-quality-0.0.2.conda/bin/python
    --archives https://gitlab.wikimedia.org/api/v4/projects/211/packages/generic/article-quality/0.0.2/article-quality-0.0.2.conda.tgz#article-quality-0.0.2.conda
    --executor-cores 2 --executor-memory 16G --driver-memory 4G --keytab analytics-research.keytab
    --principal analytics-research/an-airflow1002.eqiad.wmnet@WIKIMEDIA --name research_article_quality__compute_article_quality_scores__20220912
    --queue production --deploy-mode client article-quality-0.0.2.conda/bin/article_quality_app.py
    --mediawiki_snapshot 2022-07 --wikidata_snapshot 2022-08-01 --start_date 20100101
    --end_date 20100131 --projects enwiki,uzwiki --mode production --save_directory
    /user/bmansurov/article-quality/dag/

can be found in HDFS:

/user/bmansurov/article-quality/dag/20100101_20100131_en_uz_features.parquet
/user/bmansurov/article-quality/dag/20100101_20100131_en_uz_scores.parquet
Edited by Bmansurov

Merge request reports