Skip to content

Use HDFS for internal state, new pipeline step for output datasets

Fabian Kaelin requested to merge cleanup into main
  • general cleanup, e.g. replaced func.py with wikidata.py and page_history.py. removed article features code that moved to the article quality repo
  • separated the development mode input arguments (hive db/table prefix) from the hive output configuration
  • use a hdfs directory for all intermediate datasets, add a config file to enable pipeline steps to refer to data generated by another step
  • new pipeline step (output_datasets.py) that is responsible for preparing output datasets (e.g. for hive, or hfds->public folder). in the airflow dag, this step is only executed upon the successful completion of all other steps. Support for generating multiple output formats/datasets for content gap metrics. e.g. content gap data in normalized, denormalized, csv (required for the knowledge-gap-index) formats. raw content gap features per article.

I used the validation notebook to verify content gaps metrics are generated and look the same as for the previous validation pass.

Merge request reports