Use HDFS for internal state, new pipeline step for output datasets (!22) · Merge requests · repos / research / Knowledge Gaps

Fabian Kaelin requested to merge cleanup into main Sep 13, 2022

general cleanup, e.g. replaced func.py with wikidata.py and page_history.py. removed article features code that moved to the article quality repo
separated the development mode input arguments (hive db/table prefix) from the hive output configuration
use a hdfs directory for all intermediate datasets, add a config file to enable pipeline steps to refer to data generated by another step
new pipeline step (output_datasets.py) that is responsible for preparing output datasets (e.g. for hive, or hfds->public folder). in the airflow dag, this step is only executed upon the successful completion of all other steps. Support for generating multiple output formats/datasets for content gap metrics. e.g. content gap data in normalized, denormalized, csv (required for the knowledge-gap-index) formats. raw content gap features per article.

I used the validation notebook to verify content gaps metrics are generated and look the same as for the previous validation pass.

Admin message