Use HDFS for internal state, new pipeline step for output datasets
- general cleanup, e.g. replaced
func.py
withwikidata.py
andpage_history.py
. removed article features code that moved to the article quality repo - separated the development mode input arguments (hive db/table prefix) from the hive output configuration
- use a hdfs directory for all intermediate datasets, add a config file to enable pipeline steps to refer to data generated by another step
- new pipeline step (
output_datasets.py
) that is responsible for preparing output datasets (e.g. for hive, or hfds->public folder). in the airflow dag, this step is only executed upon the successful completion of all other steps. Support for generating multiple output formats/datasets for content gap metrics. e.g. content gap data in normalized, denormalized, csv (required for the knowledge-gap-index) formats. raw content gap features per article.
I used the validation notebook to verify content gaps metrics are generated and look the same as for the previous validation pass.