Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • K Knowledge Gaps
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Service Desk
    • Milestones
  • Custom issue tracker
    • Custom issue tracker
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • repos
  • research
  • Knowledge Gaps
  • Merge requests
  • !22

Use HDFS for internal state, new pipeline step for output datasets

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Fabian Kaelin requested to merge cleanup into main Sep 13, 2022
  • Overview 2
  • Commits 10
  • Pipelines 4
  • Changes 18
  • general cleanup, e.g. replaced func.py with wikidata.py and page_history.py. removed article features code that moved to the article quality repo
  • separated the development mode input arguments (hive db/table prefix) from the hive output configuration
  • use a hdfs directory for all intermediate datasets, add a config file to enable pipeline steps to refer to data generated by another step
  • new pipeline step (output_datasets.py) that is responsible for preparing output datasets (e.g. for hive, or hfds->public folder). in the airflow dag, this step is only executed upon the successful completion of all other steps. Support for generating multiple output formats/datasets for content gap metrics. e.g. content gap data in normalized, denormalized, csv (required for the knowledge-gap-index) formats. raw content gap features per article.

I used the validation notebook to verify content gaps metrics are generated and look the same as for the previous validation pass.

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: cleanup