Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • K Knowledge Gaps
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 3
    • Issues 3
    • List
    • Boards
    • Service Desk
    • Milestones
  • Custom issue tracker
    • Custom issue tracker
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Terraform modules
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • repos
  • research
  • Knowledge Gaps
  • Issues
  • #1

Add feature metrics for content gap features

The dataframe of articles mapped to content gap features is joined with a set of metrics using article_id. The set of metrics is:

  • Article quality score (article-quality#1 (closed))
  • Pageviews (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly)
  • Revision Counts (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history)

Where possible, this should be done via a hive/sql query, otherwise via a dataframe reading from hdfs/parquet directly

Note that the feature metric dataframes (i.e. before they are joined with the content gap features) are timeseries, there will be multiple values per article (e.g. monthly)

Edited May 19, 2022 by Fabian Kaelin
Assignee
Assign to
Time tracking