Knowledge Gaps merge requestshttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests2023-08-23T14:26:57Zhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/33Restructure output datasets2023-08-23T14:26:57ZFabian KaelinRestructure output datasetsGeneralize output formats, the metrics are generated for these four aggregation levels
- `metrics_by_category`: metrics for content gap categories (e.g. female category of the gender gap)
- `metrics_by_content_gap`: metrics for content...Generalize output formats, the metrics are generated for these four aggregation levels
- `metrics_by_category`: metrics for content gap categories (e.g. female category of the gender gap)
- `metrics_by_content_gap`: metrics for content gaps (e.g. across all categories)
- `metrics_by_category_all_wikis`: metrics across all wikis per category
- `metrics_by_content_gap_all_wikis`: metrics across all wikis per content gap
Additional changes
- improved configuration, including adding a `sub_content_gaps` arg to select specific gaps to compute
- updated validation notebook to use new output files
- bump version to 0.3.0https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/34README update2023-08-23T13:35:07ZNick IfeajikaREADME updateUpdates made to the README. Subject to changes as requestedUpdates made to the README. Subject to changes as requestedhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/32update article features path2023-07-28T15:06:14ZFabian Kaelinupdate article features pathhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/31Standard quality threshold2023-05-03T20:59:14ZFabian KaelinStandard quality thresholdImplements the standard quality heuristic described in https://phabricator.wikimedia.org/T332383 as a feature metric,
- the 'standard_quality' is boolean for the metric features dataset, e.g. wiki_db,page_id,time_bucket,standard_quality
...Implements the standard quality heuristic described in https://phabricator.wikimedia.org/T332383 as a feature metric,
- the 'standard_quality' is boolean for the metric features dataset, e.g. wiki_db,page_id,time_bucket,standard_quality
- for the content gap metrics, the 'standard_quality' is the % of articles that above the threshold for a given gap category/timebucket
Also implemented is a forward fill for missing article quality values, though this will be further refined in a future MR.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/30Update geography gaps2023-04-14T19:33:19ZFabian KaelinUpdate geography gapsChanges to the geography gaps based on discussions in https://phabricator.wikimedia.org/T332384
- switch to wmf internal source for geography entity mappings
- rename `geographic` gap to `geography_country`
- rename `geographic_region...Changes to the geography gaps based on discussions in https://phabricator.wikimedia.org/T332384
- switch to wmf internal source for geography entity mappings
- rename `geographic` gap to `geography_country`
- rename `geographic_region` gap to `geography_cultural_region`
- rename `geographic_continent` gap to `geography_continent`
- remove `geographic_sub_continent` gap
- add new `geography_wmf_region` gaphttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/29Enable writing to external hive table2023-04-06T01:54:29ZFabian KaelinEnable writing to external hive tableThe pageviews dataset is now an external hive table, the data is stored on `/wmf/data/research/pageview_daily`, this requires an additional option when writing from pyspark.The pageviews dataset is now an external hive table, the data is stored on `/wmf/data/research/pageview_daily`, this requires an additional option when writing from pyspark.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/28Percentile metrics2023-03-24T18:57:15ZFabian KaelinPercentile metricsGenerate percentiles ([0.05, 0.25, 0.5, 0.75, 0.95] for article_created, pageviews, quality_score, page_revision_count metrics
The motivation for this is initially tracking article quality in buckets of quality (e.g what percentage of a...Generate percentiles ([0.05, 0.25, 0.5, 0.75, 0.95] for article_created, pageviews, quality_score, page_revision_count metrics
The motivation for this is initially tracking article quality in buckets of quality (e.g what percentage of articles for a given content gap category can be considered good?).
See the [Quantile metrics](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/blob/quantiles/interactive/datasets.ipynb) section of the example notebook.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/27Revert store_true action2023-03-19T00:08:24ZFabian KaelinRevert store_true actionhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/26Output metric features2023-03-18T22:39:21ZFabian KaelinOutput metric featureshttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/25Feature metric joins2023-03-18T01:55:22ZFabian KaelinFeature metric joinsFull join for feature metric dataframesFull join for feature metric dataframeshttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/24Article quality metric feature2023-03-17T17:44:40ZFabian KaelinArticle quality metric featureUse the article quality score of the last revision to a given article in a given timebucket (e.g. month).Use the article quality score of the last revision to a given article in a given timebucket (e.g. month).https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/23Multimedia gap for illustrated articles2023-03-14T16:26:56ZFabian KaelinMultimedia gap for illustrated articlesAdd a multimedia gap ("multimedia_illustrated") to the knowledge gaps pipeline.Add a multimedia gap ("multimedia_illustrated") to the knowledge gaps pipeline.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/22Use HDFS for internal state, new pipeline step for output datasets2022-10-01T12:30:41ZFabian KaelinUse HDFS for internal state, new pipeline step for output datasets- general cleanup, e.g. replaced `func.py` with `wikidata.py` and `page_history.py`. removed article features code that moved to the article quality repo
- separated the development mode input arguments (hive db/table prefix) from the hi...- general cleanup, e.g. replaced `func.py` with `wikidata.py` and `page_history.py`. removed article features code that moved to the article quality repo
- separated the development mode input arguments (hive db/table prefix) from the hive output configuration
- use a hdfs directory for all intermediate datasets, add a config file to enable pipeline steps to refer to data generated by another step
- new pipeline step (`output_datasets.py`) that is responsible for preparing output datasets (e.g. for hive, or hfds->public folder). in the airflow dag, this step is only executed upon the successful completion of all other steps. Support for generating multiple output formats/datasets for content gap metrics. e.g. content gap data in normalized, denormalized, csv (required for the knowledge-gap-index) formats. raw content gap features per article.
I used the validation notebook to verify content gaps metrics are generated and look the same as for the previous validation pass.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/21Validation2022-07-28T12:38:02ZFabian KaelinValidationhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/20Ship development.sql when installing the package2022-06-22T17:34:45ZBmansurovShip development.sql when installing the packageThis is needed for the Airflow job that creates development data.
Issue: #3This is needed for the Airflow job that creates development data.
Issue: #3BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/19Parametrization2022-06-17T02:45:00ZBmansurovParametrizationBefore this MR we were unable to run the pipeline independently without overwriting each other's data. This patch allows us to generate different development datasets and to save the data in different databases and tables (defined by pre...Before this MR we were unable to run the pipeline independently without overwriting each other's data. This patch allows us to generate different development datasets and to save the data in different databases and tables (defined by prefixes).
Issue: #3BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/18Prepare the codebase to be run by Airflow2022-06-14T13:13:53ZBmansurovPrepare the codebase to be run by AirflowIssue: #3Issue: #3BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/17Generate development data for knowledge gap features2022-06-06T13:25:39ZBmansurovGenerate development data for knowledge gap featuresGenerated articles are made up of people articles, location articles,
and articles with time properties.
Issue: # 1Generated articles are made up of people articles, location articles,
and articles with time properties.
Issue: # 1BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/16Feature metrics: add buckets2022-06-01T17:35:21ZBmansurovFeature metrics: add bucketsBmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/14Inroduce running mode for metrics2022-05-29T01:54:05ZBmansurovInroduce running mode for metricsBmansurovBmansurov