New job: scrape and aggregate page summaries
Note
Reviewers: Please see REVIEW_AIRFLOW_MRS.md for directions on what to check for.
Contributor checklist
- I have written tests for this DAG that will be merged into data-engineering/airflow-dags/tests/wmde
- I have ran the above tests and code quality checks locally or with Docker as outlined in the tests section of the Airflow DAGs project readme
- I have tested the jobs for this DAG in my local database using queries generated with wmde/analytics/hql/gen_hql_test_scripts.py or by passing parameters to the production queries
- I have tested the included DAGs using the process outlined in TEST_AIRFLOW_DAGS.md and the test variable files provided for each DAG
-
All Hive tables that are needed by the included DAG jobs have been created and are accessible by the
analytics-wmdeAirflow user -
All changes from the
mainbranch have been rebased into this branch
Description
-
T418442
-
wiki_page_cite_references_monthly: Monthly job to scrape Enterprise snapshots, summarize per-page usage of Cite references, and aggregate by wiki.
-
Test outputs
Please describe the outputs of the tests that were ran.
Destination tables summary
If applicable, include sanitized outputs of DAG jobs so that the results can be compared against expected outputs.
-
wmde.cite_ref_errors_by_type: create_table_cite_ref_errors_by_type.hql
| dbname | snapshot_date | error_key | error_count |
|---|---|---|---|
| ffwiki | 2026-03-02 | cite_error_ref_too_many_keys | 4 |
-
wmde.transclusions_containing_only_refs: create_table_transclusions_containing_only_refs.hql
| dbname | snapshot_date | transclusion_name | transclusion_count |
|---|---|---|---|
| ffwiki | 2026-03-02 | Efn | 67 |
-
wmde.transclusions_containing_refs: create_table_transclusions_containing_refs.hql
| dbname | snapshot_date | transclusion_name | transclusion_count |
|---|---|---|---|
| ffwiki | 2026-03-02 | Officeholder table | 9 |
-
wmde.transclusions_within_refs: create_table_transclusions_within_refs.hql
| dbname | snapshot_date | transclusion_name | transclusion_count |
|---|---|---|---|
| ffwiki | 2026-03-02 | Cite quran | 6 |
-
wmde.wiki_page_cite_references_monthly: create_table_wiki_page_cite_references_monthly.hql
| dbname | snapshot_date | identical_refs_count | identical_refs_on_pages_with_25_or_less_refs_average | identical_refs_on_pages_with_over_25_refs_average | identical_refs_on_pages_with_over_25_refs_count | list_defined_ref_per_page_having_ref | list_defined_ref_sum | max_ref_reuse_average | nested_ref_sum | page_count | pages_with_automatically_named_refs_count | pages_with_identical_refs_and_over_25_refs_count | pages_with_identical_refs_count | pages_with_multiple_reflists_count | pages_with_named_refs_count | pages_with_nested_refs_count | pages_with_over_25_refs_count | pages_with_ref_reuse_count | pages_with_refs_count | pages_with_similar_refs_count | pages_with_subrefs_count | proportion_of_named_refs_uniquely_named_average | proportion_of_pages_with_identical_refs | proportion_of_pages_with_nested_refs | proportion_of_pages_with_similar_refs | proportion_of_pages_with_refs | proportion_of_refs_from_transclusion | proportion_of_refs_having_transclusion | proportion_of_refs_named_average | proportion_of_refs_reused_average | ref_by_transclusion_average | ref_by_transclusion_count | ref_count | ref_count_per_page | ref_count_per_page_having_ref | reflist_count | reflists_per_page_having_ref | refs_with_solely_transclusion_count | refs_with_transclusions_countsimilar_refs_count | subrefs_sum | transclusion_average | transclusion_sum | wikitext_length_average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ffwiki | 2026-03-02 | 1151 | 1150.9946 | 1.4634147 | 60 | 4.4535493E-4 | 9 | 1.6768292 | 3 | 26180 | 1167 | 22 | 696 | 2176 | 4908 | 2 | 41 | 820 | 11227 | 176 | 0 | 0.92662185 | 0.06199341 | 1.7814199E-4 | 0.015676495 | 0.42883882 | 0.010996527 | 0.67252445 | 0.299617 | 0.024091247 | 0.037231673 | 418 | 38012 | 1.451948 | 3.3857665 | 16228 | 1.445444 | 24147 | 25564 | 312 | 0 | 1.8700917 | 48959 |
-
wmde.wiki_page_cite_references_raw: create_table_wiki_page_cite_references_raw.hql
| automatic_ref_name_usages_count | automatic_ref_names_count | html_length | identical_ref_count | list_defined_ref_count | main_ref_count | nested_ref_count | page_id | page_namespacepage_title | potential_ref_transclusions | potential_subref_transclusions | potential_transclusions_with_top_level_refs | ratio_subrefs_to_main_refs | ref_by_top_ref_transclusion_count | ref_by_transclusion_count | ref_count | ref_error_counts_by_type | ref_reuse_count | ref_reuse_counts | ref_with_name_count | reflist_count | reflist_item_count | reflist_subref_item_count | refs_with_solely_transclusion_count | refs_with_transclusions_count | rev_id | rev_timestamp | similar_ref_count | subref_count | subref_error_counts_by_type | subref_reuse_count | subrefs_with_errors_count | transclusion_count | transclusions_inside_refs | transclusions_inside_subrefs | unique_name_count | wikitext_length | database | snapshot_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2775 | 0 | 0 | 0 | 0 | 27460 | 0 | The Fall-Down Artist | {} | {} | {} | 0.0 | 0 | 0 | 0 | {} | 0 | [] | 0 | 0 | 00 | 0 | 0 | 101582 | NULL | 0 | 0 | {} | 0 | 0 | 1 | {} | {} | 0 | 676 | ffwiki | 2026-03-02 |
