Add a DAG to materialize query event data in HDFS

Each day, compute the previous day's data from event.wdqs_external_sparql_query and save it in wikidata.wdqs_external_queries_by_user_agent_daily .

Note: we might want this DAG to also write to other destination tables later, hence its more general name compared to the table.

The schema of the destination table wikidata.wdqs_external_queries_by_user_agent_daily is as follows. The counts are based on a GROUP BY on day, graph_name, and ua (user-agent).

column name description
day YYYY-MM-DD for the query (we don't retain the exact time)
graph_name main or scholarly graph
ua User-Agent as extracted from the HTTP headers
num_1_less_10ms Count of queries for this User-Agent with latency < 10ms (query time class 1)
num_2_10ms_to_100ms Count of queries for this User-Agent with 10ms <= latency < 100ms (query time class 2)
num_3_100ms_to_1s Count of queries for this User-Agent with 100ms <= latency < 1s (query time class 3)
num_4_1s_to_10s Count of queries for this User-Agent with 1s <= latency < 10s (query time class 4)
num_5_more_10s Count of queries for this User-Agent with latency >= 10s (query time class 5)
total_query_time_ms Sum of all time spent on all queries by this User-Agent
num_queries Total queries for this User-Agent (same as summing the 5 bucketing columns)
num_successful_queries Total queries for this User-Agent with HTTP status code 200

This data is what's needed to generate the data views we want in a Superset table to analyze current traffic patterns. The query used by the DAG is stored in Wikidata Platform's Analytics repo. At this time we only track User-Agents with >1000 queries in a day, but this might change in the future.

This CR closely follows the current setup for WMDE, who query the same table and retain similar metrics for their team.

Data cleanup after 90 days happens through the HQL (see link 2 lines up) before each day's query runs.

We probably will want a weekly or monthly DAG to do additional aggregation based on the results of this DAG.

We also might not want to get an email every day, but let's see if it's annoying and adjust later.

Bug:T418723

@gmodena @trueg @ahoelzl @andrewtavis-wmde

Edited by Lerickson

Merge request reports

Loading