Add a DAG to materialize query event data in HDFS
Each day, compute the previous day's data from event.wdqs_external_sparql_query and save it in wikidata.wdqs_external_queries_by_user_agent_daily .
Note: we might want this DAG to also write to other destination tables later, hence its more general name compared to the table.
The schema of the destination table wikidata.wdqs_external_queries_by_user_agent_daily is as follows. The counts are based on a GROUP BY on day, graph_name, and ua (user-agent).
| column name | description |
|---|---|
| day | YYYY-MM-DD for the query (we don't retain the exact time) |
| graph_name | main or scholarly graph |
| ua | User-Agent as extracted from the HTTP headers |
| num_1_less_10ms | Count of queries for this User-Agent with latency < 10ms (query time class 1) |
| num_2_10ms_to_100ms | Count of queries for this User-Agent with 10ms <= latency < 100ms (query time class 2) |
| num_3_100ms_to_1s | Count of queries for this User-Agent with 100ms <= latency < 1s (query time class 3) |
| num_4_1s_to_10s | Count of queries for this User-Agent with 1s <= latency < 10s (query time class 4) |
| num_5_more_10s | Count of queries for this User-Agent with latency >= 10s (query time class 5) |
| total_query_time_ms | Sum of all time spent on all queries by this User-Agent |
| num_queries | Total queries for this User-Agent (same as summing the 5 bucketing columns) |
| num_successful_queries | Total queries for this User-Agent with HTTP status code 200 |
This data is what's needed to generate the data views we want in a Superset table to analyze current traffic patterns. The query used by the DAG is stored in Wikidata Platform's Analytics repo. At this time we only track User-Agents with >1000 queries in a day, but this might change in the future.
This CR closely follows the current setup for WMDE, who query the same table and retain similar metrics for their team.
Data cleanup after 90 days happens through the HQL (see link 2 lines up) before each day's query runs.
We probably will want a weekly or monthly DAG to do additional aggregation based on the results of this DAG.
We also might not want to get an email every day, but let's see if it's annoying and adjust later.
Bug:T418723