Commit 38c7167a authored by Miriam Redi's avatar Miriam Redi Committed by GitHub
Browse files

Merge pull request #5 from gmodena/T275162-enable-spark-metrics-collection

T275162 enable spark metrics collection
parents dade3b27 b305130e
# ImageMatching # ImageMatching
Image recommendation for unillustrated Wikipedia articles Image recommendation for unillustrated Wikipedia articles
# Production data ETL
`etl` contains [pyspark](https://spark.apache.org/docs/latest/api/python/index.html) utilities to transform the
algo raw output into a _production dataset_ that will be consumed by a service.
```python
spark-submit etl/transform.py <raw data> <production data>
=======
## Getting started ## Getting started
Connect to stat1005 through ssh (the remote machine that will host your notebooks) Connect to stat1005 through ssh (the remote machine that will host your notebooks)
...@@ -49,3 +41,33 @@ The output .ipynb and .tsv files can be found in your output directory ...@@ -49,3 +41,33 @@ The output .ipynb and .tsv files can be found in your output directory
ls Output ls Output
hywiki_2020-12-28.ipynb hywiki_2020-12-28_wd_image_candidates.tsv hywiki_2020-12-28.ipynb hywiki_2020-12-28_wd_image_candidates.tsv
``` ```
## Production data ETL
`etl` contains [pyspark](https://spark.apache.org/docs/latest/api/python/index.html) utilities to transform the
algo raw output into a _production dataset_ that will be consumed by a service.
The transform etl can be run on a local cluster with:
```python
spark2-submit etl/transform.py <raw data> <production data>
```
`conf/spark.properties` provides default settings to run the ETL as a [regular size spark job](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Spark_Resource_Settings) on WMF's Analytics cluster.
```python
spark2-submit --properties-file conf/spark.properties etl/transform.py <raw data> <production data>
```
## Metrics collection
On WMF's cluster the Hadoop Resource Manager (and Spark History) is available at `https://yarn.wikimedia.org/cluster`.
Additional instrumentation can be enabled by passing `metrics.properites` file to the Notebook or ETL jobs. A template
metrics files, that outpus to the driver and executors stdout, can be found at `conf/metrics.properties.template`.
The easiest way to do it by setting `PYSPARK_SUBMISSION_ARGS`. For example
```bash
export PYSPARK_SUBMIT_ARGS="--files ./conf/metrics.properties --conf spark.metrics.conf=metrics.properties pyspark-shell"
python3 algorunner.py 2020-12-28 hywiki Output
```
Will submit the `algorunner` job, with additional instrumentation.
For more information refer to https://spark.apache.org/docs/latest/monitoring.html.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pyspark
import re import re
import pyspark.sql
import pickle import pickle
import pandas as pd import pandas as pd
import math import math
import numpy as np import numpy as np
import random import random
import requests import requests
#from bs4 import BeautifulSoup #from bs4 import BeautifulSoup
import json import json
import os import os
from wmfdata.spark import get_session
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!which python !which python
``` ```
%% Output
/srv/home/clarakosi/venv/bin/python
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
qids_and_properties={} qids_and_properties={}
``` ```
%% Cell type:code id: tags:parameters %% Cell type:code id: tags:parameters
``` python ``` python
# Pass in directory to place output files # Pass in directory to place output files
output_dir = 'Output' output_dir = 'Output'
if not os.path.exists(output_dir): if not os.path.exists(output_dir):
os.makedirs(output_dir) os.makedirs(output_dir)
# Pass in the full snapshot date # Pass in the full snapshot date
snapshot = '2020-12-28' snapshot = '2020-12-28'
# Allow the passing of a single language as a parameter # Allow the passing of a single language as a parameter
language = 'arwiki' language = 'arwiki'
# A spark session type determines the resource pool
# to initialise on yarn
spark_session_type = 'regular'
```
%% Cell type:code id: tags:
``` python
# We use wmfdata boilerplate to init a spark session.
# Under the hood the library uses findspark to initialise
# Spark's environment. pyspark imports will be available
# after initialisation
spark = get_session(type='regular', app_name="ImageRec-DEV Training")
import pyspark
import pyspark.sql
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# languages=['enwiki','arwiki','kowiki','cswiki','viwiki','frwiki','fawiki','ptwiki','ruwiki','trwiki','plwiki','hewiki','svwiki','ukwiki','huwiki','hywiki','srwiki','euwiki','arzwiki','cebwiki','dewiki','bnwiki'] #language editions to consider languages=['enwiki','arwiki','kowiki','cswiki','viwiki','frwiki','fawiki','ptwiki','ruwiki','trwiki','plwiki','hewiki','svwiki','ukwiki','huwiki','hywiki','srwiki','euwiki','arzwiki','cebwiki','dewiki','bnwiki'] #language editions to consider
#val=100 #threshold above which we consider images as non-icons #val=100 #threshold above which we consider images as non-icons
languages=[language] languages=[language]
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
reg = r'^([\w]+-[\w]+)' reg = r'^([\w]+-[\w]+)'
short_snapshot = re.match(reg, snapshot).group() short_snapshot = re.match(reg, snapshot).group()
short_snapshot short_snapshot
``` ```
%% Output %% Cell type:code id: tags:
'2020-12' ``` python
!ls /home/gmodena/ImageMatching/conf/metrics.properties
```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
len(languages) len(languages)
``` ```
%% Output
1
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def get_threshold(wiki_size): def get_threshold(wiki_size):
#change th to optimize precision vs recall. recommended val for accuracy = 5 #change th to optimize precision vs recall. recommended val for accuracy = 5
sze, th, lim = 50000, 15, 4 sze, th, lim = 50000, 15, 4
if (wiki_size >= sze): if (wiki_size >= sze):
#if wiki_size > base size, scale threshold by (log of ws/bs) + 1 #if wiki_size > base size, scale threshold by (log of ws/bs) + 1
return (math.log(wiki_size/sze, 10)+1)*th return (math.log(wiki_size/sze, 10)+1)*th
#else scale th down by ratio bs/ws, w min possible val of th = th/limiting val #else scale th down by ratio bs/ws, w min possible val of th = th/limiting val
return max((wiki_size/sze) * th, th/lim) return max((wiki_size/sze) * th, th/lim)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
val={} val={}
total={} total={}
for wiki in languages: for wiki in languages:
querytot="""SELECT COUNT(*) as c querytot="""SELECT COUNT(*) as c
FROM wmf_raw.mediawiki_page FROM wmf_raw.mediawiki_page
WHERE page_namespace=0 WHERE page_namespace=0
AND page_is_redirect=0 AND page_is_redirect=0
AND snapshot='"""+short_snapshot+"""' AND snapshot='"""+short_snapshot+"""'
AND wiki_db='"""+wiki+"""'""" AND wiki_db='"""+wiki+"""'"""
wikisize = spark.sql(querytot).toPandas() wikisize = spark.sql(querytot).toPandas()
val[wiki]=get_threshold(int(wikisize['c'])) val[wiki]=get_threshold(int(wikisize['c']))
total[wiki]=int(wikisize['c']) total[wiki]=int(wikisize['c'])
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
val val
``` ```
%% Output
{'arwiki': 35.11995065862349}
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
total total
``` ```
%% Output
{'arwiki': 1097236}
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
wikisize wikisize
``` ```
%% Output
c
0 1097236
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The query below retrieves, for each unillustrated article: its Wikidata ID, the image of the Wikidata ID (if any), the Commons category of the Wikidata ID (if any), and the lead images of the articles in other languages (if any). The query below retrieves, for each unillustrated article: its Wikidata ID, the image of the Wikidata ID (if any), the Commons category of the Wikidata ID (if any), and the lead images of the articles in other languages (if any).
`allowed_images` contains the list of icons (images appearing in more than `val` articles) `allowed_images` contains the list of icons (images appearing in more than `val` articles)
`image_pageids` contains the list of illustrated articles (articles with images that are not icons) `image_pageids` contains the list of illustrated articles (articles with images that are not icons)
`noimage_pages` contains the pageid and Qid of unillustrated articles `noimage_pages` contains the pageid and Qid of unillustrated articles
`qid_props` contains for each Qid in `noimage_pages`, the values of the following properties, when present: `qid_props` contains for each Qid in `noimage_pages`, the values of the following properties, when present: