Commit 0b55f7e7 authored by Gmodena's avatar Gmodena Committed by GitHub
Browse files

Add android dataset scripts. (#21)

* Add android dataset scripts.

* Add newline at end of file
parent 0f8b35e8
...@@ -102,4 +102,21 @@ and **output directory** (defaults to Output) ...@@ -102,4 +102,21 @@ and **output directory** (defaults to Output)
```shell ```shell
cd dataset_metrics/ cd dataset_metrics/
python3 dataset_metrics_runner.py 2021-01 Output python3 dataset_metrics_runner.py 2021-01 Output
``` ```
\ No newline at end of file
### Exporting datasets
The following scripts export the datasets currently used by client teams.
* `ddl/export_prod_data.hql` generates the canonical dataset for the `image-suggestions-api` service.
* `ddl/export_prod_data-android.hql` generates an Android specific variant.
A template is provided at `ddl/imagerec.sqlite.template` to ingest data into sqlite
for testing and validation purposes. It's parametrized by a `SNAPSHOT` variable;
an sqlite script (DDL and `.import`s) can be generated in Bash with:
```{bash}
export SNAPSHOT=2021-02-22
eval "cat <<EOF
$(cat imagerec.sqlite.template)
EOF
" 2> /dev/null
```
-- This script is used to export production datasets,
-- in a format consumable by the APIs.
--
-- Run with:
-- hive -hiveconf output_path=<output_path> -hiveconf username=${username} -hiveconf wiki=${wiki} -hiveconf snapshot=${monthly_snapshot} -f export_prod_data.hql
--
--
-- Format
-- * Include header: yes
-- * Field delimiter: "\t"
-- * Null value for missing recommendations
-- (image_id, confidence_rating, source fields): ""
-- * found_on: list of wikis delimited by ','
--
-- Changelog:
-- * 2021-03-31: creation.
--
--
use ${hiveconf:username};
set hivevar:null_value="";
set hivevar:found_on_delimiter=",";
set hive.cli.print.header=true;
insert overwrite local directory '${hiveconf:output_path}'
row format delimited fields terminated by '\t'
select page_id,
page_title,
nvl(image_id, ${null_value}) as image_id,
nvl(confidence_rating, ${null_value}) as confidence_rating,
nvl(source, ${null_value}) as source,
dataset_id,
insertion_ts,
wiki,
concat_ws(${found_on_delimiter}, found_on) as found_on
from imagerec_prod
where wiki = '${hiveconf:wiki}' and snapshot='${hiveconf:snapshot}' and is_article_page=true and image_id is not null;
CREATE TABLE t(page_id INTEGER,
page_title TEXT,
image_id TEXT,
confidence_rating TEXT,
source TEXT,
dataset_id TEXT,
insertion_ts REAL,
wiki TEXT,
found_on TEXT);
CREATE INDEX t_wiki_page_id ON t(wiki, page_id);
.mode ascii
.separator "\t" "\n"
.timer on
.import imagerec_prod_${SNAPSHOT}/prod-arwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-arzwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-bnwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-cebwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-cswiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-dewiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-enwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-eswiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-euwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-fawiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-frwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-hewiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-huwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-hywiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-itwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-kowiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-plwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-ptwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-ruwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-srwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-svwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-trwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-ukwiki-${SNAPSHOT}-wd_image_candidates.tsv t
.import imagerec_prod_${SNAPSHOT}/prod-viwiki-${SNAPSHOT}-wd_image_candidates.tsv t
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment