- 12 Aug, 2021 1 commit
-
-
Clarakosi authored
* Add columns for production data export to elastic search * Fix table creation * Update partitioning * Add distinct() to search table
-
- 04 Aug, 2021 1 commit
-
-
Gmodena authored
* Update spark dep in makefile * Spark version bump in build.yml
-
- 06 May, 2021 1 commit
-
-
Clarakosi authored
* Add recurrent time frame to instances_to_filter list * Add instance of century leap year, family name, and name to filter list
-
- 30 Apr, 2021 1 commit
-
-
Clarakosi authored
-
- 28 Apr, 2021 1 commit
-
-
Clarakosi authored
-
- 19 Apr, 2021 1 commit
-
-
Gmodena authored
* Move wiki and poc_wiki lists to a config file * Make variable names more generic
-
- 14 Apr, 2021 1 commit
-
-
Gmodena authored
* Add android dataset scripts. * Add newline at end of file
-
- 08 Apr, 2021 1 commit
-
-
Clarakosi authored
* Filter image suggestions detected as "placeholder images" * Update based on code review * Use repartition instead of coalesce
-
- 07 Apr, 2021 1 commit
-
-
Clarakosi authored
-
- 02 Apr, 2021 1 commit
-
-
Gmodena authored
-
- 01 Apr, 2021 1 commit
-
-
Gmodena authored
* Extract a list of wikis from the note column. * Fix missing note record mock * store imagerec_prod as parquet * Add found_on column to prod dataset * Remove white spaces from found_on entries * Fix. reformat style * Add validation and EDA on found_on column * Store the output of hive locally. `hive -f` output contains some Parquet log noise, that is written to stdout and was redirected to the dataset. The export query and dataset generation logic have been modified to save data locally, without stdout redirection of the query result set. * Gracefully stop spark session before exit etl scripts. * Gracefully stop spark session before exit etl scripts. * Fix. notebook json post-merge clutter * Fix metrics notebook and merge with main. * Clear notebook output * Fix duplicated field in ddl * Add EOL to hive queries * Add termination after create ddl
-
- 31 Mar, 2021 3 commits
- 30 Mar, 2021 1 commit
-
-
Gmodena authored
* Add page redirect counters * Fix table name and column. * Fix. quote snapshot literal
-
- 22 Mar, 2021 2 commits
-
-
Clarakosi authored
* Update transform.py to parse "instance of" json blob * Update tests and fix transform.py schema changes * Simplify parsing logic, add metrics, and update tests * Updates based on code review
-
Gmodena authored
* Project instanceof in model output * Upload raw model output to HDFS as paruqet * Add elt to PYTHONPATH when running pytest * Copy raw data to HDFS and convert it to parquet * Update doc * Add instance of to imagerec and store content as parquet * Fix. append to PYTHONPATH * Add placeholder instanceof column in mocks
-
- 18 Mar, 2021 1 commit
-
-
Clarakosi authored
-
- 17 Mar, 2021 1 commit
-
-
Clarakosi authored
* Add initial dataset metrics * Update draft dataset metrics with updated datasets * Add dataset metrics script and comparison of intermediate & final data * Add initial dataset metrics * Update draft dataset metrics with updated datasets * Add dataset metrics script and comparison of intermediate & final data * Changes based on code review * Add initial dataset metrics * Update draft dataset metrics with updated datasets * Add dataset metrics script and comparison of intermediate & final data * Add initial dataset metrics * Update draft dataset metrics with updated datasets * Add dataset metrics script and comparison of intermediate & final data * Changes based on code review * Update dataset_metrics_runner
-
- 16 Mar, 2021 1 commit
-
-
Gmodena authored
* Add script to generate and export production datasets * Move hql script to ddl * Document publish.sh * Add some crude metrics reporting * Store artifacts and metrics by run identifier * Fix variable names * Adjust var names, record timestamps in metrics * Enable dynamic partitioning * Add snapshot partition to production dataset * Fix dir name * Update publish.sh doc * Make virtual env before activationg * Fix: confidence_rating to source mapping * Add export data summary * Update validation notebook with regression cases * Add test for confidence mapping * Fix. call uuid4 for default dataset_id * Fix missing coma in column list * Export NULL values as empty strings. * Genedate data for all languages * Update data export changelog * Update data export changelog: set month to March * Clean up validation notebook * Load validation data from hive * Fix character escaping
-
- 04 Mar, 2021 3 commits
-
-
Miriam Redi authored
T275685 automate pytest
-
Miriam Redi authored
T275162 enable spark metrics collection
-
Gmodena authored
-
- 02 Mar, 2021 9 commits
- 01 Mar, 2021 2 commits
- 26 Feb, 2021 2 commits
- 24 Feb, 2021 1 commit
-
-
Gabriele Modena authored
-
- 23 Feb, 2021 3 commits
-
-
Gabriele Modena authored
-
Gmodena authored
-
Gmodena authored
-