1. 14 Apr, 2021 1 commit
  2. 08 Apr, 2021 1 commit
  3. 07 Apr, 2021 1 commit
  4. 02 Apr, 2021 1 commit
  5. 01 Apr, 2021 1 commit
    • Gmodena's avatar
      T277776 add found on wiki (#13) · 0fe8e0ee
      Gmodena authored
      * Extract a list of wikis from the note column.
      
      * Fix missing note record mock
      
      * store imagerec_prod as parquet
      
      * Add found_on column to prod dataset
      
      * Remove white spaces from found_on entries
      
      * Fix. reformat style
      
      * Add validation and EDA on found_on column
      
      * Store the output of hive locally.
      
      `hive -f` output contains some Parquet log noise,
      that is written to stdout and was redirected to
      the dataset.
      
      The export query and dataset generation logic have
      been modified to save data locally, without stdout
      redirection of the query result set.
      
      * Gracefully stop spark session before exit etl scripts.
      
      * Gracefully stop spark session before exit etl scripts.
      
      * Fix. notebook json post-merge clutter
      
      * Fix metrics notebook and merge with main.
      
      * Clear notebook output
      
      * Fix duplicated field in ddl
      
      * Add EOL to hive queries
      
      * Add termination after create ddl
      0fe8e0ee
  6. 31 Mar, 2021 3 commits
  7. 30 Mar, 2021 1 commit
  8. 22 Mar, 2021 2 commits
    • Clarakosi's avatar
      Implement parsing of “instance of” fields in ImageMatching production datasets (#9) · 7712d9f4
      Clarakosi authored
      * Update transform.py to parse "instance of" json blob
      
      * Update tests and fix transform.py schema changes
      
      * Simplify parsing logic, add metrics, and update tests
      
      * Updates based on code review
      7712d9f4
    • Gmodena's avatar
      T277552 project jdata store as parquet (#10) · 292c864a
      Gmodena authored
      * Project instanceof in model output
      
      * Upload raw model output to HDFS as paruqet
      
      * Add elt to PYTHONPATH when running pytest
      
      * Copy raw data to HDFS and convert it to parquet
      
      * Update doc
      
      * Add instance of to imagerec and store content as parquet
      
      * Fix. append to PYTHONPATH
      
      * Add placeholder instanceof column in mocks
      292c864a
  9. 18 Mar, 2021 1 commit
  10. 17 Mar, 2021 1 commit
    • Clarakosi's avatar
      T275165 dataset metrics (#8) · e4163f38
      Clarakosi authored
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Changes based on code review
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Changes based on code review
      
      * Update dataset_metrics_runner
      e4163f38
  11. 16 Mar, 2021 1 commit
    • Gmodena's avatar
      T275685 generate production datasets (#7) · 05888e6a
      Gmodena authored
      * Add script to generate and export production datasets
      
      * Move hql script to ddl
      
      * Document publish.sh
      
      * Add some crude metrics reporting
      
      * Store artifacts and metrics by run identifier
      
      * Fix variable names
      
      * Adjust var names, record timestamps in metrics
      
      * Enable dynamic partitioning
      
      * Add snapshot partition to production dataset
      
      * Fix dir name
      
      * Update publish.sh doc
      
      * Make virtual env before activationg
      
      * Fix: confidence_rating to source mapping
      
      * Add export data summary
      
      * Update validation notebook with regression cases
      
      * Add test for confidence mapping
      
      * Fix. call uuid4 for default dataset_id
      
      * Fix missing coma in column list
      
      * Export NULL values as empty strings.
      
      * Genedate data for all languages
      
      * Update data export changelog
      
      * Update data export changelog: set month to March
      
      * Clean up validation notebook
      
      * Load validation data from hive
      
      * Fix character escaping
      05888e6a
  12. 04 Mar, 2021 3 commits
  13. 02 Mar, 2021 9 commits
  14. 01 Mar, 2021 2 commits
  15. 26 Feb, 2021 2 commits
  16. 24 Feb, 2021 1 commit
  17. 23 Feb, 2021 6 commits
  18. 22 Feb, 2021 3 commits