This GitLab instance is a work in progress, and brief service interruptions are likely. Shared job runners currently paused. Questions? Ask in #wikimedia-releng on libera.chat, or file a Phabricator task under "GitLab".

  1. 12 Aug, 2021 1 commit
    • Clara's avatar
      Add Search table (#27) · 7cb80f12
      Clara authored
      * Add columns for production data export to elastic search
      
      * Fix table creation
      
      * Update partitioning
      
      * Add distinct() to search table
      7cb80f12
  2. 04 Aug, 2021 1 commit
  3. 06 May, 2021 1 commit
  4. 30 Apr, 2021 1 commit
  5. 28 Apr, 2021 1 commit
  6. 19 Apr, 2021 1 commit
  7. 14 Apr, 2021 1 commit
  8. 08 Apr, 2021 1 commit
  9. 07 Apr, 2021 1 commit
  10. 02 Apr, 2021 1 commit
  11. 01 Apr, 2021 1 commit
    • Gmodena's avatar
      T277776 add found on wiki (#13) · 0fe8e0ee
      Gmodena authored
      * Extract a list of wikis from the note column.
      
      * Fix missing note record mock
      
      * store imagerec_prod as parquet
      
      * Add found_on column to prod dataset
      
      * Remove white spaces from found_on entries
      
      * Fix. reformat style
      
      * Add validation and EDA on found_on column
      
      * Store the output of hive locally.
      
      `hive -f` output contains some Parquet log noise,
      that is written to stdout and was redirected to
      the dataset.
      
      The export query and dataset generation logic have
      been modified to save data locally, without stdout
      redirection of the query result set.
      
      * Gracefully stop spark session before exit etl scripts.
      
      * Gracefully stop spark session before exit etl scripts.
      
      * Fix. notebook json post-merge clutter
      
      * Fix metrics notebook and merge with main.
      
      * Clear notebook output
      
      * Fix duplicated field in ddl
      
      * Add EOL to hive queries
      
      * Add termination after create ddl
      0fe8e0ee
  12. 31 Mar, 2021 3 commits
  13. 30 Mar, 2021 1 commit
  14. 22 Mar, 2021 2 commits
    • Clara's avatar
      Implement parsing of “instance of” fields in ImageMatching production datasets (#9) · 7712d9f4
      Clara authored
      * Update transform.py to parse "instance of" json blob
      
      * Update tests and fix transform.py schema changes
      
      * Simplify parsing logic, add metrics, and update tests
      
      * Updates based on code review
      7712d9f4
    • Gmodena's avatar
      T277552 project jdata store as parquet (#10) · 292c864a
      Gmodena authored
      * Project instanceof in model output
      
      * Upload raw model output to HDFS as paruqet
      
      * Add elt to PYTHONPATH when running pytest
      
      * Copy raw data to HDFS and convert it to parquet
      
      * Update doc
      
      * Add instance of to imagerec and store content as parquet
      
      * Fix. append to PYTHONPATH
      
      * Add placeholder instanceof column in mocks
      292c864a
  15. 18 Mar, 2021 1 commit
  16. 17 Mar, 2021 1 commit
    • Clara's avatar
      T275165 dataset metrics (#8) · e4163f38
      Clara authored
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Changes based on code review
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Add initial dataset metrics
      
      * Update draft dataset metrics with updated datasets
      
      * Add dataset metrics script and comparison of intermediate & final data
      
      * Changes based on code review
      
      * Update dataset_metrics_runner
      e4163f38
  17. 16 Mar, 2021 1 commit
    • Gmodena's avatar
      T275685 generate production datasets (#7) · 05888e6a
      Gmodena authored
      * Add script to generate and export production datasets
      
      * Move hql script to ddl
      
      * Document publish.sh
      
      * Add some crude metrics reporting
      
      * Store artifacts and metrics by run identifier
      
      * Fix variable names
      
      * Adjust var names, record timestamps in metrics
      
      * Enable dynamic partitioning
      
      * Add snapshot partition to production dataset
      
      * Fix dir name
      
      * Update publish.sh doc
      
      * Make virtual env before activationg
      
      * Fix: confidence_rating to source mapping
      
      * Add export data summary
      
      * Update validation notebook with regression cases
      
      * Add test for confidence mapping
      
      * Fix. call uuid4 for default dataset_id
      
      * Fix missing coma in column list
      
      * Export NULL values as empty strings.
      
      * Genedate data for all languages
      
      * Update data export changelog
      
      * Update data export changelog: set month to March
      
      * Clean up validation notebook
      
      * Load validation data from hive
      
      * Fix character escaping
      05888e6a
  18. 04 Mar, 2021 3 commits
  19. 02 Mar, 2021 9 commits
  20. 01 Mar, 2021 2 commits
  21. 26 Feb, 2021 2 commits
  22. 24 Feb, 2021 1 commit
  23. 23 Feb, 2021 3 commits