Skip to content

Add 'coalesce' parameter to control number of files generated per Hive partition.

Xcollazo requested to merge T323614-reduce-file-footprint into main

In this MR we add a parsearg named coalesce to control number of files generated per Hive partition.

We bubble this parameter down all the way to just before the call to Spark's saveAsTable:

df
  .computation1
  .computation2
  ...
  .coalesce(2). <=== this effectively controls the number of files generated per Hive partition by merging together Spark partitions.
  .saveAsTable(...)

I have done a test run and it is working as expected with no significant performance degradation. Some details of results below.

Before changes

Here is the amount of files generated by a run for snapshot=2022-12-05:

# numbers below refer to #-folders, #-files and #-bytes, respectively:
xcollazo@stat1007:~$  hdfs dfs -count /user/hive/warehouse/analytics_platform_eng.db/*/snapshot=2022-12-05/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
         659       101247          213478397 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache/snapshot=2022-12-05
           1          256          125479494 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_lead_image_data/snapshot=2022-12-05
           1          256            1955511 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_search_index_delta/snapshot=2022-12-05
           1          512          831765957 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_search_index_full/snapshot=2022-12-05
         659       101247         6884570783 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions/snapshot=2022-12-05
         659       101421          292499396 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache/snapshot=2022-12-05
           1          512          501687364 /user/hive/warehouse/analytics_platform_eng.db/image_suggestions_wikidata_data/snapshot=2022-12-05

In total, we have this amount of files:

xcollazo@stat1007:~$ hdfs dfs -count /user/hive/warehouse/analytics_platform_eng.db/*/snapshot=2022-12-05/ | awk '{sum+=$2;} END{print sum;}'
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
305451

In total, we have this amount of bytes:

xcollazo@stat1007:~$ hdfs dfs -count /user/hive/warehouse/analytics_platform_eng.db/*/snapshot=2022-12-05/ | awk '{sum+=$3;} END{print sum;}'
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
8851436902

Which is ~8.3GB, so average file size of 8851436902 / 305451 = 28 KB.

After changes

Here is the amount of files generated by a run for snapshot=2022-12-05 with coalesce=2:

xcollazo@stat1007:~$ hdfs dfs -count /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/*/snapshot=2022-12-05/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
         659         1254           61523381 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_instanceof_cache/snapshot=2022-12-05
           1            2          123687615 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_lead_image_data/snapshot=2022-12-05
           1            2          879434997 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_delta/snapshot=2022-12-05
           1            4          867684894 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05
         659         1254         6807802929 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_suggestions/snapshot=2022-12-05
         659         1248          124526750 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_title_cache/snapshot=2022-12-05
           1            2          547493026 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_wikidata_data/snapshot=2022-12-05

In total, we have this amount of files:

xcollazo@stat1007:~$  hdfs dfs -count /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/*/snapshot=2022-12-05/ | awk '{sum+=$2;} END{print sum;}'
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
3766

In total, we have this amount of bytes:

xcollazo@stat1007:~$  hdfs dfs -count /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/*/snapshot=2022-12-05/ | awk '{sum+=$3;} END{print sum;}'
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
9412153592

Which is ~8.8GB, so average file size of 9412153592 / 3766 = 2.4 MB.

So we had a 80-fold file count decrease when setting coalesce=2.

More details

Click to expand File sizes all over are also reasonable with `coalesce=2`. See examples below:
xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_instanceof_cache/snapshot=2022-12-05/wiki=enwiki
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    622.2 K 2022-12-21 17:02 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_instanceof_cache/snapshot=2022-12-05/wiki=enwiki/part-00000-7e626ca9-9a5d-410e-a7ae-ea7079f3e8e3.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    620.6 K 2022-12-21 17:01 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_instanceof_cache/snapshot=2022-12-05/wiki=enwiki/part-00001-7e626ca9-9a5d-410e-a7ae-ea7079f3e8e3.c000.snappy.parquet

xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_lead_image_data/snapshot=2022-12-05
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users     59.0 M 2022-12-21 14:58 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_lead_image_data/snapshot=2022-12-05/part-00000-9f1e9063-40bb-42a8-a1d5-46722b45b202.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users     59.0 M 2022-12-21 14:58 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_lead_image_data/snapshot=2022-12-05/part-00001-9f1e9063-40bb-42a8-a1d5-46722b45b202.c000.snappy.parquet

xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_delta/snapshot=2022-12-05
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    467.2 M 2022-12-21 17:12 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_delta/snapshot=2022-12-05/part-00000-773fea6e-50d7-4d0b-be5d-97c2b713d0e0.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    371.4 M 2022-12-21 17:12 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_delta/snapshot=2022-12-05/part-00001-773fea6e-50d7-4d0b-be5d-97c2b713d0e0.c000.snappy.parquet


xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 4 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      6.4 M 2022-12-21 17:05 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05/part-00000-1690c012-9b2b-4897-bda0-792166b9ff80.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    435.5 M 2022-12-21 15:38 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05/part-00000-448d55dc-d5a2-4589-b4b8-4e770db5cb22.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      9.8 M 2022-12-21 17:05 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05/part-00001-1690c012-9b2b-4897-bda0-792166b9ff80.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    375.8 M 2022-12-21 15:38 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_search_index_full/snapshot=2022-12-05/part-00001-448d55dc-d5a2-4589-b4b8-4e770db5cb22.c000.snappy.parquet


xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_suggestions/snapshot=2022-12-05/wiki=enwiki
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      6.8 M 2022-12-21 16:02 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_suggestions/snapshot=2022-12-05/wiki=enwiki/part-00000-d402323a-4a17-4a6b-bd59-25931dbeae9a.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      7.2 M 2022-12-21 16:02 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_suggestions/snapshot=2022-12-05/wiki=enwiki/part-00001-d402323a-4a17-4a6b-bd59-25931dbeae9a.c000.snappy.parquet

xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_title_cache/snapshot=2022-12-05/wiki=enwiki
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      1.2 M 2022-12-21 16:32 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_title_cache/snapshot=2022-12-05/wiki=enwiki/part-00000-c239366b-5678-49a4-98e8-d8255965b8fb.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users      1.2 M 2022-12-21 16:32 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_title_cache/snapshot=2022-12-05/wiki=enwiki/part-00001-c239366b-5678-49a4-98e8-d8255965b8fb.c000.snappy.parquet

xcollazo@stat1007:~$ hdfs dfs -ls -h /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_wikidata_data/snapshot=2022-12-05
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r-----   3 analytics-privatedata analytics-privatedata-users     58.7 M 2022-12-21 14:42 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_wikidata_data/snapshot=2022-12-05/part-00000-aff8cd21-6c94-414b-998b-adb68169b28b.c000.snappy.parquet
-rw-r-----   3 analytics-privatedata analytics-privatedata-users    463.5 M 2022-12-21 14:46 /user/hive/warehouse/xcollazo_delete_me_test_image_suggestions_spark3.db/image_suggestions_wikidata_data/snapshot=2022-12-05/part-00001-aff8cd21-6c94-414b-998b-adb68169b28b.c000.snappy.parquet
Edited by Xcollazo

Merge request reports