Support using an existing file obj (!32) · Merge requests · repos / research / html-dumps

Matthias Mullie requested to merge mlitn/html-dumps:fileobj into main Feb 22, 2024

This provides more flexibility; e.g. no longer needs to be a local file

Example use case would be for running within a spark job, where there is no local file: there's no access to NFS share & not enough disk space to downloads full dumps to, but I can have the dumps in HDFS, and access them from within the job like so:

cat = subprocess.Popen(['hdfs', 'dfs', '-cat', wiki_dump_path], stdout=subprocess.PIPE)
html_dump = HTMLDump(filepath=wiki_dump_path, fileobj=cat.stdout)

Admin message

Admin message

Support using an existing file obj

Merge request reports