Skip to content

Support using an existing file obj

Matthias Mullie requested to merge mlitn/html-dumps:fileobj into main

This provides more flexibility; e.g. no longer needs to be a local file

Example use case would be for running within a spark job, where there is no local file: there's no access to NFS share & not enough disk space to downloads full dumps to, but I can have the dumps in HDFS, and access them from within the job like so:

cat = subprocess.Popen(['hdfs', 'dfs', '-cat', wiki_dump_path], stdout=subprocess.PIPE)
html_dump = HTMLDump(filepath=wiki_dump_path, fileobj=cat.stdout)

Merge request reports