Skip to content

Datasets library for spark

Fabian Kaelin requested to merge datasets into main

Library for accessing productionized research datasets generated using scheduled airflow dags

Usage:

from research_common import datasets
from research_common.spark import create_yarn_spark_session

spark = create_yarn_spark_session(app_id='my_app')

df = datasets.article_features(spark)
df.printSchema()
root
 |-- page_id: long (nullable = true)
 |-- revision_id: long (nullable = true)
 |-- revision_timestamp: string (nullable = true)
 |-- page_length: integer (nullable = true)
 |-- num_refs: integer (nullable = true)
 |-- num_wikilinks: integer (nullable = true)
 |-- num_categories: integer (nullable = true)
 |-- num_media: integer (nullable = true)
 |-- num_headings: integer (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- time_partition: string (nullable = true)

Merge request reports