Datasets library for spark
Library for accessing productionized research datasets generated using scheduled airflow dags
Usage:
from research_common import datasets
from research_common.spark import create_yarn_spark_session
spark = create_yarn_spark_session(app_id='my_app')
df = datasets.article_features(spark)
df.printSchema()
root
|-- page_id: long (nullable = true)
|-- revision_id: long (nullable = true)
|-- revision_timestamp: string (nullable = true)
|-- page_length: integer (nullable = true)
|-- num_refs: integer (nullable = true)
|-- num_wikilinks: integer (nullable = true)
|-- num_categories: integer (nullable = true)
|-- num_media: integer (nullable = true)
|-- num_headings: integer (nullable = true)
|-- wiki_db: string (nullable = true)
|-- time_partition: string (nullable = true)