transform.py ($10) · Snippets

Nice!

I was able to run your script:

+--------+------------------+                                                   
|database|COUNT_OF_NEW_PAGES|
+--------+------------------+
|  enwiki|              6696|
+--------+------------------+

A couple of minor things:

spark.run should be changed to spark.sql (run is not a SparkSession method).
pandas_df is actually a Spark DataFrame. Conceptually its the same object as Pandas, but it has different semantics. You can cast to Pandas using the toPandas method (pandas_df = pandas_df.toPandas()), but our best practices discourage this type of collect-on-the-driver patterns.
pandas_df does not have a rename method. To renamed columns of an existing Spark DataFrame, you should do pandas_df = pandas_df.withColumnRenamed('domain', 'DOMAIN').withColumnRenamed('count(1)', 'COUNT_OF_NEW_PAGES') instead.

Admin message

Admin message