• Nice!

    I was able to run your script:

    |  enwiki|              6696|

    A couple of minor things:

    • spark.run should be changed to spark.sql (run is not a SparkSession method).
    • pandas_df is actually a Spark DataFrame. Conceptually its the same object as Pandas, but it has different semantics. You can cast to Pandas using the toPandas method (pandas_df = pandas_df.toPandas()), but our best practices discourage this type of collect-on-the-driver patterns.
    • pandas_df does not have a rename method. To renamed columns of an existing Spark DataFrame, you should do pandas_df = pandas_df.withColumnRenamed('domain', 'DOMAIN').withColumnRenamed('count(1)', 'COUNT_OF_NEW_PAGES') instead.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment