Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece of Pandas' DataFrame computation to Apache Spark parallel computation framework.
You can find Apache Spark here. There are also some managed Spark service availabe so you don't have to build your own environment any more, just click some buttton and start coding! Amazon Glue is a fully managed serverless environment, also there is Databricks which support both AWS and Azure - click here to have try.
Key Differences
1. Immutability of Spark DataFrames
The most critical difference is that DataFrames in Spark is immutable, which means you can't modify the data in place like this:
pandas_df['col_A'] = pandas_df['col_B'] + pandas_df['col_C']
But don't be panic, you can use something like below to tranform the dataframe to another one:
spark_df = spark_df.withColumn('col_A', spark_df['col_B'] + spark_df['col_C'])
2. Distributed DataFrames
If you load some file into a Pandas dataframe, the order of the records is the same as in the file, but things are totally different in Spark. As Spark may load the file in parallele, there is no guarantee of the orders. But you can always sort the dataframe and get the first out of them.
Creating DataFrames
1. Load From Files
# pandas way
import pandas as pd
pdf = pd.read_csv("/data/filename.csv", header='infer', sep=',', quotechar='"')
# spark
df = spark.read.csv("/data/filename.csv", header=True, inferSchema=True, sep=',', quote='"')
df.show()
# or df = spark.read.load(path='', format='csv', schema='', header=True)
# or df = spark.read.json('file.json')
2. Create DataFrames
# pandas
import numpy as np
df = pd.DataFrame({'AAA': [4,5,6,7], 'BBB': [10,20,30,40], 'CCC': [100,50,-30,-50]})
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['colA', 'colB'])
# with type
df = spark.createDataFrame([(x,) for x in range(10)], 'colA: int')
df.show()
DataFrame Operations
1. Select Columns (similar syntax)
# pandas
import pandas as pd
df = pd.DataFrame({'colA': [4,5,6,7], 'colB': [10,20,30,40], 'colC': [100,50,-30,-50]})
print(df.AAA, df['BBB'], df[['BBB','CCC']])
# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
# only return a Column object
print(df.A, df['A'])
# return a dataframe
df[['A', 'B']].show()
df.select('A').show()
d.select(['A', 'B']).show()
2. Add Column or Modify Existing Column
# pandas (in-place)
df['newA'] = df['A'] * 2
# spark (return a new dataframe)
df = df.withColumn('newA', df.B * 2)
# can also change an existing column
import pyspark.sql.functions as F
df = df.withColumn('B', F.when(F.col('B').isNull(), 500 + F.col('B')).otherwise(F.col('B') + 200))
3. Filter Rows
# pandas
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df[df.A.isin([3,4,7]) & (df.B > 1)]
# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
df[df.A.isin([3,4,7]) & (df.B > 1)].show()
# sql is also available
df.filter("A in(3,4,7) and B > 1").show() # or df.where(...)
4. Aggregations
# pandas
df.groupby('A').agg({'A': 'sum', 'B': 'mean'})
# spark
import numpy as np
df = spark.createDataFrame([(int(x), int(x*2)) for x in np.random.randint(1,10,50)], ['A', 'B'])
df.groupBy('A').agg({'A': 'sum', 'B': 'mean'}).orderBy('A').show()
df.groupBy("A").avg("B").show()
import pyspark.sql.functions as F
df.groupBy("A").agg(F.avg("B").alias('avg_B'), F.min("B")).show()
5. Joins
# pandas
dfA = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfB = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfA.merge(dfB, on='B', suffixes=('_1', '_2'), how='left')
# spark
dfA = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
dfB = spark.createDataFrame([(i,i*2) for i in range(10)], ['B', 'D'])
dfC = dfA.join(dfB, on='B', how='left_outer')
6. Unions
# pandas
import pandas as pd
pd.concat([dfA, dfB], axis=0).reset_index()
# spark
dfC = dfA.union(dfB)
7. Advanced
7.1 UDFs
import pyspark.sql.functions as F
import pyspark.sql.types as T
maturity_udf = F.udf(lambda age: "adult" if age >=18 else "child", T.StringType())
df = spark.createDataFrame([(i,i*2) for i in range(5,15)], ['no', 'age'])
df.withColumn("maturity", maturity_udf(df.age)).show()
7.2 Windowing
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(int(i),int(i*2)) for i in np.random.randint(5,15,50)], ['A', 'B'])
df.withColumn('cc', F.count('*').over(Window.partitionBy('A').orderBy('B')).alias('cc')).where('cc > 1').show()