From Pandas to Spark DataFrames

Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece of Pandas' DataFrame computation to Apache Spark parallel computation framework.

You can find Apache Spark here. There are also some managed Spark service availabe so you don't have to build your own environment any more, just click some buttton and start coding! Amazon Glue is a fully managed serverless environment, also there is Databricks which support both AWS and Azure - click here to have try.

Key Differences

1. Immutability of Spark DataFrames

The most critical difference is that DataFrames in Spark is immutable, which means you can't modify the data in place like this:

pandas_df['col_A'] = pandas_df['col_B'] + pandas_df['col_C']

But don't be panic, you can use something like below to tranform the dataframe to another one:

spark_df = spark_df.withColumn('col_A', spark_df['col_B'] + spark_df['col_C'])
2. Distributed DataFrames

If you load some file into a Pandas dataframe, the order of the records is the same as in the file, but things are totally different in Spark. As Spark may load the file in parallele, there is no guarantee of the orders. But you can always sort the dataframe and get the first out of them.

Creating DataFrames

1. Load From Files
# pandas way
import pandas as pd
pdf = pd.read_csv("/data/filename.csv", header='infer', sep=',', quotechar='"')

# spark
df = spark.read.csv("/data/filename.csv", header=True, inferSchema=True, sep=',', quote='"')
df.show()

# or df = spark.read.load(path='', format='csv', schema='', header=True)
# or df = spark.read.json('file.json')
2. Create DataFrames
# pandas
import numpy as np
df = pd.DataFrame({'AAA': [4,5,6,7], 'BBB': [10,20,30,40], 'CCC': [100,50,-30,-50]})
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['colA', 'colB'])
# with type
df = spark.createDataFrame([(x,) for x in range(10)], 'colA: int')
df.show()

DataFrame Operations

1. Select Columns (similar syntax)
# pandas
import pandas as pd
df = pd.DataFrame({'colA': [4,5,6,7], 'colB': [10,20,30,40], 'colC': [100,50,-30,-50]})
print(df.AAA, df['BBB'], df[['BBB','CCC']])
# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
# only return a Column object
print(df.A, df['A'])

# return a dataframe
df[['A', 'B']].show()
df.select('A').show()
d.select(['A', 'B']).show()
2. Add Column or Modify Existing Column
# pandas (in-place)
df['newA'] = df['A'] * 2

# spark (return a new dataframe)
df = df.withColumn('newA', df.B * 2)
# can also change an existing column
import pyspark.sql.functions as F
df = df.withColumn('B', F.when(F.col('B').isNull(), 500 + F.col('B')).otherwise(F.col('B') + 200))
3. Filter Rows
# pandas
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df[df.A.isin([3,4,7]) & (df.B > 1)]

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
df[df.A.isin([3,4,7]) & (df.B > 1)].show()
# sql is also available
df.filter("A in(3,4,7) and B > 1").show() # or df.where(...)
4. Aggregations
# pandas
df.groupby('A').agg({'A': 'sum', 'B': 'mean'})

# spark
import numpy as np
df = spark.createDataFrame([(int(x), int(x*2)) for x in np.random.randint(1,10,50)], ['A', 'B'])
df.groupBy('A').agg({'A': 'sum', 'B': 'mean'}).orderBy('A').show()
df.groupBy("A").avg("B").show()
import pyspark.sql.functions as F
df.groupBy("A").agg(F.avg("B").alias('avg_B'), F.min("B")).show()
5. Joins
# pandas
dfA = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfB = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfA.merge(dfB, on='B', suffixes=('_1', '_2'), how='left')

# spark
dfA = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
dfB = spark.createDataFrame([(i,i*2) for i in range(10)], ['B', 'D'])
dfC = dfA.join(dfB, on='B', how='left_outer')
6. Unions
# pandas
import pandas as pd
pd.concat([dfA, dfB], axis=0).reset_index()

# spark
dfC = dfA.union(dfB)
7. Advanced
7.1 UDFs
import pyspark.sql.functions as F
import pyspark.sql.types as T
maturity_udf = F.udf(lambda age: "adult" if age >=18 else "child", T.StringType())
df = spark.createDataFrame([(i,i*2) for i in range(5,15)], ['no', 'age'])
df.withColumn("maturity", maturity_udf(df.age)).show()
7.2 Windowing
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(int(i),int(i*2)) for i in np.random.randint(5,15,50)], ['A', 'B'])
df.withColumn('cc', F.count('*').over(Window.partitionBy('A').orderBy('B')).alias('cc')).where('cc > 1').show()
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值