From Pandas to Spark DataFrames

这篇博客对比了Pandas和Spark DataFrames的关键差异,如Spark的不可变性和分布式特性。内容涵盖了如何创建DataFrame,包括从文件加载和直接创建,并详细列举了DataFrame的操作,如选择列、添加或修改列、过滤行、聚合、连接和高级操作,如UDF和窗口函数。此外,还提到了可以使用Apache Spark进行并行计算,以及像Amazon Glue和Databricks这样的托管服务。
摘要由CSDN通过智能技术生成

Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece of Pandas' DataFrame computation to Apache Spark parallel computation framework.

You can find Apache Spark here. There are also some managed Spark service availabe so you don't have to build your own environment any more, just click some buttton and start coding! Amazon Glue is a fully managed serverless environment, also there is Databricks which support both AWS and Azure - click here to have try.

Key Differences

1. Immutability of Spark DataFrames

The most critical difference is that DataFrames in Spark is immutable, which means you can't modify the data in place like this:

pandas_df['col_A'] = pandas_df['col_B'] + pandas_df['col_C']

But don't be panic, you can use something like below to tranform the dataframe to another one:

spark_df = spark_df.withColumn('col_A', spark_df['col_B'] + spark_df['col_C'])
2. Distributed DataFrames

If you load some file into a Pandas dataframe, the order of the records is the same as in the file, but things are totally different in Spark. As Spark may load the file in parallele, there is no guarantee of the orders. But you can always sort the dataframe and get the first out of them.

Creating DataFrames

1. Load From Files
# pandas way
import pandas as pd
pdf = pd.read_csv("/data/filename.csv", header='infer', sep=',', quotechar='"')

# spark
df = spark.read.csv("/data/filename.csv", header=True, inferSchema=True, sep=',', quote='"')
df.show()

# or df = spark.read.load(path='', format='csv', schema='', header=True)
# or df = spark.read.json('file.json')
2. Create DataFrames
# pandas
import numpy as np
df = pd.DataFrame({'AAA': [4,5,6,7], 'BBB': [10,20,30,40], 'CCC': [100,50,-30,-50]})
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['colA', 'colB'])
# with type
df = spark.createDataFrame([(x,) for x in range(10)], 'colA: int')
df.show()

DataFrame Operations

1. Select Columns (similar syntax)
# pandas
import pandas as pd
df = pd.DataFrame({'colA': [4,5,6,7], 'colB': [10,20,30,40], 'colC': [100,50,-30,-50]})
print(df.AAA, df['BBB'], df[['BBB','CCC']])
# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
# only return a Column object
print(df.A, df['A'])

# return a dataframe
df[['A', 'B']].show()
df.select('A').show()
d.select(['A', 'B']).show()
2. Add Column or Modify Existing Column
# pandas (in-place)
df['newA'] = df['A'] * 2

# spark (return a new dataframe)
df = df.withColumn('newA', df.B * 2)
# can also change an existing column
import pyspark.sql.functions as F
df = df.withColumn('B', F.when(F.col('B').isNull(), 500 + F.col('B')).otherwise(F.col('B') + 200))
3. Filter Rows
# pandas
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df[df.A.isin([3,4,7]) & (df.B > 1)]

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
df[df.A.isin([3,4,7]) & (df.B > 1)].show()
# sql is also available
df.filter("A in(3,4,7) and B > 1").show() # or df.where(...)
4. Aggregations
# pandas
df.groupby('A').agg({'A': 'sum', 'B': 'mean'})

# spark
import numpy as np
df = spark.createDataFrame([(int(x), int(x*2)) for x in np.random.randint(1,10,50)], ['A', 'B'])
df.groupBy('A').agg({'A': 'sum', 'B': 'mean'}).orderBy('A').show()
df.groupBy("A").avg("B").show()
import pyspark.sql.functions as F
df.groupBy("A").agg(F.avg("B").alias('avg_B'), F.min("B")).show()
5. Joins
# pandas
dfA = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfB = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfA.merge(dfB, on='B', suffixes=('_1', '_2'), how='left')

# spark
dfA = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
dfB = spark.createDataFrame([(i,i*2) for i in range(10)], ['B', 'D'])
dfC = dfA.join(dfB, on='B', how='left_outer')
6. Unions
# pandas
import pandas as pd
pd.concat([dfA, dfB], axis=0).reset_index()

# spark
dfC = dfA.union(dfB)
7. Advanced
7.1 UDFs
import pyspark.sql.functions as F
import pyspark.sql.types as T
maturity_udf = F.udf(lambda age: "adult" if age >=18 else "child", T.StringType())
df = spark.createDataFrame([(i,i*2) for i in range(5,15)], ['no', 'age'])
df.withColumn("maturity", maturity_udf(df.age)).show()
7.2 Windowing
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(int(i),int(i*2)) for i in np.random.randint(5,15,50)], ['A', 'B'])
df.withColumn('cc', F.count('*').over(Window.partitionBy('A').orderBy('B')).alias('cc')).where('cc > 1').show()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值