From Pandas to Spark DataFrames

最新推荐文章于 2023-12-04 22:28:52 发布

wglacier

最新推荐文章于 2023-12-04 22:28:52 发布

阅读量282

点赞数 2

分类专栏：原创 Spark 文章标签： Python Spark DataFrame

本文链接：https://blog.csdn.net/ghz/article/details/79449095

版权

原创同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Spark

1 篇文章 0 订阅

订阅专栏

这篇博客对比了Pandas和Spark DataFrames的关键差异，如Spark的不可变性和分布式特性。内容涵盖了如何创建DataFrame，包括从文件加载和直接创建，并详细列举了DataFrame的操作，如选择列、添加或修改列、过滤行、聚合、连接和高级操作，如UDF和窗口函数。此外，还提到了可以使用Apache Spark进行并行计算，以及像Amazon Glue和Databricks这样的托管服务。

摘要由CSDN通过智能技术生成

Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece of Pandas' DataFrame computation to Apache Spark parallel computation framework.

You can find Apache Spark here. There are also some managed Spark service availabe so you don't have to build your own environment any more, just click some buttton and start coding! Amazon Glue is a fully managed serverless environment, also there is Databricks which support both AWS and Azure - click here to have try.

Key Differences

1. Immutability of Spark DataFrames

The most critical difference is that DataFrames in Spark is immutable, which means you can't modify the data in place like this:

pandas_df['col_A'] = pandas_df['col_B'] + pandas_df['col_C']

But don't be panic, you can use something like below to tranform the dataframe to another one:

spark_df = spark_df.withColumn('col_A', spark_df['col_B'] + spark_df['col_C'])

2. Distributed DataFrames

If you load some file into a Pandas dataframe, the order of the records is the same as in the file, but things are totally different in Spark. As Spark may load the file in parallele, there is no guarantee of the orders. But you can always sort the dataframe and get the first out of them.

Creating DataFrames

1. Load From Files

# pandas way
import pandas as pd
pdf = pd.read_csv("/data/filename.csv", header='infer', sep=',', quotechar='"')

# spark
df = spark.read.csv("/data/filename.csv", header=True, inferSchema=True, sep=',', quote='"')
df.show()

# or df = spark.read.load(path='', format='csv', schema='', header=True)
# or df = spark.read.json('file.json')

2. Create DataFrames

# pandas
import numpy as np
df = pd.DataFrame({'AAA': [4,5,6,7], 'BBB': [10,20,30,40], 'CCC': [100,50,-30,-50]})
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['colA', 'colB'])
# with type
df = spark.createDataFrame([(x,) for x in range(10)], 'colA: int')
df.show()

DataFrame Operations

1. Select Columns (similar syntax)

# pandas
import pandas as pd
df = pd.DataFrame({'colA': [4,5,6,7], 'colB': [10,20,30,40], 'colC': [100,50,-30,-50]})
print(df.AAA, df['BBB'], df[['BBB','CCC']])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
# only return a Column object
print(df.A, df['A'])

# return a dataframe
df[['A', 'B']].show()
df.select('A').show()
d.select(['A', 'B']).show()

2. Add Column or Modify Existing Column

# pandas (in-place)
df['newA'] = df['A'] * 2

# spark (return a new dataframe)
df = df.withColumn('newA', df.B * 2)
# can also change an existing column
import pyspark.sql.functions as F
df = df.withColumn('B', F.when(F.col('B').isNull(), 500 + F.col('B')).otherwise(F.col('B') + 200))

3. Filter Rows

# pandas
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df[df.A.isin([3,4,7]) & (df.B > 1)]

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
df[df.A.isin([3,4,7]) & (df.B > 1)].show()
# sql is also available
df.filter("A in(3,4,7) and B > 1").show() # or df.where(...)

4. Aggregations

# pandas
df.groupby('A').agg({'A': 'sum', 'B': 'mean'})

# spark
import numpy as np
df = spark.createDataFrame([(int(x), int(x*2)) for x in np.random.randint(1,10,50)], ['A', 'B'])
df.groupBy('A').agg({'A': 'sum', 'B': 'mean'}).orderBy('A').show()
df.groupBy("A").avg("B").show()
import pyspark.sql.functions as F
df.groupBy("A").agg(F.avg("B").alias('avg_B'), F.min("B")).show()

5. Joins

# pandas
dfA = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfB = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfA.merge(dfB, on='B', suffixes=('_1', '_2'), how='left')

# spark
dfA = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
dfB = spark.createDataFrame([(i,i*2) for i in range(10)], ['B', 'D'])
dfC = dfA.join(dfB, on='B', how='left_outer')

6. Unions

# pandas
import pandas as pd
pd.concat([dfA, dfB], axis=0).reset_index()

# spark
dfC = dfA.union(dfB)

7. Advanced

7.1 UDFs

import pyspark.sql.functions as F
import pyspark.sql.types as T
maturity_udf = F.udf(lambda age: "adult" if age >=18 else "child", T.StringType())
df = spark.createDataFrame([(i,i*2) for i in range(5,15)], ['no', 'age'])
df.withColumn("maturity", maturity_udf(df.age)).show()

7.2 Windowing

import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(int(i),int(i*2)) for i in np.random.randint(5,15,50)], ['A', 'B'])
df.withColumn('cc', F.count('*').over(Window.partitionBy('A').orderBy('B')).alias('cc')).where('cc > 1').show()