From Pandas to Spark DataFrames

最新推荐文章于 2023-05-20 16:48:13 发布

wglacier

最新推荐文章于 2023-05-20 16:48:13 发布

阅读量261

点赞数 2

分类专栏：原创 Spark 文章标签： Python Spark DataFrame

本文链接：https://blog.csdn.net/ghz/article/details/79449095

版权

原创同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Spark

1 篇文章 0 订阅

订阅专栏

Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece of Pandas' DataFrame computation to Apache Spark parallel computation framework.

You can find Apache Spark here. There are also some managed Spark service availabe so you don't have to build your own environment any more, just click some buttton and start coding! Amazon Glue is a fully managed serverless environment, also there is Databricks which support both AWS and Azure - click here to have try.

Key Differences

1. Immutability of Spark DataFrames

The most critical difference is that DataFrames in Spark is immutable, which means you can't modify the data in place like this:

pandas_df['col_A'] = pandas_df['col_B'] + pandas_df['col_C']

But don't be panic, you can use something like below to tranform the dataframe to another one:

spark_df = spark_df.withColumn('col_A', spark_df['col_B'] + spark_df['col_C'])

2. Distributed DataFrames

If you load some file into a Pandas dataframe, the order of the records is the same as in the file, but things are totally different in Spark. As Spark may load the file in parallele, there is no guarantee of the orders. But you can always sort the dataframe and get the first out of them.

Creating DataFrames

1. Load From Files

# pandas way
import pandas as pd
pdf = pd.read_csv("/data/filename.csv", header='infer', sep=',', quotechar='"')

# spark
df = spark.read.csv("/data/filename.csv", header=True, inferSchema=True, sep=',', quote='"')
df.show()

# or df = spark.read.load(path='', format='csv', schema='', header=True)
# or df = spark.read.json('file.json')

2. Create DataFrames

# pandas
import numpy as np
df = pd.DataFrame({'AAA': [4,5,6,7], 'BBB': [10,20,30,40], 'CCC': [100,50,-30,-50]})
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['colA', 'colB'])
# with type
df = spark.createDataFrame([(x,) for x in range(10)], 'colA: int')
df.show()

DataFrame Operations

1. Select Columns (similar syntax)

# pandas
import pandas as pd
df = pd.DataFrame({'colA': [4,5,6,7], 'colB': [10,20,30,40], 'colC': [100,50,-30,-50]})
print(df.AAA, df['BBB'], df[['BBB','CCC']])

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
# only return a Column object
print(df.A, df['A'])

# return a dataframe
df[['A', 'B']].show()
df.select('A').show()
d.select(['A', 'B']).show()

2. Add Column or Modify Existing Column

# pandas (in-place)
df['newA'] = df['A'] * 2

# spark (return a new dataframe)
df = df.withColumn('newA', df.B * 2)
# can also change an existing column
import pyspark.sql.functions as F
df = df.withColumn('B', F.when(F.col('B').isNull(), 500 + F.col('B')).otherwise(F.col('B') + 200))

3. Filter Rows

# pandas
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
df[df.A.isin([3,4,7]) & (df.B > 1)]

# spark
df = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
df[df.A.isin([3,4,7]) & (df.B > 1)].show()
# sql is also available
df.filter("A in(3,4,7) and B > 1").show() # or df.where(...)

4. Aggregations

# pandas
df.groupby('A').agg({'A': 'sum', 'B': 'mean'})

# spark
import numpy as np
df = spark.createDataFrame([(int(x), int(x*2)) for x in np.random.randint(1,10,50)], ['A', 'B'])
df.groupBy('A').agg({'A': 'sum', 'B': 'mean'}).orderBy('A').show()
df.groupBy("A").avg("B").show()
import pyspark.sql.functions as F
df.groupBy("A").agg(F.avg("B").alias('avg_B'), F.min("B")).show()

5. Joins

# pandas
dfA = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfB = pd.DataFrame(np.arange(9).reshape(3,3), columns=['A', 'B', 'C'])
dfA.merge(dfB, on='B', suffixes=('_1', '_2'), how='left')

# spark
dfA = spark.createDataFrame([(i,i*2) for i in range(10)], ['A', 'B'])
dfB = spark.createDataFrame([(i,i*2) for i in range(10)], ['B', 'D'])
dfC = dfA.join(dfB, on='B', how='left_outer')

6. Unions

# pandas
import pandas as pd
pd.concat([dfA, dfB], axis=0).reset_index()

# spark
dfC = dfA.union(dfB)

7. Advanced

7.1 UDFs

import pyspark.sql.functions as F
import pyspark.sql.types as T
maturity_udf = F.udf(lambda age: "adult" if age >=18 else "child", T.StringType())
df = spark.createDataFrame([(i,i*2) for i in range(5,15)], ['no', 'age'])
df.withColumn("maturity", maturity_udf(df.age)).show()

7.2 Windowing

import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(int(i),int(i*2)) for i in np.random.randint(5,15,50)], ['A', 'B'])
df.withColumn('cc', F.count('*').over(Window.partitionBy('A').orderBy('B')).alias('cc')).where('cc > 1').show()

wglacier

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
From Pandas to Spark DataFrames

Pandas is a great library, but it is a single machine tool and it doesn’t have any parallelism built in, which means it uses only one CPU core. Luckily with Spark, you can port pretty much any piece o...
复制链接

扫一扫