pyspark.sql.DataFrame与pandas.DataFrame之间的相互转换

44 篇文章 6 订阅

pyspark.sql.DataFrame与pandas.DataFrame之间的相互转换

代码示例:

# -*- coding: utf-8 -*-
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkContext

# 配置spark 运行参数
import os
os.environ["SPARK_HOME"] = "/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6"
# 初始化spark DataFrame
sc = SparkContext()
if __name__ == "__main__":
    print "1、初始化pandas DataFrame"
    # 初始化pandas DataFrame
    df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['row1', 'row2'], columns=['c1', 'c2', 'c3'])

    # 打印数据
    print df

    spark = SparkSession\
        .builder\
        .appName("testDataFrame")\
        .getOrCreate()

    sentenceData = spark.createDataFrame([(0.0, "I like Spark"),
                                          (1.0, "Pandas is useful"),
                                          (2.0, "They are coded by Python ")],
                                         ["label", "sentence"])
    # 显示数据
    sentenceData.select("label").show()

    print "2、将pandas.DataFrame 转换成 spark.DataFrame"
    # spark.DataFrame 转换成 pandas.DataFrame
    sqlContext = SQLContext(sc)
    spark_df = sqlContext.createDataFrame(df)

    # 显示数据
    spark_df.select("c1").show()

    print "3、将spark.DataFrame 转换成 pandas.DataFrame"
    # pandas.DataFrame 转换成 spark.DataFrame
    pandas_df = sentenceData.toPandas()

    # 打印数据
    print pandas_df

运行结果如下:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/21 19:47:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/21 19:47:22 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.2.33.229 instead (on interface en0)
18/05/21 19:47:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/05/21 19:47:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
1、初始化pandas DataFrame
      c1  c2  c3
row1   1   2   3
row2   4   5   6
+-----+
|label|
+-----+
|  0.0|
|  1.0|
|  2.0|
+-----+

2、将pandas.DataFrame 转换成  spark.DataFrame
+---+
| c1|
+---+
|  1|
|  4|
+---+

3、将spark.DataFrame 转换成  pandas.DataFrame
   label                   sentence
0    0.0               I like Spark
1    1.0           Pandas is useful
2    2.0  They are coded by Python 

Process finished with exit code 0

参考:https://blog.csdn.net/zhurui_idea/article/details/72981715

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值