pyspark 数据读写

dataframe数据

保存

创建数据

output_df = sqlContext.createDataFrame(
            [{'a': [2, 3, 4], 'b': [1, 2, 3], 'c': ['a', 'b', 'c'], 'd': ['a','b','c']},
             {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']}])

数据展示

output_df.show()

+---------+---------+---------+---------+
|        a|        b|        c|        d|
+---------+---------+---------+---------+
|[2, 3, 4]|[1, 2, 3]|[a, b, c]|[a, b, c]|
|[3, 4, 5]|[4, 5, 6]|[c, c, d]|[c, c, d]|
+---------+---------+---------+---------+

此时,如果直接保存数据,会报错

output_df.write.csv('./my',header=True)

Py4JJavaError: An error occurred while calling o37.csv.
: java.lang.UnsupportedOperationException: CSV data source does not support array<bigint> data type.

需要对数据做处理

def to_csv(x):
    return [','.join([str(i) for i in x[0]]), ','.join([str(i) for i in x[1]]),
            ','.join([str(i) for i in x[2]]), ','.join([str(i) for i in x[3]])]

output_df = output_df.rdd.map(to_csv).toDF(['a', 'b', 'c', 'd'])
output_df.show()

+-----+-----+-----+-----+
|    a|    b|    c|    d|
+-----+-----+-----+-----+
|2,3,4|1,2,3|a,b,c|a,b,c|
|3,4,5|4,5,6|c,c,d|c,c,d|
+-----+-----+-----+-----+

现在,数据可以保存了

output_df.write.csv('./my',header=True)

数据保存在./my目录下,目录下的文件为分布式csv数据
这里,数据是保存为csv格式,可以看到,有3个csv文件,这个是根据我们的分区数目定的。因为我们只建立了两条数据,分区数比我们的数据条数多,所以会产生3个文件,如果分区数小于等于数据条数,那么建立的csv文件数应该等于分区数目。

分区数查看

output_df.rdd.getNumPartitions()

8

接下来建立大于8条数据的数据

outputdf = sqlContext.createDataFrame(
            [{'a': [2, 3, 4], 'b': [1, 2, 3], 'c': ['a', 'b', 'c'], 'd': ['a','b','c']},
             {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']},
            {'a': [3,4,5], 'b': [4,5,6], 'c': ['c','c','d'], 'd': ['c','c','d']}])

output_df = outputdf.rdd.map(to_csv).toDF(['a', 'b', 'c', 'd'])
output_df.show()

+-----+-----+-----+-----+
|    a|    b|    c|    d|
+-----+-----+-----+-----+
|2,3,4|1,2,3|a,b,c|a,b,c|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
|3,4,5|4,5,6|c,c,d|c,c,d|
+-----+-----+-----+-----+

查看每个分区情况

partitions = output_df.rdd.glom().collect()
partitions

[[Row(a='2,3,4', b='1,2,3', c='a,b,c', d='a,b,c')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
  Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
  Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
  Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')],
 [Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d'),
  Row(a='3,4,5', b='4,5,6', c='c,c,d', d='c,c,d')]]

可以看到,partitions长度为8,说明8个分区都被占用,这个8是电脑的cpu数目,此时我采用的默认分区,就是等于cpu数目。

保存文件

output_df.write.csv('./my',header=True)

查看保存的文件
csv数据
可以看到,此时保存的csv文件数是8,和我们的分区数一致。
如果我们手动设置分区,那么分区数就与我们手动设置的一致,默认的会是我们的cpu数目。

读取

读取保存的csv数据

output_df = sqlContext.read.csv('./my', header=True)

将数据还原为可用于计算的数据

def from_to_csv(s):
    return [[float(x) for x in s[0].split(',') if len(s[0]) > 0],
            [int(x) for x in s[1].split(',') if len(s[1]) > 0],
            [str(x) for x in s[2].split(',') if len(s[2]) > 0],
            [str(x) for x in s[3].split(',') if len(s[3]) > 0]]
output_df = output_df.rdd.map(from_to_csv).toDF(['a','b','c','d'])
output_df.show()

+---------------+---------+---------+---------+
|              a|        b|        c|        d|
+---------------+---------+---------+---------+
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[2.0, 3.0, 4.0]|[1, 2, 3]|[a, b, c]|[a, b, c]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
|[3.0, 4.0, 5.0]|[4, 5, 6]|[c, c, d]|[c, c, d]|
+---------------+---------+---------+---------+
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

飞华1993

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值