![](https://img-blog.csdnimg.cn/20201014180756913.png?x-oss-process=image/resize,m_fixed,h_64,w_64)
spark
条件反射104
中国科学院大学自动化研究所
展开
-
pyspark中flatMapValues的用法
# flatMapValuesx = sc.parallelize([('A',(1,2,3)),('B',(4,5))])y = x.flatMapValues(lambda x: [i**2 for i in x]) # function is applied to entire value, then result is flattenedprint(x.collect())print(y.collect()) [('A', (1, 2, 3)), ('B', (4, 5))][('A'原创 2021-05-24 17:42:13 · 709 阅读 · 0 评论 -
pyspark:去掉rdd中空的项
rdd.filter(lambda x: x is not None)原创 2021-05-24 17:40:51 · 1591 阅读 · 1 评论 -
pyspark: rdd和dataframe相互转化
rdd转dataframefrom pyspark.sql import SparkSessionfrom pyspark.sql import Rowspark = SparkSession.builder.getOrCreate()sc = spark.sparkContextrdd= sc.parallelize([('John',30),('Mary',78)])dataframe=spark.createDataFrame(rdd,['name','age'])dataframe原创 2021-05-19 17:37:08 · 330 阅读 · 0 评论 -
pyspark rdd去重
对于pyspark中的rdd按照某一列进行去重的时候,可以使用reduceByKey()。需要将要去重的列作为key,其余作为value。rdd = rdd.reduceByKey(lambda x, y: x)原创 2021-05-19 15:20:25 · 1611 阅读 · 0 评论 -
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. :
服务器上使用pyspark报错:Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.j原创 2021-05-11 11:47:24 · 6085 阅读 · 5 评论 -
rdd.foreach(print)报错SyntaxError: invalid syntax
解决方法:先运行from __future__ import print_function然后再运行rdd.foreach(print)原创 2021-04-26 14:35:31 · 1605 阅读 · 0 评论