pyspark 中DataFrame的操作

最新推荐文章于 2024-10-13 16:12:28 发布

广小辉

最新推荐文章于 2024-10-13 16:12:28 发布

阅读量1.2k

点赞数 1

分类专栏：人工智能系列大数据1-spark中的dataframe

本文链接：https://blog.csdn.net/Galbraith_/article/details/86602738

版权

人工智能系列大数据1-spark中的dataframe 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1、查询

1.1 行元素的查询操作

--显示：

data.show(30) ------ 数据类型：dataframe

--取某几行：

data.head() ----------数据类型： Row

data.take(5) , data.head(5) --------list 类型，

注意：两种数据类型不一样！

以树的形式打印概要，相当于sql 中的desc

data.printSchema()

--总共有多少行

data.count() -----------3789

为空的有多少行:

from pyspark.sql.functions import isnull
data1 = data.filter(isnull('goods_id'))

data1.count()

非空的多少行？ data.count()-data1.count()

--变为列表------row.column

list = df.collect()

--去重：

data1.select('goods_id').distinct() -----------返回结果是 dataframe

--随机采样：

data1.sample(False, 0.5, 0) -----------------返回结果是datafra

data.sample(withReplacement=None, fraction=None, seed=None)

--类似于sql中的随机采样：

# 进行随机采样

sql = "select * from dm.dm_goods_detail_page_h where dt='2019020301'  order by rand() limit 20"rand_sql = ss.sql(sql)

1.2 列元素的查询操作

--获取所有列的列名（字段名称）

new_table.columns ，返回结果是list

['item', 'user', 'time']

--选择一列或者多列

new_table['time'] ， new_table.time ，

返回结果是 Column<b'time'>

new_table.select('time')   # 这个可以，很好用
new_table.select(new_table.tiem, new_table.usr)
new_table.select(new_table['time'], new_table['usr'])

返回含有select 的返回的都是 dataframe

-- 重载的select方法

new_table.select(new_table.click, new_table.click + 1).show(5)

--排序

相当于sql 中的 order by 。。desc ， order by 。。asc

new_table.orderBy(new_table.item.desc()).show(5)

--按照列名过滤

new_table.where("item='163049'").show()

返回结果也是一个dataframe

按照nan， null进行过滤

from pyspark.sql.functions import isnull, isnan
new_table = new_table.filter(isnull(new_table['click']))
new_table = new_table.filter(isnan(new_table.item))

二、增改数据

2.1 增加------新建

有两种方式新建dataframe ，一种是createdataframe ，另外一种是toDF ；

1、将pandas中的dataframe转化为 spark中的dataframe

df = pd.dataframe()

spark_df = SparkSession.createDataFrame(df)

# 这种会报错
2、直接建立Dataframe

columns = ['goods_id', 'recommends']
value = [(123, 'seg, ese')]
result = sparksession.createDataFrame(value, columns)

要新建dataframe还是推荐这种方法

3、将spark中的dataframe转化为 pandas中的dataframe ，

table_df = table_data.toPandas()

4、将spark中的rdd转变为dataframe

from pyspark.sql import Row
row = Row("user", "item")
x = ['lisan','wangwu']
y = ['beer','milk']
new_df = ss.sparkContext.parallelize([row(x[i], y[i]) for i in range(2)]).toDF()

现在，spark中已经有很多的特征工程中的方法，尝试用spark的特征工程（尽量还是用一种，用熟悉）

2.2 新增数据列

主要的函数是withColumn

new_table = new_table.withColumn('test', new_table['click'] * 2)
#   withColumn中 第一个参数为新的列名，第二个参数为 数据中存在的列

2.3 修改原有的列的值，列名，类型

--修改原有列的值

new_table = new_table.withColumn('time', new_table.time + 23)
new_table = new_table.withColumn('time', new_table.item + 23)

使用withColumn和 select重载的区别为，重载是可以为同一列做出运算并且show

withColumn就显得比较灵活一些；

--修改原有列的列名

new_table = new_table.withColumnRenamed('test', 'retest')

--修改原有列的数据类型

new_table = new_table.withColumn("item", new_table["item"].cast("Int"))

-----------------------------------------------------------------------------------

null值过滤：data.filter(isnull(''))

data.na.drop() //所有空值

data.na.drop(Seq("col1","col2")过滤第一行，第二行的空值；

广小辉

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录