学习pyspark中出现的一些问题

最新推荐文章于 2023-09-17 00:37:30 发布

Leviathan_Four

最新推荐文章于 2023-09-17 00:37:30 发布

阅读量375

点赞数

分类专栏：笔记大数据开发技术文章标签： python pyspark RDD SparkSQL

本文链接：https://blog.csdn.net/weixin_45755831/article/details/120811750

版权

笔记同时被 2 个专栏收录

38 篇文章 2 订阅

订阅专栏

大数据开发技术

10 篇文章 0 订阅

订阅专栏

一、函数不加括号

写代码的时候很多地方需要加括号，又有些函数不需要加括号。
给我整懵了，随即去看了看源码，发现是有一种函数是被**@property**所修饰，这样的函数一般是用来维护不可修改的元素的值所创建的，自然也就不需要添加括号了。
这里给出一个例子，rdd函数，调用后将会返回dataframe的一个RDD对象：
命令为

personRDD = personDF.rdd.map(lambda p: "Name: "+p[0]+", Age: "+str(p[1]))

rdd源码
通过注释我们可以了解到这个函数会返回一个RDD。
在这里插入图片描述

二、RDD数据如何变成Python可以直接操作的数据

有时候我们就想康康里面长啥样，或者对数据进行一些操作。比如我们需要前10项数据。这时候我们就可以使用collect函数或者take函数了。collect函数的功能是返回整个RDD为一个列表，而take可以写一个参数为需要取的元素个数。

展示一下他们返回的数据。
我们可以看到返回的是一个列表，且每个元素都为字符串类型。
在这里插入图片描述

collect的源码，我们可以看到代码中写的是返回所有元素的列表数据，而且注释中说只有当这个RDD相对小的时候才能使用，因为所有的数据都会加载进驱动程序的内存中。

    def collect(self):
        """
        Return a list that contains all of the elements in this RDD.

        .. note:: This method should only be used if the resulting array is expected
            to be small, as all the data is loaded into the driver's memory.
        """
        with SCCallSiteSync(self.context) as css:
            sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
        return list(_load_from_socket(sock_info, self._jrdd_deserializer))

take的源码，我们可以看到返回的是前num个元素

    def take(self, num):
        """
        Take the first num elements of the RDD.

        It works by first scanning one partition, and use the results from
        that partition to estimate the number of additional partitions needed
        to satisfy the limit.

        Translated from the Scala implementation in RDD#take().

        .. note:: this method should only be used if the resulting array is expected
            to be small, as all the data is loaded into the driver's memory.

        >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
        [2, 3]
        >>> sc.parallelize([2, 3, 4, 5, 6]).take(10)
        [2, 3, 4, 5, 6]
        >>> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
        [91, 92, 93]
        """
        items = []
        totalParts = self.getNumPartitions()
        partsScanned = 0

        while len(items) < num and partsScanned < totalParts:
            # The number of partitions to try in this iteration.
            # It is ok for this number to be greater than totalParts because
            # we actually cap it at totalParts in runJob.
            numPartsToTry = 1
            if partsScanned > 0:
                # If we didn't find any rows after the previous iteration,
                # quadruple and retry.  Otherwise, interpolate the number of
                # partitions we need to try, but overestimate it by 50%.
                # We also cap the estimation in the end.
                if len(items) == 0:
                    numPartsToTry = partsScanned * 4
                else:
                    # the first parameter of max is >=1 whenever partsScanned >= 2
                    numPartsToTry = int(1.5 * num * partsScanned / len(items)) - partsScanned
                    numPartsToTry = min(max(numPartsToTry, 1), partsScanned * 4)

            left = num - len(items)

            def takeUpToNumLeft(iterator):
                iterator = iter(iterator)
                taken = 0
                while taken < left:
                    try:
                        yield next(iterator)
                    except StopIteration:
                        return
                    taken += 1

            p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
            res = self.context.runJob(self, takeUpToNumLeft, p)

            items += res
            partsScanned += numPartsToTry

        return items[:num]

三、Row的顺序问题

在做SparkSQL热词的时候，用到了生成Row来创建DateFrame。但是当我输出使用Row生成的RDD的时候，我发现生成的每一行的元素的顺序和我的期望不太一样，我期望的顺序是Row(word,cnt)，但是真实的顺序为相反的。
在这里插入图片描述
这样就搞的我很疑惑。可以看到生成的dataframe也是一样的。

当使用sql语句进行查询的时候，可以自己定义顺序。
最后实现不行，去看源码，发现Row是按照各个域的名来进行排序的。终于解惑！

最后的解决方法就是要么使用sql语句，要么就使用编码模式生成Schema定义RDD。

编码模式生成Schema定义RDD:

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, IntegerType, StructType, Row

conf = SparkConf().setAppName("test").setMaster("local")
spark = SparkSession.builder.config(conf=conf).getOrCreate()


def CodingMode():
    # 定义DataFrame的scheme
    fieldNames = ["name", 'age']
    fieldTypes = [StringType(), IntegerType()]
    fields = [StructField(fieldName, fieldTypes[idx]) for idx, fieldName in enumerate(fieldNames)]
    schema = StructType(fields)
    # RDD获取数据
    lines = spark.sparkContext.textFile("../../data/people.txt");
    words = lines.map(lambda line: line.split(","))
    rows = words.map(lambda p: Row(p[0], int(p[1])))  ### diff 1

    # 创建frame 根据data和scheme
    df = spark.createDataFrame(rows, schema)  ### diff 2
    df.createOrReplaceTempView("people")
    personDF = spark.sql("select name, age from people where age >= 20")

    # personDF.show()
    result_rdd = personDF.rdd.map(lambda row: (row.name, row.age))
    result_rdd.foreach(print)


if __name__ == '__main__':
    CodingMode()

这里给出源码中Row的使用方法。

    >>> row = Row(name="Alice", age=11)
    >>> row
    Row(age=11, name='Alice')
    >>> row['name'], row['age']
    ('Alice', 11)
    >>> row.name, row.age
    ('Alice', 11)
    >>> 'name' in row
    True
    >>> 'wrong_key' in row
    False

    Row also can be used to create another Row like class, then it
    could be used to create Row objects, such as

    >>> Person = Row("name", "age")
    >>> Person
    <Row(name, age)>
    >>> 'name' in Person
    True
    >>> 'wrong_key' in Person
    False
    >>> Person("Alice", 11)
    Row(name='Alice', age=11)

Leviathan_Four

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
学习pyspark中出现的一些问题

一、函数不加括号写代码的时候很多地方需要加括号，又有些函数不需要加括号。给我整懵了，随即去看了看源码，发现是有一种函数是被**@property**所修饰，这样的函数一般是用来维护不可修改的元素的值所创建的，自然也就不需要添加括号了。这里给出一个例子，rdd函数，调用后将会返回dataframe的一个RDD对象：命令为personRDD = personDF.rdd.map(lambda p: "Name: "+p[0]+", Age: "+str(p[1]))rdd源码通过注释我们可以了解
复制链接

扫一扫