带表头清洗与不带表头清洗的区别

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/sinat_26566137/article/details/81190580

(1)带表头清洗
带表头清洗,经常会遇到有些字段里面因为含有不合法的数据,导致数据类型不是所申明的类型,在用spark做处理的时候会报Type相关的错误,难以排查;
(2)不带表头清洗
不带表头清洗,可以使用spark的类型推断,一般情况下可以将所有字段全部先推断为StingType,然后再基于Stringtype做类型转换,比如转换成int型,如果转换不成功则设定一个默认值,(通常设定None好一点,不要设置成“ ”空字符串,当然根据情况需要,也可以按类型设定默认值,比如字符串类型,默认值可以设为空字符串“ ”)

import sys

reload(sys)
sys.setdefaultencoding('utf8')

from biz.spark_session_utils.spark_session_utils import SparkSessionUtils
from pyspark.sql.types import StringType, StructField, StructType, IntegerType,ShortType,LongType
from pyspark.sql import Row
from pyspark.sql.functions import udf
from sc_risk_model_clean_utils.clean_utils.clean_utils import parse_date
from sc_risk_model_clean_utils.clean_utils.clean_utils import name_clean
from sc_risk_model_clean_utils.clean_utils.clean_utils import str_to_int
from conf.conf import HDFS_FS_PREFIX
from biz.sub.dag import TableInfo


class CleanTrademarkInfo(SparkSessionUtils):
    name_clean_udf = udf(name_clean, StringType())
    trademark_sample_table_input = "%s/test/trademark_info_sample.csv" % HDFS_FS_PREFIX
    trademark_sample_table_output = "%s/test/trademark_info_sample_rst.csv" % HDFS_FS_PREFIX
    raw_trademark_info_field = [ "sc_data_id","company_name", "reg_publication_time", 'pre_publication_time',"apply_time",'int_type_code']
    trademark_info_schema = StructType(
        [
            StructField("sc_data_id", StringType(), True),
            StructField('company_name', StringType(), True),
            StructField('reg_publication_time', StringType(), True),
            StructField('pre_publication_time', StringType(), True),
            StructField('apply_time', StringType(), True),
            StructField('int_type_code', StringType(), True)]
    )

    # StructField('int_type_code', IntegralType(), True),
    # [StructField(field_name, StringType(), True) for field_name in raw_trademark_info_field]

    def set_table_info(self):
        depend_tb = TableInfo("trademark_sample", self.trademark_info_schema, self.trademark_sample_table_input)
        rst_tb = TableInfo("trademark_sample_rst", self.trademark_info_schema, self.trademark_sample_table_output)
        self.add_depend_tables(depend_tb)
        self.add_result_tables(rst_tb)

    def load_file(self):
        self.df_trademark_info_sample = self.session.read.load(self.trademark_sample_table_input, format="csv",schema=self.trademark_info_schema, delimiter=',')
        self.df_trademark_info_sample.createOrReplaceTempView("trademark_info_sample")

    @staticmethod
    def clean_row_name(row, raw_trademark_info_field):
        row_dict = dict([(k, row[k]) for k in raw_trademark_info_field])

        row_dict["company_name"] = name_clean(row_dict["company_name"])
        row_dict["reg_publication_time"] = parse_date(row_dict["reg_publication_time"])
        row_dict["pre_publication_time"] = parse_date(row_dict["pre_publication_time"])
        row_dict["apply_time"] = parse_date(row_dict["apply_time"])
        row_dict["int_type_code"] = str_to_int(row_dict["int_type_code"])

        return row_dict

    def rdd_map_reduce_demo(self):
        # 注意 rdd 的时候需要将map reduce中的self 去掉
        clean_row_name = self.clean_row_name
        raw_trademark_info_field = self.raw_trademark_info_field
        self.df_trademark_info_sample = self.df_trademark_info_sample \
            .rdd \
            .map(lambda row: Row(**clean_row_name(row, raw_trademark_info_field))) \
            .filter(lambda row: len(row["company_name"]) >= 5) \
            .toDF(self.trademark_info_schema)

        self.df_trademark_info_sample.createOrReplaceTempView("trademark_info_sample")

    # def udf_clean_demo(self):
    #     self.df_trademark_info_sample = self.df_trademark_info_sample.withColumn("company_name", self.name_clean_udf(
    #         self.df_trademark_info_sample["name"]))
    #     self.df_trademark_info_sample.createOrReplaceTempView("trademark_info_sample")
    #
    def run_task(self):
        self.load_file()
        self.rdd_map_reduce_demo()
        self.session.sql("""
        select * from trademark_info_sample
        """).write.save(self.trademark_sample_table_output, format="csv", header=False, delimiter=',',mode="overwrite")
展开阅读全文

文本清洗

05-01

<p>rn <span style="color:#666666;font-size:14px;background-color:#FFFFFF;"> </span>rn</p>rn<p>rn <p>rn 20周年限定:唐宇迪老师一卡通!<span style="color:#337FE5;">可学唐宇迪博士全部课程</span>,仅售799元(原价10374元),<span style="color:#E53333;">还送漫威正版授权机械键盘+CSDN 20周年限量版T恤+智能编程助手!</span>rn </p>rn <p>rn 点此链接购买:rn </p>rn <table>rn <tbody>rn <tr>rn <td>rn <span style="color:#337FE5;"><a href="https://edu.csdn.net/topic/teachercard?utm_source=jsk20xqy" target="_blank">https://edu.csdn.net/topic/teachercard?utm_source=jsk20xqy</a><br />rn</span>rn </td>rn </tr>rn </tbody>rn </table>rn</p>rn购买课程后,可扫码进入学习群<span>,获取唐宇迪老师答疑</span> rn<p>rn <br />rn</p>rn<p>rn <span style="color:#666666;font-size:14px;background-color:#FFFFFF;"><img src="https://img-bss.csdn.net/201908070540055840.jpg" alt="" /></span> rn</p>rn<p>rn <span style="color:#666666;font-size:14px;background-color:#FFFFFF;">Python数据分析与机器学习实战课程使用当下最主流的工具包结合真实数据集进行分析与建模任务,全程实战演练,旨在用最接地气的方式带领大家熟悉数据分析与建模常规套路与实战流程。针对具体任务,进行详细探索性分析与可视化展示,从中提取最有价值的数据特征并进行建模与评估分析,详细解读其中每一步流程,不放过一行代码。课程提供全部所需数据集,代码文件。</span> rn</p>

没有更多推荐了,返回首页