pyspark 写入数据到iceberg

飞锡2024

已于 2022-07-21 16:22:07 修改

阅读量1.4k

点赞数

分类专栏：大数据文章标签：数据库 spark hive

于 2022-06-14 14:02:04 首次发布

本文链接：https://blog.csdn.net/weixin_38235865/article/details/125276764

版权

大数据专栏收录该内容

30 篇文章 0 订阅

订阅专栏

该博客介绍了如何在D盘Python环境搭建PySpark，包括添加必要依赖jar包，创建conf目录并配置hdfs-site.xml和hive-site.xml。接着展示了如何通过Python代码创建SparkSession，设置相关配置，并进行数据处理操作，如数据读取、转换、拼音处理及数据写入。文中提供了多种数据写入解决方案，包括删除现有分区并插入新数据、创建或替换临时视图后再插入数据等。

摘要由CSDN通过智能技术生成

pyspark环境搭建

1.D:\Python\python37\Lib\site-packages\pyspark\jars
放入
iceberg-spark3-runtime-0.13.1.jar
alluxio-2.6.2-client.jar

2.D:\Python\python37\Lib\site-packages\pyspark
创建conf文件夹放入 hdfs-site.xml hive-site.xml

代码

 
import os
import warnings

import argparse

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StructField,StructType,DecimalType,IntegerType,TimestampType, StringType

import pypinyin

warnings.filterwarnings("ignore")

def get_spark():
    os.environ.setdefault('HADOOP_USER_NAME', 'root')
    spark = SparkSession.builder\
        .config('spark.sql.debug.maxToStringFields', 2000) \
        .config('spark.debug.maxToStringFields', 2000) \
    .getOrCreate()
    spark.conf.set("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
    spark.conf.set("spark.sql.catalog.iceberg.type", "hive")
    spark.conf.set("spark.sql.catalog.iceberg.uri", "thrift://192.168.x.xx:9083")


    spark.conf.set("spark.sql.iceberg.handle-timestamp-without-timezone", True)
    # Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'spark.sql.iceberg.handle-timestamp-without-timezone' to true

    spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

    # spark.conf.set("spark.sql.storeAssignmentPolicy", "LEGACY")
    # https://www.cnblogs.com/songchaolin/p/12098618.html pyspark.sql.utils.AnalysisException: LEGACY store assignment policy is disallowed in Spark data source V2. Please set the configuration spark.sql.storeAssignmentPolicy to other values.

    return spark

def Capitalize_hanzipinyin(word):
     
    return ''

def main_run(dt):
    table_name='iceberg.xxx.xxx'
    target_table_name = 'iceberg.xxx.xxx'
    target_table_name_columns = ['A','B']
    sql = """
    select
	A,B
from
	%s
where 
    dt = '%s'	
    """%(table_name, dt)

    

    spark = get_spark()

    spark_df = spark.sql(sql)
    toPinyinUDF = udf(Capitalize_hanzipinyin, StringType())
    spark_df = spark_df.withColumn('A_pinyin', toPinyinUDF('A'))
  
    # soulution 1
    delete_sql = "delete from %s where dt = '%s' "%(target_table_name,dt)
    spark.sql(delete_sql)
    spark_df.write.saveAsTable(target_table_name, None, "append", partitionBy='dt')
    # solution 2
    spark_df.createOrReplaceTempView("test")#：创建临时视图
    spark.sql(
        "insert overwrite table  %s partition(dt) select  A,B,A_pinyin from test" % target_table_name)
    # 使用select * 会报错 Cannot safely cast '': string to int

    # soulution 3
    new_spark_df = spark.sql("SELECT A,B,A_pinyin   from test")
    new_spark_df.write.insertInto(target_table_name, True)

    # solution 4 会全部覆盖表的数据
    # new_spark_df.write.saveAsTable(target_table_name, None, "overwrite",partitionBy='dt')    


    # solution 5 spark df 转 pandas df 数据类型可能匹配失败
    df = spark_df.toPandas()
    # 此数据帧转换为pandas时，列类型从spark中的integer更改为pandas中的float
    df['A_pinyin'] = df['A'].apply(Capitalize_hanzipinyin)
    df = df[target_table_name_columns] #更换位置
    schema = StructType([
                         StructField("A", StringType(), True),
                         ...  
                         ])
    # 设置了scheam field A: IntegerType can not accept object 2.0 in type <class 'float'>
    DF = spark.createDataFrame(df,schema)
    #没有schema ValueError: Some of types cannot be determined after inferring存在字段spark无法推断它的类型
    DF.write.insertInto(target_table_name,  True)