20221209
https://iceberg.apache.org/releases/
iceberg-spark-runtime下载
20220721
iceberg只能spark或者flink建表
dbeaver不能建
iceberg只能删除表,不能删除具体的表内容?
20220523
https://blog.csdn.net/weixin_43161811/article/details/123647348
数据湖架构中 Iceberg 的核心特性
20220402
pyspark读写iceberg
# code:utf-8
import findspark
findspark.init(r"D:\Python37\Lib\site-packages\pyspark")
这里要指定pyspark的路径,如果是服务器的话最好用spark所在的pyspark路径
import os
java8_location = r'D:\Java\jdk1.8.0_301/' # 设置你自己的路径
os.environ['JAVA_HOME'] = java8_location
from pyspark.sql import SparkSession
def get_spark():
# pyspark 读iceberg表
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.iceberg.type", "hive")
spark.conf.set("spark.sql.catalog.iceberg.uri", "thrift://192.168.1.54:9083")
不同的目标地址,不同的服务器集群,要拷贝对应的两个hive文件到当地客户端的pyspar conf文件夹下
return spark
if __name__ == '__main__':
spark = get_spark()
pdf = spark.sql("select shangpgg from iceberg.test.end_spec limit 10")
spark.sql("insert into iceberg.test.end_spec values ('aa','bb')")
pdf.show()
print()
1. 在pyspark下新建conf文件夹,把iceberg下的两个hive配置文件
放在下面
hdfs-site.xml
hive-site.xml
2. iceberg-spark-runtime-3.1_2.12-0.13.2.jar 把这个文件放在pyspark的jars文件夹
3.hdfs-site.xml 中
hadoop01:改成具体的地址 比如 192.168.1.50
22/10/27 10:30:25 WARN Tasks: Retrying task after failure: Failed to get file system for path: alluxio://hadoop01:19998/lakehouse/ice_ods/ods_onekey_jkzj_infosp_jmd_cd_da/metadata/00135-6702d6d0-0f13-4e8c-a96f-3c3d592aeb82.gz.metadata.json
org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: alluxio://hadoop01:19998/lakehouse/ice_ods/ods_onekey_jkzj_infosp_jmd_cd_da/metadata/00135-6702d6d0-0f13-4e8c-a96f-3c3d592aeb82.gz.metadata.json
at org.apache.iceberg.hadoop.Util.getFs(Util.java:53)
之前建立在alluxio上的表,用iceberg现在打不开
有可能是有多个spark环境,运行的spark程序不对,在findspark.init(r"D:\python38_env\main_data\Lib\site-packages\pyspark")
版本对应
pyspark 3.1.2
iceberg-spark-runtime-3.1_2.12-0.13.2.jar
iceberg 版本是0.12
相当于版本要对应
新的集群iceberg是0.14
20220315
```python
self.config_iceberg = {
"host": "192.168.1.55",
"port": 8881,
"user": "root",
"catalog": "iceberg",
"schema": "ice_ods",
}
import trino
if connected_type == "iceberg":
self.conn = trino.dbapi.connect(**self.config_iceberg)
iceberg和trino的关系连接
连接成功后,再写sql语句除了业务逻辑
sink写
source读
分区数最好和kafkatopic的分区数一样,否则用默认的200个分区很慢
按天分区相当于一天只有一个目录
https://blog.csdn.net/xuronghao/article/details/106184831
spark写入iceberg
partition具体分区
hadoop_prod是具体的catalog,tb是数据库
通过catalog两种方式hive或者hadoop来创建数仓
def save_to_db(data,database_type):
"""
保存至数据库
:param data: 要保存的数据
:return: 无返回值
"""
if database_type == '生产':
trino_engine = create_engine(
"trino://root@192.168.1.55:8881/iceberg/ice_dwt"
) # 生产库
else:
trino_engine = create_engine('trino://root@192.168.40.11:8882/iceberg/ice_dwt') # 测试库
times = int(np.ceil(data.shape[0] / 1000))
for i in tqdm(range(times)):
data.iloc[i * 1000 : (i + 1) * 1000, :].to_sql(
name="dwt_dm_bi_b2b_customer_churn_wide",
con=trino_engine,
index=False,
if_exists="append",
schema="ice_dwt",
method="multi",
)
logger.debug("存入数据库成功")
可以用此方式插入iceberg虽然速度慢一点
20220314
spark要写入iceberg需要一个alluxio-2.7.3-client.jar这个jar包
在alluxio下载下来的zip包里面
spark读写iceberg没有测试成功