Hive 分区表
分区为 HDFS 上表目录的子目录,数据按照分区存储在子目录中。如果查询的 where 字句的中包含分区条件,则直接从该分区去查找,而不是扫描整个表目录,合理的分区设计可以极大提高查询速度和性能。
Hive 分桶表
分区提供了一个隔离数据和优化查询的可行方案,但是并非所有的数据集都可以形成合理的分区,分区的数量也不是越多越好,过多的分区条件可能会导致很多分区上没有数据。同时 Hive 会限制动态分区可以创建的最大分区数,用来避免过多分区文件对文件系统产生负担。鉴于以上原因,Hive 还提供了一种更加细粒度的数据拆分方案:分桶表 (bucket Table)。
概念
Hive 中的分桶概念和 Java 数据结构中的 HashMap 的分桶概念是一致的。
当调用 HashMap 的 put() 方法存储数据时,程序会先对 key 值调用 hashCode() 方法计算出 hashcode,然后对数组长度取模计算出 index,最后将数据存储在数组 index 位置的链表上,链表达到一定阈值后会转换为红黑树 (JDK1.8+)。
Hive on MR
优化实践
#one
xxx盘 从150M 提升至 400M
#two
在Map阶段启用snappy压缩的参数设置为:
mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
在Reduce阶段启用lzo压缩的参数设置为:
mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.compress.lzo.LzoCodec
set hive.exec.compress.intermediate=true; --启用中间数据压缩
SET hive.exec.compress.output=true; -- 启用最终数据输出压缩
#Three
xxx.enabled=true
mapreduce.fileoutputcommitter.algorithm.version=1
#Four
set hive.optimize.sort.dynamic.partition=true;
partitions 相关
spark 相关操作
CREATE TABLE stable1(a string, b string) options (path "file:///home/hadoop/verify/origin") PARTITIONED BY ( p string );
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO stable1 VALUES('a2', 'b2', 'p2');
INSERT INTO table1 VALUES('a3', 'b3', 'p3');
INSERT INTO stable1 VALUES('a11', 'b11', 'p1');
alter table stable1 add partition(p='p1') location 'file:///home/hadoop/verify/new/p=p1';
ALTER TABLE stable1 PARTITION(p='p1') SET LOCATION 'file:///home/hadoop/verify/new/p=p1';
insert overwrite table stable1 select * from VALUES('a1', 'b1', 'p1');
INSERT INTO stable1 VALUES('a11', 'b11', 'p1');
hive location 调整
# spark
CREATE TABLE stable3(a string, b string) options (path "file:///home/hadoop/verify/origin") PARTITIONED BY ( p string );
# hive
CREATE TABLE htable2(a string, b string) PARTITIONED BY ( p string ) location "hdfs://HDFS1010671/user/hadoop/htable2/";
alter table stable4 add partition(p='p1') location 'hdfs://xxx:xx/user/hadoop/htable1/p=p1';
alter table stable4 partition(p='p1') set location 'hdfs://xxx:xx/user/hadoop/csv/p=p1';
hive 的一些 优化配置
hive --hiveconf hive.root.logger=info,console
set hive.exec.mode.local.auto=true
hive2
CREATE TABLE htable2(a string, b string) PARTITIONED BY ( p string ) location "hdfs://xxx/user/hadoop/htable2/";
alter table htable2 add partition(p='p1') location 'hdfs://xxx/usr/hive/warehouse/htable2/p=p1';
insert into htable2 partition (p='p1') VALUES('a1', 'b1') ;
insert into htable2 partition (p='p2') VALUES('a2', 'b2') ;
hive>
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.mode.local.auto=true;
insert overwrite table htable2 partition(p) select * from htable2;
insert into htable2 partition(p) VALUES('a11', 'b11', 'p1') ;
hive3
CREATE {external} TABLE htable2(a string, b string) PARTITIONED BY ( p string ) location "hdfs://xxx/user/hadoop/htable1";
INSERT INTO htable2 VALUES('a2', 'b2', 'p2');
alter table htable2 add partition(p='p1') location 'cosn://xxx/tmp/hive/default.db/table1/p=p1';
describe formatted htable2 partition(p='p1');
alter table htable2 partition(p='p1') set location 'cosn://xxx/tmp/hive/default.db/table1/p=p1';
INSERT INTO htable2 VALUES('a2', 'b2', 'p1');
alter table htable2 partition(p='p2') set location "hdfs://xxx/user/hadoop/htable2/p=p2";
Hive 导入数据
Hive的LOAD DATA命令用于将数据从HDFS或本地文件系统加载到Hive表中。以下是基本用法:
LOAD DATA [LOCAL] INPATH 'source_path' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
LOCAL: 可选参数,表示数据文件位于本地文件系统而不是HDFS。如果使用本地文件系统,需要在Hive服务器和数据文件所在的机器上都可用。
INPATH ‘source_path’: 指定数据文件的路径。如果使用HDFS,路径应该是HDFS的路径。如果使用本地文件系统,则路径应该是本地文件系统的路径。
OVERWRITE: 可选参数,表示如果目标表中已经存在数据,是否覆盖。如果指定了OVERWRITE,则会覆盖目标表中的数据。如果不指定,则会在目标表的末尾追加数据。
INTO TABLE tablename: 指定目标表的名称。
PARTITION (partcol1=val1, partcol2=val2 …): 可选参数,用于指定分区列及其对应的值。如果目标表是分区表,并且你想要将数据加载到特定分区中,则需要使用该参数。
LOAD DATA LOCAL INPATH '/home/hadoop/3.text' OVERWRITE INTO TABLE testhive;
- 注意事项:需要指定以下格式, 并且字段间隔空格需要仔细检查。否则很可能会导入出错,或者导入到第一个字段。
row format delimited fields terminated by '\t'
STORED AS TEXTFILE
insert overwrite
insert overwrite table xx.xxx select * from testhive;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table keep_dw.dwd_keep_flow_item_show_click_di partition(p_date, event_type) select * from testhive;
- Note: 对于hive, insert overwrite 之后 需要加上 关键字 “table”
否则会报以下错误
org.apache.hadoop.hive.ql.parse.ParseException: line 1:17 cannot recognize input near 'xx' '.' 'xxx' in destination specification
Spark datasource table
设置该变量为 false, 就不会创建出 hive 表了,也就不会创建出 text 格式的hive 表了
val LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT =
buildConf("spark.sql.legacy.createHiveTableByDefault")
.internal()
.doc("When set to true, CREATE TABLE syntax without USING or STORED AS will use Hive " +
s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key} as the table provider.")
.version("3.1.0")
.booleanConf
.createWithDefault(true)
spark-sql>
> CREATE TABLE default.customer_address_5000 select /*+ REPARTITION(5000) */ * from hdfs_parquet_tpcds_kit_1000.customer_address;
2024-04-22 19:49:41,806 [WARN] [main] ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
Time taken: 44.936 seconds
其他相关配置
val CONVERT_CTAS = buildConf("spark.sql.hive.convertCTAS")
.internal()
.doc("When true, a table created by a Hive CTAS statement (no USING clause) " +
"without specifying any storage property will be converted to a data source table, " +
s"using the data source set by ${DEFAULT_DATA_SOURCE_NAME.key}.")
.version("2.0.0")
.booleanConf
.createWithDefault(false)
// This is used to set the default data source
val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.version("1.3.0")
.stringConf
.createWithDefault("parquet")
实例
set spark.sql.hive.convertCTAS=true;
create table default.store_sales location "cosn://xx/testdata/parquet_tpcds_kit/1000/store_sales" as select * from cosn_parquet_tpcds_kit_1000.store_sales;
insert overwrite
CREATE TABLE metacosnstable(col0_str string) using parquet options (path "cosn://xxx/testdata/testparquet");
CREATE TABLE metacosnstableoverwrite(col0_str string) using parquet options (path "cosn://xxx/testdata/testparquetoverwrite");
insert overwrite metacosnstableoverwrite select * from metacosnstable;
#spark-shell
val df = Seq(
("AB"),
("AB"),
("AB"),
("AB"),
("AB"),
).toDF("col0_str")
df.coalesce(1).write.option("compression","snappy").parquet("cosn://xxx/testdata/testparquet")
#spark-sql
CREATE TABLE cosnstable(col0_str string) using parquet options (path "cosn://xxx/testdata/testparquet");
CREATE TABLE cosnstableoverwrite(col0_str string) using parquet options (path "cosn://xxx/testdata/testparquetoverwrite");
Spark SQL. UTs
sql("CREATE TABLE t1(cal_dt DATE) USING PARQUET")
sql(
"""
|INSERT INTO t1 VALUES
|(date'2021-06-27'),
|(date'2021-06-28'),
|(date'2021-06-29'),
|(date'2021-06-30')""".stripMargin)
sql("CREATE VIEW t1_v AS SELECT * FROM t1")
sql(
"""
|CREATE TABLE t2(FLAG INT, CAL_DT DATE)
|USING PARQUET
|PARTITIONED BY (CAL_DT)""".stripMargin)
generate data case
第二次运行时,需要注视掉 insert into 的语句
test("Verify Join") {
sql(
"""CREATE TABLE web_sales(ws_order_number string) using parquet options (path
"file:///Users/xiangxshen/Dev/test/case/web_sales"
);""")
// sql(
// """INSERT INTO web_sales VALUES('1001'), ('1001'), ('1002'), ('1002'), ('1002');
// """.stripMargin)
//
sql(
"""CREATE TABLE web_returns(wr_order_number string) using parquet options (path
"file:///Users/xiangxshen/Dev/test/case/web_returns"
);""")
// sql(
// """ INSERT INTO web_returns VALUES('1001'), ('1002'), ('1003');
// """.stripMargin)
withSQLConf("spark.sql.autoBroadcastJoinThreshold" -> "-1",
"spark.sql.codegen.wholeStage" -> "false") {
val df = sql("WITH ws_wh AS " +
"(SELECT ws1.ws_order_number FROM web_sales ws1, web_sales ws2 WHERE ws1.ws_order_number = ws2.ws_order_number)" +
"select wr.wr_order_number from web_returns wr where wr.wr_order_number IN " +
"(SELECT ws_order_number FROM ws_wh) order by wr.wr_order_number limit 1;")
df.show()
sql("select * from web_sales").show
}
}
234

被折叠的 条评论
为什么被折叠?



