HDP3.X 新特性测试

最新推荐文章于 2023-09-04 12:39:20 发布

星辰丶晟妍

最新推荐文章于 2023-09-04 12:39:20 发布

阅读量393

点赞数

分类专栏： hadoop ambari2.7.4 Linux 文章标签： hadoop 大数据

本文链接：https://blog.csdn.net/qq_34648165/article/details/113366118

版权

Linux 同时被 3 个专栏收录

25 篇文章 0 订阅

订阅专栏

hadoop

6 篇文章 0 订阅

订阅专栏

ambari2.7.4

2 篇文章 0 订阅

订阅专栏

hdfs 50070可以直接上传文件了
hdp3.x通过统一的apache ranger服务来管理用户访问hdp服务的权限

ranger ur 登录信息

http://node01.ambari3.0.com:6080/login.jsp
USERNAME: admin
PASSWORD:

纠删码测试

ec策略相关介绍

ErasureCodingPolicy=[Name=XOR-2-1-1024k, Schema=[ECSchema=[Codec=xor, numDataUnits=2, numParityUnits=1]], CellSize=1048576, Id=4, State=DISABLED]

目前hadoop-3.0.0beta1共支持5种纠删码策略，分别是：

RS-10-4-1024k：使用RS编码，每10个数据单元（cell），生成4个校验单元，共14个单元，也就是说：这14个单元中，只要有任意的10个单元存在（不管是数据单元还是校验单元，只要总数=10），就可以得到原始数据。每个单元的大小是1024k=1024*1024=1048576。

RS-3-2-1024k：使用RS编码，每3个数据单元，生成2个校验单元，共5个单元，也就是说：这5个单元中，只要有任意的3个单元存在（不管是数据单元还是校验单元，只要总数=3），就可以得到原始数据。每个单元的大小是1024k=1024*1024=1048576。

RS-6-3-1024k：使用RS编码，每6个数据单元，生成3个校验单元，共9个单元，也就是说：这9个单元中，只要有任意的6个单元存在（不管是数据单元还是校验单元，只要总数=6），就可以得到原始数据。每个单元的大小是1024k=1024*1024=1048576。

RS-LEGACY-6-3-1024k：策略和上面的RS-6-3-1024k一样，只是编码的算法用的是rs-legacy，应该是之前遗留的rs算法。

XOR-2-1-1024k：使用XOR编码（速度比RS编码快），每2个数据单元，生成1个校验单元，共3个单元，也就是说：这3个单元中，只要有任意的2个单元存在（不管是数据单元还是校验单元，只要总数=2），就可以得到原始数据。每个单元的大小是1024k=1024*1024=1048576。

以RS-6-3-1024k为例，6个数据单元+3个校验单元，可以容忍任意的3个单元丢失，冗余的数据是50%。而采用副本方式，3个副本，冗余200%，却还不能容忍任意的3个单元丢失。因此，RS编码在相同冗余度的情况下，会大大提升数据的可用性，而在相同可用性的情况下，会大大节省冗余空间。

在hdp3.0和2.7环境中都创建相同的结构的表

  create database fake_db;
  use fake_db;
  CREATE TABLE fake_table_not_ec(
     name string,
     ssn string,  
     company string,
     mobile string,
     location string,
     bank_number string
  )
  comment 'fake data table'
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
      "separatorChar" = ",",
      "quoteChar" = "\"",
      "escapeChar" = "\\"
  )
  STORED AS TEXTFILE;

造测试数据
- 条数：
- 大小：

数据导入测试(必须要在hiveserver2的节点上执行)

load data local inpath '/opt/install/fake_table.csv' into table fake_db.fake_table_not_ec;

查看hdfs上占用空间

[root@master install]# sudo -u hdfs hdfs dfs -du -s -h /warehouse/tablespace/managed/hive/fake_db.db/fake_table
1.4 G  4.2 G  /warehouse/tablespace/managed/hive/fake_db.db/fake_table

发现并没有出现那种省空间的现象？

是通过如下命令对一些目录设置成使用纠删码

# 查看当前支持的ec策略
hdfs ec -listPolicies
# 开启相应的ec策略
sudo -u hdfs hdfs ec -enablePolicy  -policy XOR-2-1-1024k
# 设置某路径的ec策略
hdfs ec -setPolicy -policy XOR-2-1-1024k -path /warehouse/tablespace/managed/hive
# 查看某路径的ec策略
hdfs ec -getPolicy -path /warehouse/tablespace/managed/hive
# 取消某路径的ec策略
hdfs ec -unsetPolicy -path /warehouse/tablespace/managed/hive

load本地数据文件(数据条数：11000100 ,数据大小： 1.4G)耗时比较
- 未设置ec策略的: 64.195 seconds
- 设置ec策略的: 10.755 seconds
hive导入数据，将未设置ec策略和设置ec策略相比

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dpSFaesG-1611884660829)(C:\Users\15996\AppData\Roaming\Typora\typora-user-images\image-20210106140816462.png)]
可以看到确实缩减了一半的使用空间

将一个datanode停掉，看看数据的安全性

停掉之前先count一下两张表(一个有ec策略，一个没有ec策略)
- fake_table: 11000100
- fake_table_ec: 11000100

停掉之后再count一下两张表(一个有ec策略，一个没有ec策略)

fake_table: 11000100
fake_table_ec: 11000100

随便查一些数据，看看是否能查到，测试sql如下

select * from fake_table where bank_number='GB05MEHC45622256829264';  
select * from fake_table where bank_number='GB18GLWZ88759544610352'; 
select * from fake_table where bank_number='GB49BHBS00884452230411';
select * from fake_table where bank_number='GB41TPHM25826181520045';
select * from fake_table where bank_number='GB29CSSB80998968822738';
select * from fake_table where bank_number='GB60GUNL97445046902360';
select * from fake_table where bank_number='GB30OTGY35588851274731';

核查原始数据文件，确认和sql查到的结果和条数相同

恢复datanode

hive表压缩比测试

评价压缩模式的三个指标
- 压缩比：压缩比越高，压缩后文件越小，所以压缩比越高越好
- 压缩时间：越快越好
- 已经压缩的格式文件是否可以再分割：可以分割的格式允许单一文件由多个Mapper程序处理，可以更好的并行化
hive默认支持的执行引擎是tez
- 执行引擎只能是tez，mr不支持了
  
  [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EXH6wRIQ-1611884660832)(C:\Users\15996\AppData\Roaming\Typora\typora-user-images\image-20210106161351620.png)]
- 帖子地址
```
https://community.cloudera.com/t5/Support-Questions/Cannot-Disable-Tez-with-Hive-on-HDP3-0/td-p/186044
```

hive支持的压缩格式

TEXTFILE 不压缩，纯文本存储

建表命令

CREATE TABLE fake_table_textfile(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED AS TEXTFILE;

ZLIB

建表命令

CREATE TABLE fake_table_zlib(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc tblproperties ("orc.compress"="NONE");

SNAPPY

建表命令

CREATE TABLE fake_table_snappy(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc tblproperties ("orc.compress"="SNAPPY");

ORC

建表命令

CREATE TABLE fake_table_orc(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc;

对上述压缩表分别进行数据导入

insert into table fake_table_textfile select * from fake_table_ec;
insert into table fake_table_zlib select * from fake_table_ec;
insert into table fake_table_snappy select * from fake_table_ec;
insert into table fake_table_orc select * from fake_table_ec;

分别记录数据导入进去的时间
- TEXTFILE： 80.903 seconds
- ZLIB： 71.551 seconds
- SNAPPY： 67.019 seconds
- ORC： 56.889 seconds
原始数据大小： 1511958111 B

计算每种存储方式的压缩比

TEXTFILE： 1.080029399703389

[root@master install]# sudo -u hdfs hdfs dfs -du -s /warehouse/tablespace/managed/hive/fake_db.db/fake_table_textfile
1632959211  2452945643  /warehouse/tablespace/managed/hive/fake_db.db/fake_table_textfile

ZLIB： 0.7492874238762558

[root@master install]# sudo -u hdfs hdfs dfs -du -s /warehouse/tablespace/managed/hive/fake_db.db/fake_table_zlib
1132891198  1703316543  /warehouse/tablespace/managed/hive/fake_db.db/fake_table_zlib

SNAPPY： 0.5253655112671306

[root@master install]# sudo -u hdfs hdfs dfs -du -s /warehouse/tablespace/managed/hive/fake_db.db/fake_table_snappy
794330646  1196493868  /warehouse/tablespace/managed/hive/fake_db.db/fake_table_snappy

ORC： 0.3585011602216273

[root@master install]# sudo -u hdfs hdfs dfs -du -s /warehouse/tablespace/managed/hive/fake_db.db/fake_table_orc
542038737  816690594  /warehouse/tablespace/managed/hive/fake_db.db/fake_table_orc

从压缩比和时间来看，都是orc压缩格式更好一些
重复上述内容在hdp2.x环境中测试

TEXTFILE 不压缩，纯文本存储

建表命令

CREATE TABLE fake_table_textfile(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED AS TEXTFILE;

ZLIB

建表命令

CREATE TABLE fake_table_zlib(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc tblproperties ("orc.compress"="NONE");

SNAPPY

建表命令

CREATE TABLE fake_table_snappy(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc tblproperties ("orc.compress"="SNAPPY");

ORC

建表命令

CREATE TABLE fake_table_orc(
   name string,
   ssn string,  
   company string,
   mobile string,
   location string,
   bank_number string
)
comment 'fake data table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\"
)
STORED as orc;

对上述压缩表分别进行数据导入

load data local inpath '/opt/install/fake_table.csv' into table fake_db.fake_table;
ALTER TABLE fake_table_zlib SET FILEFORMAT ORC;
ALTER TABLE fake_table_snappy SET FILEFORMAT ORC;
ALTER TABLE fake_table_orc SET FILEFORMAT ORC;
insert into table fake_table_textfile select * from fake_table;
insert into table fake_table_zlib select * from fake_table;
insert into table fake_table_snappy select * from fake_table;
insert into table fake_table_orc select * from fake_table;

分别记录数据导入进去的时间
- TEXTFILE： 41.094 seconds
- ZLIB： 63.606 seconds
- SNAPPY： 50.819 seconds
- ORC： 50.17 seconds
原始数据大小： 1511955903 B

计算每种存储方式的压缩比

TEXTFILE： 1.080029516575127

[root@master install]# hdfs dfs -du -s /apps/hive/warehouse/fake_db.db/fake_table_textfile
1632957003  /apps/hive/warehouse/fake_db.db/fake_table_textfile

ZLIB： 0.7549669065976721

[root@master install]# hdfs dfs -du -s /apps/hive/warehouse/fake_db.db/fake_table_zlib
1141476671  /apps/hive/warehouse/fake_db.db/fake_table_zlib

SNAPPY： 0.5198459309828165

[root@master install]# hdfs dfs -du -s /apps/hive/warehouse/fake_db.db/fake_table_snappy
785984124  /apps/hive/warehouse/fake_db.db/fake_table_snappy

ORC： 0.344330013174994

[root@master install]# hdfs dfs -du -s /apps/hive/warehouse/fake_db.db/fake_table_orc
520611796  /apps/hive/warehouse/fake_db.db/fake_table_orc

从对比结果来看，压缩比在hdp2.x和3.x相差不大

atlas测试

1.1.0版本的atlas是否支持调用api创建关系？

如何获取血缘中所有信息？

# 查看所有实体信息
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/types/typedefs"
# 查看简略实体信息
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/types/typedefs/headers"
# 查询查询所有hive表 可以根据这个接口查到表的guid
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/search/basic?typeName=hive_table"
# 根据表名去查
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/search/basic?query=fake_table_ec&typeName=hive_table”
# 如何新增一个表
/v2/entity    这个接口时可以新增一个实体的，如何调用？
curl -u admin:1518808692lhjLHJ  -ik -H "Content-Type: application/json" -X POST -d '{"entity": {"typeName" : "source_data", "attributes" : {"name" : "customs_source_data", "qualifiedName" : "customs_source_data@Ambari", "url" : "www.google.com", "clusterName":"Ambari"}}}' http://node01:21000/api/atlas/v2/entity
----------------------------
# 根据guid查询实体信息  可以查出和其他实体的关系
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=5c870fa6-b743-4634-acef-0ce3cdadb91d"
# 获取某个实体的定义   
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=7232f92a-35ed-491c-b63c-66d462ab2e02"
# 查询某个实体的血缘关系
curl -s -u admin:1518808692lhjLHJ "http://node01:21000/api/atlas/v2/lineage/5c870fa6-b743-4634-acef-0ce3cdadb91d"
# 创建关系
curl -u admin:1518808692lhjLHJ  -ik -H "Content-Type: application/json" -X POST -d '{"entities":[{"typeName":"Process","attributes":{"owner":"root","createTime":"2021-01-08T17:38:21.0Z","updateTime":"","qualifiedName":"spider_process@spider","name":"spider_process","description":"资源爬取","comment":"资源爬取","contact_info":"jdbc","type":"table","inputs":[{"guid":"7232f92a-35ed-491c-b63c-66d462ab2e02","typeName":"source_data"}],"outputs":[{"guid":"5c870fa6-b743-4634-acef-0ce3cdadb91d","typeName":"hive_table"}]}}]}' http://node01:21000/api/atlas/v2/entity/bulk

yarn 时间轴测试
- yarn 时间轴功能是hdp3.x新增的一个特性
- 理解成收集正在运行和已完成的任务的信息？每个节点上的信息？
- 比之前好在什么地方？
- Yarn Timeline Service V2提供一个通用的应用程序共享信息和共享存储模块。可以将metrics等信息保存。可以实现分布式writer实例和一个可伸缩的存储模块。同时，v2版本在稳定性和性能上面也做出了提升，原先版本不适用于大集群，v2版本使用hbase取代了原先的leveldb作为后台的存储工具。
hdfs支持两个以上的namenode
- hdp3.x开启高可用后会有一个master namenode，还有一个secondary namenode 即只有两个namenode
- hdp3.x有三个namenode
- 容灾性更强
hdp3.x服务端口固定，不再是临时端口
- hdp2.x会有一些服务的端口是临时的，hdp3.x服务的端口都是固定的了
datanode内部数据平衡
- 在hdp2.x只能平衡datanode与datanode之间的数据，如果某个datanode上磁盘增加或删除，会导致某一个datanode上数据会有偏差，datanode内部的数据平衡2.x版本是做不到的，3.x版本可以做到

星辰丶晟妍

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
HDP3.X 新特性测试

hdfs 50070可以直接上传文件了hdp3.x通过统一的apache ranger服务来管理用户访问hdp服务的权限ranger ur 登录信息http://node01.ambari3.0.com:6080/login.jspUSERNAME: adminPASSWORD: 纠删码测试ec策略相关介绍ErasureCodingPolicy=[Name=XOR-2-1-1024k, Schema=[ECSchema=[Codec=xor, numDataUnits=..
复制链接

扫一扫