杂记 优化

集群安装设置
节点角色分配
集中式Master,将SPOF单点集中到一起
NameNode,JobTracker / ResourceManager
Hive Metastore,HiveServer2
Impala StateStore,Catalog Server
Spark Master
节点内核参数
–ulimit,/etc/security/limits.conf 设置 nofile 文件打开数量
–THP (Transparent Huge Page)、ACPI、内存overcommit问题 
–不同功能的节点进行不同设置
耗内存?swap?
需要高磁盘吞吐?
CPU,system load高的节点?

linux命令
ping ip
nc ip port
telnet ip port
w3m http://localhost:port

HDFS高级运维应用
–HttpFS(HDFS over HTTP)
http://host:14000/webhdfs/v1/?op=xxx&user.name=hdfs
–HA
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-High-Availability-Guide/CDH5-High-Availability-Guide.html
–DistCp
hftp
hadoop distcp hftp://oldHDFS:50070/ hdfs://newHDFS:8020/
–Cache
Centralized Cache Management in HDFS
–NFS
sudo service portmap stop
sudo hdfs portmap 2>~/portmap.err &
sudo -u hdfs hdfs nfs3 2>~/nfs3.err &
rpcinfo -p xxx.xxx.xxx.xxx
showmount -e xxx.xxx.xxx.xxx
sudo mount -t nfs -o vers=3,proto=tcp,nolock $HOSTNAME:/ /mnt/hdfs
–upgrade
stop namenode => hdfs namenode -upgrade
hdfs dfsadmin -finalizeUpgrade
–sync user groups
-refreshUserToGroupsMappings
-refreshSuperUserGroupsConfiguration
–fsck / block
hdfs fsck /xxx -files -blocks -locations
–setrep
hadoop fs -setrep [-R] [-w] <numReplicas> <path>
–balancer
sudo -u hdfs hdfs balancer -threshold x
–优化设置
dfs.client.read.shortcircuit

MapReduce / Yarn 设置与调优
–Job设置
mapreduce.job.reduces 根据情况设置
(节点数 * reduce slot数) 的倍数
mapreduce.client.submit.file.replication 我觉得2足以
mapreduce.map.output.compress + codec
–io.sort
mapreduce.task.io.sort.factor = 10 or more
mapreduce.task.io.sort.mb
–io.sort.mb
mapreduce.map.sort.spill.percent
–io.sort.spill.percent
mapreduce.reduce.shuffle.parallelcopies
–sqrt(节点数* map slot数) 与 (节点数 * map slot数)/2 之间
其他:io.file.buffer.size = 128KB for SequenceFiles
map slot / reduce slot 数怎么来的?
–NodeManager
–CPU:yarn.nodemanager.resource.cpu-vcores
–内存:yarn.nodemanager.resource.memory-mb
–map slot
client side
–CPU:mapreduce.map.cpu.vcores
–内存:mapreduce.map.memory.mb
–Heap Size:-Xmx在mapreduce.map.java.opts做相同调整
–reduce slot
client side
–CPU:mapreduce.reduce.cpu.vcores
–内存:mapreduce.reduce.memory.mb
–Heap Size:-Xmx在mapreduce.reduce.java.opts做相同调整
注意ApplicationMaster也是要占内存和CPU core的
ResourceManager
–资源分配尽量与NodeManager端保持一致
Trouble Shooting
–Container XXX is running beyond virtual memory limits
NodeManager端设置,类似系统层面的overcommit问题
–yarn.nodemanager.vmem-pmem-ratio
–或者yarn.nodemanager.vmem-check-enabled,false掉
–OOM
内存、Heap
–编码
mapreduce.map.output.compress.codec
mapreduce.output.fileoutputformat.compress.codec
mapreduce.output.fileoutputformat.compress.type
–org.apache.hadoop.io.compress.DefaultCodec
–org.apache.hadoop.io.compress.SnappyCodec
–org.apache.hadoop.io.compress.BZip2Codec / GzipCodec

Hive各种调优设置
reducer个数
hive.exec.reducers.bytes.per.reducer
mapred.reduce.tasks=-1
权限问题
hive.warehouse.subdir.inherit.perms
HiveServer2内存问题
–设置-Xmx越大越好。。。
-Xmx=2048m 甚至 -Xmx=4g
关闭“推测式”任务
hive.mapred.reduce.tasks.speculative.execution
mapreduce.reduce.speculative
客户端
hive.cli.print.current.db
hive.cli.print.header
并行执行!
hive.exec.parallel
hive.exec.parallel.thread.number
MapJoin
hive.auto.convert.join
hive.mapjoin.smalltable.filesize
hive.mapjoin.followby.gby.localtask.max.memory.usage=0.55
hive.mapjoin.followby.map.aggr.hash.percentmemory=0.3
hive.mapjoin.localtask.max.memory.usage=0.9
hive.ignore.mapjoin.hint
Local Mode
hive.exec.mode.local.auto
hive.exec.mode.local.auto.input.files.max
hive.exec.mode.local.auto.inputbytes.max


msck repair table xxx 加载外部表
desc formatted xxx 
desc formatted xxx partition(xxx=xxx)
invalidate metadata 同步metadata
虚拟列
INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE
同一张表不同分区用不同格式存储
CREATE TABLE order_created_dynamic_partition_parquet (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;

MSCK REPAIR TABLE order_created_dynamic_partition_parquet;

-- set to text file format, bug in hive (insert into 就会恢复建表时格式)
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month='2014-06') SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe';
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month='2014-06') SET FILEFORMAT textfile;
–Hive:/*+MAPJOIN(xxx)*/ select /*+MAPJOIN(xxx)*/ a.x,b.x from a join b on a.x=b.x
自动转换 hive.auto.convert.join
–Impala:[SHUFFLE] vs. [BROADCAST]  select  a.x,b.x from  a  join   [SHUFFLE] b on a.x=b.x
什么时候?什么表?
Hive复合数据类型:array
–collect_set
–collect_list
–array_contains
–sort_array
impala的group_concat
lateral view
–explode(array)
–LATERAL VIEW OUTER
select ad_id, catalog from ad_list LATERAL VIEW OUTER explode(catalogs) t AS catalog
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值