大数据运维--生产环境处理问题汇总（持续更新中.......）

最新推荐文章于 2022-04-18 14:06:00 发布

lxqyx007

最新推荐文章于 2022-04-18 14:06:00 发布

阅读量1.9k

点赞数 2

分类专栏：大数据运维文章标签：运维大数据 hadoop 问题汇总 CDH

本文链接：https://blog.csdn.net/lxqyx007/article/details/96993378

版权

大数据运维专栏收录该内容

3 篇文章 0 订阅

订阅专栏

2020.0513
hdfs dfsadmin -report 查看HDFS集群状态
Under replicated blocks　　　　副本数少于指定副本数的block数量
Blocks with corrupt replicas　　存在损坏副本的block的数据
Missing blocks　　　　　　　　丢失block数量
hdfs fsck -list-corruptfileblocks 列出丢失的块
hdfs fsck /tmp/temporary-b8a2bca9-a9ed-4de2-8779-14f65d719622 -delete 检测这个目录如果有丢失的块就删除

2020.0401
今天对kafka已有的topic 添加新的数据源接入
现象 kafka不能接受新的数据，但是之前的数据正常
测试1 使用现有的测试topic 无法接收数据，然后各种个网络检查一大圈，没有问题。
测试2 新建topic，可以正常接收，判断可能是topic存储分区在不同的broker上，有的正常接收，有的不正常，后来建立不同分区数量的topic测试排除这个问题，确定broker的网络无问题
这时感觉是Kafka本身的问题，查看topic的状态，发现已有的正式topic（就是无法接受新数据的那个）的状态正常，已有的测试topic状态显示leader 为 -1
kafka-topics --zookeeper master3:2181 --describe --topic AIS_110_Test
最后的处理方法是重启了kafka，然后正常接受数据，但是测试topic的leader 依然为 -1 ，所以删除了这个测试topic 哈哈

2020.0116
一、备份
1、主机名、IP、防火墙规则、fstab、zabbix_agentd.conf、userparameter_mysql.conf、zabbix-agent-3.2.11-1.el7.x86_64.rpm
cm-5.8.1、/etc/alternatives/spark2-conf、crontab内容、
2、大数据集群里的角色
3、
这里是下线M7时的备份项目，其他节点再次基础上补充
二、下线
1、hdfs fsck / 检查集群数据的状态
2、cdh管理页面主机-选中要下线的主机-已选定的操作-停止主机上的所有角色
3、cdh管理页面主机-选中要下线的主机-已选定的操作-主机解除授权（如果时间紧急可以跳过，此步骤会等在所有的缺失副本复制完成，耗时较长）
4、cdh管理页面主机-选中要下线的主机-已选定的操作-从集群中删除（这一步还会勾选解除主机授权，如果时间紧急请去掉）
5、/opt/cm-5.8.1/etc/init.d/cloudera-scm-agent stop 停止cdh的客户端
6、cdh管理页面主机-选中要下线的主机-已选定的操作-删除

此次下线遇到的问题
1、hdfs集群提示数据库丢失，后来发现是丢失的数据块副本数量不足引起的，最终通过 hdfs dfs -setrep 3 /user/oozie/share/lib/lib_20170206175024
强制此目录的数据块的副本为3
hdfs fsck -list-corruptfileblocks 查看hdfs集群中的丢失块
hdfs fsck / | egrep -v ‘^.+$’ | grep -v replica | grep -v Replica 查看丢失块的信息
find /data1/dfs/dn -name blk_100376901* 在下线节点中查找丢失的块
hdfs debug recoverLease -path /user/oozie/share/lib/lib_20170206175024/sqoop/stringtemplate-3.2.1.jar 修复丢失的块（尝试了两个提示成功，但是没有效果）
2、YARN集群提示没有JobHistory Server，在下线之前将此角色安装到M1
3、YARN集群的node M3节点不能启动，原因是磁盘空间不足，重新生成副本时造成的，在节点上线后没有影响
三、重装
1、下载最新版的centos7.7
2、u盘刻录镜像（有时会遇到安装开始后进入紧急模式的情况，因为U盘的名字读取不全，需要刻录的时候改短一些，或者在安装界面的时候按tab键，删除一下名字超过的部分）
3、修改网卡命名为eth0的类型名称，在安装界面的时候按tab键，在命令行后面加个空格，然后输入net.ifnames=0 biosdevname=0 再按回车即可(可不做，没影响)
4、分区，顺序分区BIOS Boot 2048MiB、/boot 2048MiB、/1000GiB、swap 4096MiB 、/home 系统盘剩余的所有空间
5、恢复主机名、IP、qroot用户，root用户的密码，crontab任务，timedatectl set-timezone Asia/Shanghai、重启
6、恢复fstab，根据剩余磁盘创建相对应的挂载目录，重启
7、关闭交换分区 swapoff /dev/mapper/centos-swap，然后注释在/etc/fstab文件中的以下开机自动挂载内容：/dev/mapper/centos-swap swap swap default 0 0
8、恢复iptables防火墙，yum install iptables-services、systemctl disable firewalld、systemctl enable iptables、sed -i ‘s#SELINUX=enforcing#SELINUX=disabled#g’ /etc/selinux/config，重启
9、修改远程登陆端口为2420，配置集群所有节点免密（这里可以提前做方便节点之间拷贝文件）
10、yum -y update、yum install vim -y、yum install lrzsz -y
11、安装java yum localinstall java-1.8.0-openjdk-headless-1.8.0.101-3.b13.el7_2.x86_64.rpm和yum localinstall java-1.8.0-openjdk-1.8.0.101-3.b13.el7_2.x86_64.rpm
12、安装zabbix，并恢复zabbix_agentd.conf 和 userparameter_mysql.conf rpm -ivh zabbix-agent-3.2.11-1.el7.x86_64.rpm systemctl enable zabbix-agent
13、
四、上线
1、恢复/opt/cm-5.8.1，并/opt/cm-5.8.1/etc/init.d/cloudera-scm-agent start 启动cdh的客户端
2、cdh管理页面–主机可以看到新上线的节点，状态标识为绿色（这里需要在启动cdh客户端后，等几分钟）
3、cdh管理页面–主机向群集添加新的主机，根据向导完成
4、根据实际情况将集群的实例角色添加到新上线的节点，本次暂时添加了 hdfs、hbase、yarn node
5、cdh管理页面–hdfs–Balancer–操作–重新平衡
6、恢复/etc/alternatives/spark2-conf，重建软连接ln -s /etc/alternatives/spark2-conf /etc/spark2/conf
7、恢复/opt/cloudera/parcels/CDH/lib/hbase/lib下面的geomesa-hbase-distributed-runtime_2.11-2.3.0.jar和mysql-connector-java-5.1.38-bin

这里说明节点下线不会影响数据，重装系统也不影响数据，如果时间少，就保留原来的数据，这样速度快，集群可以很快的删除掉多余的副本，，，如果时间允许，可以清空数据，这样可以顺便达到一个数据平衡的作用

遇到的问题
1、spark任务不正常，通过四-6和四-7 步骤恢复正常

2019.10.18
需求：公司kafka同步数据到阿里云kafka，使用MirrorMaker
踩坑：
1、cdh添加kafka角色实例CMM，应该是不支持SSL连接
2、VPC网络接入，不知道购买的阿里云实例有VPC网络，这个是没有SSL加密的连接
3、kafka0.10.2的mirrormaker不能连接自建集群
4、阿里云控制提示是SSl接入点，实际验证方式需要SASL_SSL
5、不懂java，不知道这个是加在什么位置export KAFKA_OPTS="-Djava.security.auth.login.config=kafka_client_jaas.conf"
6、ssl.truststore.location=kafka.client.truststore.jks
ssl.truststore.password=KafkaOnsClient 这个证书需要指定路径，还有这个密码就是固定的，我使用了另外的密码
准备：
1、下载kafka_2.12-2.2.1.tgz，比阿里云推荐的高了一个小版本
2、下载kafka.client.truststore.jks，需要跟阿里云要，或者阿里云提供的文档里有下载链接
3、手动创建kafka_client_jaas.conf文件，下面会贴出内容
部署：
1、服务器确保可以访问自建集群的9092和阿里云集群的9093
2、上传，解压kafka_2.12-2.2.1.tgz（这里不要配置zookeepr，不需要启动kafka）
3、config目录新建kafka_client_jaas.conf文件（kafka的解压目录）
4、新建目录cert，并上传kafka.client.truststore.jks证书（kafka的解压目录）
5、vim /erc/profile最底部加入export KAFKA_OPTS="-Djava.security.auth.login.config=xxxxxx/kafka_client_jaas.conf"(这里需要实际的目录)
6、编辑kafka_client_jaas.conf、consumer.properties和producer.properties
7、启动nohup bin/kafka-mirror-maker.sh --consumer.config config/consumer.properties --producer.config config/producer.properties --whitelist AIS_11_AisMySql,AIS_99_FWP &（后台运行）
8、目标topic查看是否有消息
配置文件内容
1、kafka_client_jaas.conf

 #这里的用户和密码从阿里云控制台获取
 KafkaClient {
     org.apache.kafka.common.security.plain.PlainLoginModule required
     username="xxxxxx"
     password="xxxxxx";
 };

2、consumer.properties

  # list of brokers used for bootstrapping knowledge about the rest of the cluster
  #format: host1:port1,host2:port2 ...
  bootstrap.servers=自建集群ip:9092
  #consumer group id
  group.id=test-consumer-group
  #消费者分区分配策略
  partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
  #What to do when there is no initial offset in Kafka or if the current
  #offset does not exist any more on the server: latest, earliest, none
  #auto.offset.reset=

3、producer.properties

############################# Producer Basics #############################
# list of brokers used for bootstrapping knowledge about the rest of the cluster
# format: host1:port1,host2:port2 ...
bootstrap.servers=阿里云集群ip:9093

# specify the compression codec for all data generated: none, gzip, snappy, lz4, zstd
compression.type=none

# name of the partitioner class for partitioning events; default partition spreads data randomly
#partitioner.class=

# the maximum amount of time the client will wait for the response of a request
#request.timeout.ms=

# how long `KafkaProducer.send` and `KafkaProducer.partitionsFor` will block for
#max.block.ms=

# the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together
#linger.ms=

# the maximum size of a request in bytes
#max.request.size=

# the default batch size in bytes when batching multiple records sent to a partition
#batch.size=

# the total bytes of memory the producer can use to buffer records waiting to be sent to the server
#buffer.memory=
ssl.truststore.location=/application/kafka/cert/kafka.client.truststore.jks
ssl.truststore.password=KafkaOnsClient
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
ssl.endpoint.identification.algorithm=
#最后这一行是kafka的版本高于2.x.x才需要

2019.07.23
需求：现有的filebeat中新增数据源和部署新的filebeat服务器
处理过程：
1、增加数据源
Filebeat安装目录
C:\Program Files\filebeat1\filebeat
C:\Program Files\filebeat2\filebeat

编辑filebeat.yml 新增配置
input_type: log
paths:
- D:\DataServer\ExactEarth\OriginalData*
document_type: AIS_20_ExactAis
scan_frequency: 100s
close_inactive: 5m
ignore_older: 2h
clean_removed: true
close_removed: true
clean_inactive: 3h

在服务管理器重启filebeat

ps：公司分为内外网大数据集群，所以有两个filebeat同时运行，分别是filebeat1和filebeat2，添加配置的时候需要两个都添加

2、重新部署filebeat
这是从原来服务器上面拷贝过来的程序
Filebeat安装目录
C:\Program Files\filebeat1
C:\Program Files\filebeat2

注册服务
在Powershell直接脚本时会出现：无法加载文件 ******.ps1，因为在此系统中禁止执行脚本。
管理员身份运行powershell，执行 set-ExecutionPolicy RemoteSigned

编辑install-service-filebeat.ps1 ，修改服务名称为filebeat1或者filebeat2
New-Service -name filebeat1 displayName filebeat1

管理员身份运行powershell，执行 cd ‘C:\Program Files\filebeat1’ (这里路径有空格需要单引号)，再执行 ./ install-service-filebeat.ps1 ，服务注册完成

编辑配置文件filebeat.yml

Filebeat输入源配置

input_type: log
paths:
- d:\SpirePrediction\server\DataLogs*
  document_type: AIS_9_Spire_ANJI
  scan_frequency: 100s
  close_inactive: 5m
  ignore_older: 2h
  clean_removed: true
  close_removed: true
  clean_inactive: 3h

filebeat输出到kafka配置
output.kafka:
#The Logstash hosts
hosts: [“xxx.xxx.xxx.xxx:9092”]
enabled: true
codec.format:
string: ‘%{[message]}’
topic: ‘%{[type]}’
required_acks: 1
max_messages_bytes: 1000000
bulk_max_size: 10240

在服务管理器重启filebeat

ps：公司分为内外网大数据集群，所以有两个filebeat同时运行，分别是filebeat1和filebeat2，添加配置的时候需要两个都添加

3、将powershell的配置改回默认，
set-ExecutionPolicy Restricted

3、这里还有一个linecount程序的配置，这是一个公司自己的程序，不做详细说明，这个程序的作用是把filebeat数据源里的每一个文件进行行数统计，然后发到kafka，主要用来对比kafka接收数据是否正常，配置如下：

2019.07.23
需求：内外网大数据平台的kafka增加新的数据源
处理过程：
1、查看kafka现有的topic
内网命令 /usr/local/kafka/bin/kafka-topics.sh --zookeeper host5:2188/kafka –list
外网命令 kafka-topics --zookeeper master3:2181 –list
ps：内网是Apache原生版本，没有配置环境变量，外网是cdh，自动配置

2、创建新的topic
内网命令 /usr/local/kafka/bin/kafka-topics.sh --zookeeper host5:2188/kafka --create --topic AIS_20_ExactAis --replication-factor 2 --partitions 1
外网命令 kafka-topics --zookeeper master3:2181 --create --topic AIS_20_ExactAis --replication-factor 2 --partitions 1

3、因为数据源接错需要清空topic的数据，选择删除再重建
内网命令
vi /usr/local/kafka/config/server.properties，编辑这个配置文件确认以下内容

#A comma seperated list of directories under which to store log files 
log.dirs=/usr/local/kafka_data

#删除topic(server.properties中设置delete.topic.enable=true否则只是标记删除)
delete.topic.enable=true

#关闭自动创建topic
auto.create.topics.enable=false

/usr/local/kafka/bin/kafka-topics.sh --delete --zookeeper host5:2188/kafka --topic AIS_20_ExactAis

外网命令
CDH管理界面，群集-kafka-配置搜索删除topic（启用）和自动创建topic（禁用）
kafka-topics --delete --zookeeper maser3:2181 --topic AIS_20_ExactAis

ps：这次操作来看，删除topic并不难，直接这两条命令就可以完成（可能是因为没有消费者，如果有需要先停掉），所以我并没有遇到删除不掉需要使用zookeeper客户端删除的情况
4、无法正常删除topic，则需要对kafka在zookeeer的存储信息进行删除
进入zookeeper客户端删掉对应topic
./zkCli.sh -server master1:2181

找到topic目录
ls /brokers/topics
删掉对应topic
rmr /brokers/topics/topic-name
如果topic 是被标记为 marked for deletion，则通过命令 ls /admin/delete_topics，找到要删除的topic，然后执行命令：
找到目录
ls /admin/delete_topics
删除目录
rmr /admin/delete_topics/【topic name】

找到目录
ls /config/topics
删掉对应topic
rmr /config/topics/topic-name

5、添加完成后使用第一步的命令查看新的topic是否添加成功，CDH版本可以通过kafka—图标查看
6、检查是否有数据进来
内网命令 /usr/local/kafka/bin/kafka-console-consumer.sh --zookeeper host10:2188/kafka --topic AIS_9_Spire_ANJI
外网命令kafka-console-consumer --zookeeper master3:2181 --topic AIS_9_Spire_ANJI

2019.07.15
公司内部大数据环境，集群里的任意节点执行了 hdfs dfsadmin -report 命令，发现有一个节点处于 Dead datanodes (1) 状态，Last contact: Mon Jun 24 16:40:00 CST 2019 。
处理过程：
1、联系之前的工程师确认是否为有计划的退出，结果不是。
2、根据HDFS特性，这么长时间的不在线状态，应该已经完成了副本拷贝的工作，所以不用考虑数据问题，不用着急处理。
3、登录这个问题节点，执行hadoop-daemon.sh start datanode 命令，状态并没有恢复。
4、查看hadoop-root-datanode-xxx.log，发现在启动的时候加载 /sdb 这个目录不成功
5、linux系统执行 df -h 显示正常
6、cd /sdb1 执行ls命令提示 Input/output error
7、cd / 目录执行 ll 命令，发现 sdb1 目录信息不正常，显示了很多？？？？？？？？？？
8、umount之后，在mount 报错 mount: /dev/sdb1: can’t read superblock
9、xfs文件系统所以用xfs_repair 修复，不成功具体提示信息忘记了
10、fdisk -l 看不到相关的磁盘信息，可以从 /dev 看到
11、cat /proc/partitions 可以看到相关信息
12、备份修改 /usr/local/hadoop/etc/hadoop/hdfs-site.xml,

dfs.datanode.data.dir
file:/dfs,file:/sdd,file:/sdc

去掉file:/sdb
13、执行hadoop-daemon.sh start datanode 命令，查看日志没有报错，但是状态也没有会（这里需要一段时间扫描剩余磁盘的数据，上报给namenode）
14、过一段时间执行了 hdfs-dfsadmin -report，状态恢复（这时候就等着整个集群慢慢验证数据，多余的副本数量被删除，想要快的话可以自己修改上报参数来缩短时间）

Dead datanodes (1):

Name: 192.168.1.170:50010 (xxx)
Hostname:
Decommission Status : Normal
Configured Capacity: 23952530427904 (21.78 TB)
DFS Used: 7689460762544 (6.99 TB)
Non DFS Used: 592783763536 (552.07 GB)
DFS Remaining: 15668718210960 (14.25 TB)
DFS Used%: 32.10%
DFS Remaining%: 65.42%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Mon Jun 24 16:40:00 CST 2019

Name: 192.168.1.170:50010 (xxx)
Hostname: xxx
Decommission Status : Normal
Configured Capacity: 17954062262272 (16.33 TB)
DFS Used: 6108392244325 (5.56 TB)
Non DFS Used: 603698887579 (562.24 GB)
DFS Remaining: 11241853299200 (10.22 TB)
DFS Used%: 34.02%
DFS Remaining%: 62.61%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 4
Last contact: Mon Jul 15 16:04:30 CST 2019

lxqyx007

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
大数据运维--生产环境处理问题汇总（持续更新中.......）

2019.07.23需求：内外网大数据平台的kafka增加新的数据源处理过程：1、查看kafka现有的topic内网命令 /usr/local/kafka/bin/kafka-topics.sh --zookeeper host5:2188/kafka –list外网命令 kafka-topics --zookeeper master3:2181 –listps：内网是Apache...
复制链接

扫一扫