k8s 中 hawq 无法启动 segment 问题排查

问题现象

HAWQ 能创建表,但是无法进行写操作(插入、更新)

问题本质

无可用 segment

image-20200814161244403

解决方案

echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf
sysctl -p

解决过程

方案1:重新编译 hawq

官方教程

安装依赖

wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# For CentOs 7 the link is https://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-9.noarch.rpm
rpm -ivh epel-release-latest-7.noarch.rpm
yum makecache
# On redhat7, make sure enabled rhel-7-server-extras-rpms and rhel-7-server-optional-rpms channel in /etc/yum.repos.d/redhat.repo
# Otherwise yum will prompt some packages(e.g. gperf) not be found
yum install -y man passwd sudo tar which git mlocate links make bzip2 net-tools \
  autoconf automake libtool m4 gcc gcc-c++ gdb bison flex gperf maven indent \
  libuuid-devel krb5-devel libgsasl-devel expat-devel libxml2-devel \
  perl-ExtUtils-Embed pam-devel python-devel libcurl-devel snappy-devel \
  thrift-devel libyaml-devel libevent-devel bzip2-devel openssl-devel \
  openldap-devel protobuf-devel readline-devel net-snmp-devel apr-devel \
  libesmtp-devel python-pip json-c-devel \
  java-1.7.0-openjdk-devel lcov cmake3 \
  openssh-clients openssh-server perl-JSON perl-Env
 
# need tomcat6 if enable-rps
# download from http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.44/
 
ln -s /usr/bin/cmake3 /usr/bin/cmake
pip --retries=50 --timeout=300 install pycrypto

依赖镜像

registry.cn-chengdu.aliyuncs.com/sunwu/hadoop-with-hawqpackage

编译 HAWQ

编译let with the necessary dependencies installed and Hadoop is ready, the next step is to get the code and build HAWQ

# The Apache HAWQ source code can be obtained from the the following link: 
# Apache Repo: https://git-wip-us.apache.org/repos/asf/hawq.git or 
# GitHub Mirror: https://github.com/apache/hawq.
git clone https://git-wip-us.apache.org/repos/asf/hawq.git
  ;
# The code directory is hawq.
CODE_BASE=`pwd`/hawq
  
cd $CODE_BASE
  
# Run command to generate makefile.
./configure
 
# Or you could use --prefix=/hawq/install/path to change the Apache HAWQ install path, 
# and you can also add some optional components using options (--with-python --with-perl)
# For El Capitan (Mac OS 10.11), you may need to do: export CPPFLAGS="-I/usr/local/include -L/usr/local/lib" if the configure cannot find some components
./configure --prefix=/hawq/install/path --with-python --with-perl
 
# If you need to Enable RPS for Ranger Integration
export CATALINA_HOME=/tomcat/install/path
./configure --prefix=/hawq/install/path --enable-rps
 
# You can also run the command with --help for more configuration.
./configure --help
 
 
# Run command to build and install
# To build concurrently , run make with -j option. For example, make -j8
# On Linux system without large memory, you will probably encounter errors like
# "Error occurred during initialization of VM" and/or "Could not reserve enough space for object heap"
# and/or "out of memory", try to set vm.overcommit_memory = 1 temporarily, and/or avoid "-j" build,
# and/or add more memory and then rebuild.
# On mac os, you will probably see this error: "'openssl/ssl.h' file not found".
# "brew link openssl --force" should be able to solve the issue.
make -j8
  
# Install HAWQ
make install

未编译成功,放弃,回到问题本质

方案2:根据现象找原因

使用 hawq init master , hawq init segment 排查。

hawq init master

错误1:
ERROR: failed to list directory hdfs://localhost:8020/hawq_default or it is not empty

原因:

hadoop 未启动或目录已存在

解决方案:

  • 启动 hadoop,start-dfs.sh

  • 删除目录 hdfs dfs -rm -r /hawq_default

错误2
[ERROR]:-Data directory /home/bigdata/hawq-data-directory/masterdd is not empty on 58acfd266eba

原因:

文件目录已存在

解决:

rm -rf /home/bigdata/hawq-data-directory/masterdd

到此没问题

hawq init segment

错误1

[ERROR]:-Data directory /home/bigdata/hawq-data-directory/segmentdd is not empty on 58acfd266eba

解决:

 rm -rf /home/bigdata/hawq-data-directory/segmentdd

错误2

[ERROR]:-Postgres initdb failed

原因:

猜测是 pg 的问题,转而研究 pg

PG

经过排查,估计是数据库未创建。

数据库初始化

执行命令 initdb , 报错:

Failed system call was semget(48, 17, 03600)

转而研究 linux 系统的 sysctl

sysctl

增大系统的参数

sysctl -w kernel.sem="250 512000 100 2048"

使其生效

sysctl -p

验证

sysctl -a|grep sem

执行结果如下,代表成功

[root@48ce8e4e3f79 /]# sysctl -a|grep sem
kernel.sem = 250	512000	100	2048
kernel.sem_next_id = -1

dockerfile

由于 docker 中不能使用 sysctl 命令,故将其转换为如下 dockerfile 命令

# 设置参数
RUN echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf

# 创建初始化脚本
RUN echo "sysctl -p " >> /entrypoint.sh
RUN echo "/usr/sbin/sshd -D " >>  /entrypoint.sh
RUN chmod +x  /entrypoint.sh

# 设置入口
ENTRYPOINT  /entrypoint.sh

k8s中报错

无法修改容器中的系统参数,如下所示

image-20200814172414983

解决:

让容器以特权模式运行,在 ReplicaSet 加上如下代码:

 securityContext:
            privileged: true

完整的 RC:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: {{ .Values.name }}
  labels:
    app: {{ .Values.name }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Values.name }}
  template:
    metadata:
      labels:
        app: {{ .Values.name }}
    spec:
      containers:
        - name: {{ .Values.name }}
          image: {{ .Values.image.hub }}/{{ .Values.image.namespace }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          securityContext:
            privileged: true

修复后的 docker file

FROM akshays/hawq_hadoop:latest

RUN  yum makecache fast && yum -y install openssh-server openssh-clients passwd

RUN sed -i "s/#PermitRootLogin no/PermitRootLogin yes/g" /etc/ssh/sshd_config
RUN sed -i "s/#Port 22/Port 22/g" /etc/ssh/sshd_config
RUN sed -i "s/PermitRootLogin without-password/PermitRootLogin yes/g" /etc/ssh/sshd_config


RUN echo root|passwd --stdin root

RUN mkdir -p /var/run/sshd
RUN /usr/bin/ssh-keygen -A
RUN echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf
RUN echo "sysctl -p " >> /entrypoint.sh

RUN echo "/usr/sbin/sshd -D " >>  /entrypoint.sh
RUN chmod +x  /entrypoint.sh
ENTRYPOINT  /entrypoint.sh

测试

进入系统

su bigdata
start-dfs.sh

执行 start-dfs.sh 会出现 Are you sure you want to continue connecting (yes/no)?, 请选择 yes

验证

jps

如下表示成功

# 确认是否管用
[bigdata@5fd15f58ab9b /]$ jps
481 SecondaryNameNode
268 DataNode
620 Jps
159 NameNode

配置环境变量

source  /home/bigdata/hawq/hawq-2.0.0/greenplum_path.sh

启动集群

hawq start cluster

如下表示成功:

image-20200814164151216

验证集群

hawq state 

如下表示成功

[bigdata@2116fb98c099 /]$ hawq state 
--- 一大堆日志 ---
...	Total segment valid (at master)        = 1
...	Total segment failures (at master)     = 0
--- 一大堆日志 ---

执行查询,验证正确性

psql -d postgres
create table t ( i int );
insert into t values(5555555);
insert into t select generate_series(1,9);
select count(*) from t;
select * from t limit 10;

如下表示成功

image-20200814164250733

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值