最近为了解决30亿+清单级数据的查询工作,尝试用presto解决。
方案1:采用deepgreen, 优化表分布,建立索引
方案2:采用hadoop+presto
回顾一下hadoop集群的搭建过程:
1.1,准备机器
10.1.240.183 base0183
10.1.240.184 base0184
10.1.240.185 base0185
10.1.240.186 base0186
10.1.240.187 base0187
新建用户、用户组
groupadd hadoopGroup
useradd -g hadoopGroup hadoop
passwd hadoop
1.2,下载安装文件
http://mirror.bit.edu.cn/apache/hadoop/common/选择相应版本下载,这里使用hadoop-2.7.5.tar.gz
oracle官网下载相应jdk,这里使用jdk-8u161-linux-x64.tar.gz
1.3,安装ssh
Hadoop需要通过ssh来启动salve列表中各台主机的守护进程
Generate secret key using rsa method(in ~):
ssh-keygen -t rsa -P ""
Press enter and it will generate files in /home/hadoop/.ssh
Add id_rsa.pub to authorized_keys:cat .ssh/id_rsa.pub >> .ssh/authorized_keys
Generate secret key on each Slave:ssh-keygen -t rsa -P ""
Send authorized_keys of Master to each Slave:
scp ~/.ssh/authorized_keys hadoop@base0183:~/.ssh/
scp ~/.ssh/authorized_keys hadoop@base0184:~/.ssh/
scp ~/.ssh/authorized_keys hadoop@base0185:~/.ssh/
scp ~/.ssh/authorized_keys hadoop@base0186:~/.ssh/
Testing ssh trust: ssh hadoop@base0183
It works if no password enter needed anymore
1.4,配置节点hostame,这里只修改了主节点为master,不改也可以,只需要修改后续hadoop配置文件用到hostname的地方
hostname master
vi /etc/hosts,输入如下信息
10.1.240.183 base0183
10.1.240.184 base0184
10.1.240.185 base0185
10.1.240.186 base0186
10.1.240.187 master
1.5,安装配置jdk
Hadoop是用Java开发的,Hadoop的编译及MapReduce的运行都需要使用JDK
tar -zxvf jdk-8u161-linux-x64.tar.gz
[root@base0187 ~]# mkdir -p /usr/local/java
[root@base0187 ~]# chown hadoop:hadoopGroup /usr/local/java
vi .bash_profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_161
export JRE_HOME=/usr/local/java/jdk1.8.0_161/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH:
source .bash_profile
验证java -version
[hadoop@master ~]$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Send java file to each Slave,to same location with Master:
scp -r /usr/local/java/jdk1.8.0_161 hadoop@base0183:/usr/local/java
scp -r /usr/local/java/jdk1.8.0_161 hadoop@base0184:/usr/local/java
scp -r /usr/local/java/jdk1.8.0_161 hadoop@base0185:/usr/local/java
scp -r /usr/local/java/jdk1.8.0_161 hadoop@base0186:/usr/local/java
1.6,安装hadoop
在/opt新建hadoop文件夹,用于hadoop安装目录
在/data新建三个文件夹,用于后续的hadoop配置
mkdir -p /opt/hadoop
mkdir -p /data/hdfs/name
mkdir -p /data/hdfs/data
mkdir -p /data/hdfs/tmp
chown hadoop:hadoopGroup /opt/hadoop
chown -R hadoop:hadoopGroup /data/hdfs
在各个节点分别执行上述命令。
Unzip the installer: tar -zxvf hadoop-2.7.5.tar.gz /opt/hadoop
hadoop环境变量
vi .bash_profile
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.5
export JAVA_HOME=/usr/local/java/jdk1.8.0_161
export JRE_HOME=/usr/local/java/jdk1.8.0_161/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin
source .bash_profile
configuring etc/hadoop/hadoop-env.sh,此处如果使用默认的${JAVA_HOME}, 有的系统启动时候会报错,须使用绝对路径
Configuring etc/hadoop/core-site.xml:
!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<!--指定namenode的地址-->
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<!--用来指定使用hadoop时产生文件的存放目录-->
<name>hadoop.tmp.dir</name>
<value>file:/data/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>hadoop.proxyuser.hadoo.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
Configuring etc/hadoop/mapred-site.xml(if it didn't exist, rename file mapred-site.xml.template):
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>Master:9001</value>
</property>
</configuration>
Configuring etc/hadoop/hdfs-site.xml
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration
dfs.namenode.name.dir确定DFS名称节点应在本地文件系统的哪个位置存储名称表(fsimage)。
如果这是一个以逗号分隔的目录列表,则名称表将被复制到所有目录中,以实现冗余
dfs.datanode.data.dir确定DFS数据节点应该在本地文件系统上存储块的位置。
如果这是以逗号分隔的目录列表,则数据将存储在所有已命名的目录中,通常位于不同的设备上。 应该为HDFS存储策略标记相应的存储类型([SSD] / [磁盘] / [存档] / [RAM_DISK])。
如果目录没有显式标记存储类型,则默认存储类型为DISK。 如果本地文件系统权限允许,则不存在的目录将被创建。
Add Slave namenode to slaves file:
[hadoop@master hadoop]$ cat slaves
base0183
base0184
base0185
base0186
Send hadoop file to each Slave,to same location with Master:
scp -r /opt/hadoop/hadoop-2.7.5/ hadoop@base0183:/opt/hadoop/
scp -r /opt/hadoop/hadoop-2.7.5/ hadoop@base0184:/opt/hadoop/
scp -r /opt/hadoop/hadoop-2.7.5/ hadoop@base0185:/opt/hadoop/
scp -r /opt/hadoop/hadoop-2.7.5/ hadoop@base0186:/opt/hadoop/
Send .bash_profile file to each Slave,to same location with Master:
chown hadoop:hadoopGroup /opt/hadoop
chown -R hadoop:hadoopGroup /data/hdfs
scp .bash_profile hadoop@base0183:~/
scp .bash_profile hadoop@base0184:~/
scp .bash_profile hadoop@base0185:~/
scp .bash_profile hadoop@base0186:~/
1.7,启动集群
cd /opt/hadoop/hadoop-2.7.5/bin
./hdfs namenode -format # 格式化集群
[hadoop@master ~]$ hdfs namenode -fromat
18/03/12 10:45:43 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = master/10.1.240.187
STARTUP_MSG: args = [-fromat]
STARTUP_MSG: version = 2.7.5
STARTUP_MSG: classpath = /opt/hadoop/hadoop-2.7.5/etc/
cd /opt/hadoop/hadoop-2.7.5/sbin
./start-all.sh
Check connection status in namenode:
hdfs dfsadmin -repor
浏览器访问:http://10.1.240.187:50070
1.8,利用自带的wordcount程序测试环境是否搭建成功
编辑任意andy_test.txt,并上传到hdfs.
hdfs dfs -mkdir /input
hdfs dfs -put andy_test.txt /input
Run workcount demo:
hadoop jar /opt/hadoop/hadoop-2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount /input /output
报错如下:
java.net.NoRouteToHostException: No route to host,
检查发现其中一台节点184防火墙没有关闭,
service iptables status
service iptables start
service iptables stop
[root@base0184 ~]# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num target prot opt source destination
1 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
2 ACCEPT icmp -- 0.0.0.0/0 0.0.0.0/0
3 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0
4 ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22
5 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited
Chain FORWARD (policy ACCEPT)
num target prot opt source destination
1 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
[root@base0184 ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@base0184 ~]#
excute again,success,Check result :
[hadoop@master mapreduce]$ hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 2 hadoop supergroup 0 2018-03-12 12:58 /output/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 303 2018-03-12 12:58 /output/part-r-00000
[hadoop@master mapreduce]$ hdfs dfs -cat /output/part-r-00000
And 1
Give 3
above 1....
至此,集群搭建成功!
查看集群的默认配置