教程配套
实战大数据(Hadoop+Spark+Flink)从平台构建到交互式数据分析(离线/实时)【图片 价格 品牌 评论】-京东 (jd.com)
以centos7为例子
1.安装Linux7
2.配置静态ip
vi /etc/sysconfig/network-scripts/ifcfg-ens33
onboot = yes
bootproto = static
IPADDR=192.168.88.111 NETWORK=255.255.255.0 GATEWAY=192.168.88.2 DNS1=192.168.88.2
service network restart
3.更改主机名和映射
vi /etc/hosts 192.168.88.111 hadoop01 192.168.88.112 hadoop02 192.168.88.113 hadoop03
4.关闭防火墙
systemctl disable firewalld systemctl stop firewalld
5.创建用户名和用户组
groupadd hadoop useradd -g hadoop hadoop passwd hadoop visudo hadoop ALL=ALL NOPASSWD: ALL
6.复制两个虚拟机并改名
7.ssh免密
ssh-keygen -t rsa 三个服务器 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop01 三个服务器 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop02 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop03 scp -r authorized_keys 192.168.88.113:~/.ssh/
8.更改主机名
hostnamectl set-hostname hadoop01 hostnamectl set-hostname hadoop02 hostnamectl set-hostname hadoop03
9.配置chrony
主: vi /etc/chrony.conf server ntp.aliyun.com local stratum 10 allow 192.168.88.0/24 从: server 192.168.88.111
10.写脚本
01:mkdir tools vi deploy.conf
#集群角色规划 hadoop01,master,all,zookeeper,namenode,datanode hadoop02,slave,all,zookeeper,namenode,datanode hadoop03,slave,all,zookeeper,datanode
vi deploy.sh
#!/bin/bash if [ $# -lt 3 ] then echo "传入参数过小" exit 1 fi src=$1 dest=$2 tag=$3 if [ ','$4',' == ',,' ] then confFile=/home/hadoop/tools/deploy.conf else confFile=$4 fi if [ -f $confFile ] then if [ -f $src ] then for server in `cat $confFile | grep -v '^#' | grep ','$tag',' | awk -F',' '{print $1}'` do scp $src $server":"${dest} done elif [ -d $src ] then for server in `cat $confFile | grep -v '^#' | grep ','$tag',' | awk -F',' '{print $1}'` do scp -r $src $server":"${dest} done else echo "没有资源文件" fi else echo "confFile为空" fi
vi runRemoteCmd.sh
#!/bin/bash if [ $# -lt 2 ] then echo "参数过少" exit 1 fi cmd=$1 tag=$2 if [ 'a'$3'a' == 'aa' ] then confFile=/home/hadoop/tools/deploy.conf else confFile=$3 fi if [ -f $confFile ] then for server in `cat $confFile | grep -v '^3' | grep ','$tag',' | awk -F',' '{print $1}'` do echo "*****************$server*********************" ssh $server "source ~/.bashrc; $cmd" done else echo "confFile文件不存在" fi
root: vi /etc/profile.d/mytools.sh
export PATH=$PATH:/home/hadoop/tools
hadoop: chmod u+x deploy.sh chmod u+x runRemoteCmd.sh runRemoteCmd.sh "mkdir /home/hadoop/data" all runRemoteCmd.sh "mkdir /home/hadoop/app" all
Hadoop·集群搭建
1.java jdk安装
传输安装包后解压
删除原来的安装包
配置环境
ln -s jdk1.8.0_51 jdk
vim ~/.bashrc
JAVA_HOME=/home/hadoop/app/jdk CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar PATH=$JAVA_HOME/bin:/home/hadoop/tools:$PATH export JAVA_HOME CLASSPATH PATH
source ~/.bashrc
验证
[hadoop@hadoop01 app]$ java -version java version "1.8.0_51" Java(TM) SE Runtime Environment (build 1.8.0_51-b16) Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode) 安装成功
deploy.sh jdk1.8.0_51 /home/hadoop/app/ slave 发送到子节点
为子节点创建软连接
为子节点编写环境变量
02,03: vim ~/.bashrc JAVA_HOME=/home/hadoop/app/jdk CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar PATH=$JAVA_HOME/bin:$PATH export JAVA_HOME CLASSPATH PATH
zookeeper安装
传输并且解压,创建软连接为zookeeper
配置zookeeper
cd zookeeper/conf
上传配置文件zoo到zookeeper/conf
分发zookeeper·到子节点
配置子节点软连接
整体配置日志文件
01: runRemoteCmd.sh "mkdir -p /home/hadoop/data/zookeeper/zkdata" all runRemoteCmd.sh "mkdir -p /home/hadoop/data/zookeeper/zkdatalog" all
到日志文件zkdata配置zookeeper
01,02,03进入/home/hadoop/data/zookeeper/zkdata
01,02,03: touch myid
1: echo 1 > myid 2: echo 2 > myid 3: echo 3 > myid
启动zookeeper
01: runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh start" all
jps 查看进程
[hadoop@hadoop01 ~]$ jps 1750 QuorumPeerMain 1782 Jps
输入runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh status" all看进程如下表示成功
[hadoop@hadoop01 ~]$ runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh status" all *****************hadoop01********************* JMX enabled by default Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg Mode: follower *****************hadoop02********************* JMX enabled by default Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg Mode: leader *****************hadoop03********************* JMX enabled by default Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg Mode: follower
HDFS和YARN安装
上传和解压,创建链接
将core hadoop hdfs slaves
覆盖到/home/hadoop/app/hadoop-2.9.1/etc/hadoop下
分发到子节点
deploy.sh hadoop-2.9.1 /home/hadoop/app/ slave
为子节点创建软连接
02,03,app下: ln -s hadoop-2.9.1 hadoop
HDFS安装完毕
测试
启动journalnode
01:{这是一行代码} runRemoteCmd.sh "/home/hadoop/app/hadoop/sbin/hadoop-daemon.sh start journalnode" all
格式化
hadoop环境下{即代表进入app文件下的hadoop文件里} bin/hdfs namenode -format bin/hdfs zkfc -formatZK
启动namenode
bin/hdfs namenode
设置02跟随
02:{也是hadoop目录下} bin/hdfs namenode -bootstrapStandby
关闭journalnode
01: runRemoteCmd.sh "/home/hadoop/app/hadoop/sbin/hadoop-daemon.sh stop journalnode" all
一键启动
01: sbin/start-dfs.sh
jps后的结果
3088 JournalNode 3377 DFSZKFailoverController 3874 Jps 1750 QuorumPeerMain 2760 NameNode 2873 DataNode
查看状态
01: bin/hdfs haadmin -getServiceState nn1 bin/hdfs haadmin -getServiceState nn2
在自己window里面的edge浏览器里面的地址栏输入
查看进程
测试web
hdfs环境下 编写一文件 vim wd.txt 自己编写
bin/hdfs dfs -ls / 查看文件系统 bin/hdfs dfs -mkdir /text 创建 bin/hdfs dfs -put wd.txt /text 上传 bin/hdfs dfs -cat /text/wd.txt 查看
测试成功
YARN的安装
上传mapred yarn
传送到各个节点
01 deploy.sh yarn-site.xml /home/hadoop/app/hadoop/etc/hadoop slave deploy.sh mapred-site.xml /home/hadoop/app/hadoop/etc/hadoop slave
一键启动yarn
sbin/start-yarn.sh
启动02备用节点
02: sbin/yarn-daemon.sh start resourcemanager
查看状态
bin/yarn rmadmin -getServiceState rm1 bin/yarn rmadmin -getServiceState rm2
修改本机网页配置
C:\Windows\System32\drivers\etc中的hosts同linux
打开网页{edge}
http://192.168.88.111:8088/ http://192.168.88.112:8088/
http://192.168.88.112:8088/会自动跳转hadoop01
安装成功
yarn的测试
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar wordcount /text/wd.txt /text/out
网页中可以显示,运行成功
bin/hdfs dfs -ls /text
可以看见输出
bin/hdfs dfs -ls /text/out
可以看见输出日志
bin/hdfs dfs -cat /text/out/*
看到单词统计的最终结果
同样网页上也可以看见
安装完成
mapreduce和wordcount
idea构建Java项目
打开maven的conf,编辑setting.xml
添加
<profile> <id>development</id> <activation> <jdk>8</jdk> <activeByDefault>true</activeByDefault> </activation> <properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> <maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion> </properties> </profile>
bing hadoop,点击入门,点击mapreduce中的tutorial,source code中的代码复制
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
新建与app同级别类WordCount,粘贴
添加main目录下的sources目录
添加log4j资源包
添加
log4j.rootLogger = debug,stdout ### 输出信息到控制抬 ### log4j.appender.stdout = org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target = System.out log4j.appender.stdout.layout = org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
在pom中添加
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.9.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.9.1</version> </dependency>
重构
src同级目录下创建目录取名,内部添加txt文件,内部写入单词,复制目录
编辑WordCount配置
例如
D:\26284\Documents\JAVA\hadoop\in\TXT.txt D:\26284\Documents\JAVA\hadoop\in\out
安装并运行win补丁文件在windows/system32中
winutils.exe
运行程序
最后打包
文件,项目结构,工件,新建jar,选择全包,确定,构建工件
或者
cmdo'r'g
cd到项目,
mvn clean package
将包传到Hadoop目录下
bin/hadoop jar wc.jar com.hadoop.WordCount /text/wd.txt /text/out2 jar名 包名 类名
成功
MAVEN管理多个mapreduce
创建TextDriver类
package com.hadoop; import org.apache.hadoop.util.ProgramDriver; public class TextDriver { public static void main(String argv[]) { int exitCode = -1; ProgramDriver pgd = new ProgramDriver(); try { pgd.addClass("wordcount", WordCount.class, "A map/reduce program that counts the words in the input files."); /***********************************************************************************/ exitCode = pgd.run(argv); } catch(Throwable e){ e.printStackTrace(); } System.exit(exitCode); } }
pom文件在</dependencies>下添加如下
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.5.1</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.hadoop.TextDriver</mainClass> </transformer> </transformers> <createDependencyReducedPom>false</createDependencyReducedPom> </configuration> </execution> </executions> </plugin> </plugins> </build>
用maven打包上传
bin/hadoop jar hadoop-1.0-SNAPSHOT.jar wordcount /text/wd.txt /text/out3
成功
构建数据存储与交换系统
HBASE
上传解压软连接
上传配置文件
脚本分发
deploy.sh hbase-1.2.0 /home/hadoop/app/ slave
创建软连接
为01.02.03更改配置文件
vi ~/.bashrc
# .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # Uncomment the following line if you don't like systemctl's auto-paging feature: # export SYSTEMD_PAGER= JAVA_HOME=/home/hadoop/app/jdk ZOOKEEPER_HOME=/home/hadoop/app/zookeeper HADOOP_HOME=/home/hadoop/app/hadoop CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin:$PATH export JAVA_HOME CLASSPATH PATH HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME # User specific aliases and functions
source ~/.bashrc
启动hbase
bin/start-hbase.sh
jps
hmaster,hregionserver
hadoop01:16010 hadoop02:16010
shell测试
bin/hbase shell
status查看状态 version版本
2.3.1 基本操作 进入 HBase 客户端命令行 hbase shell 1 查看帮助命令 help 1 查看当前数据库中有哪些表 list 1 2.3.2 表的操作 创建表 create 'student','info' 1 插入数据到表 put 'student','1001','info:sex','male' 1 put 'student','1001','info:age','18' 1 put 'student','1002','info:name','Janna' 1 put 'student','1002','info:sex','female' 1 put 'student','1002','info:age','20' 1 扫描查看表数据 scan 'student' 1 scan 'student',{STARTROW => '1001', STOPROW => '1001'} 1 scan 'student',{STARTROW => '1001'} 1 查看表结构 describe 'student' 1 更新指定字段的数据 put 'student','1001','info:name','Nick' 1 put 'student','1001','info:age','100' 1 查看 “指定行” 或 “指定列族:列” 的数据 get 'student','1001' 1 get 'student','1001','info:name' 1 统计表数据行数 count 'student' 1 变更表信息 将 info 列族中的数据存放 3 个版本 alter 'student',{NAME=>'info',VERSIONS=>3} 1 get 'student','1001',{COLUMN=>'info:name',VERSIONS=>3} 1 删除数据 ① 删除某 rowkey 的全部数据 deleteall 'student','1001' 1 ② 删除某 rowkey 的某一列数据 delete 'student','1002','info:sex' 1 清空表数据 truncate 'student' 1 提示:清空表的操作顺序为先 disable,然后再 truncate。 删除表 ① 首先需要先让该表为 disable 状态 disable 'student' 1 ② 然后才能 drop 这个表 drop 'student' 1 2.3.3 命名空间的基本操作 查看命名空间 list_namespace 1 创建命名空间 create_namespace 'bigdata' 1 在新的命名空间中创建表 create 'bigdata:student','info' 1 删除命名空间 只能删除空的命名空间,如果不为空,需要先删除该命名空间下的所有表 drop_namespace 'bigdata'
KAFKA
上传解压软连接
上传配置文件config下
分发
deploy.sh kafka_2.12-1.1.1 /home/hadoop/app/ slave
软连接
修改config
中server配置
中broker.id=1
改为对应虚拟机编号,
比如hadoop02则改为2
安装完毕
启动kafka
bin/kafka-server-start.sh config/server.properties bin/kafka-server-start.sh config/server.properties bin/kafka-server-start.sh config/server.properties
测试
创建topic
创建djt
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic djt --replication-factor 3 --partitions 3
查看表
bin/kafka-topics.sh --zookeeper localhost:2181 --list
看详细信息
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic djt
消费者消费
01 bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic djt
生产者发送消息
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic djt
KAFKA集群监控
sudo yum install -y unzip zip
上传kafka-manager
解压
unzip kafka-manager-2.0.0.2.zip
软连接
ln -s kafka-manager-2.0.0.2 kafka-manager
进入目录
conf文件下
改application.conf中
kafka-manager.zkhosts="kafka-manager-zookeeper:2181"
basicAuthentication.username="admin"
basicAuthentication.password="password"
basicAuthentication.enabled=false
为
01: kafka-manager.zkhosts="192.168.88.111:2181" basicAuthentication.username="yhzyh" basicAuthentication.enabled=true basicAuthentication.password="yhzyh"
启动kafka-manager
kafka-manager目录下
bin/kafka-manager -Dhttp.port=9999
浏览器输入
http://192.168.88.111:9999/
新建集群
Cluster Name
my-kafka-cluster
Cluster Zookeeper Hosts
hadoop01:2181
Kafka Version
1.1.1
Flume
上传,解压,进入flume
进入conf
mv flume-conf.properties.template flume-conf.properties
分发
deploy.sh apache-flume-1.8.0-bin /home/hadoop/app slave
创建软连接
ln -s apache-flume-1.8.0-bin flume
前台启动flume(测试用,暂时不做)
bin/flume-ng agent -n agent -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console
上传taildir到hadoop01中的conf里
上传avro到02,03的conf中
进入01,
cd cd data mkdir flume cd flume mkdir logs cd logs vi sogou.log 写入yhzyh 保存退出
在02,03中分别启动
02:03: bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-logger.properties -Dflume.root.logger=INFO,console
01:
bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
启动完成,
准备测试数据
01中进入data/flume/logs,向Sogou .log中添加数据集
echo "hadoop1112" >> sogou.log
看到02或03中打印出
完成
FLUME与KAFKA
上传配置文件avro-file-selector-kafka.properties到02,03的conf中
启动02,03的flume,启动01的flume
在01中创建topic
kafka bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic sogoulogs --replication-factor 3 --partitions 3
02,03启动fiume
02.03 bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-kafka.properties -Dflume.root.logger=INFO,console
01
bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
启动卡夫卡消费者
kafka bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sogoulogs
启动完成,
准备测试数据
01中进入data/flume/logs,向Sogou .log中添加数据集
echo "hadoop1112" >> sogou.log
看到kafka中打印出
完成
FLUME与HBASE
上传flume-ng-hbase-sink-1.8.0.jar到02和03 lib 目录下
修改配置文件favro-file-selector-hbase.properties中
agent1.sinks.k1.table = sogoulogs
上传favro-file-selector-hbase.properties到02和03 conf 目录下
修改01,02,03的~/.bashrc文件如下
# .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi JAVA_HOME=/home/hadoop/app/jdk CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar ZOOKEEPER_HOME=/home/hadoop/app/zookeeper HADOOP_HOME=/home/hadoop/app/hadoop HBASE_HOME=/home/hadoop/app/hbase PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin:$PATH export JAVA_HOME CLASSPATH PATH HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME # Uncomment the following line if you don't like systemctl's auto-paging feature: # export SYSTEMD_PAGER= # User specific aliases and functions
source ~/.bashrc
安装成功
测试
进入hbase
进入shell
bin/hbase shell
创建表
create 'sogoulogs','info'
启动02,03的flume
bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-hbase.properties -Dflume.root.logger=INFO,console
启动01
bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
启动hbase
hbase: bin/hbase shell
准备测试数据
01中进入data/flume/logs,向Sogou .log中添加数据集
echo "hadoop1112" >> sogou.log
查看hbase
scan 'sogoulogs'
hbase与kafka与flume
上传avro-file-selector-hbase-kafka.properties到02,03conf里面
安装完成
测试
0203: bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-hbase-kafka.properties -Dflume.root.logger=INFO,console
01: bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
启动hbase
hbase: bin/hbase shell
启动卡夫卡消费者
kafka bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sogoulogs
01中进入data/flume/logs,向Sogou .log中添加数据集
echo "hadoop1112" >> sogou.log
查看hbase
scan 'sogoulogs'
flume就完成了
HIVE
mysql安装与配置
01进入root模式下
yum install -y wget
卸载已安装的 Mariadb 数据库。
rpm -qa|grep mariadb|xargs rpm -e --nodeps
下载安装包文件
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
rpm 安装源信息
rpm -ivh mysql-community-release-el7-5.noarch.rpm
使用 yum 安装MySQL
yum install mysql-server
安装过程中,会提示让我们确认,一律输入 y 按回车即可
检验是否安装完成
rpm -qa | grep mysql
出现结果代表完成
检查mariaDB是否被覆盖
rpm -qa | grep mariadb
不输出,表示 mariaDB 已经被成功覆盖。
systemctl stop mysqld.service #停止 mysql systemctl start mysqld.service #启动 mysql systemctl restart mysqld.service #重启 mysql systemctl enable mysqld.service #设置 mysql 开机启动
这里我们
设置开机自启,
启动mysql
登录mysql
mysql -u root -p
root还没有密码,按enter
设置root的密码
set password for root@localhost=password('root');
密码为root
退出,重新登陆
创建hive用户
create user 'hive' identified by 'hive';
授予hive所有权限
grant all on *.* to 'hive'@'hadoop01' identified by 'hive'; grant all on *.* to hive@'%' identified by "hive";
刷新权限
flush privileges;
登录hive账户
mysql -u hive -p
密码
hive
root添加远程登陆
RANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "root";
登陆
mysql -h hadoop01 -u hive -p
安装hive
上传5.3安装包apache-hive到01,hadoop的app目录下
解压,创建软连接hive,删除压缩包
将hive-site和hive-env上传到hive的conf下
将安装包mysql-connector上传到lib目录下
root账户下
vi /etc/my.cnf
添加
skip_ssl
保存退出
systemctl restart mysqld
安装完毕
初始化
hive目录下 bin/schematool -dbType mysql -initSchema
执行
hive目录下 bin/hive
输入
show databases;
hive> show databases; OK default Time taken: 0.243 seconds, Fetched: 1 row(s) hive> 代表成功
hive与hbase
进入和hbase的lib目录
cp hbase-client-1.2.0.jar /home/hadoop/app/hive/lib/ cp hbase-common-1.2.0.jar /home/hadoop/app/hive/lib/ cp hbase-server-1.2.0.jar /home/hadoop/app/hive/lib/ cp hbase-common-1.2.0-tests.jar /home/hadoop/app/hive/lib/ cp hbase-protocol-1.2.0.jar /home/hadoop/app/hive/lib/ cp htrace-core-3.1.0-incubating.jar /home/hadoop/app/hive/lib/ cp zookeeper-3.4.6.jar /home/hadoop/app/hive/lib/
进入hive
bin/hive
输入
CREATE EXTERNAL TABLE sogoulogs( id string, datatime string, userid string, searchname string, retorder string, cliorder string, cliurl string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:datatime,info:userid,info:searchname,info:retorder,info:cliorder,info:cliurl") TBLPROPERTIES ("hbase.table.name" = "sogoulogs");
查询
select * from sogoulogs limit 5;
搭建完毕
spark
上传01的app目录,解压,软连接spark
spark目录下
vi djt.log
自己随便填写内容
启动spark
bin/spark-shell
spark wordcount
如下
val line = sc.textFile("/home/hadoop/app/spark/djt.log") line.flatMap(_.split("")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
会显示djt统计结果
idea maven 构建spark,wordcount
mvn中pom添加如下依赖
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.9.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.9.1</version> </dependency>
下载插件scal
main下新建scala目录,标记为源码
在其下
新建MyScalaWordCout的scala文件
import org.apache.spark.{SparkConf,SparkContext} object MyScalaWordCout{ def main(args:Array[String]):Unit = { if(args.length < 2) { System.err.println("Usage:MyWordCount<input><output>") System.exit(1) } val input = args(0) val output = args(1) val conf = new SparkConf().setAppName("myWordCount").setMaster("local") val sc=new SparkContext(conf) val lines = sc.textFile(input) val resultRdd = lines.flatMap(_.split("")).map((_,1)).reduceByKey(_+_) resultRdd.saveAsTextFile(output) sc.stop() } }
在main下添加资源包,添加
log4j.properties文件
log4j.rootLogger = debug,stdout ### 输出信息到控制抬 ### log4j.appender.stdout = org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target = System.out log4j.appender.stdout.layout = org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
项目文件同级别下,新建测试文件,自己编写,如djt.txt
修改运行实参例如
F:\apache-flume-1.8.0-src\flume-ng-sinks\testspark\djt.txt F:\apache-flume-1.8.0-src\flume-ng-sinks\testspark\out
运行,输出成功
mvn clean package打包
SPARK的STANDALONE模式
上传6.3配置文件1到conf中
回到app目录下分发给02,03
deploy.sh spark-2.3.1-bin-hadoop2.7 /home/hadoop/app slave
为02,03创建软连接
修改conf中spark-env.sh文件
最底下添加
HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
启动spark集群
01: sbin/start-all.sh
01,jps里面有
work master
02: sbin/start-master.sh
02也有了master
浏览器进入
hadoop01:8888
hadoop02:8888
maven 打包,上传到hadoop/shell/lib中
确保hdfs中存在test目录,存在djt.txt文件
进入spark
spark: bin/spark-submit --master spark://hadoop01:7077,hadoop02:7077 --class com.hadoop.MyScalaWordCout /home/hadoop/shell/lib/testSpark-1.0-SNAPSHOT.jar /test/djt.txt /test/output1
spark on yarn
conf文件中,
vi spark-env.sh
添加
HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop
进入spark ``` spark: bin/spark-submit --master yarn --class com.hadoop.MyScalaWordCout /home/hadoop/shell/lib/testSpark-1.0-SNAPSHOT.jar /test/djt.txt /test/output1 ## spark on yarn
spark streaming
用root账户进入
进入mysql
GRANT CREATE ON *.* TO 'hive'@'%';
退出
用hive账户进入
create database test;
use test;
create table newscount ( name varchar(50) not null, count int(11) not null );
create table periodcount ( logtime varchar(50) not null, count int(11) not null );
解压6.4 代码learningspark
然后
在hadoop01 /home/hadoop下
mkdir shell
cd shell
mkdir lib
mkdir data
mkdir bin
mvn打包learningspark
取出jar
上传到hadoop01的
/home/hadoop/shell/lib下
自己编写一个sogoulogs.log文件
放入数据,上传到/home/hadoop/shell/data下
编写6.4脚本,改为
#!/bin/sh home=$(cd `dirname $0`; cd ..; pwd) . ${home}/bin/common.sh echo "start analog data ****************" java -cp ${lib_home}/learningspark.jar com.hadoop.java.AnalogData ${data_home}/sogoulogs.log /home/hadoop/data/flume/logs/sogou.log
上传到/home/hadoop/shell/bin
然后
进入/home/hadoop/shell/bin
vi common.sh
#!/bin/sh home=$(cd `dirname $0`;cd ..; pwd) bin_home=$home/bin conf_home=$home/conf logs_home=$home/logs data_home=$home/data lib_home=$home/lib flume_home=/home/hadoop/app/flume kafka_home=/home/hadoop/app/kafka
mv sogoulogs.sh sogoulogs1.sh cat sogoulogs1.sh > sogoulogs.sh rm -f sogoulogs1.sh chmod u+x sogoulogs.sh
idea打开learningspark
02.03.01启动flume服务,
执行01中sogoulog.sh脚本,在mysql中看到结果
./sogoulogs.sh
spark与hive与mysql与hbase
与hive:
01中修改hive中conf中hive-site.xml中
<property> <name>hive.metastore.uris</name> <value>thrift://hadoop01:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property>
保存
cp hive-site.xml /home/hadoop/app/spark/conf
01进入lib,
cp mysql-connector-java-5.1.38.jar /home/hadoop/app/spark/jars
在mysql运行时
01进入
/home/hadoop/shell/data
vim course.txt
001 hadoop 002 storm 003 spark 004 flink
进入hive
01 bin/hive --service metastore
01进入hive
bin/hive
create database djt;
use djt;
create table if not exists course( > cid string, > name string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS textfile;
load data local inpath "/home/hadoop/shell/data/course.txt" into table course;
select * from course;
显示结果如下
hive> select * from course; OK 001 hadoop NULL 002 storm NULL 003 spark NULL 004 flink NULL Time taken: 1.703 seconds, Fetched: 4 row(s)
01进入 spark
bin/spark-shell
spark.sql("select * from djt.course").show
如下,结果
+----------+----+ | cid|name| +----------+----+ |001 hadoop|null| | 002 storm|null| | 003 spark|null| | 004 flink|null| +----------+----+
01 执行
bin/spark-sql
use djt;
select * from course;
结果如下
001 hadoop NULL 002 storm NULL 003 spark NULL 004 flink NULL Time taken: 2.236 seconds, Fetched 4 row(s)
与mysql:
mysql中插入数据,
use test
insert into newscount value ('yhzyh',123)
01 spark
bin/spark-shell
:paste
val df = spark .read .format("jdbc") .option("url","jdbc:mysql://192.168.88.111:3306/test") .option("dbtable","newscount") .option("user","hive") .option("password","hive") .load()
scala> df.show 结果如下 scala> df.show +-----+-----+ | name|count| +-----+-----+ |yhzyh| 123| +-----+-----+
spark与Hbase
进入hbase的lib
cp hbase-client-1.2.0.jar /home/hadoop/app/spark/jars/ cp hbase-common-1.2.0.jar /home/hadoop/app/spark/jars/ cp hbase-protocol-1.2.0.jar /home/hadoop/app/spark/jars/ cp hbase-server-1.2.0.jar /home/hadoop/app/spark/jars/ cp htrace-core-3.1.0-incubating.jar /home/hadoop/app/spark/jars/ cp metrics-core-2.2.0.jar /home/hadoop/app/spark/jars/ cp hive-hbase-handler-2.3.7.jar /home/hadoop/app/spark/jars/ cp mysql-connector-java-5.1.38.jar /home/hadoop/app/spark/jars/
进入spark
bin/spark-shell
spark.sql("select * from sogoulogs").show
结果如下
+---+--------+------+----------+--------+--------+------+ | id|datatime|userid|searchname|retorder|cliorder|cliurl| +---+--------+------+----------+--------+--------+------+ +---+--------+------+----------+--------+--------+------+
spark离线分析
解压6.5代码,打开
修改运行实参,运行,即可
spark Structured Steaming 实时分析
6.6代码解压打开
shell/bin
./sogoulogs,sh
即可
浏览器地址
192.168.88.111:8081 192.168.88.112:8081 192.168.88.113:8081
flink
上传解压软连接
flink下,
vim djt.log
自己编写内容
修改conf/flink-conf.yaml,添加
rest.port: 8083
flink下
bin/start-scala-shell.sh local
启动成功
测试
val lines = benv.readTextFile("/home/hadoop/app/flink/djt.log");
val wordcounts = lines.flatMap(_.split("\\s+")).map(word => (word,1)).groupBy(0).sum(1);
wordcounts.print()
下面是我的结果
scala> wordcounts.print() (flink,3) (hadoop,3) (spark,3)
flink集群
flink Standalone
上传
7.2
修改master配置文件8081为8083
所有配置文件到conf中
修改conf/flink-conf.yaml,添加
rest.port: 8083
上传7.2安装包到lib中
发送安装包到子节点
01app: deploy.sh flink-1.9.1 /home/hadoop/app slave
为子节点软连接
ln -s flink-1.9.1 flink
进入02的flink的conf
修改flink-conf.yaml
jobmanager.rpc.address: hadoop02
启动
01: bin/start-cluster.sh
浏览器打开
hadoop01:8083 hadoop02:8083
测试
bin/flink run -c org.apache.flink.examples.java.wordcount. WordCount examples/batch/WordCount.jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output2
Could not build the program from JAR file. 或者有数据出现代表成功
flink on yarn
vi ~/.bashrc
添加
export HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop
source ~/.bashrc
运行
bin/yarn-session.sh -n 2 -s 2 -jm 1024 -nm test_flink_cluster
jps
FlinkYarnSessionCli NameNode FlinkYarnSessionCli QuorumPeerMain
01进入hadoop
bin/yarn application -list | grep test_flink_cluster | awk '{print $1}'
我的结果如下
23/10/19 22:13:21 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 application_1697675641964_0002
进入flink
bin/flink run -yid application_1697675641964_ 0002 -c org.apache.flink.examples.java.wordcount.WordCount examples/batch/WorkCount. jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output4
进入hadoop
bin/hdfs dfs -cat /test/output4/*
第二种模式
flink下
bin/flink run -m yarn-cluster -p 2 -yn 2 -ys 2 -yjm 1024 -ytm 1024 -corg.apache.flink.examples.java.wordcount.WordCount examples/ batch/WordCount.jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output4
进入hadoop
bin/hdfs dfs -cat /test/output4/*
FLINK DATASTREAM
7.3代码解压
打开,执行shell/bin中sogoulogs.sh脚本
完成
FLINK DATASET
7.4代码解压
启动即可