实战大数据

教程配套

实战大数据(Hadoop+Spark+Flink)从平台构建到交互式数据分析(离线/实时)【图片 价格 品牌 评论】-京东 (jd.com)

以centos7为例子

1.安装Linux7

2.配置静态ip

vi /etc/sysconfig/network-scripts/ifcfg-ens33

onboot = yes

bootproto = static

IPADDR=192.168.88.111
NETWORK=255.255.255.0
GATEWAY=192.168.88.2
DNS1=192.168.88.2

service network restart

3.更改主机名和映射

vi /etc/hosts
192.168.88.111 hadoop01
192.168.88.112 hadoop02
192.168.88.113 hadoop03
​

4.关闭防火墙

systemctl disable firewalld
systemctl stop firewalld
​

5.创建用户名和用户组

groupadd hadoop
useradd -g hadoop hadoop
passwd hadoop
visudo
hadoop ALL=ALL  NOPASSWD: ALL

6.复制两个虚拟机并改名

7.ssh免密

ssh-keygen -t rsa    三个服务器
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop01  三个服务器
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop02
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop03
scp -r authorized_keys 192.168.88.113:~/.ssh/
​

8.更改主机名

hostnamectl set-hostname hadoop01
hostnamectl set-hostname hadoop02
hostnamectl set-hostname hadoop03

9.配置chrony

主:
vi /etc/chrony.conf
server ntp.aliyun.com
local stratum 10
allow 192.168.88.0/24
从:
server 192.168.88.111

10.写脚本

01:mkdir tools
vi deploy.conf

#集群角色规划 hadoop01,master,all,zookeeper,namenode,datanode hadoop02,slave,all,zookeeper,namenode,datanode hadoop03,slave,all,zookeeper,datanode

vi deploy.sh
#!/bin/bash
​
if [ $# -lt 3 ]
then
  echo "传入参数过小"
  exit 1
fi
​
src=$1
dest=$2
tag=$3
​
if [ ','$4',' == ',,' ]
then
  confFile=/home/hadoop/tools/deploy.conf
else
  confFile=$4
fi
​
if [ -f $confFile ]
then
  if [ -f $src ]
  then
    for server in `cat $confFile | grep -v '^#' | grep ','$tag',' | awk -F',' '{print $1}'`
    do
    scp $src $server":"${dest}
    done
  elif [ -d $src ]
  then
    for server in `cat $confFile | grep -v '^#' | grep ','$tag',' | awk -F',' '{print $1}'`
    do
    scp -r $src $server":"${dest}
    done
  else
    echo "没有资源文件"
  fi
else
  echo "confFile为空"
fi                                                           
vi runRemoteCmd.sh
#!/bin/bash
​
if [ $# -lt 2 ]
then
  echo "参数过少"
  exit 1
fi
​
cmd=$1
tag=$2
​
if [ 'a'$3'a' == 'aa' ]
then
  confFile=/home/hadoop/tools/deploy.conf
else
  confFile=$3
fi
​
if [ -f $confFile ]
then
  for server in `cat $confFile | grep -v '^3' | grep ','$tag',' | awk -F',' '{print $1}'`
  do
  echo "*****************$server*********************"
  ssh $server "source ~/.bashrc; $cmd"
  done
else
  echo "confFile文件不存在" 
fi                                    
root:
​
vi /etc/profile.d/mytools.sh
export PATH=$PATH:/home/hadoop/tools
hadoop:
chmod u+x deploy.sh
chmod u+x runRemoteCmd.sh
runRemoteCmd.sh "mkdir /home/hadoop/data" all
runRemoteCmd.sh "mkdir /home/hadoop/app" all
​

Hadoop·集群搭建

1.java jdk安装

传输安装包后解压

删除原来的安装包

配置环境

ln -s jdk1.8.0_51 jdk 
vim ~/.bashrc
​
JAVA_HOME=/home/hadoop/app/jdk
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:/home/hadoop/tools:$PATH
export JAVA_HOME CLASSPATH PATH
​
​
source ~/.bashrc
​

验证

[hadoop@hadoop01 app]$ java -version
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
安装成功
deploy.sh jdk1.8.0_51 /home/hadoop/app/ slave
发送到子节点

为子节点创建软连接

为子节点编写环境变量

02,03:
​
​
vim ~/.bashrc
​
​
JAVA_HOME=/home/hadoop/app/jdk
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$PATH
export JAVA_HOME CLASSPATH PATH

zookeeper安装

传输并且解压,创建软连接为zookeeper

配置zookeeper

cd zookeeper/conf
​

上传配置文件zoo到zookeeper/conf

分发zookeeper·到子节点

配置子节点软连接

整体配置日志文件

01:
​
​
runRemoteCmd.sh "mkdir -p /home/hadoop/data/zookeeper/zkdata" all
runRemoteCmd.sh "mkdir -p /home/hadoop/data/zookeeper/zkdatalog" all
​

到日志文件zkdata配置zookeeper

01,02,03进入/home/hadoop/data/zookeeper/zkdata

01,02,03:
​
touch myid

1:
echo 1 > myid
2:
echo 2 > myid
3:
echo 3 > myid

启动zookeeper

01:
​
​
runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh start" all

jps 查看进程

[hadoop@hadoop01 ~]$ jps
1750 QuorumPeerMain
1782 Jps

输入runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh status" all看进程如下表示成功

[hadoop@hadoop01 ~]$ runRemoteCmd.sh "/home/hadoop/app/zookeeper/bin/zkServer.sh status" all
*****************hadoop01*********************
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Mode: follower
*****************hadoop02*********************
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Mode: leader
*****************hadoop03*********************
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Mode: follower
​

HDFS和YARN安装

上传和解压,创建链接

将core hadoop hdfs slaves

覆盖到/home/hadoop/app/hadoop-2.9.1/etc/hadoop下

分发到子节点

deploy.sh hadoop-2.9.1 /home/hadoop/app/ slave

为子节点创建软连接

02,03,app下:
​
ln -s hadoop-2.9.1 hadoop

HDFS安装完毕

测试

启动journalnode

01:{这是一行代码}



runRemoteCmd.sh "/home/hadoop/app/hadoop/sbin/hadoop-daemon.sh start journalnode" all



格式化

hadoop环境下{即代表进入app文件下的hadoop文件里}


bin/hdfs namenode -format


bin/hdfs zkfc -formatZK

启动namenode

bin/hdfs namenode

设置02跟随

02:{也是hadoop目录下}


 bin/hdfs namenode -bootstrapStandby

关闭journalnode

01:

runRemoteCmd.sh "/home/hadoop/app/hadoop/sbin/hadoop-daemon.sh stop journalnode" all


一键启动

01:

sbin/start-dfs.sh

jps后的结果

3088 JournalNode
3377 DFSZKFailoverController
3874 Jps
1750 QuorumPeerMain
2760 NameNode
2873 DataNode

查看状态

01:


bin/hdfs haadmin -getServiceState nn1




bin/hdfs haadmin -getServiceState nn2

在自己window里面的edge浏览器里面的地址栏输入

http://192.168.88.111:50070/

查看进程

http://192.168.88.112:50070/

测试web

hdfs环境下

编写一文件
vim wd.txt
自己编写

bin/hdfs dfs -ls /    查看文件系统

bin/hdfs dfs -mkdir /text  创建

bin/hdfs dfs -put wd.txt /text  上传

bin/hdfs dfs -cat /text/wd.txt  查看

测试成功

YARN的安装

上传mapred yarn

传送到各个节点

01



deploy.sh yarn-site.xml /home/hadoop/app/hadoop/etc/hadoop slave




deploy.sh mapred-site.xml /home/hadoop/app/hadoop/etc/hadoop slave

一键启动yarn

sbin/start-yarn.sh

启动02备用节点

02:
sbin/yarn-daemon.sh start resourcemanager

查看状态

bin/yarn rmadmin -getServiceState rm1



bin/yarn rmadmin -getServiceState rm2

修改本机网页配置

C:\Windows\System32\drivers\etc中的hosts同linux



打开网页{edge}

http://192.168.88.111:8088/




http://192.168.88.112:8088/

http://192.168.88.112:8088/会自动跳转hadoop01

安装成功

yarn的测试

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar wordcount /text/wd.txt /text/out




网页中可以显示,运行成功

bin/hdfs dfs -ls /text


可以看见输出

bin/hdfs dfs -ls /text/out

可以看见输出日志

bin/hdfs dfs -cat /text/out/*

看到单词统计的最终结果

同样网页上也可以看见

安装完成

mapreduce和wordcount

idea构建Java项目

打开maven的conf,编辑setting.xml

添加

  <profile>
      <id>development</id>
          <activation>
            <jdk>8</jdk>
            <activeByDefault>true</activeByDefault>
          </activation>
      <properties>
            <maven.compiler.source>8</maven.compiler.source>
            <maven.compiler.target>8</maven.compiler.target>
            <maven.compiler.compilerVersion>1.8</maven.compiler.compilerVersion>
      </properties>
 </profile>

bing hadoop,点击入门,点击mapreduce中的tutorial,source code中的代码复制

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

新建与app同级别类WordCount,粘贴

添加main目录下的sources目录

添加log4j资源包

添加

log4j.rootLogger = debug,stdout

### 输出信息到控制抬 ###
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

在pom中添加

<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.9.1</version>
      </dependency>
    
      <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.9.1</version>
      </dependency>

      <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>2.9.1</version>
      </dependency>

      <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.17</version>
      </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.9.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.9.1</version>
    </dependency>

重构

src同级目录下创建目录取名,内部添加txt文件,内部写入单词,复制目录

编辑WordCount配置

例如

D:\26284\Documents\JAVA\hadoop\in\TXT.txt D:\26284\Documents\JAVA\hadoop\in\out

安装并运行win补丁文件在windows/system32中

winutils.exe

运行程序

最后打包

文件,项目结构,工件,新建jar,选择全包,确定,构建工件

或者

cmdo'r'g

cd到项目,

mvn clean package

将包传到Hadoop目录下

bin/hadoop jar wc.jar com.hadoop.WordCount /text/wd.txt /text/out2

               jar名    包名        类名     

成功

MAVEN管理多个mapreduce

创建TextDriver类

package com.hadoop;

import org.apache.hadoop.util.ProgramDriver;

public class TextDriver {
    public static void main(String argv[]) {
        int exitCode = -1;
        ProgramDriver pgd = new ProgramDriver();
        try {
            pgd.addClass("wordcount", WordCount.class,
                    "A map/reduce program that counts the words in the input files.");


            /***********************************************************************************/


            exitCode = pgd.run(argv);
        }
        catch(Throwable e){
            e.printStackTrace();
        }

        System.exit(exitCode);
    }
}

pom文件在</dependencies>下添加如下

    <build>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>3.5.1</version>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
              <configuration>
                <transformers>
                  <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                    <mainClass>com.hadoop.TextDriver</mainClass>
                  </transformer>
                </transformers>
                <createDependencyReducedPom>false</createDependencyReducedPom>
              </configuration>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </build>

用maven打包上传

 bin/hadoop jar hadoop-1.0-SNAPSHOT.jar wordcount /text/wd.txt /text/out3

成功

构建数据存储与交换系统

HBASE

上传解压软连接

上传配置文件

脚本分发

deploy.sh hbase-1.2.0 /home/hadoop/app/ slave
​

创建软连接

为01.02.03更改配置文件

vi ~/.bashrc
# .bashrc
​
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
​
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
JAVA_HOME=/home/hadoop/app/jdk
ZOOKEEPER_HOME=/home/hadoop/app/zookeeper
HADOOP_HOME=/home/hadoop/app/hadoop
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin:$PATH
export JAVA_HOME CLASSPATH PATH HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME
# User specific aliases and functions
​

source ~/.bashrc

启动hbase

bin/start-hbase.sh

jps

hmaster,hregionserver

hadoop01:16010
hadoop02:16010

shell测试

bin/hbase shell
status查看状态
version版本
2.3.1 基本操作
进入 HBase 客户端命令行
hbase shell
1
查看帮助命令
help
1
查看当前数据库中有哪些表
list
1
2.3.2 表的操作
创建表
create 'student','info'
1
插入数据到表
put 'student','1001','info:sex','male'
1
put 'student','1001','info:age','18'
1
put 'student','1002','info:name','Janna'
1
put 'student','1002','info:sex','female'
1
put 'student','1002','info:age','20'
1
扫描查看表数据
scan 'student'
1
​
​
scan 'student',{STARTROW => '1001', STOPROW  => '1001'}
1
scan 'student',{STARTROW => '1001'}
1
查看表结构
describe 'student'
1
更新指定字段的数据
put 'student','1001','info:name','Nick'
1
put 'student','1001','info:age','100'
1
查看 “指定行” 或 “指定列族:列” 的数据
get 'student','1001'
1
get 'student','1001','info:name'
1
统计表数据行数
count 'student'
1
变更表信息
​
将 info 列族中的数据存放 3 个版本
​
alter 'student',{NAME=>'info',VERSIONS=>3}
1
get 'student','1001',{COLUMN=>'info:name',VERSIONS=>3}
1
删除数据
​
① 删除某 rowkey 的全部数据
​
deleteall 'student','1001'
1
    ② 删除某 rowkey 的某一列数据
​
delete 'student','1002','info:sex'
1
清空表数据
truncate 'student'
1
   提示:清空表的操作顺序为先 disable,然后再 truncate。
​
删除表
    ① 首先需要先让该表为 disable 状态
​
disable 'student'
1
    ② 然后才能 drop 这个表
​
drop 'student'
1
2.3.3 命名空间的基本操作
查看命名空间
list_namespace
1
创建命名空间
create_namespace 'bigdata'
1
在新的命名空间中创建表
create 'bigdata:student','info'
1
删除命名空间
​
只能删除空的命名空间,如果不为空,需要先删除该命名空间下的所有表
​
drop_namespace 'bigdata'

KAFKA

上传解压软连接

上传配置文件config下

分发

deploy.sh kafka_2.12-1.1.1 /home/hadoop/app/ slave
​

软连接

修改config

中server配置

中broker.id=1

改为对应虚拟机编号,

比如hadoop02则改为2

安装完毕

启动kafka

​
 bin/kafka-server-start.sh config/server.properties
​
 bin/kafka-server-start.sh config/server.properties
​
 bin/kafka-server-start.sh config/server.properties

测试

创建topic

创建djt

bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic djt --replication-factor 3 --partitions 3

查看表

bin/kafka-topics.sh --zookeeper localhost:2181 --list

看详细信息

 bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic djt

消费者消费

01
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic djt

生产者发送消息

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic djt
​

KAFKA集群监控

sudo yum install -y unzip zip

上传kafka-manager

解压

unzip kafka-manager-2.0.0.2.zip 

软连接

ln -s kafka-manager-2.0.0.2 kafka-manager

进入目录

conf文件下

改application.conf中

kafka-manager.zkhosts="kafka-manager-zookeeper:2181"

basicAuthentication.username="admin"

basicAuthentication.password="password"

basicAuthentication.enabled=false

01:
kafka-manager.zkhosts="192.168.88.111:2181"
basicAuthentication.username="yhzyh"
basicAuthentication.enabled=true
basicAuthentication.password="yhzyh"

启动kafka-manager

kafka-manager目录下

bin/kafka-manager -Dhttp.port=9999

浏览器输入

http://192.168.88.111:9999/

新建集群

Cluster Name

my-kafka-cluster

Cluster Zookeeper Hosts

hadoop01:2181

Kafka Version

1.1.1

Flume

上传,解压,进入flume

进入conf

mv flume-conf.properties.template flume-conf.properties

分发

deploy.sh apache-flume-1.8.0-bin /home/hadoop/app slave

创建软连接

ln -s apache-flume-1.8.0-bin flume

前台启动flume(测试用,暂时不做)

bin/flume-ng agent -n agent -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console
​

上传taildir到hadoop01中的conf里

上传avro到02,03的conf中

进入01,

cd
cd data
mkdir flume
cd flume
mkdir logs
cd logs
vi sogou.log
写入yhzyh
保存退出
​

在02,03中分别启动

02:03:
bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-logger.properties -Dflume.root.logger=INFO,console
​

01:

bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
​
​

启动完成,

准备测试数据

01中进入data/flume/logs,向Sogou .log中添加数据集

echo "hadoop1112" >> sogou.log 
​

看到02或03中打印出

完成

FLUME与KAFKA

上传配置文件avro-file-selector-kafka.properties到02,03的conf中

启动02,03的flume,启动01的flume

在01中创建topic

kafka
​
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic sogoulogs --replication-factor 3 --partitions 3

02,03启动fiume

02.03
​
​
bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-kafka.properties -Dflume.root.logger=INFO,console
​

01

bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console
​

启动卡夫卡消费者

kafka
​
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sogoulogs

启动完成,

准备测试数据

01中进入data/flume/logs,向Sogou .log中添加数据集

echo "hadoop1112" >> sogou.log 
​

看到kafka中打印出

完成

FLUME与HBASE

上传flume-ng-hbase-sink-1.8.0.jar到02和03 lib 目录下

修改配置文件favro-file-selector-hbase.properties中

agent1.sinks.k1.table = sogoulogs

上传favro-file-selector-hbase.properties到02和03 conf 目录下

修改01,02,03的~/.bashrc文件如下

# .bashrc
​
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
JAVA_HOME=/home/hadoop/app/jdk
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
ZOOKEEPER_HOME=/home/hadoop/app/zookeeper
HADOOP_HOME=/home/hadoop/app/hadoop
HBASE_HOME=/home/hadoop/app/hbase
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$ZOOKEEPER_HOME/bin:$HBASE_HOME/bin:$PATH
export JAVA_HOME CLASSPATH PATH HADOOP_HOME ZOOKEEPER_HOME HBASE_HOME
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
​
# User specific aliases and functions
​
source ~/.bashrc
​

安装成功

测试

进入hbase

进入shell

bin/hbase shell

创建表

create 'sogoulogs','info'

启动02,03的flume

bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-hbase.properties -Dflume.root.logger=INFO,console

启动01

bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console

启动hbase

hbase:

bin/hbase shell

准备测试数据

01中进入data/flume/logs,向Sogou .log中添加数据集

echo "hadoop1112" >> sogou.log 

查看hbase

scan 'sogoulogs'

hbase与kafka与flume

上传avro-file-selector-hbase-kafka.properties到02,03conf里面

安装完成

测试

0203:

bin/flume-ng agent -n agent1 -c conf -f conf/avro-file-selector-hbase-kafka.properties -Dflume.root.logger=INFO,console
01:

bin/flume-ng agent -n agent1 -c conf -f conf/taildir-file-selector-avro.properties -Dflume.root.logger=INFO,console

启动hbase

hbase:

bin/hbase shell

启动卡夫卡消费者

kafka

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sogoulogs

01中进入data/flume/logs,向Sogou .log中添加数据集

echo "hadoop1112" >> sogou.log 

查看hbase

scan 'sogoulogs'

flume就完成了

HIVE

mysql安装与配置

01进入root模式下

yum install -y wget

卸载已安装的 Mariadb 数据库。

rpm -qa|grep mariadb|xargs rpm -e --nodeps

下载安装包文件

wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm

rpm 安装源信息

rpm -ivh mysql-community-release-el7-5.noarch.rpm





使用 yum 安装MySQL

yum install mysql-server

安装过程中,会提示让我们确认,一律输入 y 按回车即可

检验是否安装完成

rpm -qa | grep mysql
出现结果代表完成

检查mariaDB是否被覆盖

rpm -qa | grep mariadb

不输出,表示 mariaDB 已经被成功覆盖。

systemctl stop mysqld.service #停止 mysql




systemctl start mysqld.service #启动 mysql


systemctl restart mysqld.service #重启 mysql



systemctl enable mysqld.service #设置 mysql 开机启动

这里我们

设置开机自启,

启动mysql

登录mysql

mysql -u root -p

root还没有密码,按enter

设置root的密码

 set password for root@localhost=password('root');

密码为root

退出,重新登陆

创建hive用户

create user 'hive' identified by 'hive';

授予hive所有权限

grant all on *.* to 'hive'@'hadoop01' identified by 'hive';

grant all on *.* to hive@'%' identified by "hive";

刷新权限

flush privileges;

登录hive账户

mysql -u hive -p

密码

hive

root添加远程登陆

RANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "root";

登陆

mysql -h hadoop01 -u hive -p

安装hive

上传5.3安装包apache-hive到01,hadoop的app目录下

解压,创建软连接hive,删除压缩包

将hive-site和hive-env上传到hive的conf下

将安装包mysql-connector上传到lib目录下

root账户下

vi /etc/my.cnf

添加

skip_ssl

保存退出

systemctl restart mysqld

安装完毕

初始化

hive目录下



bin/schematool -dbType mysql -initSchema

执行

hive目录下

bin/hive

输入

show databases;

hive> show databases;
OK
default
Time taken: 0.243 seconds, Fetched: 1 row(s)
hive> 

代表成功

hive与hbase

进入和hbase的lib目录

cp hbase-client-1.2.0.jar /home/hadoop/app/hive/lib/

cp hbase-common-1.2.0.jar /home/hadoop/app/hive/lib/

cp hbase-server-1.2.0.jar /home/hadoop/app/hive/lib/

cp hbase-common-1.2.0-tests.jar /home/hadoop/app/hive/lib/

cp hbase-protocol-1.2.0.jar /home/hadoop/app/hive/lib/

cp htrace-core-3.1.0-incubating.jar /home/hadoop/app/hive/lib/

cp zookeeper-3.4.6.jar /home/hadoop/app/hive/lib/

进入hive

bin/hive

输入

CREATE EXTERNAL TABLE sogoulogs(
  id string,
  datatime string,
  userid string,
  searchname string,
  retorder string,
  cliorder string,
  cliurl string
) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:datatime,info:userid,info:searchname,info:retorder,info:cliorder,info:cliurl") 
TBLPROPERTIES ("hbase.table.name" = "sogoulogs");


查询

select * from sogoulogs limit 5;

搭建完毕

spark

上传01的app目录,解压,软连接spark

spark目录下

vi djt.log

自己随便填写内容

启动spark

bin/spark-shell

spark wordcount

如下

​
val line = sc.textFile("/home/hadoop/app/spark/djt.log")
​
​
​
​
​
line.flatMap(_.split("")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
​
​
​
​

会显示djt统计结果

idea maven 构建spark,wordcount

mvn中pom添加如下依赖

     <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.3.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.9.1</version>
    </dependency>
​
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>2.9.1</version>
    </dependency>
​
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
      <version>2.9.1</version>
    </dependency>
​
    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.17</version>
    </dependency>
​
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.9.1</version>
    </dependency>
​
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.9.1</version>
    </dependency>

下载插件scal

main下新建scala目录,标记为源码

在其下

新建MyScalaWordCout的scala文件

​
import org.apache.spark.{SparkConf,SparkContext}
object MyScalaWordCout{
  def main(args:Array[String]):Unit = {
​
    if(args.length < 2) {
      System.err.println("Usage:MyWordCount<input><output>")
      System.exit(1)
    }
    val input = args(0)
    val output = args(1)
    val conf = new SparkConf().setAppName("myWordCount").setMaster("local")
    val sc=new SparkContext(conf)
    val lines = sc.textFile(input)
    val resultRdd = lines.flatMap(_.split("")).map((_,1)).reduceByKey(_+_)
    resultRdd.saveAsTextFile(output)
    sc.stop()
​
  }
}
​

在main下添加资源包,添加

log4j.properties文件

log4j.rootLogger = debug,stdout
​
### 输出信息到控制抬 ###
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

项目文件同级别下,新建测试文件,自己编写,如djt.txt

修改运行实参例如

F:\apache-flume-1.8.0-src\flume-ng-sinks\testspark\djt.txt F:\apache-flume-1.8.0-src\flume-ng-sinks\testspark\out

运行,输出成功

mvn clean package打包

SPARK的STANDALONE模式

上传6.3配置文件1到conf中

回到app目录下分发给02,03

 deploy.sh spark-2.3.1-bin-hadoop2.7 /home/hadoop/app slave
​

为02,03创建软连接

修改conf中spark-env.sh文件

最底下添加

HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop
LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
​

启动spark集群

01:
​
sbin/start-all.sh
​

01,jps里面有

work
master
02:
​
sbin/start-master.sh

02也有了master

浏览器进入

hadoop01:8888
hadoop02:8888

maven 打包,上传到hadoop/shell/lib中

确保hdfs中存在test目录,存在djt.txt文件

进入spark

spark:
​
​
​
bin/spark-submit --master spark://hadoop01:7077,hadoop02:7077 --class com.hadoop.MyScalaWordCout /home/hadoop/shell/lib/testSpark-1.0-SNAPSHOT.jar /test/djt.txt /test/output1
​

spark on yarn

conf文件中,

vi spark-env.sh

添加

HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop
进入spark

​```
spark:



bin/spark-submit --master yarn --class com.hadoop.MyScalaWordCout /home/hadoop/shell/lib/testSpark-1.0-SNAPSHOT.jar /test/djt.txt /test/output1


## spark on yarn

spark streaming

用root账户进入

进入mysql

GRANT CREATE ON *.* TO 'hive'@'%';

退出

用hive账户进入

create database test;
use test;

create table newscount ( name varchar(50) not null, count int(11) not null );

create table periodcount ( logtime varchar(50) not null, count int(11) not null );

解压6.4 代码learningspark

然后

在hadoop01 /home/hadoop下

mkdir shell
cd shell
mkdir lib
mkdir data

mkdir bin

mvn打包learningspark

取出jar

上传到hadoop01的

/home/hadoop/shell/lib下

自己编写一个sogoulogs.log文件

放入数据,上传到/home/hadoop/shell/data下

编写6.4脚本,改为

#!/bin/sh
home=$(cd `dirname $0`; cd ..; pwd)
. ${home}/bin/common.sh
echo "start analog data ****************"
java -cp ${lib_home}/learningspark.jar com.hadoop.java.AnalogData  ${data_home}/sogoulogs.log /home/hadoop/data/flume/logs/sogou.log

上传到/home/hadoop/shell/bin

然后

进入/home/hadoop/shell/bin

vi common.sh
#!/bin/sh
home=$(cd `dirname $0`;cd ..; pwd)
bin_home=$home/bin
conf_home=$home/conf
logs_home=$home/logs
data_home=$home/data
lib_home=$home/lib
flume_home=/home/hadoop/app/flume
kafka_home=/home/hadoop/app/kafka                                
mv sogoulogs.sh sogoulogs1.sh 


cat sogoulogs1.sh > sogoulogs.sh 


rm -f sogoulogs1.sh 



chmod u+x sogoulogs.sh 

idea打开learningspark

02.03.01启动flume服务,

执行01中sogoulog.sh脚本,在mysql中看到结果

 ./sogoulogs.sh

spark与hive与mysql与hbase

与hive:

01中修改hive中conf中hive-site.xml中

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop01:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

保存

cp hive-site.xml /home/hadoop/app/spark/conf

01进入lib,

cp mysql-connector-java-5.1.38.jar /home/hadoop/app/spark/jars

在mysql运行时

01进入

/home/hadoop/shell/data
 vim course.txt

001 hadoop
002 storm
003 spark
004 flink

进入hive

01



bin/hive --service metastore

01进入hive

bin/hive
create database djt;

use djt;

create table if not exists course(
    > cid string,
    > name string
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS textfile;

 load data local inpath "/home/hadoop/shell/data/course.txt" into table course;

select * from course;

显示结果如下

hive> select * from course;
OK
001 hadoop	NULL
002 storm	NULL
003 spark	NULL
004 flink	NULL
Time taken: 1.703 seconds, Fetched: 4 row(s)

01进入 spark

bin/spark-shell

spark.sql("select * from djt.course").show

如下,结果

+----------+----+                                                               
|       cid|name|
+----------+----+
|001 hadoop|null|
| 002 storm|null|
| 003 spark|null|
| 004 flink|null|
+----------+----+


01 执行

bin/spark-sql
use djt;
select * from course;

结果如下

001 hadoop	NULL
002 storm	NULL
003 spark	NULL
004 flink	NULL
Time taken: 2.236 seconds, Fetched 4 row(s)

与mysql:

mysql中插入数据,

use test

insert into newscount value ('yhzyh',123)

01 spark

 bin/spark-shell

:paste

val df = spark
.read 
.format("jdbc") 
.option("url","jdbc:mysql://192.168.88.111:3306/test")
.option("dbtable","newscount")
.option("user","hive")
.option("password","hive") 
.load()

scala> df.show

结果如下


scala> df.show
+-----+-----+
| name|count|
+-----+-----+
|yhzyh|  123|
+-----+-----+



spark与Hbase

进入hbase的lib

cp hbase-client-1.2.0.jar /home/hadoop/app/spark/jars/


cp hbase-common-1.2.0.jar /home/hadoop/app/spark/jars/


cp hbase-protocol-1.2.0.jar /home/hadoop/app/spark/jars/


cp hbase-server-1.2.0.jar /home/hadoop/app/spark/jars/


cp htrace-core-3.1.0-incubating.jar /home/hadoop/app/spark/jars/


cp metrics-core-2.2.0.jar /home/hadoop/app/spark/jars/


cp hive-hbase-handler-2.3.7.jar /home/hadoop/app/spark/jars/


cp mysql-connector-java-5.1.38.jar /home/hadoop/app/spark/jars/



进入spark

bin/spark-shell
spark.sql("select * from sogoulogs").show

结果如下

+---+--------+------+----------+--------+--------+------+
| id|datatime|userid|searchname|retorder|cliorder|cliurl|
+---+--------+------+----------+--------+--------+------+
+---+--------+------+----------+--------+--------+------+

spark离线分析

解压6.5代码,打开

修改运行实参,运行,即可

 

spark Structured Steaming 实时分析

6.6代码解压打开

shell/bin

./sogoulogs,sh

即可

浏览器地址

192.168.88.111:8081

192.168.88.112:8081

192.168.88.113:8081

flink

上传解压软连接

flink下,

vim djt.log

自己编写内容

修改conf/flink-conf.yaml,添加

rest.port: 8083
​

flink下

bin/start-scala-shell.sh local

启动成功

测试

val lines = benv.readTextFile("/home/hadoop/app/flink/djt.log");
​
val wordcounts = lines.flatMap(_.split("\\s+")).map(word => (word,1)).groupBy(0).sum(1);
wordcounts.print()

下面是我的结果

scala> wordcounts.print()
(flink,3)
(hadoop,3)
(spark,3)
​

flink集群

flink Standalone

上传

7.2

修改master配置文件8081为8083

所有配置文件到conf中

修改conf/flink-conf.yaml,添加

rest.port: 8083
​

上传7.2安装包到lib中

发送安装包到子节点

01app:
​
deploy.sh flink-1.9.1 /home/hadoop/app slave
​
​

为子节点软连接

ln -s flink-1.9.1 flink

进入02的flink的conf

修改flink-conf.yaml

jobmanager.rpc.address: hadoop02

启动

01:
​
bin/start-cluster.sh
​
​
​
​
​

浏览器打开

hadoop01:8083
​
hadoop02:8083

测试

bin/flink run -c org.apache.flink.examples.java.wordcount. WordCount examples/batch/WordCount.jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output2
​
​
Could not build the program from JAR file.
​
​
或者有数据出现代表成功

flink on yarn

vi ~/.bashrc

添加

export HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop

source ~/.bashrc

运行

bin/yarn-session.sh -n 2 -s 2 -jm 1024 -nm test_flink_cluster

jps

FlinkYarnSessionCli
NameNode
FlinkYarnSessionCli
QuorumPeerMain

01进入hadoop

bin/yarn application -list | grep test_flink_cluster | awk '{print $1}'

我的结果如下

23/10/19 22:13:21 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2




application_1697675641964_0002


进入flink

bin/flink run -yid application_1697675641964_ 0002 -c org.apache.flink.examples.java.wordcount.WordCount examples/batch/WorkCount. jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output4

进入hadoop

bin/hdfs dfs -cat /test/output4/*

第二种模式

flink下

bin/flink run -m yarn-cluster -p 2 -yn 2 -ys 2 -yjm 1024 -ytm 1024 -corg.apache.flink.examples.java.wordcount.WordCount examples/ batch/WordCount.jar --input hdfs://mycluster/test/djt.txt --output hdfs://mycluster/test/output4

进入hadoop

bin/hdfs dfs -cat /test/output4/*

FLINK DATASTREAM

7.3代码解压

打开,执行shell/bin中sogoulogs.sh脚本

完成

FLINK DATASET

7.4代码解压

启动即可

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
实战大数据(hadoop spark flink)pdf是指利用大数据处理技术(如HadoopSparkFlink)进行实际的数据分析和应用开发,并以PDF格式进行文档化。 大数据处理技术的出现,使得企业和机构可以处理和分析海量的数据,从而发掘出更多有价值的信息和洞察。而HadoopSparkFlink作为目前比较流行的大数据处理框架,具有各自的特点和适用场景。 首先,Hadoop是一个基于分布式文件系统的大数据处理框架,能够实现数据的存储和计算的分布式处理。它采用MapReduce计算模型,可以对大规模数据进行批处理,适用于离线的数据分析任务。因此,在实战大数据的PDF中,可以介绍如何使用Hadoop进行大数据的存储和离线计算,以及如何利用Hadoop的生态系统组件如Hive、HBase等进行数据处理和查询。 其次,Spark是一个内存计算框架,它具有很强的处理速度和灵活性。Spark提供了一系列的API,可以方便地处理和分析大规模数据,同时支持批处理和实时流处理,适用于交互式和实时的数据分析任务。在实战大数据的PDF中,可以介绍如何使用Spark进行数据的处理和分析,包括数据清洗、特征工程、机器学习等方面的实践。 最后,Flink是一个流式计算框架,它具有低延迟、高吞吐量和状态一致性等特点。Flink支持基于时间的窗口计算、迭代计算和状态管理等功能,适用于复杂的实时数据分析任务。在实战大数据的PDF中,可以介绍如何使用Flink进行实时数据处理和分析,包括窗口计算、流式机器学习等方面的实践。 总之,实战大数据(hadoop spark flink)pdf可以从不同维度和使用场景来介绍大数据处理技术的应用,帮助读者了解和掌握这些技术在实际项目中的使用方法和优势。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

雨后紫暘花

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值