大数据Hadoop(三)

最新推荐文章于 2024-10-17 15:34:15 发布

晓枫桥亭

最新推荐文章于 2024-10-17 15:34:15 发布

阅读量107

点赞数

分类专栏：大数据分析文章标签：大数据 hadoop BigData MapReduce

本文链接：https://blog.csdn.net/heaterhip/article/details/97305322

版权

大数据分析专栏收录该内容

7 篇文章 0 订阅

订阅专栏

大数据Hadoop(三)

MapReduce开发环境（Maven )

pom.xml

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.5.2</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.5.2</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.5.2</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>2.5.2</version>
</dependency>
<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-yarn-common</artifactId>
      <version>2.5.2</version>
</dependency>

代码结构（骨架）

/*1. Map*/
public class MyMaper extends Mapper<LongWritable, Text, Text, IntWritable>{
       
        @Override
        protected void map(LongWritable k1, Text v1, Context context) throws IOException, InterruptedException {
              //todo
                }

        }
    }

/*2. Reduce*/
public class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text k2, Iterable<IntWritable> v2s, Context context) throws IOException, InterruptedException {
            //todo
        }
    }

/*3. Job*/
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "MyFirstJob");
        //作业以jar包形式  运行
        job.setJarByClass(MyMapReduce.class);

        // InputFormat
        Path path = new Path("/src/data");
        TextInputFormat.addInputPath(job,path);

        //Map
        job.setMapperClass(MyMaper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //shuffle 默认的方式处理 无需设置

        //reduce
        job.setReducerClass(MyReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //输出目录一定不能存在，由MR动态创建
        Path out = new Path("/dest2");
        FileSystem fileSystem = FileSystem.get(conf);
        fileSystem.delete(out,true);
        TextOutputFormat.setOutputPath(job,out);
        //运行job作业
        job.waitForCompletion(true);

MR运行

必须运行在yarn集群上

基于Maven构建工具，简化MR在yarn集群的运行

<!--jar(main函数)   上传yarn 远程执行 bin/yarn jar xxx.jar -->
 <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-jar-plugin</artifactId>
         <version>2.3.2</version>
         <configuration>
           <outputDirectory>${basedir}</outputDirectory>
           <archive>
             <manifest>
               <mainClass>${baizhi-mainClass}</mainClass>
             </manifest>
           </archive>
         </configuration>
  </plugin>

 <extensions>
      <extension>
        <groupId>org.apache.maven.wagon</groupId>
        <artifactId>wagon-ssh</artifactId>
        <version>2.8</version>
      </extension>
 </extensions>

 <plugin>
         <groupId>org.codehaus.mojo</groupId>
         <artifactId>wagon-maven-plugin</artifactId>
         <version>1.0</version>
         <configuration>
           <fromFile>${project.build.finalName}.jar</fromFile>
           <url>scp://root:123456@${target-host}${target-position}</url>
           <commands>
             <command>pkill -f ${project.build.finalName}.jar</command>
             <command>nohup /opt/install/hadoop-2.5.2/bin/yarn jar ${target-position}/${project.build.finalName}.jar > /root/nohup.out 2>&amp;1 &amp;</command>
           </commands>
           <!-- 显示运行命令的输出结果 -->
           <displayCommandOutputs>true</displayCommandOutputs>
         </configuration>
 </plugin>

jar:jar wagon:upload wagon:sshexec

Maven自定义骨架

程序员根据自己的需求，定义Maven Archetype(骨架)，后续选择自定义的骨架，就可以把我们需要的pom,其他配置文件，代码的骨架，自动生成，简化开发与测试

创建一个模板module

1. 引入相关jar的坐标
2. 创建Java代码

在本项目的根下：mvn --settings F:\apache-maven-3.3.9\conf\settings.xml archetype:create-from-project

在这里插入图片描述
2. 复制骨架的坐标（便于后续的安装）

  ~~~markdown
  <groupId>com.baizhiedu</groupId>
  <artifactId>hadoop-test-archetype</artifactId>
  <version>1.0-SNAPSHOT</version>
  ~~~

安装骨架

cd target\generated-sources\archetype
mvn clean install

创建项目并引入骨架

需要指定骨架的坐标，来源第二步。

在这里插入图片描述

MapReduce程序的调试

建议MR代码中通过Log4进行调试

Logger logger = Logger(xxx.class);
logger.info()

通过上述操作 输出的结果，只能查看job的信息,而Map,Reduce的信息看不到。
需要开启Yarn 历史日志 ，日志归档

yarn集群中如何开启历史日志，日志归档

1. 配置文件
   mapred-site.xml 历史服务
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop12:10020</value>
    </property>
     <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>hadoop12:19888</value>
     </property>
   yarn-site.xml 日志聚合
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
     <!--秒-->
     <property> 
         <name>yarn.log-aggregation.retain-seconds</name>
         <value>604800</value>
     </property>
2. 启动进程
 sbin/mr-jobhistory-daemon.sh start historyserver
 sbin/mr-jobhistory-daemon.sh stop historyserver

实战
应用shell脚本解决

关闭日志聚合

etc/hadoop/yarn-env.sh

export YARN_LOG_DIR=~/logs/yarn
export YARN_PID_DIR=~/data/yarn

创建脚本

if [ $# -le 0 ]
then
    echo 缺少参数
    exit 1
fi

logtype=out

if [ $# -ge 1 ]
then
    logtype=${2}
fi 

for n in `cat /opt/install/hadoop-2.5.2/etc/hadoop/slaves`
do
    echo ===========查看节点 $n============
    ssh $n "cat ~/logs/yarn/userlogs/${1}/container_*/*${logtype}|grep hadoop"
done

运行脚本

1. 修改脚本权限
2. ./scanMRLog.sh application_1558968514803_0001

MapReduce自定义数据类型

MapReduce中，Map与Reduce会进行跨JVM,跨服务器的通信，所以需要MapReduce中的数据类型进行序列化

Writable
Compareable

WriableCompareable (既能排序又能序列化)

IntWritable
LongWritable
FloatWritable
Text

NullWritable

#自定义hadoop的数据类型？
程序员自定义的类型 实现Writable Comparable(compareto方法 int 0  1 -1 )
直接实现WritableCompareble
write(DataOutput out)
readFields(DataInput in)
compareto()

equals
hashcode
toString


注意：自定义类型中 toString方法 返回的内容，将会位置输出文件的格式

MapReduce作业中Map,Reduce的一些细节问题

MapReduce中是可以没有Reduce

如果MapReduce中，只是对数据进行清洗，而不负责统计，去重的话，就没有Reduce

job.setNumReduceTasks(0);
//        job.setReducerClass(MyReducer.class);
//        job.setOutputKeyClass(Text.class);
//        job.setOutputValueClass(NullWritable.class);

MapReduce中有多少个Map?

1. 文本文件处理中，Map的数量由block决定。

MapReduce中有多少个Reduce?

1. Reduce个数可以设置的
   默认情况 reduce的个数是 1 
           mapreduce.job.reduces 1 
           
           job.setNumReduceTasks(?);

1. 为什么要设置多个Reduce？
   提高MR的运行效率
2. 多个Reduce的输出结果是多个文件，可以再次进行Map的处理，进行汇总
3. Partion分区
   由分区决定Map输出结果，交给那个Reduce处理。默认有HashPartitioner实现
   k2.hashCode()%reduceNum(2) = 0,1

程序员自定义分区算法

/*123 R   大于3 在另一个Reduce */
1. 自定义Partitioner
public class MyPartitioner<k2,v2> extends Partitioner<k2,v2> {
    @Override
    public int getPartition(k2 key, v2 value, int numPartitions) {

        String k = key.toString();

        try{
            int key_i = Integer.parseInt(k);
            if(key_i<=3){
                return 0;
            }else{
                return 1;
            }
        }catch(Exception e){
            return -1;
        }
    }


}
2. job作业设置
 job.setPartitionerClass(MyPartitioner.class);

MapReduce中的技术器 Counter

自己定义计数器  书写在Map或者Reduce
   context.getCounter("group-name","counter-name").increment(1);
   public enum MyCounter {
    MY_COUNTER
   }
   context.getCounter(MyCounter.MY_COUNTER).increment(1);

Combiner

Combiner是Map端的Reduce，提前Map端作合并，从而减少传输的数据，提高效率。
默认情况是Combiner关闭
job.setCombinerClass(MyReducer.class);

Job作业的原理分析

InputFormat

从HDFS读入数据，并把读入的数据封装Key Value (k1,v1)

// 负责从HDFS(DataNode)中 读入文件的数据

// 把块的数据 封装 split中 保证2者 一一对应

public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
// 负责把数据封装成 K,V  RecordReader
public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

在这里插入图片描述
2. shuffle

3. shuffle代码体现

String inpath = "/hw/data1";
String outpath = "/dest";

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, MapReduceJobSubmiter.class.getName());
//作业以jar包形式  运行
job.setJarByClass(MapReduceJobSubmiter.class);

// InputFormat
Path path = new Path(inpath);
TextInputFormat.addInputPath(job,path);

//Map
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);

//shuffle阶段代码处理
job.setPartitionerClass(MyPartitioner.class);
//map 自定义数据类型 compareto()
job.setCombinerClass(MyReducer.class);
//分组排序
job.setGroupingComparatorClass();


//reduce
job.setNumReduceTasks(2);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);

//输出目录一定不能存在，由MR动态创建
Path out = new Path(outpath);
FileSystem fileSystem = FileSystem.get(conf);
fileSystem.delete(out,true);
TextOutputFormat.setOutputPath(job,out);
//运行job作业
job.waitForCompletion(true);

MapReduce （二次排序）

hadoop-mr-secondsort
1. 自定义Map的key
2. 自定义分组排序 （分组）
3. 自定义分区

在这里插入图片描述

回顾HDFS与MapReduce

HDFS

1. 什么情况需要使用MR
   操作HDFS数据时需要使用MR 
2. MR的主要作用
   统计，去重，排序，清洗

在这里插入图片描述

1. 注意 MapReduce只能对Key作排序
2. 开发 （wordcount topn 二排）

晓枫桥亭

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

大数据Hadoop(三)

大数据Hadoop(三)

MapReduce开发环境 （Maven )

Maven自定义骨架

MapReduce程序的调试

MapReduce自定义数据类型

MapReduce作业中Map,Reduce的一些细节问题

Job作业的原理分析

MapReduce （二次排序）

回顾HDFS与MapReduce

MapReduce开发环境（Maven )