hadoop结课作业

最新推荐文章于 2024-01-29 08:30:00 发布

Theoyah

最新推荐文章于 2024-01-29 08:30:00 发布

阅读量1.1k

点赞数 1

分类专栏：分布式文件系统文章标签： hadoop hdfs mapreduce

本文链接：https://blog.csdn.net/qq_41736774/article/details/103806359

版权

分布式文件系统专栏收录该内容

4 篇文章 0 订阅

订阅专栏

`hadoop` 结课作业

  Auth       hahally
  Start      2019.12.1
  End        2020.1.1
   abstract   对电视节目数据的有关统计【收看人数，时长，平均观看时长，Top 10】

1. 数据分析

数据片段

  <GHApp>
  	<WIC cardNum="174041665" stbNum="01050908200014994" date="2012-09-16" pageWidgetVersion="1.0">
  		<A e="23:56:45" s="23:51:45" n="133" t="2" pi="488" p="24%E5%B0%8F%E6%97%B6" sn="CCTV-13 新闻" />
  		<I s="23:58:58"><URI><![CDATA[ui://standby.htm]]></URI></I>
  	</WIC>
  </GHApp>

    1. 每个 `WIC` 代表一个机顶盒用户，用唯一的 `cardNum` 标识
	2. 每个 `A` 标签代表一个频道
	3. 每个 `WIC` 标签下包含多个 `A` 标签
	4. `p` 属性内容需要解码
	5. 有些 `sn` 属性值与 `p` 属性值是空的(null或者空格字符串)
	6. `n`、`t`、`pi` 不知道是什么，但是如果属性值小于0的话，`p` 或者 `sn` 的值为空
	7. `I` 标签的内容是 URI 资源，这里不做处理

属性说明

  cardNum  -------  机顶盒号
  stbNum   -------  用户编号
  date     -------  日期
  e        -------  结束时间
  s        -------  起始时间
  n        -------  
  t        -------  
  pi       -------   
  p        -------  节目内容
  sn       -------  频道

2. 处理思路

第一步文件合并

一共 7天的统计数据，7个文件夹

  /73-
	-/2012-09-17
	-/2012-09-18
	-/2012-09-19
	-/2012-09-20
	-/2012-09-21
	-/2012-09-22
	-/2012-09-23

每个文件夹下有很多 .txt 文件

	···
	ars10767@20120917000000.txt
	···

所以，第一步可以先合并文件，将每个文件下的 .txt 合并成一个 .txt 文件，如下：

	2012-09-17.txt
	2012-09-18.txt
	2012-09-19.txt
	2012-09-20.txt
	2012-09-21.txt
	2012-09-22.txt
	2012-09-23.txt

代码实现

说明:
	使用 FileUtil 进行文件处理
	参考官方文档的 org.apache.hadoop.fs.FileUtil
	API docs: http://hadoop.apache.org/docs/r2.7.5/api/index.html

	public class AllFilesToFile {
		public static void main(String[] args) throws IOException {
			Configuration conf = new Configuration();
			String srcPath = "E:/hadoop/73/";
			String dstPath = "E:/hadoop/wordcount/input/";          // 输出目录
			
			String[] pathlist = FileUtil.list(new File(srcPath));   // 取得文件目录
			
			for(int i=0;i<pathlist.length;i++){
				System.out.println(pathlist[i]);
				Path srcDir = new Path(srcPath+pathlist[i]);
				Path dstFile = new Path(dstPath+pathlist[i]+".txt");
				FileSystem srcFS = srcDir.getFileSystem(conf);
				FileSystem dstFS = srcDir.getFileSystem(conf);
				boolean deleteSource = false;
				String addString = "";                              // 用于文件之间的分隔符
				boolean s = FileUtil.copyMerge(srcFS, srcDir, dstFS, dstFile, deleteSource, conf, addString); 
				System.out.println(s);
			}
		}
	}

   注意：FileUtil.copyMerge() 方法在 hadoop3.2 中是没有的
   解决：网上搜索替代方法，或者下载 hadoop2.x 的 jar 包
   以下是本人写的一个不成熟的憨憨代码，不推荐，费时

说明:
	1. 文件读取
	2. 文件写入
	3. 多线程
	4. 文件合并速度目测  200 KB/s 左右

文件读取 readFile

	public void readFile(String path) throws IOException {
		File file = new File(path);
		File[] fs = file.listFiles();	
		for(File f:fs){
			FileInputStream fis = new FileInputStream(f.getPath());
			InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
			BufferedReader br = new BufferedReader(isr);
			String line = "";
			while ((line = br.readLine()) != null) {
				save(line, path);    # 调用文件写入方法
			}
			br.close();
			isr.close();
			fis.close();
		}
	}

文件写入

	public void save(String content,String path) throws IOException{
		BufferedWriter file = new BufferedWriter (new OutputStreamWriter (new FileOutputStream (path+".txt",true),"UTF-8"));
		file.write(content+"\r\n");
		file.close();
	}

Runnable 接口创建线程

	public class ThreadJob implements Runnable{
		public String path=null;
		
		// 构造方法
		public ThreadJob(String path){
			this.path = path;
		}
		
		···
		public void readFile(String path) throws IOException{
			···
		}
		···
		public void save(String content,String path) throws IOException{
			···
		}
		···
		@Override
		public void run() {
			try {
				readFile(path);
			} catch (IOException e) {
				e.printStackTrace();
			}	
		}
	}

主类

	public class MergeFile {
		public static void main(String[] args) throws IOException {
			String path = "E:/hadoop/73/";
			List<String> pathList=getParentPath(path);
			// 每个文件夹 创建一个线程来操作
			Thread th1 = new Thread(new ThreadJob(pathList.get(0)));
			Thread th2 = new Thread(new ThreadJob(pathList.get(1)));
			Thread th3 = new Thread(new ThreadJob(pathList.get(2)));
			Thread th4 = new Thread(new ThreadJob(pathList.get(3)));
			Thread th5 = new Thread(new ThreadJob(pathList.get(4)));
			Thread th6 = new Thread(new ThreadJob(pathList.get(5)));
			Thread th7 = new Thread(new ThreadJob(pathList.get(6)));
			th1.start();
			th2.start();
			th3.start();
			th4.start();
			th5.start();
			th6.start();
			th7.start();
		}
		public static List<String> getParentPath(String path) throws IOException { 
			File file = new File(path);		//获取其file对象
			File[] fs = file.listFiles();
			List<String> parentPath =new ArrayList<String>();  //父目录
			for(File f:fs){					//遍历File[]数组
				if(f.isDirectory()){
					parentPath.add(f.getPath());
				}
			}
			return parentPath;
		}
	}

第二步编写 `MapReduce` 程序清洗数据

编写了三个 mapreduce 程序
第一个输出结果作第二个输入
第二个输出作第三个输入
最后的结果导入hbase
注意: 删除不必要的结果文件

第一个 `MapReduce` 程序

说明:
    正则提取数据中的相关字段:
    	节目名称（p）
    	日期（date）
   	 用户编号（stbNum）
    统计 总时长 / 人均观看时长 / 人数
    结果如下：
           ···
           1039交通服务热线@2012-09-17	7@6034@862
           ···

主类 MapReduceDemo.java

		package com.mapreducejob;

		import java.io.IOException;
		import org.apache.hadoop.conf.Configuration;
		import org.apache.hadoop.fs.Path;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Job;
		import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
		import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

		public class MapReduceDemo {
			public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
				String inputPath="E:/hadoop/wordcount/input/";
				String outputPath="E:/hadoop/wordcount/output";
				DealContext dc = new DealContext();
				dc.deleteFile(outputPath);                              // 如果输出目录存在，则删除它
				args = new String[]{
						inputPath,
						outputPath
				};
				Configuration conf = new Configuration();               // 获取环境变量	
				Job job = Job.getInstance(conf);                        // 实例化任务
				job.setJarByClass(MapReduceDemo.class);                 // 设定运行jar类型	
				job.setOutputKeyClass(Text.class);                      // 设置输出key格式
				job.setOutputValueClass(Text.class);                    // 设置输出value格式
				job.setMapperClass(MapperDemo.class);                   // 设置Mapper类
				//job.setCombinerClass(ReducerDemo.class);
				job.setReducerClass(ReducerDemo.class);                 //设置Reducer类
				job.setPartitionerClass(MyPartitioner.class);           // 定义分区
				job.setNumReduceTasks(10);                              // 设置 Reduce 任务数 与分区数 numPartitions对应
				FileInputFormat.addInputPath(job, new Path(args[0]));   //添加输出路径
				FileOutputFormat.setOutputPath(job, new Path(args[1])); //添加输出路径
				job.waitForCompletion(true);
			}
		}

Mapper类 MapperDemo.java

		package com.mapreducejob;

		import java.io.IOException;
		import java.util.List;
		import org.apache.hadoop.io.LongWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Mapper;

		public class MapperDemo extends Mapper<LongWritable,Text,Text,Text> {
			protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException{
				DealContext dc = new DealContext();
				String line = value.toString();
				// 调用 DealContext类方法进行正则提取
				List<String> A = dc.getPattern(line, "<A(.*?)/>");
				List<String> stbnum = dc.getPattern(line, "stbNum=\"(.*?)\"");
				List<String> date = dc.getPattern(line, "date=\"(.*?)\"");
				if(A.size()>0){
					List<String> sn = dc.getPattern(line, "sn=\"(.*?)\"");
					List<String> p = dc.getPattern(line, "p=\"(.*?)\"");
					List<String> e = dc.getPattern(line, " e=\"(.*?)\"");
					List<String> s = dc.getPattern(line, "s=\"(.*?)\"");
					for (int i=0;i<A.size();i++){
						try {
						if(!sn.get(i).trim().equals("")||!p.get(i).trim().equals("")){
							Long starttime = dc.getSecond(s.get(i).split(":")[0], s.get(i).split(":")[1], s.get(i).split(":")[2]) ;
							Long endtime = dc.getSecond(e.get(i).split(":")[0], e.get(i).split(":")[1], e.get(i).split(":")[2]) ;
							Long time = endtime - starttime;
							if(time<0){
								time = time+24*3600;
							}
							// 记录大于 1s 的
							if(time>1){
								//System.out.print(p.get(i).trim()+"#"+date.get(0)+sn.get(i)+"\n");
								context.write(new Text(p.get(i).trim()+"@"+date.get(0)),new Text(stbnum.get(0)+"@"+sn.get(i).trim()+"@"+time));
							}
						}
						} catch (Exception e2) {
							System.out.println(e2.getMessage());
						}
					}
				}
			}
		}

Reducer类 ReducerDemo.java

		package com.mapreducejob;

		import java.io.IOException;
		import java.util.HashSet;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Reducer;

		public  class ReducerDemo extends Reducer<Text,Text,Text,Text> {
			protected void reduce(Text key, Iterable<Text> values,Context context) throws IOException,InterruptedException{
				Long time=(long) 0;
				HashSet<String> stbset = new HashSet<String>();    // hash集合，key值相同的数据不允许插入
				for(Text val : values){
					String[] str = val.toString().split("@");
					String stb = str[0];
					time += Long.parseLong(str[2]);
					stbset.add(stb);
				}
				//"人数："+stbset.size()+" 时长："+time+" 人均收视时长："+time/stbset.size())
				context.write(key,new Text(stbset.size()+"@"+time+"@"+time/stbset.size()));
			}
		}

自定义分区类 MyPartitioner.java

		package com.mapreducejob;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Partitioner;
		public class MyPartitioner extends Partitioner<Text, Text>{
			/**
			 * @param {Object} Text key
			 * @param {Object} Text value
			 * @param {Object} int numPartitions
			 * 按照日期分区,16~24对应 分区 0~8 其它都归类到 分区 9
			 */
			@Override
			public int getPartition(Text key, Text value, int numPartitions) {
				//System.out.println(key.toString());
				int out = Integer.parseInt(key.toString().split("@")[1].split("-")[2]);
				if(out==16)
				{
					return 0;
				}
				if(out==17){
					return 1;
				}
				if(out==18){
					return 2;
				}
				if(out==19){
					return 3;
				}
				if(out==20){
					return 4;
				}
				if(out==21){
					return 5;
				}
				if(out==22){
					return 6;
				}
				if(out==23){
					return 7;
				}
				if(out==24){
					return 8;
				}
				return 9;
			}
		}

上下文处理类 DealContext.java

		package com.mapreducejob;
		import java.io.BufferedReader;
		import java.io.BufferedWriter;
		import java.io.File;
		import java.io.FileInputStream;
		import java.io.FileOutputStream;
		import java.io.IOException;
		import java.io.InputStreamReader;
		import java.io.OutputStreamWriter;
		import java.net.URLDecoder;
		import java.util.ArrayList;
		import java.util.List;
		import java.util.regex.Matcher;
		import java.util.regex.Pattern;
		public class DealContext {
			/**
			 * 正则提取
			 * @param {Object} String line     被提取对象
			 * @param {Object} String pattern  正则表达式 
			 * return List<String> result      返回
			 */
			public List<String> getPattern(String line,String pattern) throws IOException {
				List<String> result = new ArrayList<String>(); 
				
				Pattern r = Pattern.compile(pattern);
				Matcher m = r.matcher(line);
				
				while (m.find( )) {
					// System.out.println(m.group(0)+"\n");
					// 解码
					String str = URLDecoder.decode(m.group(1),"utf-8");
					result.add(str);
				}
				return result;
			}
			/**
			 * 时间处理 返回秒钟
			 * @param {Object} String hour  
			 * @param {Object} String minute
			 * @param {Object} String second
			 * return long       返回秒钟
			 */
			public Long getSecond(String hour,String minute,String second){		
				return Long.parseLong(hour)*3600+Long.parseLong(minute)*60+Long.parseLong(second);
			}
			/**
			 * 判断文件是否存在，存在则删除
			 */
			public boolean deleteFile(String path){
				File dirFile = new File(path);
				if (!dirFile.exists()) {
					return false;
				}
				if (dirFile.isFile()) {
					return dirFile.delete();
				} else {

					for (File file : dirFile.listFiles()) {
						deleteFile(file.getPath());
					}
				}
				return dirFile.delete();
			}
		}

第二个 `MapReduce` 程序

说明：

  对于第一个 MapReduce 程序清洗的结果进行处理。
  统计每个日期下的 Top 10 （根据 人均观看时长排名）
  代码参考教材例子

主类 TopnJob.java

		package topN;

		import java.io.IOException;
		import org.apache.hadoop.conf.Configuration;
		import org.apache.hadoop.fs.Path;
		import org.apache.hadoop.io.NullWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Job;
		import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
		import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
		public class TopnJob {
			public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {	
				String inputPath = "E:/hadoop/wordcount/input1";
				String outputPath = "E:/hadoop/wordcount/output1";
				Configuration conf = new Configuration();
				Job job = Job.getInstance(conf);
				job.setJarByClass(TopnJob.class);
				job.setMapperClass(TopnMapper.class);	
				job.setCombinerClass(TopnReducer.class);
				job.setReducerClass(TopnReducer.class);
				job.setOutputKeyClass(NullWritable.class);
				job.setOutputValueClass(Text.class);
				job.setPartitionerClass(MyPartitioner.class);
				job.setNumReduceTasks(10);
				FileInputFormat.addInputPath(job, new Path(inputPath));
				FileOutputFormat.setOutputPath(job, new Path(outputPath));
				System.exit(job.waitForCompletion(true)?0:1);		
			}
		}

Mapper类 TopnMapper.java

		package topN;

		import java.io.IOException;
		import java.util.TreeMap;
		import org.apache.hadoop.io.NullWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Mapper;

		public class TopnMapper extends Mapper<Object, Text, NullWritable, Text>{
			private TreeMap<Integer, Text> visittimesMap = new TreeMap<Integer, Text>();

			@Override
			public void map(Object key,Text value, Context context){
				if(value==null){
					return;
				}
				String[] strs = value.toString().split("	");
				String tId = strs[0];
				String reputation = strs[1];
				if(tId==null||reputation==null){
					return;
				}
				visittimesMap.put(Integer.parseInt(reputation.split("@")[1]), new Text(value));
				if(visittimesMap.size()>10){
					visittimesMap.remove(visittimesMap.firstKey());  
				}
			}
			@Override
			protected void cleanup(Context context) throws IOException, InterruptedException{
				for(Text t:visittimesMap.values()){
					context.write(NullWritable.get(), t);
				}
			}
		}

Reducer类 TopnReducer.java

		package topN;

		import java.io.IOException;
		import java.util.TreeMap;
		import org.apache.hadoop.io.NullWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Reducer;

		public class TopnReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
			private TreeMap<Integer, Text> visittimesMap = new TreeMap<Integer, Text>();
			@Override
			public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException{
				for(Text value:values){
					String[] strs = value.toString().split("	");
					visittimesMap.put(Integer.parseInt(strs[1].split("@")[1]), new Text(value));
					if(visittimesMap.size()>10){
						visittimesMap.remove(visittimesMap.firstKey());
					}
				}
				for(Text t:visittimesMap.values()){
					context.write(NullWritable.get(), t);
				}
			}
		}

自定义分区类 MyPartitioner.java

		package topN;

		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Partitioner;
		import org.apache.hadoop.io.NullWritable;
		public class MyPartitioner extends Partitioner<NullWritable, Text>{
			@Override
			public int getPartition(NullWritable key, Text value, int numPartitions) {
				String[] strs = value.toString().split("	");
				int out = Integer.parseInt(strs[0].split("@")[1].split("-")[2]);
				
				if(out==16)
				{
					return 0;
				}
				if(out==17){
					return 1;
				}
				if(out==18){
					return 2;
				}
				if(out==19){
					return 3;
				}
				if(out==20){
					return 4;
				}
				if(out==21){
					return 5;
				}
				if(out==22){
					return 6;
				}
				if(out==23){
					return 7;
				}
				if(out==24){
					return 8;
				}
				return 9;
			}
		}

第三步将结果存放进 `Hbase`

将 top10 的结果上传到 HDFS 文件系统上 hdfs://master:9000/input

主类

		package com.hbasetest;

		import java.io.IOException;
		import org.apache.hadoop.conf.Configuration;
		import org.apache.hadoop.hbase.HBaseConfiguration;
		import org.apache.hadoop.hbase.HColumnDescriptor;
		import org.apache.hadoop.hbase.HTableDescriptor;
		import org.apache.hadoop.hbase.TableName;
		import org.apache.hadoop.hbase.client.Admin;
		import org.apache.hadoop.hbase.client.Connection;
		import org.apache.hadoop.hbase.client.ConnectionFactory;
		import org.apache.hadoop.hbase.client.Put;
		import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
		import org.apache.hadoop.io.IntWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Job;
		import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

		public class TablePutTest {
			public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
				String tableName = "TVshow";
				TableName tbn = TableName.valueOf(tableName);       // Hbase的数据表名
			
				// 1. 创建所需要的配置，实例化 Configuration
				Configuration conf = HBaseConfiguration.create();
				conf.set("hbase.zookeeper.quorum", "master","slave1");
				conf.set("hbase.zookeeper.property.clientPort", "2181");
				
				// 如果表存在，就先删除
				Connection connection = ConnectionFactory.createConnection(conf);
				Admin admin = connection.getAdmin();
				if(admin.tableExists(tbn)){
					admin.disableTable(tbn);
					admin.deleteTable(tbn);
				}
				HTableDescriptor htd = new HTableDescriptor(tbn);             // 数据表的对象
				HColumnDescriptor hcd = new HColumnDescriptor("content");     // 列族的对象
				htd.addFamily(hcd);                                           // 创建列族
				admin.createTable(htd);                                       // 创建列族
				
				Job job = Job.getInstance(conf,"import from hdfs to hbase");  // 作业对象
				job.setJarByClass(TablePutTest.class);
				job.setMapperClass(MapperHbase.class);                        // 设置map
			
				// 设置插入 Hbase 时的相关操作
				TableMapReduceUtil.initTableReducerJob(tableName, ReducerHbase.class, job, null, null, null, null, false);
				
				job.setMapOutputKeyClass(Text.class);
				job.setMapOutputValueClass(IntWritable.class);
				
				job.setOutputKeyClass(Text.class);
				job.setOutputValueClass(Put.class);
				FileInputFormat.addInputPaths(job,"hdfs://master:9000/input");

				System.exit(job.waitForCompletion(true)?0:1);
			}
		}

Mapper 类

		package com.hbasetest;

		import java.io.IOException;
		import org.apache.hadoop.io.IntWritable;
		import org.apache.hadoop.io.Text;
		import org.apache.hadoop.mapreduce.Mapper;

		public class MapperHbase extends Mapper<Object, Text, Text, IntWritable> {
			private final static IntWritable one = new IntWritable(1);
			private Text word = new Text();
			public  void map(Object key, Text value, Context context) throws IOException, InterruptedException {
				context.write(new Text(word), one);
			}
		}

Reducer 类

		package com.hbasetest;

		import java.io.IOException;

		import org.apache.hadoop.hbase.client.Put;
		import org.apache.hadoop.hbase.mapreduce.TableReducer;
		import org.apache.hadoop.hbase.util.Bytes;
		import org.apache.hadoop.io.IntWritable;
		import org.apache.hadoop.io.Text;

		public class ReducerHbase extends TableReducer<Text, IntWritable, Text>{
			// TableOutputFormat类时，输出值必须是Put或Delete实例。
			public void reduce(Text key,Iterable<IntWritable> values,Context context)
					throws IOException, InterruptedException {

				String[] line = key.toString().split("	");
				String p = line[0].split("@")[0];
				System.out.println(p);
				String date = line[0].split("@")[1];
				
				String num = line[1].split("@")[0];
				String time = line[1].split("@")[1];
				String avertime = line[1].split("@")[2];
				
				// put 实例化key代表主键
				Put put = new Put(Bytes.toBytes(key.toString()));  
				 
				// 3个参数: 列族为content，列修饰符为count,列值为对应参数值
				put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("TV"),Bytes.toBytes(String.valueOf(p)));
				put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("date"),Bytes.toBytes(String.valueOf(date)));
				put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("num"),Bytes.toBytes(String.valueOf(num)));
				put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("time"),Bytes.toBytes(String.valueOf(time)));
				put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("avertime"),Bytes.toBytes(String.valueOf(avertime)));
				
				context.write(key, put);
			}
		}

第四步查询 `Hbase` 中的结果

 这里没有写 hbase 查询接口
 只能通过 hbase shell 来查询

 hadoop@master: /usr/local/hadoop-2.9.2/sbin/start-dfs.sh
 hadoop@master: /usr/local/hadoop-2.9.2/sbin/start-yarn.sh
 hadoop@master: /usr/local/hbase-1.5.0/bin/start-hbase.sh    # 启动 hbase
 hadoop@master: hbase shell
 ···
 ···
 hbase(main):001:0> scan 'TVshow'   # 查看全部结果
 ······
 24\xE5\xB0\x8F\xE6\x97\xB6@2012-09-16\x09106@29785@280 column=content:TV, timestamp=1577885998265, value=24\xE5\xB0\x8F\xE6\x97\xB6
 24\xE5\xB0\x8F\xE6\x97\xB6@2012-09-16\x09106@29785@280 column=content:avertime, timestamp=1577885998265, value=280
 ······
 90 row(s) in 10.1250 seconds

3. 总结

 1. 思路比代码更重要
 2. MapReduce 程序是真的很大程度的降低了分布式处理数据的门槛。
    程序猿需要关注的就只是 mapper 和 reducer 的 <key, value>
    输入输出类型，以及对 <key, value> 的处理逻辑。
 3. 写个 MapReduce 不代表就是个会分布式的程序猿。
 4. 对数据的观察应该是最关键的。通过代码来分析数据特征会事半功倍。
 ······

Theoyah

关注

1
点赞
踩
20

收藏

觉得还不错? 一键收藏
1
评论
hadoop结课作业

hadoop 结课作业 Auth hahally Start 2019.12.1 End 2020.1.1 abstract 对电视节目数据的有关统计【收看人数，时长，平均观看时长，Top 10】1. 数据分析数据片段 <GHApp> <WIC cardNum="174041665" stbNum="...
复制链接

扫一扫