hadoop
结课作业
Auth hahally Start 2019.12.1 End 2020.1.1 abstract 对电视节目数据的有关统计【收看人数,时长,平均观看时长,Top 10】
1. 数据分析
数据片段
<GHApp> <WIC cardNum="174041665" stbNum="01050908200014994" date="2012-09-16" pageWidgetVersion="1.0"> <A e="23:56:45" s="23:51:45" n="133" t="2" pi="488" p="24%E5%B0%8F%E6%97%B6" sn="CCTV-13 新闻" /> <I s="23:58:58"><URI><![CDATA[ui://standby.htm]]></URI></I> </WIC> </GHApp>
1. 每个 `WIC` 代表一个机顶盒用户,用唯一的 `cardNum` 标识
2. 每个 `A` 标签代表一个频道
3. 每个 `WIC` 标签下包含多个 `A` 标签
4. `p` 属性内容需要解码
5. 有些 `sn` 属性值与 `p` 属性值是空的(null或者空格字符串)
6. `n`、`t`、`pi` 不知道是什么,但是如果属性值小于0的话,`p` 或者 `sn` 的值为空
7. `I` 标签的内容是 URI 资源,这里不做处理
属性说明
cardNum ------- 机顶盒号 stbNum ------- 用户编号 date ------- 日期 e ------- 结束时间 s ------- 起始时间 n ------- t ------- pi ------- p ------- 节目内容 sn ------- 频道
2. 处理思路
第一步 文件合并
一共 7天的统计数据,7个文件夹
/73-
-/2012-09-17
-/2012-09-18
-/2012-09-19
-/2012-09-20
-/2012-09-21
-/2012-09-22
-/2012-09-23
每个文件夹下有很多 .txt
文件
···
ars10767@20120917000000.txt
···
所以,第一步可以先合并文件,将每个文件下的 .txt
合并成一个 .txt
文件,如下:
2012-09-17.txt
2012-09-18.txt
2012-09-19.txt
2012-09-20.txt
2012-09-21.txt
2012-09-22.txt
2012-09-23.txt
代码实现
说明:
使用 FileUtil 进行文件处理
参考官方文档的 org.apache.hadoop.fs.FileUtil
API docs: http://hadoop.apache.org/docs/r2.7.5/api/index.html
public class AllFilesToFile {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String srcPath = "E:/hadoop/73/";
String dstPath = "E:/hadoop/wordcount/input/"; // 输出目录
String[] pathlist = FileUtil.list(new File(srcPath)); // 取得文件目录
for(int i=0;i<pathlist.length;i++){
System.out.println(pathlist[i]);
Path srcDir = new Path(srcPath+pathlist[i]);
Path dstFile = new Path(dstPath+pathlist[i]+".txt");
FileSystem srcFS = srcDir.getFileSystem(conf);
FileSystem dstFS = srcDir.getFileSystem(conf);
boolean deleteSource = false;
String addString = ""; // 用于文件之间的分隔符
boolean s = FileUtil.copyMerge(srcFS, srcDir, dstFS, dstFile, deleteSource, conf, addString);
System.out.println(s);
}
}
}
注意:FileUtil.copyMerge() 方法在 hadoop3.2 中是没有的 解决:网上搜索替代方法,或者下载 hadoop2.x 的 jar 包 以下是本人写的一个不成熟的憨憨代码,不推荐,费时
说明:
1. 文件读取
2. 文件写入
3. 多线程
4. 文件合并速度目测 200 KB/s 左右
文件读取 readFile
public void readFile(String path) throws IOException {
File file = new File(path);
File[] fs = file.listFiles();
for(File f:fs){
FileInputStream fis = new FileInputStream(f.getPath());
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader br = new BufferedReader(isr);
String line = "";
while ((line = br.readLine()) != null) {
save(line, path); # 调用文件写入方法
}
br.close();
isr.close();
fis.close();
}
}
文件写入
public void save(String content,String path) throws IOException{
BufferedWriter file = new BufferedWriter (new OutputStreamWriter (new FileOutputStream (path+".txt",true),"UTF-8"));
file.write(content+"\r\n");
file.close();
}
Runnable
接口创建线程
public class ThreadJob implements Runnable{
public String path=null;
// 构造方法
public ThreadJob(String path){
this.path = path;
}
···
public void readFile(String path) throws IOException{
···
}
···
public void save(String content,String path) throws IOException{
···
}
···
@Override
public void run() {
try {
readFile(path);
} catch (IOException e) {
e.printStackTrace();
}
}
}
主类
public class MergeFile {
public static void main(String[] args) throws IOException {
String path = "E:/hadoop/73/";
List<String> pathList=getParentPath(path);
// 每个文件夹 创建一个线程来操作
Thread th1 = new Thread(new ThreadJob(pathList.get(0)));
Thread th2 = new Thread(new ThreadJob(pathList.get(1)));
Thread th3 = new Thread(new ThreadJob(pathList.get(2)));
Thread th4 = new Thread(new ThreadJob(pathList.get(3)));
Thread th5 = new Thread(new ThreadJob(pathList.get(4)));
Thread th6 = new Thread(new ThreadJob(pathList.get(5)));
Thread th7 = new Thread(new ThreadJob(pathList.get(6)));
th1.start();
th2.start();
th3.start();
th4.start();
th5.start();
th6.start();
th7.start();
}
public static List<String> getParentPath(String path) throws IOException {
File file = new File(path); //获取其file对象
File[] fs = file.listFiles();
List<String> parentPath =new ArrayList<String>(); //父目录
for(File f:fs){ //遍历File[]数组
if(f.isDirectory()){
parentPath.add(f.getPath());
}
}
return parentPath;
}
}
第二步 编写 MapReduce
程序清洗数据
编写了三个 mapreduce 程序 第一个输出结果作第二个输入 第二个输出作第三个输入 最后的结果导入hbase 注意: 删除不必要的结果文件
第一个 MapReduce
程序
说明: 正则提取数据中的相关字段: 节目名称(p) 日期(date) 用户编号(stbNum) 统计 总时长 / 人均观看时长 / 人数 结果如下: ··· 1039交通服务热线@2012-09-17 7@6034@862 ···
主类 MapReduceDemo.java
package com.mapreducejob;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MapReduceDemo {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
String inputPath="E:/hadoop/wordcount/input/";
String outputPath="E:/hadoop/wordcount/output";
DealContext dc = new DealContext();
dc.deleteFile(outputPath); // 如果输出目录存在,则删除它
args = new String[]{
inputPath,
outputPath
};
Configuration conf = new Configuration(); // 获取环境变量
Job job = Job.getInstance(conf); // 实例化任务
job.setJarByClass(MapReduceDemo.class); // 设定运行jar类型
job.setOutputKeyClass(Text.class); // 设置输出key格式
job.setOutputValueClass(Text.class); // 设置输出value格式
job.setMapperClass(MapperDemo.class); // 设置Mapper类
//job.setCombinerClass(ReducerDemo.class);
job.setReducerClass(ReducerDemo.class); //设置Reducer类
job.setPartitionerClass(MyPartitioner.class); // 定义分区
job.setNumReduceTasks(10); // 设置 Reduce 任务数 与分区数 numPartitions对应
FileInputFormat.addInputPath(job, new Path(args[0])); //添加输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1])); //添加输出路径
job.waitForCompletion(true);
}
}
Mapper类 MapperDemo.java
package com.mapreducejob;
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperDemo extends Mapper<LongWritable,Text,Text,Text> {
protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException{
DealContext dc = new DealContext();
String line = value.toString();
// 调用 DealContext类方法进行正则提取
List<String> A = dc.getPattern(line, "<A(.*?)/>");
List<String> stbnum = dc.getPattern(line, "stbNum=\"(.*?)\"");
List<String> date = dc.getPattern(line, "date=\"(.*?)\"");
if(A.size()>0){
List<String> sn = dc.getPattern(line, "sn=\"(.*?)\"");
List<String> p = dc.getPattern(line, "p=\"(.*?)\"");
List<String> e = dc.getPattern(line, " e=\"(.*?)\"");
List<String> s = dc.getPattern(line, "s=\"(.*?)\"");
for (int i=0;i<A.size();i++){
try {
if(!sn.get(i).trim().equals("")||!p.get(i).trim().equals("")){
Long starttime = dc.getSecond(s.get(i).split(":")[0], s.get(i).split(":")[1], s.get(i).split(":")[2]) ;
Long endtime = dc.getSecond(e.get(i).split(":")[0], e.get(i).split(":")[1], e.get(i).split(":")[2]) ;
Long time = endtime - starttime;
if(time<0){
time = time+24*3600;
}
// 记录大于 1s 的
if(time>1){
//System.out.print(p.get(i).trim()+"#"+date.get(0)+sn.get(i)+"\n");
context.write(new Text(p.get(i).trim()+"@"+date.get(0)),new Text(stbnum.get(0)+"@"+sn.get(i).trim()+"@"+time));
}
}
} catch (Exception e2) {
System.out.println(e2.getMessage());
}
}
}
}
}
Reducer类 ReducerDemo.java
package com.mapreducejob;
import java.io.IOException;
import java.util.HashSet;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReducerDemo extends Reducer<Text,Text,Text,Text> {
protected void reduce(Text key, Iterable<Text> values,Context context) throws IOException,InterruptedException{
Long time=(long) 0;
HashSet<String> stbset = new HashSet<String>(); // hash集合,key值相同的数据不允许插入
for(Text val : values){
String[] str = val.toString().split("@");
String stb = str[0];
time += Long.parseLong(str[2]);
stbset.add(stb);
}
//"人数:"+stbset.size()+" 时长:"+time+" 人均收视时长:"+time/stbset.size())
context.write(key,new Text(stbset.size()+"@"+time+"@"+time/stbset.size()));
}
}
自定义分区类 MyPartitioner.java
package com.mapreducejob;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<Text, Text>{
/**
* @param {Object} Text key
* @param {Object} Text value
* @param {Object} int numPartitions
* 按照日期分区,16~24对应 分区 0~8 其它都归类到 分区 9
*/
@Override
public int getPartition(Text key, Text value, int numPartitions) {
//System.out.println(key.toString());
int out = Integer.parseInt(key.toString().split("@")[1].split("-")[2]);
if(out==16)
{
return 0;
}
if(out==17){
return 1;
}
if(out==18){
return 2;
}
if(out==19){
return 3;
}
if(out==20){
return 4;
}
if(out==21){
return 5;
}
if(out==22){
return 6;
}
if(out==23){
return 7;
}
if(out==24){
return 8;
}
return 9;
}
}
上下文处理类 DealContext.java
package com.mapreducejob;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URLDecoder;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class DealContext {
/**
* 正则提取
* @param {Object} String line 被提取对象
* @param {Object} String pattern 正则表达式
* return List<String> result 返回
*/
public List<String> getPattern(String line,String pattern) throws IOException {
List<String> result = new ArrayList<String>();
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find( )) {
// System.out.println(m.group(0)+"\n");
// 解码
String str = URLDecoder.decode(m.group(1),"utf-8");
result.add(str);
}
return result;
}
/**
* 时间处理 返回秒钟
* @param {Object} String hour
* @param {Object} String minute
* @param {Object} String second
* return long 返回秒钟
*/
public Long getSecond(String hour,String minute,String second){
return Long.parseLong(hour)*3600+Long.parseLong(minute)*60+Long.parseLong(second);
}
/**
* 判断文件是否存在,存在则删除
*/
public boolean deleteFile(String path){
File dirFile = new File(path);
if (!dirFile.exists()) {
return false;
}
if (dirFile.isFile()) {
return dirFile.delete();
} else {
for (File file : dirFile.listFiles()) {
deleteFile(file.getPath());
}
}
return dirFile.delete();
}
}
第二个 MapReduce
程序
说明:
对于第一个 MapReduce 程序清洗的结果进行处理。 统计每个日期下的 Top 10 (根据 人均观看时长排名) 代码参考教材例子
主类 TopnJob.java
package topN;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class TopnJob {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
String inputPath = "E:/hadoop/wordcount/input1";
String outputPath = "E:/hadoop/wordcount/output1";
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(TopnJob.class);
job.setMapperClass(TopnMapper.class);
job.setCombinerClass(TopnReducer.class);
job.setReducerClass(TopnReducer.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setPartitionerClass(MyPartitioner.class);
job.setNumReduceTasks(10);
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
System.exit(job.waitForCompletion(true)?0:1);
}
}
Mapper类 TopnMapper.java
package topN;
import java.io.IOException;
import java.util.TreeMap;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TopnMapper extends Mapper<Object, Text, NullWritable, Text>{
private TreeMap<Integer, Text> visittimesMap = new TreeMap<Integer, Text>();
@Override
public void map(Object key,Text value, Context context){
if(value==null){
return;
}
String[] strs = value.toString().split(" ");
String tId = strs[0];
String reputation = strs[1];
if(tId==null||reputation==null){
return;
}
visittimesMap.put(Integer.parseInt(reputation.split("@")[1]), new Text(value));
if(visittimesMap.size()>10){
visittimesMap.remove(visittimesMap.firstKey());
}
}
@Override
protected void cleanup(Context context) throws IOException, InterruptedException{
for(Text t:visittimesMap.values()){
context.write(NullWritable.get(), t);
}
}
}
Reducer类 TopnReducer.java
package topN;
import java.io.IOException;
import java.util.TreeMap;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class TopnReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private TreeMap<Integer, Text> visittimesMap = new TreeMap<Integer, Text>();
@Override
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException{
for(Text value:values){
String[] strs = value.toString().split(" ");
visittimesMap.put(Integer.parseInt(strs[1].split("@")[1]), new Text(value));
if(visittimesMap.size()>10){
visittimesMap.remove(visittimesMap.firstKey());
}
}
for(Text t:visittimesMap.values()){
context.write(NullWritable.get(), t);
}
}
}
自定义分区类 MyPartitioner.java
package topN;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.io.NullWritable;
public class MyPartitioner extends Partitioner<NullWritable, Text>{
@Override
public int getPartition(NullWritable key, Text value, int numPartitions) {
String[] strs = value.toString().split(" ");
int out = Integer.parseInt(strs[0].split("@")[1].split("-")[2]);
if(out==16)
{
return 0;
}
if(out==17){
return 1;
}
if(out==18){
return 2;
}
if(out==19){
return 3;
}
if(out==20){
return 4;
}
if(out==21){
return 5;
}
if(out==22){
return 6;
}
if(out==23){
return 7;
}
if(out==24){
return 8;
}
return 9;
}
}
第三步 将结果存放进 Hbase
将 top10 的结果上传到 HDFS 文件系统上 hdfs://master:9000/input
主类
package com.hbasetest;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class TablePutTest {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
String tableName = "TVshow";
TableName tbn = TableName.valueOf(tableName); // Hbase的数据表名
// 1. 创建所需要的配置,实例化 Configuration
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "master","slave1");
conf.set("hbase.zookeeper.property.clientPort", "2181");
// 如果表存在,就先删除
Connection connection = ConnectionFactory.createConnection(conf);
Admin admin = connection.getAdmin();
if(admin.tableExists(tbn)){
admin.disableTable(tbn);
admin.deleteTable(tbn);
}
HTableDescriptor htd = new HTableDescriptor(tbn); // 数据表的对象
HColumnDescriptor hcd = new HColumnDescriptor("content"); // 列族的对象
htd.addFamily(hcd); // 创建列族
admin.createTable(htd); // 创建列族
Job job = Job.getInstance(conf,"import from hdfs to hbase"); // 作业对象
job.setJarByClass(TablePutTest.class);
job.setMapperClass(MapperHbase.class); // 设置map
// 设置插入 Hbase 时的相关操作
TableMapReduceUtil.initTableReducerJob(tableName, ReducerHbase.class, job, null, null, null, null, false);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Put.class);
FileInputFormat.addInputPaths(job,"hdfs://master:9000/input");
System.exit(job.waitForCompletion(true)?0:1);
}
}
Mapper
类
package com.hbasetest;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperHbase extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text(word), one);
}
}
Reducer
类
package com.hbasetest;
import java.io.IOException;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
public class ReducerHbase extends TableReducer<Text, IntWritable, Text>{
// TableOutputFormat类时,输出值必须是Put或Delete实例。
public void reduce(Text key,Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
String[] line = key.toString().split(" ");
String p = line[0].split("@")[0];
System.out.println(p);
String date = line[0].split("@")[1];
String num = line[1].split("@")[0];
String time = line[1].split("@")[1];
String avertime = line[1].split("@")[2];
// put 实例化key代表主键
Put put = new Put(Bytes.toBytes(key.toString()));
// 3个参数: 列族为content,列修饰符为count,列值为对应参数值
put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("TV"),Bytes.toBytes(String.valueOf(p)));
put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("date"),Bytes.toBytes(String.valueOf(date)));
put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("num"),Bytes.toBytes(String.valueOf(num)));
put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("time"),Bytes.toBytes(String.valueOf(time)));
put.addColumn(Bytes.toBytes("content"),Bytes.toBytes("avertime"),Bytes.toBytes(String.valueOf(avertime)));
context.write(key, put);
}
}
第四步 查询 Hbase
中的结果
这里没有写 hbase 查询接口 只能通过 hbase shell 来查询
hadoop@master: /usr/local/hadoop-2.9.2/sbin/start-dfs.sh
hadoop@master: /usr/local/hadoop-2.9.2/sbin/start-yarn.sh
hadoop@master: /usr/local/hbase-1.5.0/bin/start-hbase.sh # 启动 hbase
hadoop@master: hbase shell
···
···
hbase(main):001:0> scan 'TVshow' # 查看全部结果
······
24\xE5\xB0\x8F\xE6\x97\xB6@2012-09-16\x09106@29785@280 column=content:TV, timestamp=1577885998265, value=24\xE5\xB0\x8F\xE6\x97\xB6
24\xE5\xB0\x8F\xE6\x97\xB6@2012-09-16\x09106@29785@280 column=content:avertime, timestamp=1577885998265, value=280
······
90 row(s) in 10.1250 seconds
3. 总结
1. 思路比代码更重要 2. MapReduce 程序是真的很大程度的降低了分布式处理数据的门槛。 程序猿需要关注的就只是 mapper 和 reducer 的 <key, value> 输入输出类型,以及对 <key, value> 的处理逻辑。 3. 写个 MapReduce 不代表就是个会分布式的程序猿。 4. 对数据的观察应该是最关键的。通过代码来分析数据特征会事半功倍。 ······