Hadoop应用实例：提取网页元素

最新推荐文章于 2024-06-20 08:41:52 发布

DilicelSten

最新推荐文章于 2024-06-20 08:41:52 发布

阅读量2.5k

点赞数

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/Totoro1745/article/details/55188909

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

过了一个舒服的新年假期，可以继续卯足劲干活了~说实话还是有挺大收获的，之前在家休息时写过一篇关于Hadoop的博客，讲的Hadoop的简介、安装以及其两大重点：HDFS和MapReduce,这则博客则分享一下如何利用Hadoop的MApReduce去提取网页元素，一步步如何去攻克~（这是基于本机安装好Hadoop的情况下，如果还没安装可以参考我上一篇关于Hadoop的博客）

关于Ｈａｄｏｏｐ

数据类型

BooleanWritable 标准布尔变量的封装
ByteWritable 单字节数的封装
DoubleWritable 双字节数的封装
FloatWritable 浮点数的封装
IntWritable 整数的封装
LongWritable Long的封装
Text 使用UTF8格式的文本封装
NullWritable 无键值时的站位符

关于map和Reducer

Mapper接口负责数据处理阶段。它采用的形式为Mapper< K1,V1,K2,V2 >Java泛型，这里键类和值类分别实现WritableComparable和Writable接口。Mapper只有一个方法–map，用于处理一个单独的键/值对。
Reducer任务接收来自各个mapper的输出时，它按照键/值对中的键对输入数据进行排序，并将相同键的值归并。然后调用reduce（）函数，并通过迭代那些与指定键相关联的值，生成一个列表（K3，V3）

Hadoop应用步骤

第一：切换 Map/Reduce 开发视图，选择 Window 菜单下选择 Open Perspective -> Other，弹出一个窗体，从中选择 Map/Reduce 选项即可进行切换。

此处输入图片的描述

第二：建立与 Hadoop 集群的连接，点击 Eclipse软件右下角的 Map/Reduce Locations 面板，在面板中单击右键，选择 New Hadoop Location。

此处输入图片的描述

第三：建立与 Hadoop 集群的连接，开启NameNode 和 DataNode 守护进程。

cd /home/Hadoop   //先进入本机的Hadoop文件夹
./sbin/start-dfs.sh   //开启服务

第四：新建一个MapReduce的Project

（1） New -> Project…:
（2）选择 Map/Reduce Project，点击 Next
（3）接着右键点击刚创建的项目，选择 New -> Class

第五：将配置文件复制到项目下

cp /usr/local/hadoop/etc/hadoop/core-site.xml ~/workspace/UserVInfo/src
cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml ~/workspace/UserVInfo/src
cp /usr/local/hadoop/etc/hadoop/log4j.properties ~/workspace/UserVInfo/src

第六：开始写代码了！！（这里使用Java语言编程）

PS：接下来我提供一些在我自己的Project中所用到的一些方法
(1)上传本地文件至hdfs

public static void uploadInputFile(String localFile) throws IOException{
        Configuration conf = new Configuration();
        String hdfsPath = "hdfs://localhost:9000/";
        String hdfsInput = "hdfs://localhost:9000/user/hadoop/input";
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.copyFromLocalFile(new Path(localFile), new Path(hdfsInput));
        fs.close();
        System.out.println("已经上传文件到input文件夹啦");
    }

（2）因为项目需要循环操作，因此我找到一个方法，删除Input文件

public static void deleteInput() throws IOException{
            Configuration conf = new Configuration();
            String hdfsInput = "hdfs://localhost:9000/user/hadoop/input";
            String hdfsPath = "hdfs://localhost:9000/";
            Path path = new Path(hdfsInput);
            FileSystem fs = FileSystem.get(URI.create(hdfsPath),conf);
            fs.deleteOnExit(path);
            fs.close();
            System.out.println("input文件已删除");
        }

(3)但是循环需要新建Input文件，因此在上传文件的之前先新建一个input文件夹，在HDFS新建文件夹和其他的系统不一样

//在hdfs下新建文件夹
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(conf);
file.mkdirs(new Path("/user/hadoop/input"));

PS:在Windows系统下新建文件夹的Java代码实现

//创建文件夹（不适用于hdfs即hadoop的文件系统）
    public static boolean mkDirectory(String path) {  
        File file = null;  
        try {  
            file = new File(path);  
            if (!file .exists()  && !file .isDirectory()) {  
               return file.mkdirs();  
            }  
            else{  
                return false;  
            }  
        } catch (Exception e) {  
        } finally {  
            file = null;  
        }  
        return false;  
    }

(4)创建Mapper类和Reducer类

public static class UserVMapper extends Mapper<Object, Text, Text, Text>{

        private Text userV = new Text();
        private Text intro = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String html = value.toString();
            String block_regex = "<li class=\"follow_item.*?</li>";
            Pattern pattern = Pattern.compile(block_regex);
            Matcher matcher = pattern.matcher(html);
            while(matcher.find()){
                String block = matcher.group();
                String[] userAndInfo = getUserAndInfo(block);
                if(userAndInfo != null){
                    String user = userAndInfo[0];
                    String info = userAndInfo[1];
                     userV.set(user);
                     intro.set(info);
                     context.write(userV, intro);
                }       
            }
        }
    }

    public static class UserVReducer extends Reducer<Text, Text, Text, Text>{

        private Text value = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
            String result = "";
            for(Text text: values){
                String one = new String(text.toString());//之前用getByte不行
                result += one;
            }
            value.set(result);
            context.write(key, value);
        }
    }

(5)执行MapReduce程序

public static void runMapReduce(String [] args) throws Exception{
        Configuration conf = new Configuration();
        String [] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(otherArgs.length != 2){
            System.err.println("Usage: wordcount<in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "UserV");
        job.setJarByClass(UserVinfo.class);
        job.setMapperClass(UserVMapper.class);
        job.setCombinerClass(UserVReducer.class);
        job.setReducerClass(UserVReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        if(job.waitForCompletion(true)==true){
            System.out.println("mapReduce 执行完毕！");
        }
    }

大致步骤如上，具体的代码可以参考我的github:
https://github.com/Totoro1997/Hadoop/blob/master/UserVinfo.java

参考资料：
1.使用Eclipse编译运行MapReduce程序 Hadoop2.6.0_Ubuntu/CentOS
2.基于hadoop的网页元素抽取

DilicelSten

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hadoop应用实例：提取网页元素

过了一个舒服的新年假期，可以继续卯足劲干活了~说实话还是有挺大收获的，之前在家休息时写过一篇关于Hadoop的博客，讲的Hadoop的简介、安装以及其两大重点：HDFS和MapReduce,这则博客则分享一下如何利用Hadoop的MApReduce去提取网页元素，一步步如何去攻克~（这是基于本机安装好Hadoop的情况下，如果还没安装可以参考我上一篇关于Hadoop的博客）
复制链接

扫一扫

专栏目录