hadoop与java客户端编程

最新推荐文章于 2024-07-01 00:29:07 发布

王小懒ws

最新推荐文章于 2024-07-01 00:29:07 发布

阅读量1.5k

点赞数

分类专栏： hadoop 文章标签： hdfs

本文链接：https://blog.csdn.net/wangshun_410/article/details/90681225

版权

hadoop 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

开发前准备工作

在windows开发环境中做一些准备工作，因为Hadoop是适用于Linux操作系统上的，所以下载的开发包也是按照Linux系统编译的，因此，要在window下开发，需要下载源码包，使用windows进行编译。否则在使用某些功能时会提示找不到HADOOP_HOME,winutils.exe等。这里有一份已经编译的精简的windows版本（约3M左右，只保留了一些必须的功能）。链接：https://pan.baidu.com/s/1E5fsKpCigPxi1h5fMLJKRg
提取码：6al7

在windows的某个路径中解压一份windows版本的hadoop安装包
将解压出的hadoop目录配置到windows的环境变量中：HADOOP_HOME

开发代码

1、配置开发环境

如果你习惯于导入jar包来进行开发，那么请将hdfs客户端开发所需的jar导入工程（jar包可在hadoop安装包中找到common/hdfs），这里我使用的是maven工程，pom.xml中导入的jar包最好和你安装的hadoop版本一致，如需其他配置，请访问仓库：https://mvnrepository.com/

pom.xml 配置如下：

<properties>
        <hadoop-version>3.0.2</hadoop-version>
        <log-version>1.2.17</log-version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop-version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop-version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop-version}</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>${log-version}</version>
        </dependency>
    </dependencies>

2、编写代码，上传一个本地文件到hdfs

要点：要对hdfs中的文件进行操作，代码中首先需要获得一个hdfs的客户端对象

public static void main(String[] args) throws Exception {
        // 加载配置，其实设置hdfs-site.xml里面的配置，若不设置，默认为hdfs-site.xml里面的配置
        Configuration conf=new Configuration();
        // 设置副本数量
        conf.set("dfs.replication","2");
        // 设置切块大小
        conf.set("def.blocksize","20m");
        // 获得一个hdfs的客户端对象,uri就是访问的结点，conf就是上面的配置，user就是用户身份，                      不写默认为windows的用户
        FileSystem fs=FileSystem.get(new URI("hdfs://node1:9000"),conf,"ws");
        // 上传本地文件到hdfs
        fs.copyFromLocalFile(new Path("F:\\spark-2.3.3-bin-hadoop2.7.tgz")
                ,new Path("/"));
        // 关流
        fs.close();
 }

demo1---日志上传

1、需求描述：

在业务系统的服务器上，业务程序会不断生成业务日志（比如网站的页面访问日志），业务日志是用log4j生成的，会不断地切出日志文件，需要定期（比如每小时）从业务服务器上的日志目录中，探测需要采集的日志文件，发往HDFS

注意点：业务服务器可能有多台(hdfs上的文件名不能直接用日志服务器上的文件名)

当天采集到的日志要放在hdfs的当天目录中

采集完成的日志文件，需要移动到到日志服务器的一个备份目录中

定期检查（一小时检查一次）备份目录，将备份时长超出24小时的日志文件清除

2、需求设计

1、流程

    生成日志

    启动一个定时任务：
	    ——定时探测日志源目录
	    ——获取需要采集的文件
	    ——移动这些文件到一个待上传临时目录
	    ——遍历待上传目录中各文件，逐一传输到HDFS的目标路径，同时将传输完成的文件移动到备份目录

    启动一个定时任务：
	——探测备份目录中的备份数据，检查是否已超出最长备份时长，如果超出，则删除
	
	
2、规划各种路径
    日志源路径： d:/logs/accesslog/
    待上传临时目录： d:/logs/toupload/
    备份目录： d:/logs/backup/日期/

    HDFS存储路径： /logs/日期
    HDFS中的文件的前缀：access_log_
    HDFS中的文件的后缀：.log

3、代码设计

3.1、日志生成模块

使用log4j生成日志文件

LoggerWriter.java

public class LoggerWriter {
	public static void main(String[] args) throws Exception {
		while (true) {
			Logger logger = Logger.getLogger("logRollingFile");
			logger.info("111111111111111111111111111110----"+ System.currentTimeMillis());
			Thread.sleep(10);
		}
	}
}

log4j.properties

#log4j.rootLogger=debug,stdout,genlog
log4j.rootLogger=logRollingFile,stdout


log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
        
###
log4j.logger.logRollingFile= DEBUG,test1
log4j.appender.test1 = org.apache.log4j.RollingFileAppender
log4j.appender.test1.layout = org.apache.log4j.PatternLayout
log4j.appender.test1.layout.ConversionPattern =%d{yyyy-MMM-dd HH:mm:ss}-[TS] %p %t %c - %m%n
log4j.appender.test1.Threshold = DEBUG
log4j.appender.test1.ImmediateFlush = TRUE
log4j.appender.test1.Append = TRUE
log4j.appender.test1.File = d:/logs/accesslog/access.log
# 设置日志文件大小
log4j.appender.test1.MaxFileSize = 64KB
# 设置最大文件索引
log4j.appender.test1.MaxBackupIndex = 200

3.2、demo配置文件模块

从我们需求设计可以看出来，我们demo中有大量的文件路径，此时我们可以写一个配置文件，避免代码中出现大量的字符串。与此同时，为了避免我们编写代码获取配置文件时拼错，我们还应该将配置文件中的key放置到一个常量池。再深入思考，我们加载配置文件的程序应该只需要执行一次，便可以在各处得到它的实例，为了避免重复加载，我们应该使用单例模式

collect.properties

LOG_SOURCE_DIR=d:/logs/accesslog/
LOG_TOUPLOAD_DIR=d:/logs/toupload/
LOG_BACKUP_BASE_DIR=d:/logs/backup/
LOG_BACKUP_TIMEOUT=24
LOG_LEGAL_PREFIX=access.log.

HDFS_URI=hdfs://node1:9000/
HDFS_DEST_BASE_DIR=/logs/
HDFS_FILE_PREFIX=access_log_
HDFS_FILE_SUFFIX=.log

Constant.java

public class Constant {
       static final String LOG_SOURCE_DIR="LOG_SOURCE_DIR";
       static final String LOG_TOUPLOAD_DIR="LOG_TOUPLOAD_DIR";
       static final String LOG_BACKUP_BASE_DIR="LOG_BACKUP_BASE_DIR";
       static final String LOG_BACKUP_TIMEOUT="LOG_BACKUP_TIMEOUT";
       static final String LOG_LEGAL_PREFIX="LOG_LEGAL_PREFIX";
       static final String HDFS_URI="HDFS_URI";
       static final String HDFS_DEST_BASE_DIR="HDFS_DEST_BASE_DIR";
       static final String HDFS_FILE_PREFIX="HDFS_FILE_PREFIX";
       static final String HDFS_FILE_SUFFIX="HDFS_FILE_SUFFIX";
}

PropertyHolderLazy.java（懒汉式）

public class PropertyHolderLazy {
    private static Properties prop=null;
    public static Properties getProps(){
        if(prop==null){
            synchronized (PropertyHolderLazy.class){
                if (prop==null){
                    prop=new Properties();
                    try {
                        prop.load(PropertyHolderLazy.class.getClassLoader().getResourceAsStream("collect.properties"));
                    }catch (Exception e){
                        e.printStackTrace();
                    }
                }
            }
        }
        return prop;
    }
}

3.3、日志收集模块

CollectTask.java

public class CollectTask extends TimerTask {
    @Override
    public void run() {
        //探测日志源目录
        SimpleDateFormat sdf=new SimpleDateFormat("yy-MM-dd-HH");
        String data = sdf.format(new Date());
        Properties props = PropertyHolderLazy.getProps();
        File srcDir = new File(props.getProperty(Constant.LOG_SOURCE_DIR));
        File[] listFiles = srcDir.listFiles(new FilenameFilter() {
            @Override
            public boolean accept(File dir, String name) {
                if (name.startsWith(props.getProperty(Constant.LOG_LEGAL_PREFIX))) {
                    return true;
                } else {
                    return false;
                }
            }
        });
        System.out.println(listFiles);
        //移动这些文件到一个待上传临时目录
        try {
            File toupload = new File(props.getProperty(Constant.LOG_TOUPLOAD_DIR));
            for (File files:listFiles
            ) {
                FileUtils.moveFileToDirectory(files, toupload, true);
            }
         //遍历待上传目录中各文件，逐一传输到HDFS的目标路径，同时将传输完成的文件移动到备份目录
            FileSystem fs = FileSystem.get(new URI(props.getProperty(Constant.HDFS_URI)), new Configuration(), "ws");
            File[] touploadFiles = toupload.listFiles();
            Path hdfsDestPath = new Path(props.getProperty(Constant.HDFS_DEST_BASE_DIR) + data);
            if(fs.exists(hdfsDestPath)){
                fs.create(hdfsDestPath);
            }
            for (File file:touploadFiles
                 ) {
                fs.copyFromLocalFile(new Path(file.getAbsolutePath()),
                        new Path(hdfsDestPath+props.getProperty(Constant.HDFS_FILE_PREFIX)+ UUID.randomUUID()+props.getProperty(Constant.HDFS_FILE_SUFFIX)));
                FileUtils.moveFileToDirectory(file,new File(props.getProperty(Constant.LOG_BACKUP_BASE_DIR)+data+"/"),true);
            }
        }catch (Exception e){
            e.printStackTrace();
        }

    }
}

3.4、超时日志删除模块

DeleOvertimeFile.java

public class DeleOvertimeFile extends TimerTask{
    @Override
    public void run() {
        SimpleDateFormat sdf=new SimpleDateFormat();
        long now = new Date().getTime();
        Properties props = PropertyHolderLazy.getProps();
        File backupBaseDir = new File(props.getProperty(Constant.LOG_BACKUP_BASE_DIR));
        File[] backupDir =backupBaseDir.listFiles();
        try{
            for (File dir:backupDir
            ) {
                long time = sdf.parse(dir.getName()).getTime();
                if (now-time>24*60*60*1000L){
                    FileUtils.deleteDirectory(dir);
                }
            }
        }catch (Exception e){
            e.printStackTrace();
        }

    }
}

3.5、主函数

public class DataConllectMain {
    public static void main(String[] args) {
        Properties props = PropertyHolderLazy.getProps();
        Timer timer = new Timer();
        // 收集任务
        timer.schedule( new CollectTask(),0,24*60*60*1000L);
        // 删除过期文件任务
        timer.schedule(new DeleOvertimeFile(),0,24*60*60*1000L);
    }
}

demo2--简单词频统计

1、需求描述：

现在我们需要实现一个简单的单词统计功能，现在有多个文件，我们需要统计其中的单词，以及单词出现的次数，并将统计结果保存到一个文件中，方便查阅

2、需求设计

1、流程

    得到待统计文件：
	    
    将统计结果保存到结果文件：
	
	
2、规划路径
    待统计文件路径： /wordcount/input
    统计结果保存路径： /wordcount/output/result.data

3、代码设计

为了使我们的程序鲁棒性更强，我们在设想一些情况，假设我们不仅仅只做单词统计，而且在后期或许我们还会需要其它功能来对这些文件进行处理，因此我们需要编写一个接口，并且可以通过类反射加载配置来为这个接口更换不同的实现类。再深层次思考，如果我还需要进行缓存，那么我们编写的接口应该怎么设计？我们可以在接口中传入一个上下文参数，让上下文参数去执行缓存任务。

Mapper.java-----其中传入一个上下文参数

public interface Mapper {
    void map(String line,Context context);
}

Context.java-----使用hashmap简单实现缓存

public class Context {
    HashMap<Object,Object> contextMap=new HashMap<>();
    public void write(Object key,Object value){
        contextMap.put(key,value);
    }
    public Object getValue(Object key){
        return contextMap.get(key);
    }
    public HashMap<Object,Object> getContextMap(){
        return contextMap;
    }
}

WordCountMapper.java---实现mapper接口，处理具体的业务逻辑

public class WordCountMapper implements Mapper {
    @Override
    public void map(String line, Context context) {
        String[] words = line.split(" ");
        for (String word :words
                ) {
            Object value = context.getValue(word);
            if (null==value){
                context.write(word,1);
            }else {
                int v=(int)value;
                context.write(word,v+1);
            }
        }
    }
}

HdfsWordCount.java---main函数

public class HdfsWordCount {
    public static void main(String[] args) throws Exception{
        /*
        * @Author shun
        * @Description 初始化
        **/
        Properties properties = new Properties();
        properties.load(HdfsWordCount.class.getClassLoader().getResourceAsStream("job.properties"));
        String mapper_class = properties.getProperty("MAPPER_CLASS");
        Mapper mapper = (Mapper) Class.forName(mapper_class).newInstance();
        Context context=new Context();
        /*
        * @Author shun
        * @Description 数据处理
        **/
        FileSystem fs = FileSystem.get(new URI("hdfs://node1:9000"), new Configuration(), "ws");
        RemoteIterator<LocatedFileStatus> iter = fs.listFiles(new Path(properties.getProperty("INPUT_PATH")), false);
        while (iter.hasNext()){
            LocatedFileStatus file = iter.next();
            FSDataInputStream in = fs.open(file.getPath());
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in));
            String line=null;
            while ((line=bufferedReader.readLine())!=null){
                // 调用业务逻辑
                mapper.map(line,context);
            }
            bufferedReader.close();
        }
        /*
        * @Author shun
        * @Description 输出结果
        **/
        HashMap<Object, Object> contextMap = context.getContextMap();
        Path outpath = new Path("/wordcount/output/");
        if (!fs.exists(outpath)){
            fs.mkdirs(outpath);
        }
        FSDataOutputStream out = fs.create(new Path(properties.getProperty("OUTPUT_PATH")));
        Set<Map.Entry<Object, Object>> entries = contextMap.entrySet();
        for (Map.Entry<Object, Object> entry:entries
             ) {
            out.write((entry.getKey().toString()+"\t"+entry.getValue()+"\n").getBytes());
        }
        fs.close();
    }
}