Hadoop API 使用介绍

最新推荐文章于 2024-05-10 18:55:53 发布
绛门人
最新推荐文章于 2024-05-10 18:55:53 发布
阅读量5.3k
点赞数 4
分类专栏： hadoop
hadoop 专栏收录该内容
12 篇文章 0 订阅
订阅专栏
----------------Hadoop API 使用介绍---------------------




Hadoop API被分成（divide into）如下几种主要的包（package）
org.apache.hadoop.conf     定义了系统参数的配置文件处理API。
org.apache.hadoop.fs          定义了抽象的文件系统API。
org.apache.hadoop.io         定义了通用的I/O API，用于针对网络，数据库，文件等数据对象做读写操作。
org.apache.hadoop.ipc       用于网络服务端和客户端的工具，封装了网络异步I/O的基础模块。
org.apache.hadoop.mapred         Hadoop分布式计算系统（MapReduce）模块的实现，包括任务的分发调度等。
org.apache.hadoop.metrics        定义了用于性能统计信息的API，主要用于mapred和dfs模块。
org.apache.hadoop.record 定义了针对记录的I/O API类以及一个记录描述语言翻译器，用于简化将记录序列化成语言中性的格式（language-neutral manner）。
org.apache.hadoop.tools    定义了一些通用的工具。
org.apache.hadoop.util       定义了一些公用的API。




hadoop中关于文件操作类基本上全部是在"org.apache.hadoop.fs"包中，

这些API能够支持的操作包含：打开文件，读写文件，删除文件等。

Hadoop类库中最终面向用户提供的接口类是FileSystem，该类是个抽象类，只能通过来类的get方法得到具体类。
get方法存在几个重载版本，常用的是这个：
 
static FileSystem get(Configuration conf);




---------------------------------------------




package hdfstest;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

public class HdfsTest {

	// 创建新文件
	public static void createFile(String dst, byte[] contents) throws IOException {
		Configuration conf = new Configuration();
		
		    conf.set("fs.defaultFS", "hdfs://192.168.146.128:9000");
		
		FileSystem fs = FileSystem.get(conf);
		Path dstPath = new Path(dst); // 目标路径
		// 打开一个输出流
		FSDataOutputStream outputStream = fs.create(dstPath);
		outputStream.write(contents);
		outputStream.close();
		fs.close();
		System.out.println("文件创建成功！");
	}

	// 上传本地文件
	public static void uploadFile(String src, String dst) throws IOException {
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path srcPath = new Path(src); // 原路径
		Path dstPath = new Path(dst); // 目标路径
		// 调用文件系统的文件复制函数,前面参数是指是否删除原文件，true为删除，默认为false
		fs.copyFromLocalFile(false, srcPath, dstPath);

		// 打印文件路径
		System.out.println("Upload to " + conf.get("fs.default.name"));
		System.out.println("------------list files------------" + "\n");
		FileStatus[] fileStatus = fs.listStatus(dstPath);
		for (FileStatus file : fileStatus) {
			System.out.println(file.getPath());
		}
		fs.close();
	}

	// 文件重命名
	public static void rename(String oldName, String newName) throws IOException {
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path oldPath = new Path(oldName);
		Path newPath = new Path(newName);
		boolean isok = fs.rename(oldPath, newPath);
		if (isok) {
			System.out.println("rename ok!");
		} else {
			System.out.println("rename failure");
		}
		fs.close();
	}

	// 删除文件
	public static void delete(String filePath) throws IOException {
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path path = new Path(filePath);
		boolean isok = fs.deleteOnExit(path);
		if (isok) {
			System.out.println("delete ok!");
		} else {
			System.out.println("delete failure");
		}
		fs.close();
	}

	// 创建目录
	public static void mkdir(String path) throws IOException {
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path srcPath = new Path(path);
		boolean isok = fs.mkdirs(srcPath);
		if (isok) {
			System.out.println("create dir ok!");
		} else {
			System.out.println("create dir failure");
		}
		fs.close();
	}

	// 读取文件的内容
	public static void readFile(String filePath) throws IOException {
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path srcPath = new Path(filePath);
		InputStream in = null;
		try {
			in = fs.open(srcPath);
			IOUtils.copyBytes(in, System.out, 4096, false); // 复制到标准输出流
		} finally {
			IOUtils.closeStream(in);
		}
	}

	public static void main(String[] args) throws IOException {
		// 测试上传文件
		uploadFile("D:\\c.txt", "/input");
		// 测试创建文件
		
		 byte[] contents = "hello world 世界你好\n".getBytes();
		  createFile("/input/d.txt",contents);
		 
		// 测试重命名
		// rename("/user/hadoop/test/d.txt", "/user/hadoop/test/dd.txt");
		// 测试删除文件
		// delete("test/dd.txt"); //使用相对路径
		// delete("test1"); //删除目录
		// 测试新建目录
		// mkdir("test1");
		// 测试读取文件
		readFile("/input/d.txt");
	}

}


--------------------------------


HDFS javaAPI
 
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://h6:9000");
FileSystem fileSystem = FileSystem.get(conf);
 
 
  
 
1.创建文件夹:

判断是否存在
不存在再创建
if (!fileSystem.exists(new Path("/weir01"))) {
        fileSystem.mkdirs(new Path("/weir01"));
      }
 
2.创建文件：
in - InputStream to read from 原文件路径
out - OutputStream to write to  hdfs 目录
the size of the buffer  缓冲大小
close - whether or not close the InputStream and OutputStream at the end. The streams are closed in the finally clause. 是否关闭流
 
 
 
FSDataOutputStream out =fileSystem.create(new Path("/d1"));
      FileInputStream in = new FileInputStream("f:/hadoop.zip");
      IOUtils.copyBytes(in, out, 1024, true);
 
 
3上传本地文件
 
delSrc - whether to delete the src是否删除源文件
overwrite - whether to overwrite an existing file是否覆盖已存在的文件
srcs - array of paths which are source 可以上传多个文件数组方式
dst – path 目标路径
 
fileSystem.copyFromLocalFile(src, dst);
      fileSystem.copyFromLocalFile(delSrc, src, dst);
      fileSystem.copyFromLocalFile(delSrc, overwrite, src, dst);
   fileSystem.copyFromLocalFile(delSrc, overwrite, srcs, dst);
 
 
4 重命名HDFS文件
 
fileSystem.rename(src, dst);
 
5.删除文件
 
True 表示递归删除
fileSystem.delete(new Path("/d1"), true);
 
6.查看目录及文件信息
 
FileStatus[] fs = fileSystem.listStatus(new Path("/"));
      for (FileStatus f : fs) {
        String dir = f.isDirectory() ? "目录":"文件";
        String name = f.getPath().getName();
        String path = f.getPath().toString();
        System.out.println(dir+"----"+name+"  path:"+path);
        System.out.println(f.getAccessTime());
        System.out.println(f.getBlockSize());
        System.out.println(f.getGroup());
        System.out.println(f.getLen());
        System.out.println(f.getModificationTime());
        System.out.println(f.getOwner());
        System.out.println(f.getPermission());
        System.out.println(f.getReplication());
        System.out.println(f.getSymlink());
      }
 
 
7.查找某个文件在HDFS集群的位置
 
FileStatus fs = fileSystem.getFileStatus(new Path("/data"));
      BlockLocation[] bls=fileSystem.getFileBlockLocations(fs, 0, fs.getLen());
      for (int i = 0,h=bls.length; i < h; i++) {
        String[] hosts= bls[i].getHosts();
        System.out.println("block_"+i+"_location:  "+hosts[0]);
      }
 
 
8.获取HDFS集群上所有节点名称信息
 
 
DistributedFileSystem hdfs = (DistributedFileSystem) fileSystem;
      DatanodeInfo[] dns=hdfs.getDataNodeStats();
      for (int i = 0,h=dns.length; i < h; i++) {
        System.out.println("datanode_"+i+"_name:  "+dns[i].getHostName());
      }
 
 
 
 

--------------HDFS写入文件的重要概念



http://blog.csdn.net/dyllove98/article/details/8592686




---------------core-site.xml  hdfs-site.xml



 
 -----------------------fs.default.name详解--------------------------
 
 -fs.default.name - 这是一个描述集群中NameNode结点的URI(包括协议、主机名称、端口号)，集群里面的每一台机器都需要知道NameNode的地址。
 DataNode结点会先在NameNode上注册，这样它们的数据才可以被使用。独立的客户端程序通过这个URI跟DataNode交互，以取得文件的块列表。-->
 <property>
              <name>fs.default.name</name>
              <value>hdfs://localhost:9000</value>
        </property>

       <!—hadoop.tmp.dir 是hadoop文件系统依赖的基础配置，很多路径都依赖它。如果hdfs-site.xml中不配置namenode和datanode的存放位置，默认就放在这个路径中-->
  <property>
      <name>hadoop.tmp.dir</name>
       <value>/home/hdfs/tmp</value>
   </property>
   
 3.  在conf/hdfs-site.xml中增加如下内容：
       <!-- dfs.replication -它决定着 系统里面的文件块的数据备份个数。对于一个实际的应用，它 应该被设为3
       （这个数字并没有上限，但更多的备份可能并没有作用，而且会占用更多的空间）。少于三个的备份，可能会影响到数据的可靠性(系统故障时，也许会造成数据丢失)-->
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
 
          <!--  dfs.data.dir - 这是DataNode结点被指定要存储数据的本地文件系统路径。DataNode结点上 的这个路径没有必要完全相同，
          因为每台机器的环境很可能是不一样的。但如果每台机器上的这            个路径都是统一配置的话，会使工作变得简单一些。默认的情况下，
          它的值hadoop.tmp.dir, 这  个路径只能用于测试的目的，因为，它很可能会丢失掉一些数据。所以，这个值最好还是被覆 盖。 
dfs.name.dir - 这是NameNode结点存储hadoop文件系统信息的本地系统路径。这个值只对NameNode有效，DataNode并不需要使用到它。
上面对于/temp类型的警告，同样也适用于这里。在实际应用中，它最好被覆盖掉。-->
           <property>
             <name>dfs.name.dir</name>
             <value>/home/hdfs/name</value>
        </property>
       <property>
        <name>dfs.data.dir</name>
        <value>/home/hdfs/data</value>
   </property>
 
 
 <!—解决：org.apache.hadoop.security.AccessControlException:Permission    denied:user=Administrator,access=WRITE,inode="tmp":root:supergroup:rwxr-xr-x 。
因为Eclipse使用hadoop插件提交作业时，会默认以 DrWho 身份去将作业写入hdfs文件系统中，对应的也就是 HDFS 上的/user/hadoop ,
  由于 DrWho 用户对hadoop目录并没有写入权限，所以导致异常的发生。解决方法为：放开 hadoop 目录的权限，
   命令如下 ：$ hadoop fs -chmod 777 /user/hadoop -->
               <property> 
                   <name>dfs.permissions</name>
                   <value>false</value>
<description>
                      If "true", enable permission checking in HDFS. If "false", permission checking is turned   off,   
                      but all other behavior is unchanged. Switching from one parameter value to                                
                         the other does   not change the mode, owner or group of files or directories
              </description>
 
        </property>
 
   4.  在conf/mapred-site.xml中增加如下内容：
<!-- mapred.job.tracker -JobTracker的主机（或者IP）和端口。-->
      <property>
       <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
</property>
 
 
 
 -----------------
绛门人
关注
4
点赞
踩
24

收藏

觉得还不错? 一键收藏
0
评论
Hadoop API 使用介绍

----------------Hadoop API 使用介绍---------------------Hadoop API被分成（divide into）如下几种主要的包（package）org.apache.hadoop.conf 定义了系统参数的配置文件处理API。org.apache.hadoop.fs 定义了抽象的文件系统API。org.apac
复制链接

扫一扫
专栏目录