一.概述及设计目标
分布式文件系统是为了让文件多副本存储,当某个节点瘫痪,在另外的节点可以访问到副本,提高系统可靠性。这是传统的设计方法。但也存在缺点:
1)不管文件多大,都存储在一个节点上,在进行数据处理的时候很难进行并行处理,节点成为网络瓶颈,很难进行大数据处理;
2)存储负载不均衡,每个节点利用率很低
什么是HDFS?
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS
源于Google的GFS论文
HDFS的设计目标
巨大的分布式文件系统
运行在普通廉价的硬件上
易扩展
架构图:
一个文件会被拆分成多个Block
blocksize:128M
130M==>2个Block:128M 和 2M
NN:
1)负责客户端请求的响应
2)元数据的管理
DN:
1)存储用户的文件对应的数据块(Block)
2)要定期向NN发送心跳信息,汇报本身及其所有的block信息,健康状况
一个典型的部署架构是运行一个NameNode节点,集群里每一个其他机器运行一个DataNode节点。
实际生产环境中建议:NameNode、DataNode部署在不同节点上。
二.单机的伪分布式集群搭建
环境:Centos7
1.jdk安装
省略
2.安装SSH
sudo yum install ssh
ssh-keygen -t rsa
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
3.安装hadoop
1)官网下载,我选择的版本是第三方商业化版本cdh,hadoop-2.6.0-cdh5.7.0。
2)解压 tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C ~/app/
4.配置文件修改
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.56.102:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/app/tmp</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
5.启动hdfs
格式化(第一次执行即可):
cd bin
./hadoop namenode -format
cd sbin
./start-dfs.sh
验证是否成功:
jps
Jps
SecondaryNameNode
DataNode
NameNode
或者浏览器验证:http://192.168.56.102:50070
6.停止hdfs
cd sbin
./stop-dfs.sh
三.Java API操作HDFS文件
IDEA+Maven创建Java工程
添加HDFS相关依赖–pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.imooc.hadoop</groupId>
<artifactId>hadoop-train</artifactId>
<version>1.0</version>
<name>hadoop-train</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.0.0</version>
</plugin>
<!-- see http://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.20.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
HDFSApp.java
package com.cracker.hadoop.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import javax.swing.*;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.net.URI;
/**
* Hadoop HDFS Java API 操作
*/
public class HDFSApp {
public static final String HDFS_PATH = "hdfs://192.168.56.102:8020";
FileSystem fileSystem = null;
Configuration configuration = null;
/**
* 创建HDFS目录
* @throws Exception
*/
@Test
public void mkdir() throws Exception {
fileSystem.mkdirs(new Path("/hdfsapi/test"));
}
/**
* 创建文件
*/
@Test
public void create() throws Exception {
FSDataOutputStream output = fileSystem.create(new Path("/hdfsapi/test/a.txt"));
output.write("hello hadoop".getBytes());
output.flush();
output.close();
}
/**
* 查看HDFS文件上的内容
*/
@Test
public void cat() throws Exception {
FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/a.txt"));
IOUtils.copyBytes(in, System.out, 1024);
in.close();
}
/**
* 重命名
*/
@Test
public void rename() throws Exception {
Path oldPath = new Path("/hdfsapi/test/a.txt");
Path newPath = new Path("/hdfsapi/test/b.txt");
fileSystem.rename(oldPath, newPath);
}
/**
* 上传文件到HDFS
*/
@Test
public void copyFromLocalFile() throws Exception {
Path localPath = new Path("/Users/chen/Downloads/hello2.txt");
Path hdfsPath = new Path("/hdfsapi/test");
fileSystem.copyFromLocalFile(localPath, hdfsPath);
}
/**
* 上传文件到HDFS
*/
@Test
public void copyFromLocalFileWithProgress() throws Exception {
Path localPath = new Path("/Users/chen/Downloads/hello2.txt");
Path hdfsPath = new Path("/hdfsapi/test");
fileSystem.copyFromLocalFile(localPath, hdfsPath);
InputStream in = new BufferedInputStream(
new FileInputStream(
new File("/Users/chen/Downloads/hive.tar.gz")));
FSDataOutputStream output = fileSystem.create(new Path("/hdfsapi/test/hive1.0.tar.gz"),
new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, output, 4096);
}
public void copyToLocalFile() throws Exception {
Path localPath = new Path("/Users/chen/Downloads/h.txt");
Path hdfsPath = new Path("/hdfsapi/test/b.txt");
fileSystem.copyToLocalFile(hdfsPath, localPath);
}
/**
* 查看某个目录下所有文件
*/
@Test
public void listFiles() throws Exception {
FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/hdfsapi/test"));
for (FileStatus fileStatus : fileStatuses) {
String isDir = fileStatus.isDirectory()?"文件夹" : "文件";
short replication = fileStatus.getReplication();
long len = fileStatus.getLen();
String path = fileStatus.getPath().toString();
System.out.println(isDir + "\t" + replication + "\t" + len + "\t" +path);
}
}
@Test
public void delete() throws Exception {
fileSystem.delete(new Path("/hdfsapi/test"));
}
@Before
public void setUp() throws Exception{
System.out.println("HDFSApp.setUp");
configuration =new Configuration();
fileSystem = FileSystem.get(new URI(HDFS_PATH), configuration, "root");
}
@After
public void tearDown() throws Exception{
configuration = null;
fileSystem = null;
System.out.println("HDFSApp.tearDown");
}
}