MapReduce编程

最新推荐文章于 2022-11-16 21:40:06 发布

hadoop_fly

最新推荐文章于 2022-11-16 21:40:06 发布

阅读量404

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/hadoop_fly/article/details/42290693

版权

hadoop 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

一、MapReduce编程步骤：

1. 把文件从HDFS读取到然后，解析成键值对。key是每行数据的offset，value是每行数据。

2. 覆盖map函数实现自己的业务逻辑，根据需求将输入的简直对转换成自己想要的输出。

3. 对map的输出进行分区、排序、分组、归约。

4. 对多个map的输出按照不同的分区copy到不同的reduce节点，对接受到的keyvalue进行合并、排序处理。

5. 覆盖reduce函数，进行需求处理。

6. 将reduce的输出写入到hdfs中。

二、如果在编写过程中，input或者output的数据类型是自定义的数据类型，那么就需要在自定义数据类型的时候，如果是key那么就需要实现writComparable接口和Writable接口，如果是value则要实现writable接口。

在hadoop和java之间存在着数据类型的对应：

Longwritable =>long IntWritable => integer Text => String

三、分布式文件系统

1. 根据一个hadoop url读取数据

InputStream in = null;
try{
in = new URL("hdfs://local/file/").openStream();
}finally{
IOUtils.closeStream(in);
}

2. 从hadoop导出一个标准输出用URLstreamerhandler

static{
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws MalformedURLException, IOException {
InputStream in = null;
try{
in = new URL("hdfs://host/path").openStream();
}finally{
IOUtils.closeStream(in);
}

try{
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
}finally{
IOUtils.closeStream(in);
}

}

我们有多种方法获取filesystem实例

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user)
throws IOException

获取一个本地实例

public static LocalFileSystem getLocal(Configuration conf) throws IOException

当你有了filesystem实力的时候就需要通过一个open方法打开一个输入流

public FSDataInputStream open(Path f) throws IOException（default buffer size 4 KB）
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

3. 从hadoop导出一个标准输出直接使用filesystem

String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try{
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
}finally{
IOUtils.closeStream(in);
}

fsdatainputstream

filesystem返回fsdatainputstream，datainputstream支持随机访问，所以你可以访问任何一部分。所以你可以读取流的任何一部分。

public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable, Closeable,
ByteBufferReadable, HasFileDescriptor, CanSetDropBehind, CanSetReadahead {}

public interface Seekable {
/**
* Seek to the given offset from the start of the file.
* The next read() will be from that location. Can't
* seek past the end of the file.
*/
void seek(long pos) throws IOException;

/**
* Return the current offset from the start of the file
*/
long getPos() throws IOException;

/**
* Seeks a different copy of the data. Returns true if
* found a new source, false otherwise.
*/
@InterfaceAudience.Private
boolean seekToNewSource(long targetPos) throws IOException;
}

4. 用filesystem输出两次用seek输出

String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try{
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0);
IOUtils.copyBytes(in, System.out, 4096, false);
}finally{
IOUtils.closeStream(in);
}

FSDataInputStream还要实现PositionedReadable 接口

public interface PositionedReadable {
/**
* Read upto the specified number of bytes, from a given
* position within a file, and return the number of bytes read. This does not
* change the current offset of a file, and is thread-safe.
*/
public int read(long position, byte[] buffer, int offset, int length)
throws IOException;

/**
* Read the specified number of bytes, from a given
* position within a file. This does not
* change the current offset of a file, and is thread-safe.
*/
public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;

/**
* Read number of bytes equal to the length of the buffer, from a given
* position within a file. This does not
* change the current offset of a file, and is thread-safe.
*/
public void readFully(long position, byte[] buffer) throws IOException;
}

read方法从文件中给定的位置的buffer的offset开始读取。返回值就是读取的字节，调用者会检查这个值，可能会长度不足。readFully方法将字节读取进缓冲区，直到文件结束。

5. 写数据

public FSDataOutputStream create(Path f) throws IOException

public class FSDataOutputStream extends DataOutputStream
implements Syncable, CanSetDropBehind {

6. copy一个本地文件到hdfs

String localf = args[0];
String des = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localf));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(des), conf);
OutputStream out = fs.create(new Path(localf), new Progressable() {
@Override
public void progress() {
System.out.println("...");
}
});
IOUtils.copyBytes(in, out, 4096, true);

7. 解剖读取文件过程

当客户端读取文件的时候，要调用FileSystem的open方法，实际这个方法是在DistributedFileSystem下的。DistributedFileSystem通过rpc访问namenode获取block块的位置（在datanode上的位置）和大小。DistributedFilesystem返回一个FSDataInputStream流给客户端，让他读取数据。FSDataInputStream继承了DataInputStream主要是管理namenode和datanode的I/O流。客户端会调用流的read方法，public int read(long position, byte[] buffer, int offset, int length)
throws IOException {
return ((PositionedReadable)in).read(position, buffer, offset, length);
}DFSInputStream存储了datanode的前几个block的位置信息在文件中，链接在文件中的第一个block到对应的datanode（最近的）上，数据从datanode被读取回客户端，他重复调用流的read方法，当整个块被读取完的时候，DFSInputStream就会，关闭与这个datanode的链接，DFSInputStream继续链接下一个block所在的datanode。不断地从namenode获取block所在的datanode信息。当客户端读取完成之后，就会调用FSDataInputStream
的close（）方法。
在读取数据的时候，如果DFSInputStream与datanode交互的时候出错，他们会寻找下一个持有这个block块的datanode，并且记住这个datanode不再对他进行连接，DFSInputStream还会验证块，如果发现了一个被损毁的块，那么在DFSInputStream读取其他的datanode上的block的时候先报告给namenode。
这个应用的一个很好的设计就是client和datanode直接交互进行数据的读取，namenode会指引client到哪个datanode上拿block最合适。

8. 剖析写文件到hdfs过程

当客户端往hdfs写数据的时候，首先要创建文件，客户端调用create（）方法，其实是调用的Distributedfilesystem的create（）方法。DistributedFileSystem通过RPC调用namenode在文件系统空间中创建一个新的文件。这时没有block与之关联。namenode首先会检测有没有这个文件和client有没有权限创建，如果都符合那么就可以创建。否则就会创建失败抛异常。DistributedFileSystem 会返回给客户端一个FSDataoutputstream，客户端开始写数据。FSDataOutputStream继承了dataoutputstream。当客户端开始写的时候，DFSOutputStream被分裂成了数据包，叫队列数据。数据队列被DataStreamer消耗。他负责向namenode申请合适的datanode存储block。datanode列表的管道格式，我们设置了被分为三份。DataStreamer把数据流写进管道的第一个datanode，delivery数据包到与他相连的datanode。DFSOutputStream还维持着一个内部的数据包等待datanodes的确认，被称做确认队列直到所有的datanode都确认。客户端往datanode写数据的时候，如果datanode失效，首先会关闭pipeline，确认队列中的数据被添加到数据队列最前面。当前块在正常的datanode上会给他一个新的身份，当前块会和namenode交互，删除不正常的block（如果datanode恢复）。当客户端完成些数据的时候就会调用close（）