Hadoop源码分析（十二）

qq_1031904067

于 2021-12-25 14:00:39 发布

阅读量792

点赞数

文章标签： hadoop 大数据 big data

本文链接：https://blog.csdn.net/qq_45954414/article/details/122142663

版权

本文详细解析了在Java中使用Hadoop的FileSystem.open()方法读取HDFS文件的过程。从创建FileSystem对象到打开文件输入流，深入探讨了DFSInputStream的构造及其内部如何从NameNode获取数据块信息。通过多次尝试确保获取到文件的完整数据块位置，从而实现文件的顺利读取。

摘要由CSDN通过智能技术生成

2021SC@SDUSC

1.FileSystem.open()

使用Java IO读取本地文件类似，读取HDFS文件是创建一个文件输入流，在Hadoop中使用FileSystem.open()方法来创建输入流。

  public static void readFile(String filePath) throws IOException{
        FileSystem fs = getFileSystem(filePath);
        InputStream in=null;
        try{
            in=fs.open(new Path(filePath));
            IOUtils.copyBytes(in, System.out,4096,false);
        }catch(Exception e){
            System.out.println(e.getMessage());
        }finally{
            IOUtils.closeStream(in);
        }
    }

创建FileSystem：

public static void main(String[] args) throws Exception{
        String local="D:\\word2.txt";
        String dest="hdfs://192.168.80.131:9000/user/root/input/word2.txt";
        Configuration cfg=new Configuration();
        FileSystem fs=  FileSystem.get(URI.create(dest),cfg,"root");
        fs.copyFromLocalFile(new Path(local), new Path(dest));
        fs.close();
    }

进入open方法：

该方法返回的是一个FSDataInputStream对象。


  public FSDataInputStream open(Path f) throws IOException {
    return open(f, getConf().getInt("io.file.buffer.size", 4096));
  }

进入 open(Path f, int bufferSize)方法：

抽象方法

 public abstract FSDataInputStream open(Path f, int bufferSize)
    throws IOException;

进入dfs.open(String src, int buffersize, boolean verifyChecksum)方法：

 public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
      throws IOException, UnresolvedLinkException {
    checkOpen();
    //    Get block info from namenode
    TraceScope scope = getPathTraceScope("newDFSInputStream", src);
    try {
      return new DFSInputStream(this, src, verifyChecksum);
    } finally {
      scope.close();
    }
  }

进入该DFSInputStream构造方法：

该方法调用openInfo()方法，openInfo()方法是一个线程安全的方法，作用是从namenode获取要打开的文件的数据块信息。

public class DFSClient implements java.io.Closeable, RemotePeerFactory,
    DataEncryptionKeyFactory {
  ...

  DFSInputStream(DFSClient dfsClient, String src, boolean verifyChecksum
                 ) throws IOException, UnresolvedLinkException {
    this.dfsClient = dfsClient;
    this.verifyChecksum = verifyChecksum;
    this.src = src;
    synchronized (infoLock) {
      this.cachingStrategy = dfsClient.getDefaultReadCachingStrategy();
    }
    openInfo();
  }

进入openInfo()方法：

该方法中如果读取数据块信息失败，则会再次读取3次，主要调用了方法fetchLocatedBlocksAndGetLastBlockLength()方法来读取数据块的信息。

void openInfo() throws IOException, UnresolvedLinkException {
    synchronized(infoLock) {
      lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
      int retriesForLastBlockLength = dfsClient.getConf().retryTimesForGetLastBlockLength;
      while (retriesForLastBlockLength > 0) {
        // Getting last block length as -1 is a special case. When cluster
        // restarts, DNs may not report immediately. At this time partial block
        // locations will not be available with NN for getting the length. Lets
        // retry for 3 times to get the length.
        if (lastBlockBeingWrittenLength == -1) {
          DFSClient.LOG.warn("Last block locations not available. "
              + "Datanodes might not have reported blocks completely."
              + " Will retry for " + retriesForLastBlockLength + " times");
          waitFor(dfsClient.getConf().retryIntervalForGetLastBlockLength);
          lastBlockBeingWrittenLength = fetchLocatedBlocksAndGetLastBlockLength();
        } else {
          break;
        }
        retriesForLastBlockLength--;
      }
      if (retriesForLastBlockLength == 0) {
        throw new IOException("Could not obtain the last block locations.");
      }
    }
  }

进入getLocatedBlocks(String src, long start)方法：

 public LocatedBlocks getLocatedBlocks(String src, long start)
      throws IOException {
    return getLocatedBlocks(src, start, dfsClientConf.prefetchSize);
  }

进入callGetBlockLocations(ClientProtocol namenode,String src, long start, long length)方法：

static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,
      String src, long start, long length) 
      throws IOException {
    try {
      //调用namenode对象，进行远程调用
      return namenode.getBlockLocations(src, start, length);
    } catch(RemoteException re) {
      throw re.unwrapRemoteException(AccessControlException.class,
                                     FileNotFoundException.class,
                                     UnresolvedPathException.class);
    }
  }