Hadoop Platform and Application Framework
by University of California, San Diego
Common:libraries and utilities
Yarn :enhancesde power of a Hadoop compute cluster ，a resource-management platform,scheduling.
Mapreduce:a programming model for large scale data processing.
HDFS:Hadoop Distributed File System(Hadoop分布式文件系统)：
Yarn, Tez and Spark:都是framework
YARN： essentially the basic execution engine in the next generation of Hadoop
Hbase ，other apps： work though on YARN
1.Introduction to HDFS:
HDFS Design Concept:
• Scalable distributed filesystem
• Distribute data on local disks on several nodes
• Low cost commodity hardware
HDFS Design Factors :
• Hundreds/Thousands of nodes => • Need to handle node/disk failures
• Portability across heterogeneous hardware/software
• Handle large data sets
• High throughput
Approach to meet HDFS design goals:
• Simplified coherency model – write once read many.
• Data Replication – helps handle hardware failures
• Move computation close to data
• Relax POSIX requirements – increase throughput
2.HDFS Architecture and Configuration:
Summary of HDFS Architecture
• Single NameNode - a master server that manages the file system namespace and regulates access to files by clients.
• Multiple DataNodes – typically one per node in the cluster. Functions:
• Manage storage
• Serving read/write requests from clients
• Block creation, deletion, replication based on instructions from NameNode
Performance Envelope of HDFS :
• Able to determine number of blocks for a given file size
• Key HDFS and system components impacted by block size
• Impact of small files on HDFS and system
Default block size is 64MB
10GB = 10 X 1024. blocks = 10 X 1024/64 =160 bolcks.
3.Read / Write process in HDFS: