Hadoop First Touch

Yarn

Elements

  • Resource Manager: manage the use of resources across the cluster.
  • Node Manager: running on all the nodes in the cluster to launch and monitor containers.
    • Container: executes an application specific process with a constrained set of resources, i.e. CPU, Memory and so on. The container might be a Unix process or Linux cgroup.

Application

Application Startup

  • Client contacts the Resource Manager to start the Application Master.
  • Resource Manager to find a Node Manager to run Application Master.
  • Application Master determines what to do next, ask for Resources from Resource Manager to run Processes when needed.
    • Communication between client, master, and processes have to be handled by application by itself.
  • Resource Request can be made at any time.
    • Up front - Spark.
    • Dynamic 

Application Lifespan

  • One Application One Job - MapReduce;
  • One Application One Workflow that contains multiple jobs - Spark
  • One Application that is long-running and shared by different users - Application acts as Coordinator, for example, Impala daemon.

Scheduler

  • FIFO Scheduler
  • Capacity Scheduler
  • Fair Scheduler

Hadoop I/O

Data Integrity

  • CRC-32 (32 bits cyclic redudency check) for error detecting - ChecksumFileSystem;
  • CRC-32C is an enhancement of CRC-32, which is the one HDFS use;
    • dfs.bytes-per-checksum (default = 512 bytes and checksum is 4 bytes), i.e. rawData = 512 bytes and checksum = 4 bytes;
    • Write
      • Client -> Raw Data -> (Raw + CheckSum) Data -> send to 1st Data Node
      • 1st Data Node -> (Raw + Checksum) Data -> 2nd Data Node
      • Data Node -> verify the Raw Data using Checksum Data
    • Read
      • Data Node -> (Raw + Checksum) Data -> Client
      • Client -> verify the Raw Data using Checksum Data
    • Backend daemon to check data in Data Nodes regularlly to avoid bit rot;
    • Disable Checksum
      • FileSystem.setVerifyChecksum(false)
      • FileSystem.open()

Compression

  • Compression Method/Codec: DEFLATE, gzip, bzip2, LZO, LZ4, Snappy;
  • Map task
    • The compression method of the INPUT FILE can be determined by file EXTENSION, and thereafter be decompressed by Map task accordingly.
    • To compress OUTPUT FILE :
      • mapreduce.output.fileoutputformat.compress=true
      • mapreduce.output.fileoutputformat.compress.codec=compression_codec
      • mapreduce.output.fileoutputformat.compress.type
        • RECORD: compress individual record, i.e. sequence file;
        • BLOCK: compress groups of records

Serialization

The process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage, i.e.

  • Interprocess communication
  • Persist Data

Writeable

Hadoop uses its own serialization format, Writables, which is certainly compact and fast, but not easy to extend or use from other languages other than Java. Avro is designed to overcome some of the limitations of Writables. And there is some other serialization framework supported by Hadoop.

File-Based Data Structure

A persistent data structure for the binary key-value pair.

Sequence File

  • Header: magic number, version, key/value class, compression details, user-defined metadata, sync marker
  • Record: Record length, Key length, key, value (or compressed value)
    • Keys are not compressed;
    • Compression method
      • Record compression
      • Block compression: compress multiple records at a time (i.e. a block)
        • io.seqfile.compress.blocksize 
  • Sync Point: randomly placed in.

Map File

A sorted sequence file with an index to permit lookup by key.

Other Files

  • Avro File: objects stored in avro are described by a schema;
  • Colum-Orientated File: row split then column followed by column; 
    • Hive's RCFiles, Hive's ORCFiles, Parquet
    • Save the time on reading unnecssary column
    • Requires more memory to load whole rows split.
    • Not suitable for streaming.

Reference

http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

  • Avro
    • Usage :
      • Data Serialization
      • Data Exchange (RPC)
    • Data/File Format :
      • Schema (JSON)
      • Data (Binary) -> Compressed 
      • Data inlcudes marks that allow split (for big files)
      • Allows Schema Update (Add/Remove Fields)
        • Yet data remain unchanged, i.e. old fields still available for the previous customers;
  • Parquet
    • Columer format data file format.
    • Problem :
      • Storage and I/O.
    • Good for analysis usually max/min/avg on certain columns ratehr than whole table row by row scan.
    • Good for compression - type of each column is known.
    • Good for optimziation - type of each column is known - eliminate unnecessary encode/decode - save CPU power.
  • Flume
  • Sqoop
  • Crunch
    • The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
    • Making creation of MapReduce jobs easier.
    • Structure
      • Source / Original Format (DoFn)
      • Process Raw Data (Join)
      • Filter (invalid data) (FilterFn)
      • Group (Group By Key)
      • Persist Result / Target Format (DoFn)
    • Execution
      • Map : Source / Original Format (DoFn)
      • Reduce : Process Raw Data (Join)
      • Reduce : Filter (invalid data) (FilterFn)
      • Reduce : Group (Group By Key)
      • Reduce : Persist Result / Target Format (DoFn)
  • Spark
  • Zookeeper
  • Impla
  • Hue
  • Yarn
  • Cloudera
  • Tez (faster Tez)

06193543_Nlbz.png

  • Hue (Hadoop User Experience) / Cloudera
    • Sit on top of difference services/apps, and expose those services via UI;
    • Hase, Oozie, Sqoop, Hive, Pig, HDFS.
    • Infrastructure
      • Front End
        • Mako Templates
        • JQuery
        • KnockoutJS
      • Backend
        • Python + Django driven
        • Thrift + Web Client for Hadoop Components
        • Spawning/CherryPy
      • SDK
        • Make your own App!
        • build/env/bin/hue create_desktop_app <name>
      • DB
        • sqllite, mysql, postgres
  • Cascading - Workflow Abstraction
    • Middleware in between Hadoop and other applications. A tool for enterpise.
    • It addresses 
      • Staffing Bottleneck
      • System Integration
      • Operational Complexity
      • Test Driven Dvelopment

  •  
  • Data Prep : Clean up the data (usually 80% of the work);

    ETL

PMML

Hadoop

CPU, Disk, Rack

Storage

Execution

Client and Jobs

Physical Architecture

Deployment

Map Group and Reduce

Word Count

Common Patterns

Advanced Patterns

Very Advanced

Trouble with MapReduce

Real World Apps

Not Thinking in MapReduce

In A Nut Shell

Language

  • Java, Scala, Clojure
  • Python, Rubty

Higher Level Language

  • Hive
  • Pig

Frameworks

  • Cascading, Crunch

DSLs

  • Scalding, Scrunch, Scoobi, Cascalog

Topics

Challenges: 5 Vs

  • Store the data and process the data;
  • Volume: due to 
    • Technology Advance: telephone - mobile phone - smart phone;
    • IoT 
    • Social Network:
    • 4.4 Zb today, 44 Zb anticipated in 2020;
  • Variety :
    • Structured
    • Semi-structured
    • Unstructured
  • Velocity :
  • Value: Data Mining
  • Varicity: dealing with inconsistency
    • missing data
    • development 

HDFS : Cluster

Name Node : meta data (block, location)

  • Master/Slave;

Data Node :

  • Data replication : default 3;

hadoop fs :

  • hadoop fs -ls
  • hadoop fs -put loca_lfilename hdfs_filename
  • hadoop fs -get hdfs_filename loca_lfilename
  • hadoop fs -mv hdfs_filename hdfs_newfilename
  • hadoop fs -rm hdfs_filename
  • hadoop fs -tail/cat/mkdir

MapReduce : sending process to data v.s. sending data to node;

  • Challenges with Hash Table : (Key-Value)
    • Too many keys could result in OOM;
    • Too much time;
  • Files broken down into chunks and processed in parallel;
  • Mappers generate Intermediate Records (Key, Value)
    • Shuffle : move Intermediate Records from Mapper to Reducer
    • Sort : sort the Records;
  • Reducer calcualte the final result;
  • Partioner
    • Imtermeidate data to which Reducer;
  • Trackers
    • JobTracker : split job into MapReduce tasks;
    • TaskTracker : daemo running on data node;
  • Mapper

Ecosystem :

  • Core Hadoop
    • HDFS : Storage
    • MapReduce : Processing
  • SQL
    • Pig/Hive : SQL working through MapReduce against HDFS; Code turned into MapReduce tasks; take long time to run; good for batch processing;
    • Impala : SQL working directly against HDFS; Low latency queries;
    • Apache Drill : SQL, combines a variety of data stores just by using a single query;
  • Processing : Yarn
  • Integration
    • Sqoop : RDBMS data migration into HDFS or the other way around.
    • Flume : Inject data into HDFS as it's being generated 
  • Others :
    • Mahout : Machine Learning
    • Hue : Graphical FrontEnd
    • Oozie : Job Scheduler (Workflow and Coordinator)
      • Workflow : sequential set of actions to be executed;
      • Coordinator : job get triggered when certain condition is met (data availablity or time)
    • HBase : working directly against HDFS; No SQL database; Can run without HDFS
    • Spark : in-memory real-time computation; talk to HDFS directly or without HDFS;
    • Ambari : cluster manager - provision, managing and monitoring Hadoop clusters.

Others :

  • Defensive Programming;
  • cat inputfile | mapper | sort | reducer 
  • hs mapper reducer inputdirectory outputdirectory;
  • Combiners
    • Mapper go through data
    •  

Patterns

  • Filtering
    • Sampling Data
    • Top-N List
  • Summarization
    • Count/Max/Min/Avg ...
    • Stastics : Mean, Median, Std. Dev ...
    • Inverted Index
  • Structural
    • Combining Data Set

Yarn

ResourceManager: daemon running on master node;

  • Cluster level resource manager
  • Long life, high-quality hardware

NodeManager: daemon running on data node;

  • One per data node;
  • Moni

ApplicationMaster :

  • One per application; Short life;
  • Coordinate and Manage MapReduce jobs
  • Negotiate with ResourceManager to schedule tasks
  • The tasks are started by NodeManager

Job History Server:

  • Maintains Information about submitted MapReduce jobs after their ApplicationMaster terminates;

Container :

  • Created by NodeManager when requested;
  • Allocate a certain amount of resources (CPU, Memory, etc) on a slave node;

Client -> Scheduler (RM) + ApplicationsManager (RM) -> Create AM (NM) -> Create Tasks -> Request Container (AM -> RM)

Pig

Workflow

  • Pig Script ->
  • Pig Server / Grunt Shell ->
  • Parser ->
  • Optimizer ->
  • Compiler ->
  • Execution Engine ->
  • MapReduce -> HDFS

Components 

Pig Latin:

  • It is mad up of  a series of operations or transformations that are applied to the input data and produce output

Pig Execution: 

  • Script : Contains Pig Commands in a file (.pig)
  • Interactive shell for running pig commands
  • Embedded : provisioning pig script in Java

Running Mode

  • MapReduce Mode : running Pig over Hadoop cluster;
    • command : pig
  • Local Mode : local file system;
    • command : pig -x local

Data Model

Tuple : row;

Bag : collection of tuples, which can contain all the fields or a few;

((field1value, field2value, field3value...), (field1value, field2value, field3value ...), ...)

Map : [Key#Value, Key#Value, Key#Value];

Atom : string, int, float, double, byte[], char[]; 

Operator

  • LOAD : load data from local FS or HDFS into Pig;
grunt> employee = LOAD '/hdfsFile' 
                  using PigStorage #storage type
                  (',') #delimiter
                  AS (ssn:chararray, 
                      name:chararray,
                      department:chararry,
                      city:chararry)
  • STORE : save results to local FS or HDFS;
grunt> STORE emp_filter INTO '/pigresult';
  • FOREACH :
grunt> emp_foreach = foreach employee
                     generate name, department; # returning colums
  • FILTER :
grunt> emp_filter = filter employee
                    by city=='Shanghai' # condition
  • JOIN :
  • ORDER BY : 
grunt> emp_order = order employee
                   by ssn desc; # column to sort by
  • DISTINCT : 
  • GROUP :
  • COGROUP : same as group, but handle multiple fields;
  • DUMP : display the result on the screen;

Example :

# log_analysis.pig
# pig directory/log_analysis.pig
log = LOAD '/sample.log';

LEVELS =  foreach log generate REGEXT_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;

FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;

GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;

FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;

RESULT = order FREQUENCIES by COUNT desc;

DUMP RESULT;

Diagram

Cloudra

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

转载于:https://my.oschina.net/u/3551123/blog/1118873

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值