Hadoop First Touch

最新推荐文章于 2024-10-30 18:07:32 发布

chiruxu4359

最新推荐文章于 2024-10-30 18:07:32 发布

阅读量400

点赞数

文章标签： java shell javascript ViewUI

原文链接：https://my.oschina.net/u/3551123/blog/1118873

版权

Yarn

Elements

Resource Manager: manage the use of resources across the cluster.
Node Manager: running on all the nodes in the cluster to launch and monitor containers.
- Container: executes an application specific process with a constrained set of resources, i.e. CPU, Memory and so on. The container might be a Unix process or Linux cgroup.

Application

Application Startup

Client contacts the Resource Manager to start the Application Master.
Resource Manager to find a Node Manager to run Application Master.
Application Master determines what to do next, ask for Resources from Resource Manager to run Processes when needed.
- Communication between client, master, and processes have to be handled by application by itself.
Resource Request can be made at any time.
- Up front - Spark.
- Dynamic

Application Lifespan

One Application One Job - MapReduce;
One Application One Workflow that contains multiple jobs - Spark
One Application that is long-running and shared by different users - Application acts as Coordinator, for example, Impala daemon.

Scheduler

FIFO Scheduler
Capacity Scheduler
Fair Scheduler

Hadoop I/O

Data Integrity

CRC-32 (32 bits cyclic redudency check) for error detecting - ChecksumFileSystem;
CRC-32C is an enhancement of CRC-32, which is the one HDFS use;
- dfs.bytes-per-checksum (default = 512 bytes and checksum is 4 bytes), i.e. rawData = 512 bytes and checksum = 4 bytes;
- Write
  - Client -> Raw Data -> (Raw + CheckSum) Data -> send to 1st Data Node
  - 1st Data Node -> (Raw + Checksum) Data -> 2nd Data Node
  - Data Node -> verify the Raw Data using Checksum Data
- Read
  - Data Node -> (Raw + Checksum) Data -> Client
  - Client -> verify the Raw Data using Checksum Data
- Backend daemon to check data in Data Nodes regularlly to avoid bit rot;
- Disable Checksum
  - FileSystem.setVerifyChecksum(false)
  - FileSystem.open()

Compression

Compression Method/Codec: DEFLATE, gzip, bzip2, LZO, LZ4, Snappy;
Map task
- The compression method of the INPUT FILE can be determined by file EXTENSION, and thereafter be decompressed by Map task accordingly.
- To compress OUTPUT FILE :
  - mapreduce.output.fileoutputformat.compress=true
  - mapreduce.output.fileoutputformat.compress.codec=compression_codec
  - mapreduce.output.fileoutputformat.compress.type
    - RECORD: compress individual record, i.e. sequence file;
    - BLOCK: compress groups of records

Serialization

The process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage, i.e.

Interprocess communication
Persist Data

Writeable

Hadoop uses its own serialization format, Writables, which is certainly compact and fast, but not easy to extend or use from other languages other than Java. Avro is designed to overcome some of the limitations of Writables. And there is some other serialization framework supported by Hadoop.

File-Based Data Structure

A persistent data structure for the binary key-value pair.

Sequence File

Header: magic number, version, key/value class, compression details, user-defined metadata, sync marker
Record: Record length, Key length, key, value (or compressed value)
- Keys are not compressed;
- Compression method
  - Record compression
  - Block compression: compress multiple records at a time (i.e. a block)
    - io.seqfile.compress.blocksize
Sync Point: randomly placed in.

Map File

A sorted sequence file with an index to permit lookup by key.

Other Files

Avro File: objects stored in avro are described by a schema;
Colum-Orientated File: row split then column followed by column;
- Hive's RCFiles, Hive's ORCFiles, Parquet
- Save the time on reading unnecssary column
- Requires more memory to load whole rows split.
- Not suitable for streaming.

Reference

http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

Avro
- Usage :
  - Data Serialization
  - Data Exchange (RPC)
- Data/File Format :
  - Schema (JSON)
  - Data (Binary) -> Compressed
  - Data inlcudes marks that allow split (for big files)
  - Allows Schema Update (Add/Remove Fields)
    - Yet data remain unchanged, i.e. old fields still available for the previous customers;
Parquet
- Columer format data file format.
- Problem :
  - Storage and I/O.
- Good for analysis usually max/min/avg on certain columns ratehr than whole table row by row scan.
- Good for compression - type of each column is known.
- Good for optimziation - type of each column is known - eliminate unnecessary encode/decode - save CPU power.
Flume
Sqoop
Crunch
- The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
- Making creation of MapReduce jobs easier.
- Structure
  - Source / Original Format (DoFn)
  - Process Raw Data (Join)
  - Filter (invalid data) (FilterFn)
  - Group (Group By Key)
  - Persist Result / Target Format (DoFn)
- Execution
  - Map : Source / Original Format (DoFn)
  - Reduce : Process Raw Data (Join)
  - Reduce : Filter (invalid data) (FilterFn)
  - Reduce : Group (Group By Key)
  - Reduce : Persist Result / Target Format (DoFn)
Spark
Zookeeper
Impla
Hue
Yarn
Cloudera
Tez (faster Tez)

Hue (Hadoop User Experience) / Cloudera
- Sit on top of difference services/apps, and expose those services via UI;
- Hase, Oozie, Sqoop, Hive, Pig, HDFS.
- Infrastructure
  - Front End
    - Mako Templates
    - JQuery
    - KnockoutJS
  - Backend
    - Python + Django driven
    - Thrift + Web Client for Hadoop Components
    - Spawning/CherryPy
  - SDK
    - Make your own App!
    - build/env/bin/hue create_desktop_app <name>
  - DB
    - sqllite, mysql, postgres
Cascading - Workflow Abstraction
- Middleware in between Hadoop and other applications. A tool for enterpise.
- It addresses
  - Staffing Bottleneck
  - System Integration
  - Operational Complexity
  - Test Driven Dvelopment

Data Prep : Clean up the data (usually 80% of the work);

ETL

PMML

Hadoop

CPU, Disk, Rack

Storage

Execution

Client and Jobs

Physical Architecture

Deployment

Map Group and Reduce

Word Count

Common Patterns

Advanced Patterns

Very Advanced

Trouble with MapReduce

Real World Apps

Not Thinking in MapReduce

In A Nut Shell

Language

Java, Scala, Clojure
Python, Rubty

Higher Level Language

Hive
Pig

Frameworks

Cascading, Crunch

DSLs

Scalding, Scrunch, Scoobi, Cascalog

Topics

Challenges: 5 Vs

Store the data and process the data;
Volume: due to
- Technology Advance: telephone - mobile phone - smart phone;
- IoT
- Social Network:
- 4.4 Zb today, 44 Zb anticipated in 2020;
Variety :
- Structured
- Semi-structured
- Unstructured
Velocity :
Value: Data Mining
Varicity: dealing with inconsistency
- missing data
- development

HDFS : Cluster

Name Node : meta data (block, location)

Master/Slave;

Data Node :

Data replication : default 3;

hadoop fs :

hadoop fs -ls
hadoop fs -put loca_lfilename hdfs_filename
hadoop fs -get hdfs_filename loca_lfilename
hadoop fs -mv hdfs_filename hdfs_newfilename
hadoop fs -rm hdfs_filename
hadoop fs -tail/cat/mkdir

MapReduce : sending process to data v.s. sending data to node;

Challenges with Hash Table : (Key-Value)
- Too many keys could result in OOM;
- Too much time;
Files broken down into chunks and processed in parallel;
Mappers generate Intermediate Records (Key, Value)
- Shuffle : move Intermediate Records from Mapper to Reducer
- Sort : sort the Records;
Reducer calcualte the final result;
Partioner
- Imtermeidate data to which Reducer;
Trackers
- JobTracker : split job into MapReduce tasks;
- TaskTracker : daemo running on data node;
Mapper

Ecosystem :

Core Hadoop
- HDFS : Storage
- MapReduce : Processing
SQL
- Pig/Hive : SQL working through MapReduce against HDFS; Code turned into MapReduce tasks; take long time to run; good for batch processing;
- Impala : SQL working directly against HDFS; Low latency queries;
- Apache Drill : SQL, combines a variety of data stores just by using a single query;
Processing : Yarn
Integration
- Sqoop : RDBMS data migration into HDFS or the other way around.
- Flume : Inject data into HDFS as it's being generated
Others :
- Mahout : Machine Learning
- Hue : Graphical FrontEnd
- Oozie : Job Scheduler (Workflow and Coordinator)
  - Workflow : sequential set of actions to be executed;
  - Coordinator : job get triggered when certain condition is met (data availablity or time)
- HBase : working directly against HDFS; No SQL database; Can run without HDFS
- Spark : in-memory real-time computation; talk to HDFS directly or without HDFS;
- Ambari : cluster manager - provision, managing and monitoring Hadoop clusters.

Others :

Defensive Programming;
cat inputfile | mapper | sort | reducer
hs mapper reducer inputdirectory outputdirectory;
Combiners
- Mapper go through data

Patterns

Filtering
- Sampling Data
- Top-N List
Summarization
- Count/Max/Min/Avg ...
- Stastics : Mean, Median, Std. Dev ...
- Inverted Index
Structural
- Combining Data Set

Yarn

ResourceManager: daemon running on master node;

Cluster level resource manager
Long life, high-quality hardware

NodeManager: daemon running on data node;

One per data node;
Moni

ApplicationMaster :

One per application; Short life;
Coordinate and Manage MapReduce jobs
Negotiate with ResourceManager to schedule tasks
The tasks are started by NodeManager

Job History Server:

Maintains Information about submitted MapReduce jobs after their ApplicationMaster terminates;

Container :

Created by NodeManager when requested;
Allocate a certain amount of resources (CPU, Memory, etc) on a slave node;

Client -> Scheduler (RM) + ApplicationsManager (RM) -> Create AM (NM) -> Create Tasks -> Request Container (AM -> RM)

Pig

Workflow

Pig Script ->
Pig Server / Grunt Shell ->
Parser ->
Optimizer ->
Compiler ->
Execution Engine ->
MapReduce -> HDFS

Components

Pig Latin:

It is mad up of a series of operations or transformations that are applied to the input data and produce output

Pig Execution:

Script : Contains Pig Commands in a file (.pig)
Interactive shell for running pig commands
Embedded : provisioning pig script in Java

Running Mode

MapReduce Mode : running Pig over Hadoop cluster;
- command : pig
Local Mode : local file system;
- command : pig -x local

Data Model

Tuple : row;

Bag : collection of tuples, which can contain all the fields or a few;

((field1value, field2value, field3value...), (field1value, field2value, field3value ...), ...)

Map : [Key#Value, Key#Value, Key#Value];

Atom : string, int, float, double, byte[], char[];

Operator

LOAD : load data from local FS or HDFS into Pig;

grunt> employee = LOAD '/hdfsFile' 
                  using PigStorage #storage type
                  (',') #delimiter
                  AS (ssn:chararray, 
                      name:chararray,
                      department:chararry,
                      city:chararry)

STORE : save results to local FS or HDFS;

grunt> STORE emp_filter INTO '/pigresult';

FOREACH :

grunt> emp_foreach = foreach employee
                     generate name, department; # returning colums

FILTER :

grunt> emp_filter = filter employee
                    by city=='Shanghai' # condition

JOIN :
ORDER BY :

grunt> emp_order = order employee
                   by ssn desc; # column to sort by

DISTINCT :
GROUP :
COGROUP : same as group, but handle multiple fields;
DUMP : display the result on the screen;

Example :

# log_analysis.pig
# pig directory/log_analysis.pig
log = LOAD '/sample.log';

LEVELS =  foreach log generate REGEXT_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;

FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;

GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;

FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;

RESULT = order FREQUENCIES by COUNT desc;

DUMP RESULT;