《Hadoop: The Definitive Guide》读书笔记 -- Chapter 1 Meet Hadoop

Preface

Stripped to its core, the tools that Hadoop provides for working with big data are simple.If there is a common theme, it is about raising the level of abstraction -- to create building blocks for programmers who have lots of data to store and analyze, and who don't have time, the skill, or the inclination to become distributed system experts to build infrastructure on it.


Part I Hadoop Fundamentals

Chapter 1 Meet Hadoop

Data!

individual footprint to grow(Machine logs, RFID readers, sensor networks)

"More data usually beats better algorithm"


Data Storage 

There is a long time to read all data from one disk -- and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once.

read and write from parallel disks:

1. the chance of one will fail is fairly high. replication: RAID / HDFS

2. most analysis tasks need to be able to combine the data in some way. MapReduce

Hadoop provides: a reliable, scalable platform for storage and analysis


Querying All Your Data

MapReduce is a batch query processor


Beyond Batch

MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis. You can't run a query and get results back in a few seconds or less.


"Hadoop" is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce.

HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access individual rows and batch operations for reading and writing data in bulk.

YARN(Yet Another Resource Negotiator), a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in Hadoop cluster.


Different processing patterns that work with Hadoop:[ROADMAP]

1. Interactive SQL: Impala(Daemon), Hive on Tez(Container Reuse)

2. Interactive Processing: Spark(Many algorithm in machine learning are interactive in nature)

3. Stream Processing: Storm, Spark Streaming, Samza run realtime, distributed computation on unbounded streams of data and emit results.

4. Search: Solr can run on Hadoop cluster


Comparison with Other Systems

Relational Database Management System

Why can't we use databases with lots of disks to do large-scale analysis?

seeking time is improving more slowly than transfer rate. B-Tree works well for small portion of data.


Grid Computing(good for compute insensitive jobs)

Problem when nodes need to access larger data volumes(hundreds of gigabytes, the points at which Hadoop really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.

Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.(Data Locality)


Volunteer Computing




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值