<<Big Data: Principles and Best Practices of Scalable Realtime Data Systems>>读书笔记

Chapter 1 A New Paradigm for Big Data

1.1 How this Book is structured

focus on principles of big data problem => theory / illustration

1.2 Scaling with a traditional database


original 

problem: timeout error on inserting to the database


solution: queue batch updates with queue and worker

problem: more and more writes, workload still too heavy for the database


solution: horizontal partitioning or sharding spreads the write load across multiple machines

problem: keep having to reshard the database into more shards to keep up with the write load and easy to make mistake


solution: Big Data?

making your data immutable. with traditional databases, you'd be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale to so much data, you have the ability to design systems in different ways.


1.3 NoSQL is not a panacea

using conjunction with one another, you can produce scalable systems for arbitrary data problems with human-fault tolerance and a minimum of complexity


1.4 First principles

A data system answers questions based on information that was acquired in the past up to present


definition of data system: 

query = function (all data)    [how about write new data?]


1.5 Desired properties of a Big Data system

robustness and fault tolerance

low latency reads and updates

scalability

generalization

extensibility

ad hoc queries

minimal maintenance

debuggability


1.6 The Problems of incremental architectures

Traditional Architecture: use of read/write databases and maintaining the state in those database incrementally as new data is seen

Complexity: 

operation

achieving eventual consistency

lack of human-fault tolerance(感觉不是很立的住脚)


1.7 Lambda Architecture

Batch Layer

responsibility: 1. stores master dataset; 2. computes arbitrary views

formula: batch view = function(all data)

implementationHadoop, MapReduce, HDFS


Serving Layer

responsibility: 1. random access to batch views; 2. updated by batch layer

formula: NONE

implementation: Thrift, Protocol Buffers, Avro


Speed Layer

responsibility: 1. compensate for high latency of updates to serving layer; 2. Fast, incremental algorithm; 3. Batch layer eventually overrides speed layer

formula: realtime view = function(realtime view, new data)

note: can think similar to batch layer, but only looks recent data, doesn't look all new data at once

implementation: Cassandra, HBase, MongoDB, Voldemort, Riak, CouchDB


Messaging / Queueing Systems: Kafka

Realtime Computation System: Storm


Summary

batch view = function(all data)

realtime view = function(realtime view, new data)

query = function(batch view, realtime view)


Complexity Isolation


Part 1 Batch Layer

Chapter 2 Data Model for Big Data

2.1 The properties of data

Information: general collection of knowledge relevant to your Big Data system. It's synonymous with the colloquial usage of the word data

Data: refers to the information can't be derived from anything else. Data serves as the axioms from which everything drives

Queries: question you ask of your data

Views: information that has been derived from your data. They are built to assist with answering specific types of queries


Key Properties of Data

1. rawness (storing raw data is hugely valuable because you rarely know in advance all the questions you want answered)

2. immutability (Human-fault tolerance / simplicity )

3. perpetuity


2.2 The fact-based model for representing data

In fact-based model, you deconstruct your data into fundamental units calledfacts.


Fact Properties

1. atomic

2. timestamped


Benefits of the fact-based model

1. Is queryable at any time in its history

2. Tolerates human errors (by deleting the error fact)

3. Handles partial information

4. Has the advantages of both normalized and denormalized forms (In lambda architecture, the master dataset is fully normalized)


2.3 Graph schemas

graph schemas: capture the structure of a dataset stored using the fact-based model.


Nodes: entities

Edges: relationships between nodes

Properties: information about entities


The need for an enforceable schema: defines structure of fact.

Implement an enforceable schemausing a serialization framework.A serialization framework provides a language-netrual way to define nodes, edges and properties


Chapter 3 Data Model for Big Data: Illustration

Thirft: cannot do validation like the value should be non-negatiev


Chapter 4 Data Storage on The Batch Layer

topics:

Storage requirements for the master dataset

Distributed filesystems

Improving efficiency with vertical partitioning


4.1 Storage requirements for the master dataset

OperationRequisiteDiscussion
WriteEfficient of appends of new dataThe only write operation is to add new pieces of data, so it must be easy and efficient to append a new set of data
objects to the master dataset
 Scalable storageThe batch layer stores the complete dataset -- potentially terabytes or petabytes of data. It must therefore be easy
to scale the storage as you dataset grows
ReadSupport for parallel processingConstructing the batch views requires computing functions on the entire master dataset. The batch storage must
consequently support parallel processing to handle large amounts of data in a scalable manner (no need for random
access)
BothTunable storage and processing costsStorage costs money. You may choose to compress your data to help minimize your expense, but decompressing
your data during computation can affect performance. The batch layer should give you the flexibility to decide how
to store and compress your data to suit your specific needs
 Enforceable immutabilityIt's critical that you're able to enforce the immutable property of your master dataset. Of course, computers by their
very nature are mutable, so there will always be a way to mutate your data stored. The best you can do is put checks
to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over exist
data


Chapter 5 Data Storage on The Batch Layer: Illustration

HDFS

Pail


Chapter 6 Batch Layer

Recomputation vs Incremental


6.6 Low-level nature of MapReduce

MapReduce is a great primitive for batch computation -- providing you a generic, scalable, and fault-tolerant way to compute functions of large datasets -- it doesn't lend itself to particularly elegant code. You'll find that MapReduce programs written manually tend to be long, unwidely, and difficult to understand. (MapReduce比较底层,不适合用于一些场景,这些场景下代码会变的复杂、难于理解)


1. multistep computations are unnatural

2. joins are very complicated to implement manually

3. logical and physical execution tightly coupled


6.7 Pipe diagrams: a higher-level way of thinking about batch computation


Chapter 7 Batch Layer: Illustration

JCascalog as a practical implementation of pipe diagrams

inputs and outputs are defined via an abstraction called atap


7.2 Common Pitfalls of data-processing tools

1. custom languages

2. poorly composable abstraction


Chapter 8 An Example Batch Layer: Architecture and Algorithm

Chapter 9 An Example Batch Layer: Implementation


Part 2 Serving Layer

Chapter 10 Serving Layer



Part 3 Speed Layer

speed layer -> [synchronously / asynchronously]

asynchronously -> queues and streaming

two paradigms of stream processing -> [one-at-a-time / micro-batched]


Chapter 12 Realtime Views

speed layer is based on incremental computation instead of batch computation


facts: storing the realtime views and processing the incoming data stream so as to update those views.


评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值