<<Big Data: Principles and Best Practices of Scalable Realtime Data Systems>>读书笔记

最新推荐文章于 2024-07-21 16:11:15 发布

kiwi小白

最新推荐文章于 2024-07-21 16:11:15 发布

阅读量3.4k

点赞数

分类专栏：大数据思考文章标签： big data principles

本文链接：https://blog.csdn.net/kiwi_coder/article/details/49337971

版权

思考同时被 2 个专栏收录

31 篇文章 0 订阅

订阅专栏

大数据

5 篇文章 0 订阅

订阅专栏

Chapter 1 A New Paradigm for Big Data

1.1 How this Book is structured

focus on principles of big data problem => theory / illustration

1.2 Scaling with a traditional database

original

problem: timeout error on inserting to the database

solution: queue batch updates with queue and worker

problem: more and more writes, workload still too heavy for the database

solution: horizontal partitioning or sharding spreads the write load across multiple machines

problem: keep having to reshard the database into more shards to keep up with the write load and easy to make mistake

solution: Big Data?

making your data immutable. with traditional databases, you'd be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale to so much data, you have the ability to design systems in different ways.

1.3 NoSQL is not a panacea

using conjunction with one another, you can produce scalable systems for arbitrary data problems with human-fault tolerance and a minimum of complexity

1.4 First principles

A data system answers questions based on information that was acquired in the past up to present

definition of data system:

query = function (all data) [how about write new data?]

1.5 Desired properties of a Big Data system

robustness and fault tolerance

low latency reads and updates

scalability

generalization

extensibility

ad hoc queries

minimal maintenance

debuggability

1.6 The Problems of incremental architectures

Traditional Architecture: use of read/write databases and maintaining the state in those database incrementally as new data is seen

Complexity:

operation

achieving eventual consistency

lack of human-fault tolerance(感觉不是很立的住脚)

1.7 Lambda Architecture

Batch Layer

responsibility: 1. stores master dataset; 2. computes arbitrary views

formula: batch view = function(all data)

implementation: Hadoop, MapReduce, HDFS

Serving Layer

responsibility: 1. random access to batch views; 2. updated by batch layer

formula: NONE

implementation: Thrift, Protocol Buffers, Avro

Speed Layer

responsibility: 1. compensate for high latency of updates to serving layer; 2. Fast, incremental algorithm; 3. Batch layer eventually overrides speed layer

formula: realtime view = function(realtime view, new data)

note: can think similar to batch layer, but only looks recent data, doesn't look all new data at once

implementation: Cassandra, HBase, MongoDB, Voldemort, Riak, CouchDB

Messaging / Queueing Systems: Kafka

Realtime Computation System: Storm

Summary

batch view = function(all data)

realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

Complexity Isolation

Part 1 Batch Layer

Chapter 2 Data Model for Big Data

2.1 The properties of data

Information: general collection of knowledge relevant to your Big Data system. It's synonymous with the colloquial usage of the word data

Data: refers to the information can't be derived from anything else. Data serves as the axioms from which everything drives

Queries: question you ask of your data

Views: information that has been derived from your data. They are built to assist with answering specific types of queries

Key Properties of Data

1. rawness (storing raw data is hugely valuable because you rarely know in advance all the questions you want answered)

2. immutability (Human-fault tolerance / simplicity )

3. perpetuity

2.2 The fact-based model for representing data

In fact-based model, you deconstruct your data into fundamental units calledfacts.

Fact Properties

1. atomic

2. timestamped

Benefits of the fact-based model

1. Is queryable at any time in its history

2. Tolerates human errors (by deleting the error fact)

3. Handles partial information

4. Has the advantages of both normalized and denormalized forms (In lambda architecture, the master dataset is fully normalized)

2.3 Graph schemas

graph schemas: capture the structure of a dataset stored using the fact-based model.

Nodes: entities

Edges: relationships between nodes

Properties: information about entities

The need for an enforceable schema: defines structure of fact.

Implement an enforceable schemausing a serialization framework.A serialization framework provides a language-netrual way to define nodes, edges and properties

Chapter 3 Data Model for Big Data: Illustration

Thirft: cannot do validation like the value should be non-negatiev

Chapter 4 Data Storage on The Batch Layer

topics:

Storage requirements for the master dataset

Distributed filesystems

Improving efficiency with vertical partitioning

4.1 Storage requirements for the master dataset

Operation	Requisite	Discussion
Write	Efficient of appends of new data	The only write operation is to add new pieces of data, so it must be easy and efficient to append a new set of data objects to the master dataset
	Scalable storage	The batch layer stores the complete dataset -- potentially terabytes or petabytes of data. It must therefore be easy to scale the storage as you dataset grows
Read	Support for parallel processing	Constructing the batch views requires computing functions on the entire master dataset. The batch storage must consequently support parallel processing to handle large amounts of data in a scalable manner (no need for random access)
Both	Tunable storage and processing costs	Storage costs money. You may choose to compress your data to help minimize your expense, but decompressing your data during computation can affect performance. The batch layer should give you the flexibility to decide how to store and compress your data to suit your specific needs
	Enforceable immutability	It's critical that you're able to enforce the immutable property of your master dataset. Of course, computers by their very nature are mutable, so there will always be a way to mutate your data stored. The best you can do is put checks to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over exist data

Chapter 5 Data Storage on The Batch Layer: Illustration

HDFS

Pail

Chapter 6 Batch Layer

Recomputation vs Incremental

6.6 Low-level nature of MapReduce

MapReduce is a great primitive for batch computation -- providing you a generic, scalable, and fault-tolerant way to compute functions of large datasets -- it doesn't lend itself to particularly elegant code. You'll find that MapReduce programs written manually tend to be long, unwidely, and difficult to understand. (MapReduce比较底层，不适合用于一些场景，这些场景下代码会变的复杂、难于理解)

1. multistep computations are unnatural

2. joins are very complicated to implement manually

3. logical and physical execution tightly coupled

6.7 Pipe diagrams: a higher-level way of thinking about batch computation

Chapter 7 Batch Layer: Illustration

JCascalog as a practical implementation of pipe diagrams

inputs and outputs are defined via an abstraction called atap

7.2 Common Pitfalls of data-processing tools

1. custom languages

2. poorly composable abstraction

Chapter 8 An Example Batch Layer: Architecture and Algorithm

Chapter 9 An Example Batch Layer: Implementation