Chapter 1 A New Paradigm for Big Data
1.1 How this Book is structured
focus on principles of big data problem => theory / illustration
1.2 Scaling with a traditional database
original
problem: timeout error on inserting to the database
solution: queue batch updates with queue and worker
problem: more and more writes, workload still too heavy for the database
solution: horizontal partitioning or sharding spreads the write load across multiple machines
problem: keep having to reshard the database into more shards to keep up with the write load and easy to make mistake
solution: Big Data?
making your data immutable. with traditional databases, you'd be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale to so much data, you have the ability to design systems in different ways.
1.3 NoSQL is not a panacea
using conjunction with one another, you can produce scalable systems for arbitrary data problems with human-fault tolerance and a minimum of complexity
1.4 First principles
A data system answers questions based on information that was acquired in the past up to present
definition of data system:
query = function (all data) [how about write new data?]
1.5 Desired properties of a Big Data system
robustness and fault tolerance
low latency reads and updates
scalability
generalization
extensibility
ad hoc queries
minimal maintenance
debuggability
1.6 The Problems of incremental architectures
Traditional Architecture: use of read/write databases and maintaining the state in those database incrementally as new data is seen
Complexity:
operation
achieving eventual consistency
lack of human-fault tolerance(感觉不是很立的住脚)
1.7 Lambda Architecture
Batch Layer
responsibility: 1. stores master dataset; 2. computes arbitrary views
formula: batch view = function(all data)
implementation: Hadoop, MapReduce, HDFS
Serving Layer
responsibility: 1. random access to batch views; 2. updated by batch layer
formula: NONE
implementation: Thrift, Protocol Buffers, Avro
Speed Layer
responsibility: 1. compensate for high latency of updates to serving layer; 2. Fast, incremental algorithm; 3. Batch layer eventually overrides speed layer
formula: realtime view = function(realtime view, new data)
note: can think similar to batch layer, but only looks recent data, doesn't look all new data at once
implementation: Cassandra, HBase, MongoDB, Voldemort, Riak, CouchDB
Messaging / Queueing Systems: Kafka
Realtime Computation System: Storm
Summary
batch view = function(all data)
realtime view = function(realtime view, new data)
query = function(batch view, realtime view)
Complexity Isolation
Part 1 Batch Layer
Chapter 2 Data Model for Big Data
2.1 The properties of data
Information: general collection of knowledge relevant to your Big Data system. It's synonymous with the colloquial usage of the word data
Data: refers to the information can't be derived from anything else. Data serves as the axioms from which everything drives
Queries: question you ask of your data
Views: information that has been derived from your data. They are built to assist with answering specific types of queries
Key Properties of Data
1. rawness (storing raw data is hugely valuable because you rarely know in advance all the questions you want answered)
2. immutability (Human-fault tolerance / simplicity )
3. perpetuity
2.2 The fact-based model for representing data
In fact-based model, you deconstruct your data into fundamental units calledfacts.
Fact Properties
1. atomic
2. timestamped
Benefits of the fact-based model
1. Is queryable at any time in its history
2. Tolerates human errors (by deleting the error fact)
3. Handles partial information
4. Has the advantages of both normalized and denormalized forms (In lambda architecture, the master dataset is fully normalized)
2.3 Graph schemas
graph schemas: capture the structure of a dataset stored using the fact-based model.
Nodes: entities
Edges: relationships between nodes
Properties: information about entities
The need for an enforceable schema: defines structure of fact.
Implement an enforceable schemausing a serialization framework.A serialization framework provides a language-netrual way to define nodes, edges and properties
Chapter 3 Data Model for Big Data: Illustration
Thirft: cannot do validation like the value should be non-negatiev
Chapter 4 Data Storage on The Batch Layer
topics:
Storage requirements for the master dataset
Distributed filesystems
Improving efficiency with vertical partitioning
4.1 Storage requirements for the master dataset
Operation | Requisite | Discussion |
Write | Efficient of appends of new data | The only write operation is to add new pieces of data, so it must be easy and efficient to append a new set of data objects to the master dataset |
Scalable storage | The batch layer stores the complete dataset -- potentially terabytes or petabytes of data. It must therefore be easy to scale the storage as you dataset grows | |
Read | Support for parallel processing | Constructing the batch views requires computing functions on the entire master dataset. The batch storage must consequently support parallel processing to handle large amounts of data in a scalable manner (no need for random access) |
Both | Tunable storage and processing costs | Storage costs money. You may choose to compress your data to help minimize your expense, but decompressing your data during computation can affect performance. The batch layer should give you the flexibility to decide how to store and compress your data to suit your specific needs |
Enforceable immutability | It's critical that you're able to enforce the immutable property of your master dataset. Of course, computers by their very nature are mutable, so there will always be a way to mutate your data stored. The best you can do is put checks to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over exist data |
Chapter 5 Data Storage on The Batch Layer: Illustration
HDFS
Pail
Chapter 6 Batch Layer
Recomputation vs Incremental
6.6 Low-level nature of MapReduce
MapReduce is a great primitive for batch computation -- providing you a generic, scalable, and fault-tolerant way to compute functions of large datasets -- it doesn't lend itself to particularly elegant code. You'll find that MapReduce programs written manually tend to be long, unwidely, and difficult to understand. (MapReduce比较底层,不适合用于一些场景,这些场景下代码会变的复杂、难于理解)
1. multistep computations are unnatural
2. joins are very complicated to implement manually
3. logical and physical execution tightly coupled
6.7 Pipe diagrams: a higher-level way of thinking about batch computation
Chapter 7 Batch Layer: Illustration
JCascalog as a practical implementation of pipe diagrams
inputs and outputs are defined via an abstraction called atap
7.2 Common Pitfalls of data-processing tools
1. custom languages
2. poorly composable abstraction
Chapter 8 An Example Batch Layer: Architecture and Algorithm
Chapter 9 An Example Batch Layer: Implementation
Part 2 Serving Layer
Chapter 10 Serving Layer
Part 3 Speed Layer
speed layer -> [synchronously / asynchronously]
asynchronously -> queues and streaming
two paradigms of stream processing -> [one-at-a-time / micro-batched]
Chapter 12 Realtime Views
speed layer is based on incremental computation instead of batch computation
facts: storing the realtime views and processing the incoming data stream so as to update those views.