1. The three Vs of big data
Volume: High volumes of data ranging from dozens of terabytes, and even petabytes.
Variety: Data that's organized in multiple structures, ranging from raw text (which, from a computer's perspective, has little or no discernible structure — many people call this unstructured data) to log files (commonly referred to as being semistructured) to data ordered in strongly typed rows and columns (structured data). To make things even more confusing, some data sets even include portions of all three kinds of data. (This is known as multistructured data.)
Velocity: Data that enters your organization and has some kind of value for a limited window of time — a window that usually shuts well before the data has been transformed and loaded into a data warehouse for deeper analysis (for example, financial securities ticker data, which may reveal a buying opportunity, but only for a short while). The higher the volumes of data entering your organization per second, the bigger your velocity challenge.
2. NoSQL Theories
NoSQL Data Stores
NoSQL data stores originally subscribed to the notion "Just Say No to SQL" (to paraphrase from an anti-drug advertising campaign in the 1980s), and they were a reaction to the perceived limitations of (SQL-based) relational databases. It's not that these folks hated SQL, but they were tired of forcing square pegs into round holes by solving problems that relational databases weren't designed for. A relational database is a powerful tool, but for some kinds of data (like key-value pairs, or graphs) and some usage patterns (like extremely large scale storage) a relational database just isn't practical. And when it comes to high-volume storage, relational database can be expensive, both in terms of database license costs and hardware costs. (Relational databases are designed to work with enterprise-grade hardware.) So, with the NoSQL movement, creative programmers developed dozens of solutions for different kinds of thorny data storage and processing problems. These NoSQL databases typically provide massive scalability by way of clustering, and are often designed to enable high throughput and low latency.
REMEMBER The name NoSQL is somewhat misleading because many databases that fit the category do have SQL support (rather than "NoSQL" support). Think of its name instead as "Not Only SQL."
The NoSQL offerings available today can be broken down into four distinct categories, based on their design and purpose:
· Key-value stores: This offering provides a way to store any kind of data without having to use a schema. This is in contrast to relational databases, where you need to define the schema (the table structure) before any data is inserted. Since key-value stores don't require a schema, you have great flexibility to store data in many formats. In a key-value store, a row simply consists of a key (an identifier) and a value, which can be anything from an integer value to a large binary data string. Many implementations of key-value stores are based on Amazon's Dynamo paper.
· Column family stores: Here you have databases in which columns are grouped into column families and stored together on disk.
TECHNICAL STUFF Strictly speaking, many of these databases aren't column-oriented, because they're based on Google's BigTable paper, which stores data as a multidimensional sorted map. (For more on the role of Google's BigTable paper on database design, see Chapter 12.)
· Document stores: This offering relies on collections of similarly encoded and formatted documents to improve efficiencies. Document stores enable individual documents in a collection to include only a subset of fields, so only the data that's needed is stored. For sparse data sets, where many fields are often not populated, this can translate into significant space savings. By contrast, empty columns in relational database tables do take up space. Document stores also enables schema flexibility, because only the fields that are needed are stored, and new fields can be added. Again, in contrast to relational databases, table structures are defined up front before data is stored, and changing columns is a tedious task that impacts the entire data set.
· Graph databases: Here you have databases that store graph structures — representations that show collections of entities (vertices or nodes) and their relationships (edges) with each other. These structures enable graph databases to be extremely well suited for storing complex structures, like the linking relationships between all known web pages. (For example, individual web pages are nodes, and the edges connecting them are links from one page to another.) Google, of course, is all over graph technology, and invented a graph processing engine called Pregel to power its PageRank algorithm. (And yes, there's a white paper on Pregel.) In the Hadoop community, there's an Apache project called Giraph (based on the Pregel paper), which is a graph processing engine designed to process graphs stored in HDFS.
REMEMBER The data storage and processing options available in Hadoop are in many cases implementations of the NoSQL categories listed here. This will help you better evaluate solutions that are available to you and see how Hadoop can complement traditional data warehouses.
ACID versus BASE Data Stores
One hallmark of relational database systems is something known as ACID compliance. As you might have guessed, ACID is an acronym — the individual letters, meant to describe a characteristic of individual database transactions, can be expanded as described in this list:
· Atomicity: The database transaction must completely succeed or completely fail. Partial success is not allowed.
· Consistency: During the database transaction, the RDBMS progresses from one valid state to another. The state is never invalid.
· Isolation: The client's database transaction must occur in isolation from other clients attempting to transact with the RDBMS.
· Durability: The data operation that was part of the transaction must be reflected in nonvolatile storage (computer memory that can retrieve stored information even when not powered – like a hard disk) and persist after the transaction successfully completes. Transaction failures cannot leave the data in a partially committed state.
Certain use cases for RDBMSs, like online transaction processing, depend on ACID-compliant transactions between the client and the RDBMS for the system to function properly. A great example of an ACID-compliant transaction is a transfer of funds from one bank account to another. This breaks down into two database transactions, where the originating account shows a withdrawal, and the destination account shows a deposit. Obviously, these two transactions have to be tied together in order to be valid so that if either of them fail, the whole operation must fail to ensure both balances remain valid.
Hadoop itself has no concept of transactions (or even records, for that matter), so it clearly isn't an ACID-compliant system. Thinking more specifically about data storage and processing projects in the entire Hadoop ecosystem (we tell you more about these projects later in this chapter), none of them is fully ACID-compliant, either. However, they do reflect properties that you often see in NoSQL data stores, so there is some precedent to the Hadoop approach.
One key concept behind NoSQL data stores is that not every application truly needs ACID-compliant transactions. Relaxing on certain ACID properties (and moving away from the relational model) has opened up a wealth of possibilities, which have enabled some NoSQL data stores to achieve massive scalability and performance for their niche applications. Whereas ACID defines the key characteristics required for reliable transaction processing, the NoSQL world requires different characteristics to enable flexibility and scalability. These opposing characteristics are cleverly captured in the acronym BASE:
· Basically Available: The system is guaranteed to be available for querying by all users. (No isolation here.)
· Soft State: The values stored in the system may change because of the eventual consistency model, as described in the next bullet.
· Eventually Consistent: As data is added to the system, the system's state is gradually replicated across all nodes. For example, in Hadoop, when a file is written to the HDFS, the replicas of the data blocks are created in different data nodes after the original data blocks have been written. For the short period before the blocks are replicated, the state of the file system isn't consistent.
The acronym BASE is a bit contrived, as most NoSQL data stores don't completely abandon all the ACID characteristics — it's not really the polar opposite concept that the name implies, in other words. Also, the Soft State and Eventually Consistent characteristics amount to the same thing, but the point is that by relaxing consistency, the system can horizontally scale (many nodes) and ensure availability.
CAP Theory
TECHNICAL STUFF No discussion of NoSQL would be complete without mentioning the CAP theorem, which represents the three kinds of guarantees that architects aim to provide in their systems:
· Consistency: Similar to the C in ACID, all nodes in the system would have the same view of the data at any time.
· Availability: The system always responds to requests.
· Partition tolerance: The system remains online if network problems occur between system nodes.
The CAP theorem states that in distributed networked systems, architects have to choose two of these three guarantees — you can't promise your users all three. That leaves you with the three possibilities shown in Figure 11-1:
· Systems using traditional relational technologies normally aren't partition tolerant, so they can guarantee consistency and availability. In short, if one part of these traditional relational technologies systems is offline, the whole system is offline.
· Systems where partition tolerance and availability are of primary importance can't guarantee consistency, because updates (that destroyer of consistency) can be made on either side of the partition. The key-value stores Dynamo and CouchDB and the column-family store Cassandra are popular examples of partition tolerant/availability (PA) systems.
· Systems where partition tolerance and consistency are of primary importance can't guarantee availability because the systems return errors until the partitioned state is resolved.
REMEMBER Hadoop-based data stores are considered CP systems (consistent and partition tolerant). With data stored redundantly across many slave nodes, outages to large portions (partitions) of a Hadoop cluster can be tolerated. Hadoop is considered to be consistent because it has a central metadata store (the NameNode) which maintains a single, consistent view of data stored in the cluster. We can't say that Hadoop guarantees availability, because if the NameNode fails applications cannot access data in the cluster.