Big Data Storage and Management 10 -- Introduction to Cassandra- NoSQL Database

最新推荐文章于 2024-06-01 10:35:00 发布

There Uncle

最新推荐文章于 2024-06-01 10:35:00 发布

阅读量187

点赞数

分类专栏： CASSANDRA 文章标签：数据库 big data nosql

本文链接：https://blog.csdn.net/qq_32297631/article/details/122487849

版权

CASSANDRA 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Databases Revisited

Scalability, Performance, Relaxed Consistency, Agility, Intricacy and Necessity (SPRAIN)

Databases Revisited (cont.)

Motivation of NoSQL

• Designed to handle large amount of data across multiple servers

• There is a lot of unorganized data out there

• Easy to implement and deploy

• Mimics traditional relational database systems, but with triggers and lightweight transactions

• Raw, simple data structures

CAP Theorem Revisited

Guarantee to fulfill two out of three conditions

o Consistency

o All clients have same view of data

o Availability

o Writeable in the face of node failure

o Partition Tolerance

o Processing can continue in the face of network failure (crashed router, broken network)

Cassandra: A Quick Introduction

Apache Cassandra is

• an open source

• distributed and decentralized / distributed storage system (database)

• a wide column store database

• for managing very large amounts of structured data spread out across the world

• It provides highly available service with no single point of failure

• The current version: 4.0

• It is scalable, fault-tolerant, and consistent

• It is a column-oriented database

• Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable

• Created at Facebook, it differs sharply from relational database management systems

• Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful "column family” data model

• Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more

Cassandra: A Brief History

• Cassandra was created to power the Facebook Inbox Search

• Facebook open-sourced Cassandra in 2008 and became an
Apache Incubator project

• In 2010, Cassandra graduated to a top-level project, regular update and releases followed

Cassandra: An Evolution

• Google Bigtable (2006)

• Consistency model: strong

• Data Model: sparse map

• Clones: Hbase, Hypertable

• Amazon Dynamo (2007)

• O(1) DHT- distributed hash table

• Consistency Model: Client tune-able

• Clones: Riak, Voldemort

Cassandra ~= Bigtable + Dynamo

Cassandra: General Design Features

Emphasis on performance over analysis

• Still has support for analysis tools such as Hadoop

Organization

• Rows are organized into tables

• First component of a table’s primary key is the partition key

• Rows are clustered by the remaining columns of the key

• Columns may be indexed separately from the primary key

• Tables may be created, dropped, altered at runtime without blocking queries Language

• CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)

Cassandra: Some Features

Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.

Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.

Fast linear-scale performance − Cassandra is linearly scalable,i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.

Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.

Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.

Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).

Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency

Flexible, Familiar interface – Cassandra Query Language is quite similar to SQL. Relatively easy query language.

Cassandra: Some Highlights

Staged Event Architecture

• A general-purpose framework for high concurrency & load conditioning.

• Decomposes applications into stages separated by queues

Data Replication

• Configurable replication factor • Replica placement strategy

• Rack unaware →Simple Strategy

• Rack aware → Old Network Topology Strategy

• Data center shard → Network Topology Strategy

Peer-to-Peer Cluster

• Decentralized design

• Each node has the same role

• No single point of failure

• Avoids issues of master-slave DBMS’s

• No bottlenecking

Fault Tolerant/Durability

• Failures happen all the time with multiple nodes

• Replication

• Data is automatically replicated to multiple nodes

• Allows failed nodes to be immediately replaced

• Distribution of data to multiple data centers

• An entire data center can go down without data loss occurrin

Cassandra Architecture

• Partitioning

• How data is partitioned across nodes

• Replication

• How data is duplicated across nodes

• Cluster Membership

• How nodes are added, deleted to the cluster

Gossip Protocol

• Network Communication protocols inspired for real life rumour spreading.

• Periodic, Pairwise, inter-node communication.

• Low frequency communication ensures low cost.

• Random selection of peers.

• Example – Node A wish to search for pattern in data • Round 1 – Node A searches locally and then gossips with node B.

• Round 2 – Node A, B gossips with C and D.

• Round 3 – Nodes A, B, C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

Write Operation Stages

o Logging data in the Commit Log

o Writing data to the Mem-Table

o Flushing data from the Mem-Table

o Storing data on disk in SSTables

Write Operation

o Commit Log

o First place a write is recorded

o Crash recovery mechanism

o Write not successful until recorded in commit log

o Once reecorded in commit log, data is written to Mem-Table

o Mem-Table

o Data structure in memory

o Once Mem-Table size reaches a threshold, it is flushed (appended) to SSTable

o Several may exist at once (1 current, any others waiting to be flushed)

o First place read operations look for data

o SSTable

o Kept on disk

o Immutable once written

o Periodically compacted for performance

Cassandra Data Model

Data Model

o Table is a multi-dimensional map indexed by key (row key)

o Columns are grouped into Column Families

o 2 Types of Column Families

o Simple

o Super (nested Column Families)

o Each Column has

o Name

o Value

o Timestamp

Key-Value Model

o Cassandra is a column-oriented NoSQL system

o Column families: Sets of keyvalue pairs

o Column family as a table and keyvalue pairs as a row (using relational database analogy)

o A row is a collection of columns labeled with a name

Cassandra Row

o The value of a row is itself a sequence of key-value pairs

o Such nested key-value pairs are columns

o Key = column name

o A row must contain at least 1 column

Key Space

o A Key Space is a group of column families together

o It is only a logical grouping of column families and provides an isolated scope for names

Cassandra Query Language (CQL)

Interface Data Definition Language (DDL)

o Creating a keyspace - namespace of tables

CREATE KEYSPACE “Key Space Name”

WITH replication = {‘class’: ’Strategy name’, ‘replication_factor’: No of replicas};

o To use namespace: USE “Key Space Name”;

o Creating tables:

CREATE TABLE table name (

column 1 name data type PRIMARY KEY,

column 2 name data type,

column 1 name data type, )

Interface Data Manipulation Language (DML)

o Inserting data

INSERT INTO table name (, , …) VALUES (, , …);

o Querying tables

SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows

SELECT * FROM table; SELECT column FROM table WHERE condition;

Some Considerations

o Cassandra is designed as a distributed database management system

o Use it when you have a lot of data spread across multiple servers

o Cassandra write performance is always excellent, but read performance depends on write patterns

o It is important to spend enough time to design proper schema around the query pattern

o Having a high-level understanding of some internals is a plus o Ensures a design of a strong application built atop Cassandra.

Cassandra: Advantages

o Perfect for time-series data
o High performance

o Decentralization

o Nearly linear scalability

o Replication support

o No single points of failure

o MapReduce support

Cassandra: Disadvantages

o No referential integrity

o No concept of JOIN

o Querying options for retrieving data are limited

o Sorting data is a design decision

o No GROUP BY

o No support for atomic operations

o if operation fails, changes can still occur

o First think about queries, then about data model