Big Data Storage and Management 10 -- Introduction to Cassandra- NoSQL Database

Databases Revisited

Scalability, Performance, Relaxed Consistency, Agility, Intricacy and Necessity (SPRAIN)

Databases Revisited (cont.)

 Motivation of NoSQL

• Designed to handle large amount of data across multiple servers

• There is a lot of unorganized data out there

• Easy to implement and deploy

• Mimics traditional relational database systems, but with triggers and lightweight transactions

• Raw, simple data structures

CAP Theorem Revisited

Guarantee to fulfill two out of three conditions

o Consistency

      o All clients have same view of data

o Availability

      o Writeable in the face of node failure

o Partition Tolerance

     o Processing can continue in the face of network failure (crashed router, broken network)

Cassandra: A Quick Introduction

Apache Cassandra is

• an open source 

distributed and decentralized / distributed storage system (database)

• a wide column store database

• for managing very large amounts of structured data spread out across the world

• It provides highly available service with no single point of failure

     • The current version: 4.0

• It is scalable, fault-tolerant, and consistent

• It is a column-oriented database

• Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable

• Created at Facebook, it differs sharply from relational database management systems

• Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful "column family” data model

• Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more

Cassandra: A Brief History

• Cassandra was created to power the Facebook Inbox Search

• Facebook open-sourced Cassandra in 2008 and became an
Apache Incubator project

• In 2010, Cassandra graduated to a top-level project, regular update and releases followed

Cassandra: An Evolution

• Google Bigtable (2006)

      • Consistency model: strong

      • Data Model: sparse map

      • Clones: Hbase, Hypertable

• Amazon Dynamo (2007)

     • O(1) DHT- distributed hash table

     • Consistency Model: Client tune-able

     • Clones: Riak, Voldemort

Cassandra ~= Bigtable + Dynamo

Cassandra: General Design Features

Emphasis on performance over analysis

     • Still has support for analysis tools such as Hadoop

Organization

Rows are organized into tables

• First component of a table’s primary key is the partition key

Rows are clustered by the remaining columns of the key

• Columns may be indexed separately from the primary key

• Tables may be created, dropped, altered at runtime without blocking queries Language

• CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve)

Cassandra: Some Features

Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.

Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.

Fast linear-scale performance − Cassandra is linearly scalable,i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.

Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.

Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.

Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).

Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency

Flexible, Familiar interface – Cassandra Query Language is quite similar to SQL. Relatively easy query language.

Cassandra: Some Highlights

Staged Event Architecture

• A general-purpose framework for high concurrency & load conditioning.

• Decomposes applications into stages separated by queues

Data Replication

• Configurable replication factor • Replica placement strategy

• Rack unaware →Simple Strategy

• Rack aware → Old Network Topology Strategy

• Data center shard → Network Topology Strategy

Peer-to-Peer Cluster

• Decentralized design

    • Each node has the same role

• No single point of failure

   • Avoids issues of master-slave DBMS’s

• No bottlenecking

Fault Tolerant/Durability

• Failures happen all the time with multiple nodes

• Replication

   • Data is automatically replicated to multiple nodes

   • Allows failed nodes to be immediately replaced

• Distribution of data to multiple data centers

    • An entire data center can go down without data loss occurrin

Cassandra Architecture

• Partitioning

    • How data is partitioned across nodes

    • Replication

    • How data is duplicated across nodes

    • Cluster Membership

    • How nodes are added, deleted to the cluster

 

               

Gossip Protocol

• Network Communication protocols inspired for real life rumour spreading.

• Periodic, Pairwise, inter-node communication.

• Low frequency communication ensures low cost.

• Random selection of peers.

• Example – Node A wish to search for pattern in data • Round 1 – Node A searches locally and then gossips with node B.

• Round 2 – Node A, B gossips with C and D.

• Round 3 – Nodes A, B, C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

 

 Write Operation Stages

o Logging data in the Commit Log

o Writing data to the Mem-Table

o Flushing data from the Mem-Table

o Storing data on disk in SSTables

Write Operation

o Commit Log

    o First place a write is recorded

   o Crash recovery mechanism

   o Write not successful until recorded in commit log

   o Once reecorded in commit log, data is written to Mem-Table

 o Mem-Table

    o Data structure in memory

    o Once Mem-Table size reaches a threshold, it is flushed (appended) to SSTable

    o Several may exist at once (1 current, any others waiting to be flushed)

    o First place read operations look for data

o SSTable

    o Kept on disk

    o Immutable once written

    o Periodically compacted for performance

Cassandra Data Model

Data Model

o Table is a multi-dimensional map indexed by key (row key)

o Columns are grouped into Column Families

o 2 Types of Column Families

    o Simple

    o Super (nested Column Families)

o Each Column has

     o Name

     o Value

     o Timestamp

Key-Value Model

o Cassandra is a column-oriented NoSQL system

o Column families: Sets of keyvalue pairs

   o Column family as a table and keyvalue pairs as a row (using relational database analogy)

o A row is a collection of columns labeled with a name

 

Cassandra Row

o The value of a row is itself a sequence of key-value pairs

o Such nested key-value pairs are columns

o Key = column name

o A row must contain at least 1 column

Key Space

o A Key Space is a group of column families together

o It is only a logical grouping of column families and provides an isolated scope for names

Cassandra Query Language (CQL)

Interface Data Definition Language (DDL) 

o Creating a keyspace - namespace of tables

  CREATE KEYSPACE “Key Space Name”

   WITH replication = {‘class’: ’Strategy name’, ‘replication_factor’: No of replicas};

o To use namespace:  USE “Key Space Name”;

o Creating tables:

CREATE TABLE table name (

column 1 name data type PRIMARY KEY,

column 2 name data type,

column 1 name data type, )

Interface Data Manipulation Language (DML)

 o Inserting data

      INSERT INTO table name (, , …) VALUES (, , …);

o Querying tables

SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows

SELECT * FROM table; SELECT column FROM table WHERE condition;

Some Considerations

o Cassandra is designed as a distributed database management system

o Use it when you have a lot of data spread across multiple servers

o Cassandra write performance is always excellent, but read performance depends on write patterns

o It is important to spend enough time to design proper schema around the query pattern

o Having a high-level understanding of some internals is a plus o Ensures a design of a strong application built atop Cassandra.

Cassandra: Advantages

o Perfect for time-series data
o High performance

o Decentralization

o Nearly linear scalability

o Replication support

o No single points of failure

o MapReduce support

Cassandra: Disadvantages

o No referential integrity

o No concept of JOIN

o Querying options for retrieving data are limited

o Sorting data is a design decision

o No GROUP BY

o No support for atomic operations

o if operation fails, changes can still occur

o First think about queries, then about data model

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值