NoSQL数据库 Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

最新推荐文章于 2024-09-21 22:21:34 发布

zhangxinrun_业余erlang

最新推荐文章于 2024-09-21 22:21:34 发布

阅读量1.5k

点赞数

分类专栏： Erlang 海量存储和算法文章标签： couchdb cassandra redis mongodb hbase nosql数据库

Erlang 同时被 2 个专栏收录

221 篇文章 6 订阅

订阅专栏

海量存储和算法

21 篇文章 1 订阅

订阅专栏

转载：http://hi.baidu.com/yandavid/blog/item/04f0d1952850ab52d1135e94.html

NoSQL世界的几个重要理论

1.CAP理论

CAP理论无疑是导致技术趋势由关系数据库系统向NoSQL系统转变的最重要原因。

CAP（Consistency,Availability,Patition tolerance）理论论述的是在任何分布式系统中，只可能满足一致性，可用性及分区容忍性三者中的两者，不可能全部都满足。所以不用花时间精力在如何满足所有三者上面。

原理的证明简单的说就是，在保证分区容忍性的情形下，一致性和可用性是不可能同时达到的，高一致性就得牺牲可用性，高可用性就得牺牲可用性。（为什么要保证分区容忍性？因为在网络应用越来越大的今天，数据分区是一个基本要求）

证明过程：Brewer’s CAP Theorem

2.一致性hash

这个不用多说了，用过MC的人应该都清楚，直接上图：

3.MapReduce

MapReduce思想分为Map和Reduce两个部分，简单来说Map就是将大的计算量分片，以便并行的进行计算，Reduce就是将并行计算的结果进行组合，以便得到一个最终的输出。

更详细的描述见wikipedia：MapReduce

Google关于MapReduce的文档PDF版：MapReduce: Simplified Data Processing on Large Clusters

4.Gossip

Gossip是一个应用于p2p中的理论（不是当下流行的Gossip Girl［绯闻女孩］），他的主要过程是通过一个N节点集群中的每一个节点与所有其它N-1个节点进行通信，实现数据的同步，Gossip基于不要求集群中有一个Master存在，并能以病毒传播的方式将一个节点的变更传达到所有其它节点，而系统增加或减少一个结点的成本几乎为0。

更详尽描述见wikipedia：Gossip

虽然SQL数据库占据统治地位15年，但现在该是结束的时候了，这只是时间问题。在NoSQL如日中天的今天，各种NoSQL产品可谓百花齐放，但每一个产品都有自己的特点，有长处也有不适合的场景。本文对Cassandra、Mongodb、CouchDB、Redis、Riak以及HBase进行了多方面的特点分析。

CouchDB使用的开发语言为Erlang，遵循Apache许可，使用HTTP/REST协议。主要优点是可保持数据一致性和易用性，同时允许多站部署。CouchDB主要适用于积累性的、并且较少改变数据的应用。例如CRM、CMS systems等。

Redis使用的开发语言为C/C++，遵循BSD许可，使用Telnet-like协议。主要优点是速度极快。Redis主要适用数据集数据时常变化的应用。但内存占用较大。主要应用于金融机构、实时分析、实时数据收集、实时通信等。

MongoDB使用的开发语言为C++，遵循AGPL(Drivers:Apache)，使用Custom，binary(BSON)协议。MongoDB适用于动态查询、且定义索引比Map/Reduce效能更佳的地方。不过与CouchDB一样其数据变动较多，需要大容量磁盘。MongoDB可在任何Mysql/PostgreSQL的环境下使用。

Cassandra使用的开发语言是Java，遵循Apache，使用Custom，binary(Thrift)协议。Cassandra适用于写入多于查询的场合，例如银行和金融行业等需要实时数据分析的行业。

Riak使用的开发语言是Erlang & C、Javascript。遵循Apache，使用HTTP/REST协议。Riak具有高容错性的特点。Riak和Cassandra非常相似。当需要高扩展性和高容错性时Riak是不错的选择。但多站点的部署需要付费。Riak适用于销售数据录入、工控系统等一些不允许宕机的场合。

HBase使用的开发语言为Java，遵循Apache，使用HTTP/REST协议。HBase可支持高达数十亿的列。如果你喜爱BigTable并且需要一个能提供随机实时读写访问你海量数据的数据库，HBase是不错的选择。HBase现被Facebook邮件数据库所使用。

CouchDB

Written in: Erlang Main point: DB consistency, ease of use License: Apache Protocol: HTTP/REST Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!) MVCC - write operations do not block reads Previous versions of documents are available Crash-only (reliable) design Needs compacting from time to time Views: embedded map/reduce Formatting views: lists & shows Server-side document validation possible Authentication possible Real-time updates via _changes (!) Attachment handling thus, CouchApps (standalone js apps) jQuery library included

Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.

For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.

Redis

Written in: C/C++ Main point: Blazing fast License: BSD Protocol: Telnet-like Disk-backed in-memory database, but since 2.0, it can swap to disk. Master-slave replication Simple keys and values, but complex operations like ZREVRANGEBYSCORE INCR & co (good for rate limiting or statistics) Has sets (also union/diff/inter) Has lists (also a queue; blocking pop) Has hashes (objects of multiple fields) Of all these databases, only Redis does transactions (!) Values can be set to expire (as in a cache) Sorted sets (high score table, good for range queries) Pub/Sub and WATCH on data changes (!)

Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).

For example: Stock prices. Analytics. Real-time data collection. Real-time communication.

MongoDB

Written in: C++ Main point: Retains some friendly properties of SQL. (Query, index) License: AGPL (Drivers: Apache) Protocol: Custom, binary (BSON) Master/slave replication Queries are javascript expressions Run arbitrary javascript functions server-side Better update-in-place than CouchDB Sharding built-in Uses memory mapped files for data storage Performance over features After crash, it needs to repair tables Better durablity coming in V1.8

Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.

For example: For all things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

Cassandra

Written in: Java Main point: Best of BigTable and Dynamo License: Apache Protocol: Custom, binary (Thrift) Tunable trade-offs for distribution and replication (N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop I admit being a bit biased against it, because of the bloat and complexity it has partly because of Java (configuration, seeing exceptions, etc)

Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example: Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is real time data analysis.

Riak

Written in: Erlang & C, some Javascript Main point: Fault tolerance License: Apache Protocol: HTTP/REST Tunable trade-offs for distribution and replication (N, R, W) Pre- and post-commit hooks, for validation and security. Built-in full-text search Map/reduce in javascript or Erlang Comes in "open source" and "enterprise" editions

Best used: If you want something Cassandra-like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.

For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt.

HBase

(With the help of ghshephard)

Written in: Java Main point: Billions of rows X millions of columns License: Apache Protocol: HTTP/REST (also Thrift) Modeled after BigTable Map/reduce with Hadoop Query predicate push down via server side scan and get filters Optimizations for real time queries A high performance Thrift gateway HTTP supports XML, Protobuf, and binary Cascading, hive, and pig source and sink modules Jruby-based (JIRB) shell No single point of failure Rolling restart for configuration changes and minor upgrades Random access performance is like MySQL

Best used: If you're in love with BigTable. :) And when you need random, realtime read/write access to your Big Data.

For example: Facebook Messaging Database (more general example coming soon)