Apache Cassandra，目前唯一的用于大数据的NoSQL数据库的结构，顺便解释NoSQL和Big data

最新推荐文章于 2024-08-07 18:03:17 发布

lionzl

最新推荐文章于 2024-08-07 18:03:17 发布

阅读量1.5k

点赞数

分类专栏： MySQL

MySQL 专栏收录该内容

19 篇文章

订阅专栏

Apache Cassandra，目前唯一的用于大数据的NoSQL数据库的结构，顺便解释NoSQL和Big data

QQ空间新浪微博腾讯微博更多

2013 年 3 月 5 日 331 0

转我文章需注明出处：www.alexclouds.net

我不是标题党，先来说说Apache Cassandra，开源分布式数据库管理系统。它最初由Facebook开发，用于储存特别大的数据，后来变为开源项目。

Cassandra 现在用于 Netflix, eBay, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala等等等。这些公司的共同点都是由大量，活跃的数据需要管理。目前已知最大的Cassandra数据库是有个用户的超过 300 TB 的数据data 跑在 400 多台机器上.

曾经，Twitter的工程师瑞恩·金(Ryan King)在博客中直接说到：“公司的分析团队、运营团队以及基础建设团队正在使用Cassandra系统合作研发一款供Twitter后台以及客户共同使用的大规模实时数据分析产品。”

不过由于两三年前我们刚刚了解Cassandra时，尚未提出云计算BIGDATA这个概念，而在那个时候0.6版本出来之后，国内有个别厂商专门组织针对Cassandra的读写性能做了测试分析。当时是淘宝团队测试，而测试的结果显示写出色，读比较差，此后再也没有看到这个测试结果针对新版本的release有过更新。但是由于目前已经是1.2.2版本:Apache Cassandra is 1.2.2 (released on 2013-02-25).。因此这其中的改进应该还是比较明显的。

Apache Cassandra有如下特点，博主我简单做一个小结：

1、Cassandra群集扩展性能

Cassandra是混合型的非关系的数据库，类似于Google的BigTable。Cassandra的主要特点就是它不是一个数据库，而是由一堆数据库节点共同构成的一个分布式网络服务，对Cassandra的一个写操作，会被复制到其它节点上去，对Cassandra的读操作，也会被路由到某个节点上面去读取。对于一个Cassandra群集来说，扩展性能是比较简单的事情，只管在群集里面添加节点就可以了。
Cassandra能够简单透明地在多个机器上进行扩展，它们可以是廉价的硬件组成的集群，而无需购买昂贵的服务器或者SAN存储。但同样重要的是，透明地按需扩展和缩减集群的能力，使得公司能够更好地利用云的灵活性，针对小型工作负载可以添加合适的计算能力。MySQL 5.6是做不到的。

2、Cassandra与MongoDB

Cassandra的主要功能比 Dynamo（分布式的Key-Value存储系统）更丰富，但支持度却不如文档存储MongoDB，MongoDB是介于关系数据库和非关系数据库之间的开源产品，是非关系数据库当中功能最丰富，最像关系数据库的。支持的数据结构非常松散，是类似json的bjson格式，因此可以存储比较复杂的数据类型。而 Cassandra支持复杂的数据类型则比较麻烦。那次有越来越多的客户转向MongoDB。

3、跨数据中心集群能力

Cassandra的普及率在不断的升高，其中一个原因就是扩多数据中心组成单一集群的能力。可协调的一致性使得Cassandra能够在数据中心内部进行同步复制，同时异步复制到其他数据中心。另一方面，基于主机的复制由于限制往往会造成较大的延迟，即便多主复制在关系型数据库方面也无法解决这一问题，因为两步提交机制也需要多个round-trip。而通过多数据中心的无主复制，Cassandra能够对整个数据中心进行无缝的故障转移，并在电力或连接修复之后对未同步的机器进行修复。

4、性能

我们在此篇文章的篇首说过，测试结果显示Cassandra的写性能优异。基于B-tree的存储引擎(InnoDB和MyISAM)事实上并不适合传统磁盘和SSD，因为有很多小的随机写入操作。但为何关系型数据库又是如此强势呢，主要还是因为SQL语义。比如，INSERT或UPDATE需要首先进行一个行的读取以确保它是否存在。这也是为什么MySQL数据库的写性能在数据集增长的时候会随之下降的原因。即使是针对相对合适的数据集，在混合的工作负载之下，这一结果也会逐渐显现。

数据模型结构

Cassandra 的数据模型是基于列族（Column Family）的四维或五维模型。它借鉴了 Amazon 的 Dynamo 和 Google's BigTable 的数据结构和功能特点，采用 Memtable 和 SSTable 的方式进行存储。在 Cassandra 写入数据之前，需要先记录日志 ( CommitLog )，然后数据开始写入到 Column Family 对应的 Memtable 中，Memtable 是一种按照 key 排序数据的内存结构，在满足一定条件时，再把 Memtable 的数据批量的刷新到磁盘上，存储为 SSTable 。

增加keyspace \colums\的方式我会在下面详述：

Cassandra 的数据模型的基本概念：

1、 Cluster : Cassandra 的节点实例，它可以包含多个 Keyspace
2、Keyspace : 用于存放 ColumnFamily 的容器，相当于关系数据库中的 Schema 或 database3. ColumnFamily : 用于存放 Column 的容器，类似关系数据库中的 table 的概念 4. SuperColumn ：它是一个特列殊的 Column, 它的 Value 值可以包函多个 Column5. Columns：Cassandra 的最基本单位。由 name , value , timestamp 组成

写入和读取一个USER的信息

安装部署

安装比较简单，几乎没有什么困难。

1、日志目录，注意目录/var/log/cassandra/存在且可写，日志的配置在conf/log4j-server.properies:

      这一行log4j.appender.R.File=\/var/log/cassandra/system.log

2、用户数据在: data_file_directories (/var/lib/cassandra/data),在配置文件conf/cassandra.yaml里查看。

3、配置内存: 在文件里 conf/cassandra-env.sh, 如下行：

#MAX_HEAP_SIZE="4G"

#HEAP_NEWSIZE="800M" 可根据实际情况调整。
4、bin/cassandra -f启动，CTRL+C停止

5、bin/cassandra-cli进入交互命令行，如下指令可供试用：

bin/cassandra-cli

You should see the following prompt, if successful:

Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.0.7

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown]

You can access to the online help with 'help;' command. Commands are terminated with a semicolon (';') in the cli.

[default@unknown] help;

First, create a keyspace for your test.

[default@unknown] create keyspace DEMO;  
f53dff10-5bd8-11e1-0000-915a024292eb
Waiting for schema agreement...
... schemas agree across the cluster
[default@unknown]

Don't forget to add a semicolon (';') at end of the command.

Second, authenticate you to use the DEMO keyspace.

[default@unknown] use DEMO;
Authenticated to keyspace: DEMO
[default@DEMO]

Third, create a Users column family:

[default@DEMO] create column family Users                
...     with key_validation_class = 'UTF8Type'    
...     and comparator = 'UTF8Type'               
...     and default_validation_class = 'UTF8Type';
[default@DEMO]

Now you can store data into Users column family:

[default@DEMO] set Users[1234][name] = scott;
Value inserted.
Elapsed time: 10 msec(s).
[default@DEMO] set Users[1234][password] = tiger;
Value inserted.
Elapsed time: 10 msec(s).
[default@DEMO]

You have inserted a row into the Users column family. The row key is '1234', and we set values for two columns in the row: 'name', and 'password'.

Now let's fetch the data you inserted:

[default@DEMO] get Users[1234];
=> (column=name, value=scott, timestamp=1350769161684000)
=> (column=password, value=tiger, timestamp=1350769245191000)

Returned 2 results.
Elapsed time: 67 msec(s).
[default@DEMO]

集群配置

生成一个多节点的集群：Creating a multinode cluster

The default cassandra.yaml provided with cassandra is great for getting up and running on a single node. However, it is inappropriate for use in a multi-node cluster. The configuration and process here are thesimplest way to create a multi-node cluster, but may not be the best way in production deployments.

准备第一个节点：Preparing the first node

需要修改如下两个选项：The default cassandra.yaml uses the local, loopback address as its listen (inter-node) and Thrift (client access) addresses:

listen_address: locahost

rpc_address: localhost

As the listen address is used for intra-cluster communication, it must be changed to a routable address so the other nodes can reach it. For example, assuming you have an Ethernet interface with address 192.168.1.1, you would change the listen address like so:

listen_address: 192.168.1.1

The Thrift interface can be configured using either a specified address, like the listen address, or using the wildcard 0.0.0.0, which causes cassandra to listen for clients on all available interfaces. Update it as either:

rpc_address: 192.168.1.1

Or perhaps this machine has a second NIC with ip 10.140.179.1 and so you split the traffic for the intra-cluster network traffic from the thrift traffic for better performance:

rpc_address: 10.140.179.1

If the DNS entry for your host is correct, it is safe to use a hostname instead of an IP address. Similarly, the seed information should be changed from the loopback address:

seeds:
  - 127.0.0.1

Becomes:

seeds:
  - 192.168.1.1

Once these changes are made, simply restart cassandra on this node. Use netstat (e.g. netstat -ant | grep 7000) to verify cassandra is listening on the right address. Look for a line like this:

配完后，NETSTAT看一下开的端口7000:

tcp4 0 0 192.168.1.1.7000 *.* LISTEN

If netstat still shows cassandra listening on 127.0.0.1.7000, then either the previous cassandra process was not properly killed or you are not editing the cassandra.yaml file cassandra is actually using.

准备剩下的节点：

The other nodes in the ring will use a cassandra.yaml almost identical to the one on your first node, so use that configuration as the base for these changes rather than the default cassandra.yaml. The first change is to turn on automatic bootstrapping. This will cause the node to join the ring and attempt to take control of a range of the token space:

auto_bootstrap: true

1.0.版本以上此项可以不管，默认开启为true.

The second change is to the listen address, as it must also not be the loopback and cannot be the same as any other node. Assuming your second node has an Ethernet interface with the address 192.168.2.1, set its listen address with:

listen_address: 192.168.2.1

Finally, update the the Thrift address to accept client connections, as with the first node, either with a specific address or the wildcard:

rpc_address: 192.168.2.1

Or:

rpc_address: 10.140.180.1

Note that you should leave the Seeds section of the configuration as is so the new nodes know to use the first node for bootstrapping. Once these changes are made, start cassandra on the new node and it wVA,PHP,ill automatically join the ring, assign itself an initial token, and prepare itself to handle requests.

可以使用JAVA\PHP\RUBY\PYTHON连接，我只熟python，大概是下面这个样子，具体还要修改：

在 Python 中使用 Cassandra 需要 Thrift 来生成第三方 Python 库，生成方式： thrift --gen py interface/cassandra.thrift, 然后在 Python 代码中引入所需的 Python 库，生成的 Python 库提供了与 Cassandra 建立连接、读写数据时所需要的方法。

Python 连接 Cassandra，写入并读取数据。

				
 from thrift import Thrift 
 from thrift.transport import TTransport 
 from thrift.transport import TSocket 
 from thrift.protocol.TBinaryProtocol import 
 TBinaryProtocolAccelerated 
 from cassandra import Cassandra 
 from cassandra.ttypes import * 
 import time 
 import pprint 
 def main(): 
 socket = TSocket.TSocket("192.168.10.2", 9160) 
 transport = TTransport.TBufferedTransport(socket) 
 protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) 
 client = Cassandra.Client(protocol) 
 pp = pprint.PrettyPrinter(indent=2) 
 keyspace = "Keyspace1"
 column_path = ColumnPath(column_family="Standard1", column="age") 
 key = "studentA"
 value = "18 "
 timestamp = time.time() 
 try: 
 # 打开数据库连接
 transport.open() 
 # 写入数据
 client.insert(keyspace,key,column_path, 
 value,timestamp,ConsistencyLevel.ZERO) 
 # 查询数据
 column_parent = ColumnParent(column_family="Standard1") 
 slice_range = SliceRange(start="", finish="") 
 predicate = SlicePredicate(slice_range=slice_range) 
 result = client.get_slice(keyspace,key,column_parent, 
 predicate,ConsistencyLevel.ONE) 
 pp.pprint(result) 
 except Thrift.TException, tx: 
 print 'Thrift: %s' % tx.message 
 finally: 
 # 关闭连接
 transport.close() 
 if __name__ == '__main__': 
 main()

暂时说这么多，转帖请注明出处www.alexclouds.net. 谢谢！