承接上文对剩余5类数据库进行Overview(资料来源于某次调研英文展示,此部分尚未翻译并展开叙述,之后会补以新博文)
Key-value database
What :
a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash table.
Concept :
schema-free
Sharding;Distributed;CAP;BASE
Why :
- Reduce application server CPU and memory pressure
- Reduce IO read operations and IO stress
- The extensibility of relational database is not strong and it is difficult to change the table structure
Feature :
- Flexible data modeling
- Fast write performance
- Fast query performance
Use for :
- Session
- Distributed cache(High frequency data)(CAP)(BASE)
- A distributed lock
Comparison :
Storage : map,tree,…
Implement :
DynamoDB;Riak KV;Oracle NoSQL Database(table-style);
Oracle Berkeley DB(ACID);ArangoDB;FoundationDB
Memcached;Redis;Hazelcast(ACID)
LevelDB;RocksDB
Problem :
- support OLTP poorly
Document-oriented database
What :
a computer program designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.
Why(Comparison with Key-value) :
- Data transparency(extract metadata that the database engine uses for further optimization)
- Designed to offer a richer experience with modern programming techniques
Concept :
Collections;Document
Inverted index
Use for :
- Search engines
- Log
Comparison :
Encoding type : XML;JSON;YAML;BSON;PDF;WORD;EXCEL
Implement :
MongoDB;DynamoDB;CouchDB
Elasticsearch
Terrastore;RavenDB(ACID);Thrudb
OrientDB
Column-oriented database
What :
a database management system (DBMS) that stores data tables by column rather than by row.
Feature :
-
Extremely high loading speed
-
For large data sets rather than small data sets
-
Efficient storage space utilization
-
High compression rate
-
High CPU and memory utilization
-
Don’t need the view
-
Invisible index
-
Efficient projection
-
Efficient relational computing
-
Efficient aggregation operations
-
Efficient use of cache
Use for :
Data warehouse(Multiple data sources,Multidimensional data statistics,Large amount of stored data,More queries and fewer updates)
Implement :
HBase;Cassandra;Hypertable
Problem :
- Not suitable for scanning small amounts of data
- Not suitable for real-time operations with deletes and updates
Time-series database
What :
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
if we don’t use TSDB,maybe hitory table is suitable.
Why :
- Large data size(important data history)(CRud)
- Availability(Time operation)
- Relational database primary key and foreign key support poorly
Use for :
- Monitor
- Real-time data processing(IOT)
Feature :
- Write more and read less
- Sequential insert
- Update less
- Bulk delete
- Distributed
- Time index
- Time function
- Time accuracy
- Timeliness
- Data compression
Storage engine :
HDFS, Hbase,Cassandra,LevelDB,…
Implement :
- InfluxDB,OpenTSDB,KDB+,Prometheus,KairosDB,HiTSDB
- Druid,Pinot
- RRDtool, Graphite
Focus :
- Scalability, Query language, Downsampling, High compression ratio
- OLAP
- Visualization
Graph database
What :
a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.
Feature :
- Better at handling relationships than relational databases
- Graph theory algorithm (shortest path calculation, concentration calculation,…)
Use for :
- OLTP
- OLAP
- Search engine
- Social network
- Financial fraud prevention
- Risk early warning
- Business relationship
- Knowledge graph (emerging, high business abstraction, few landing except social network and search engine, more attempts in financial field)
Storage :
- Native graph (adjacency list, adjacency matrix) (good at traversal)
- Non-native graph (MySQL,…)(good at non-traversal operation)
Processing engine :
Cassovary,Pegasus,Giraph
Comparison :
Implement :
Neo4j(Neo4j Bloom),AllegroGraph
FlockDB,GraphDB
Problem :
- not applicable to large-scale data
- does not apply to binary stored data
- not suitable for large amounts of event-driven data
- super node problem
- distributed large graph problem