数据库领域的一些学习材料

之前林仕鼎曾整理过系统架构领域的学习资料,这几天Spark核心团队成员辛湜(Reynold Xin)公开了他整理的一份数据库学习资料列表,很有价值,Hacker News上引起了不少讨论。简要编译如下。大家如有补充,请评论。

基础与算法

关系数据库

经典的系统设计

列式数据库

列式存储和面向列的查询引擎对于分析型负荷即OLAP至关重要,已有15年历史(最早的MonetDB论文发表于1999年),到现在几乎所有商业数据仓库都有列式引擎了。

数据并行计算

  • MapReduce: Simplified Data Processing on Large Clusters (2004): MapReduce is both a programming model (borrowed from an old concept in functional programming) and a system at Google for distributed data-intensive computation. The programming model is so simple yet expressive enough to capture a wide range of programming needs. The system, coupled with the model, is fault-tolerant and scalable. It is probably fair to say that half of the academia are now working on problems heavily influenced by MapReduce.

  • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012): This is the research paper behind the Spark cluster computing project at Berkeley. Spark exposes a distributed memory abstraction called RDD, which is an immutable collection of records distributed across a cluster's memory. RDDs can be transformed using MapReduce style computations. The RDD abstraction can be orders of magnitude more efficient for workloads that exhibit strong temporal locality, e.g. query processing and iterative machine learning. Spark is an example of why it is important to separate the MapReduce programming model from its execution engine.

  • Shark: SQL and Rich Analytics at Scale (2013): Describes the Shark system, which is the SQL engine built on top of Spark. More importantly, the paper discusses why previous SQL on Hadoop/MapReduce query engines were slow.

  • Spanner (2012): Spanner is "a scalable, multi-version, globally distributed, and synchronously replicated database". The linchpin that allows all this functionality is the TrueTime API which lets Spanner order events between nodes without having them communicate. There is some speculation that the TrueTime API is very similar to a vector clock but each node has to store less data. Sadly, a paper on TrueTime is promised, but hasn't yet been released.

  • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (2007): Dryad is a programming model developed at Microsoft that enables large scale dataflow programming. "The fundamental difference between the [MapReduce and Dryad] is that a Dryad application may specify an arbitrary communication DAG rather than requiring a sequence of map/distribute/sort/reduce operations".

趋势(云计算,仓库规模计算和新硬件)

  • A View of Cloud Computing (2010): This is THE paper on Cloud Computing. This paper discusses the economics and obstacles of cloud computing (referring to the elasticity of resources, not the consumer-facing "cloud") from a technical perspective. The obstacles presented in this paper will impact design decisions for systems running in the cloud.

  • The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines: Google's Luiz André Barroso and Urs Hölzle explains the basics of data center hardware and software for warehouse-scale computing. There is an accompanying video. The video talks about the importance of cutting long-tail latency in massively parallel systems. The other key idea is the disaggregation of resources. Technologies such as GFS/HDFS already disaggregate disks because of high network bandwidth, but yet to see the same trend applying to DRAMs because that'd require low-latency networking.

  • CAP Twelve Years Later: How the "Rules" Have Changed (2012): The CAP theorem, proposed by Eric Brewer, asserts that any net­worked shared-data system can have only two of three desirable properties: Consistency, Availability, and Partition-Tolerance. A number of NoSQL stores reference CAP to justify their decision to sacrifice consistency. This is Eric Brewer's writeup on CAP in retrospective, explaining "'2 of 3' formulation was always misleading because it tended to oversimplify the tensions among properties."

杂项

扩展阅读

许多学校都有针对研究生的数据库阅读列表

  • Berkeley: http://www.eecs.berkeley.edu/GradAffairs/CS/Prelims/db.html
  • Brown: http://www.cs.brown.edu/courses/cs227/papers.html
  • Stanford: http://infolab.stanford.edu/db_pages/infoqual.html
  • Wisconsin: http://www.cs.wisc.edu/sites/default/files/db.reading.pdf
  • Joseph Hellerstein的Berkeley数据库研究生课程阅读列表,比本列表更全面
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值