The Pathologies of Big Data(大数据病理)


ACM 全称 Association for Computing Machinery,国际计算机学会,ACM queue 是由 ACM 创办和出版的双月计算机杂志,该杂志成立于 2003 年。Steve Bourne 担任ACM 和编辑委员会主席时曾帮助创办了该杂志。该杂志由计算机专业人员制作且面向计算机专业人士。它仅以电子形式提供,并且可以通过订阅在 Internet 上获得。在 Queue 中发表的一些文章中还包含 ACM 的月刊《Communications of the ACM》中的 “Practitioner” 部分 详见 Wikipedia

这篇文章可以在上面的 ACM queue 上搜到。在 百度学术显示的信息如下
Jacobs, Adam. The pathologies of big data[J]. Queue, 2009, 7(6):10.
百度学术-The Pathologies of Big Data


大数据病理(The Pathologies of Big Data)

Scale up your datasets enough and all your apps will come undone.What are the typical problems and where do the bottlenecks generally surface?
Adam Jacobs, 1010data Inc.

充分扩展你的数据集然后你的应用都将会跨掉,典型的问题是什么?瓶颈通常出现在哪里?
亚当 雅各布斯,1010data 有限公司。

What is “big data” anyway? Gigabytes? Terabytes? Petabytes? A brief personal memory may provide some perspective. In the late 1980s at Columbia University I had the chance to play around with what at the time was a truly enormous “disk”: the IBM 3850 MSS (Mass Storage System). The MSS was actually a fully automatic robotic tape library and associated staging disks to make random access, if not exactly instantaneous, at least fully transparent. In Columbia’s configuration, it stored a total of around 100 GB. It was already on its way out by the time I got my hands on it, but in its heyday, the early to mid-1980s, it had been used to support access by social scientists to what was unquestionably “big data” at the time: the entire 1980 U.S. Census database.2.

总之什么是“大数据”?吉字节?太字节?拍字节?以我个人简短的回忆也许能为大家提供一些观点。在 1980 年代后期,在哥伦比亚大学的我有机会试用了当时真正巨大的“磁盘”:IBM 3850 MSS(Mass Storage System)。MSS 实际上是一个全自动的磁带库(tape library)和相关联的临时磁盘以实现随机访问,如果不全是即时的,至少是显而易见的。在哥伦比亚大学那台机器的配置,总共存储了 100 GB,当我拿到它时它已经走到了尽头,不过在1980 年早期到中期的鼎盛时期,它已经被用来支持社会科学家访问当时毫无疑问的“大数据” :整个1980年美国人口普查数据库2

There was, presumably, no other practical way to provide the researchers with ready access to a dataset that large—at close to $40,000 per gigabyte,3 a 100-GB disk farm would have been far too expensive, and requiring the operators to manually mount and dismount thousands of 40-MB tapes would have slowed progress to a crawl, or at the very least severely limited the kinds of questions that could be asked about the census data.

据推测,没有其他可行的方法可以为研究人员提供对大型数据集(large datasets)的便捷访问 - 每GB接近40,000美元3,一个100 GB的磁盘场太昂贵了,并且要求操作员手动安装并卸下数千个40 MB的磁带,这样将减慢爬取数据的进度,或者至少严重限制了可能要询问的有关普查数据的问题。

A database on the order of 100 GB would not be considered trivially small even today, although hard drives capable of storing 10 times as much can be had for less than $100 at any computer store. The U.S. Census database included many different datasets of varying sizes, but let’s simplify a bit: 100 gigabytes is enough to store at least the basic demographic information—age, sex, income, ethnicity, language, religion, housing status, and location, packed in a 128-bit record—for every living human being on the planet. This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be considered “big data”? It depends, of course, on what you’re trying to do with it. Certainly, you could store it on $10 worth of disk. More importantly, any competent programmer could in a few hours write a simple, unoptimized application on a $500 desktop PC with minimal CPU and RAM that could crunch through that dataset and return answers to simple aggregation queries such as “what is the median age by sex for each country?” with perfectly reasonable performance.

即使在今天,大约100 GB的数据库也不会被认为是很非常小的,即便能够在任何计算机商店以不到100美元的价格购买能够存储10倍容量的硬盘驱动器。美国人口普查数据库包含许多不同大小的数据集,但让我们简化一下:100 GB足以存储基本的人口统计信息 - 年龄、性别、收入、种族、语言、宗教、住房状况和位置,打包记录在地球上的每个活着的人类的128位记录中,这将创建一个包含67.5亿行和10列的表。是否仍应将其视为“大数据”?当然,这取决于您要如何处理。当然您可以将其存储在价值10美元的磁盘上,更重要的是,任何称职的程序员都可以在几小时内在500美元的台式机上编写一个简单、未经优化的应用程序,且应用程序占用最少的 CPU和RAM去处理该数据集并返回简单的汇总查询答案,例如“根据每一个国家根据性别查询其年龄中位数是多少?”具有完美合理的性能。

To demonstrate this, I tried it, with fake data, of course—namely, a file consisting of 6.75 billion 16-byte records containing uniformly distributed random data (figure 1). Since a seven-bit age field allows a maximum of 128 possible values, one bit for sex allows only two (we’ll assume there were no NULLs), and eight bits for country allows up to 256 (the UN has 192 member states), we can calculate the median age by using a counting strategy: simply create 65,536 buckets—one for each combination of age, sex, and country—and count how many records fall into each. We find the median age by determining, for each sex and country group, the cumulative count over the 128 age buckets: the median is the bucket where the count reaches half of the total. In my tests, this algorithm was limited primarily by the speed at which data could be fetched from disk: a little over 15 minutes for one pass through the data at a typical 90-megabyte-per-second sustained read speed,9 shamefully underutilizing the CPU the whole time.

为了证明这一点,我当然使用伪数据(fake data)进行了尝试,也就是说由 67.5亿 个包含均匀分布的随机数据的 16byte 记录组成的文件(图1),由于 age 字段是 7bit 最多可提供128个可能的值,sex 的1 bit 仅允许两个值(我们假设没有NULL),country 的8bit 最多允许256个值(联合国有192个成员国) ,我们可以使用一种计数策略来计算年龄的中位数:只需为每个 age、sex 和 country 的排列组合创建 65,536 (译者:计算方式为:年龄的可能值 * 性别的可能值 * 国家的可能值 = 128 * 2 * 256 = 65536)个存储桶,然后计算每个存储桶中有多少条记录。我们通过确定每个 sex 和 country 组的128个年龄段的累计计数来找到年龄中位数:中位数是计数达到总数的一半的值(译者:前提是数据是排好序的)。在我的测试中,该算法主要受到从磁盘读取数据速度的限制:一次通过以典型的 90 MB/s 的持续读取速度读取花费了 15 分钟多一点9 ,可耻地整个时间 CPU 都没有充分利用。

figure 1

In fact, our table of “all the people in the world” will fit in the memory of a single, $15,000 Dell server with 128-GB RAM. Running off in-memory data, my simple median-age-by-sex-and-country program completed in less than a minute. By such measures, I would hesitate to call this “big data,” particularly in a world where a single research site, the LHC (Large Hadron Collider) at CERN (European Organization for Nuclear Research), is expected to produce 150,000 times as much raw data each year.10

实际上,我们的 “all the people in the world” 表都可以容纳在单个具有15,000美元、128 GB RAM的戴尔服务器的内存中。 利用内存中的数据,我简单 median-age-by-sex-and-country 程序在不到一分钟的时间内完成了。通过这样的计算,我会不愿意将其称为“大数据”,尤其在是在世界上,CERN(European Organization for Nuclear Research,欧洲核研究组织)的 LHC(Large Hadron Collider,大型强子对撞机)的单研究站点,预计每年产生15万倍于此的原始数据。10

For many commonly used applications, however, our hypothetical 6.75-billion-row dataset would in fact pose a significant challenge. I tried loading my fake 100-GB world census into a commonly used enterprise-grade database system (PostgreSQL6) running on relatively hefty hardware (an eight-core Mac Pro workstation with 20-GB RAM and two terabytes of RAID 0 disk) but had to abort the bulk load process after six hours as the database storage had already reached many times the size of the original binary dataset, and the workstation’s disk was nearly full. (Part of this, of course, was a result of the “unpacking” of the data. The original file stored fields bit-packed rather than as distinct integer fields, but subsequent tests revealed that the database was using three to four times as much storage as would be necessary to store each field as a 32-bit integer. This sort of data “inflation” is typical of a traditional RDBMS and shouldn’t necessarily be seen as a problem, especially to the extent that it is part of a strategy to improve performance. After all, disk space is relatively cheap.)

但是对于许多平常的应用程序,我们假设 67.5亿 行数据集实际上将构成一个重大挑战。我尝试将伪造的 100GB 世界人口普查加载到一个普通的企业级数据库系统(PostgreSQL6)中,该系统运行在相对重量级的硬件上(具有20 GB RAM和 2 TB RAID 0磁盘的八核 Mac Pro工作站),但是六个小时后中止大容量加载过程,因为数据库存储已达到原始二进制数据集大小的许多倍,并且工作站的磁盘几乎已满。(当然,部分原因是数据“解包”的结果,原始文件存储的字段是 bit-packed 的,而不是不同的整数字段,但是随后的测试表明,数据库所使用的空间差不多是将每个字段存储为 32 为整数所必须空间的 3 到 4 倍,这种数据“膨胀”是传统 RDBMS 的典型特征,不一定被视为一个问题,尤其是在提高性能的策略中,毕竟磁盘空间相对便宜。)

I was successfully able to load subsets consisting of up to 1 billion rows of just three columns: country (eight bits, 256 possible values), age (seven bits, 128 possible values), and sex (one bit, two values). This was only 2 percent of the raw data, although it ended up consuming more than 40 GB in the DBMS. I then tested the following query, essentially the same computation as the left side of figure 1:

我成功地加载了仅由三列组成的多达10亿行的子集:country(8位,256个可能值),age(7位,128个可能值)和sex(1位,2个值)。尽管最终它在DBMS中消耗了40 GB以上的空间,但这仅占原始数据的2%。 然后,我测试了以下查询,基本上与图1左侧的计算相同:

SELECT country,age,sex,count(*) FROM people GROUP BY country,age,sex;

This query ran in a matter of seconds on small subsets of the data, but execution time increased rapidly as the number of rows grew past 1 million (figure 2). Applied to the entire billion rows, the query took more than 24 hours, suggesting that PostgreSQL was not scaling gracefully to this “big” dataset, presumably because of a poor choice of algorithm for the given data and query. Invoking the DBMS’s built-in EXPLAIN facility revealed the problem: while the query planner chose a reasonable hash table-based aggregation strategy for small tables, on larger tables it switched to sorting by grouping columns—a viable, if suboptimal strategy given a few million rows, but a very poor one when facing a billion. PostgreSQL tracks statistics such as the minimum and maximum value of each column in a table (and I verified that it had correctly identified the ranges of all three columns), so it could have chosen a hash-table strategy with confidence. It’s worth noting, however, that even had the table’s statistics not been known, on a billion rows it would take far less time to do an initial scan and determine the distributions than to embark on a full-table sort.

该查询对较小的数据子集的处理仅需几秒钟即可,但是随着行数超过 100万,执行时间迅速增加(图2)。将查询应用于整个10亿行,耗时超过24小时,这表明 PostgreSQL 不能优雅地扩展到这个“大”数据集,原因可能是给定数据和查询的算法选择不佳。调用DBMS的内置EXPLAIN工具揭示了问题:虽然查询计划为小表选择了一个合理的基于哈希表的聚合策略,但是在较大的表上,它切换为按列分组排序—如果可行的话,即使是次优的策略,也可以使用几百万进行排序排,但是面对十亿时这是非常糟糕的策略。 PostgreSQL跟踪统计信息,例如表中每列的最小值和最大值(并且我验证它已经正确标识了所有三列的范围),因此它可以放心地选择哈希表策略。值得注意的是,即使不知道表的统计信息,在十亿行中进行初始扫描和确定分布的时间也比进行全表排序要少得多。

figure 2

PostgreSQL’s difficulty here was in analyzing the stored data, not in storing it. The database didn’t blink at loading or maintaining a database of a billion records; presumably there would have been no difficulty storing the entire 6.75-billion-row, 10-column table had I had sufficient free disk space.

PostgreSQL的难点在于分析存储的数据,而不是存储数据。 数据库在加载或维护一个包含10亿条记录的数据库时不会眨眼; 如果我有足够的可用磁盘空间,假定存储整个67.5亿行,10列的表将毫无困难。

Here’s the big truth about big data in traditional databases: it’s easier to get the data in than out. Most DBMSs are designed for efficient transaction processing: adding, updating, searching for, and retrieving small amounts of information in a large database. Data is typically acquired in a transactional fashion: imagine a user logging into a retail Web site (account data is retrieved; session information is added to a log), searching for products (product data is searched for and retrieved; more session information is acquired), and making a purchase (details are inserted in an order database; user information is updated). A fair amount of data has been added effortlessly to a database that—if it’s a large site that has been in operation for a while—probably already constitutes “big data.”

这是传统数据库中有关大数据的一个大真理:输入数据要比提取数据容易。 大多数DBMS旨在进行高效的事务处理:在大型数据库中添加、更新、搜索和检索少量信息。 数据通常是以事务方式获取的:假设用户登录零售网站(检索到帐户数据,将会话信息添加到日志),搜索产品(搜索并检索产品数据,获取了更多会话信息),然后进行购买(将详细信息插入订单数据库,更新用户信息)。 毫不费力地将大量数据添加到了数据库中(如果这是一个已经运行了一段时间的大型站点),可能已经构成了“大数据”。

There is no pathology here; this story is repeated in countless ways, every second of the day, all over the world. The trouble comes when we want to take that accumulated data, collected over months or years, and learn something from it—and naturally we want the answer in seconds or minutes! The pathologies of big data are primarily those of analysis. This may be a slightly controversial assertion, but I would argue that transaction processing and data storage are largely solved problems. Short of LHC-scale science, few enterprises generate data at such a rate that acquiring and storing it pose major challenges today.

这里没有病症,个故事在全世界无时无刻以无数种方式重复着。 当我们想要获取经过数月或数年收集的累积数据并从中学到一些东西时,并且我们自然希望在几秒钟或几分钟内得到答案时,麻烦就来了! 大数据的病理主要是分析的病症,这可能是一个稍微有争议的断言,但是我认为事务处理和数据存储在很大程度上解决了问题。缺少LHC规模的科学研究,很少有企业以如此快的速度生成数据,以致于如今获取和存储数据都面临着重大挑战。

In business applications, at least, data warehousing is ordinarily regarded as the solution to the database problem (data goes in but doesn’t come out). A data warehouse has been classically defined as “a copy of transaction data specifically structured for query and analysis,”4 and the general approach is commonly understood to be bulk extraction of the data from an operational database, followed by reconstitution in a different database in a form that is more suitable for analytical queries (the so-called “extract, transform, load,” or sometimes “extract, load, transform” process). Merely saying, “We will build a data warehouse” is not sufficient when faced with a truly huge accumulation of data. How must data be structured for query and analysis, and how must analytical databases and tools be designed to handle it efficiently? Big data changes the answers to these questions, as traditional techniques such as RDBMS-based dimensional modeling and cube-based OLAP (online analytical processing) turn out to be either too slow or too limited to support asking the really interesting questions about warehoused data. To understand how to avoid the pathologies of big data, whether in the context of a data warehouse or in the physical or social sciences, we need to consider what really makes it “big.”

至少在业务应用程序中,通常将数仓视为解决数据库问题的一种方法(数据进入但不输出)。数仓通常被经典的定义为“专门为查询和分析而构建的事务数据的副本” 4,一般的通用理解途径为从运营中的数据库批量提取数据,然后在另一数据库中以更适合分析查询的形式进行重构(所谓的 “提取、转换、加载”或有时称为“提取、加载、转换”过程)。 当面对真正巨大的数据积累时,仅仅说“我们将建立一个数仓”是不够的,如何构造用于查询和分析的数据,以及如何设计分析数据库和工具来有效地处理这些数据?大数据改变了这些问题的答案,因为传统技术,如基于RDBMS的维度建模(dimensional model)和基于多维数据集的OLAP(online analytical processing,联机分析处理),要么太慢,要么太局限,无法支持对仓库数据提出真正令人关注的的问题。 为了理解如何避免大数据的病态,无论是在数仓的环境中还是在物理或社会科学研究中,我们都需要考虑使它(问题)变“大”的真正原因。


处理大数据(Dealing with Big Data)

Data means “things given” in Latin—although we tend to use it as a mass noun in English, as if it denotes a substance—and ultimately, almost all useful data is “given” to us either by nature, as a reward for careful observation of physical processes, or by other people, usually inadvertently (consider logs of Web hits or retail transactions, both common sources of big data). As a result, in the real world, data is not just a big set of random numbers; it tends to exhibit predictable characteristics. For one thing, as a rule, the largest cardinalities of most datasets—specifically, the number of distinct entities about which observations are made—are small compared with the total number of observations.

数据在拉丁语中是“所给的事物”的含义,尽管在英语中我们倾向于用它作为一个不可数名词,就好像它代表一种物质一样,最终,几乎所有有用的数据都是自然地“给予”我们的,作为对仔细观察物理过程或他人(通常是不经意地)的奖励(考虑Web记录或零售交易的日志,这两种都是大数据的常见来源)。结果,在现实世界中数据不仅是一大堆随机数,而且它倾向于表现出可预测的特征,其一通常来说,大多数数据集的最大基数(特别是进行观察的不同实体的数量)与观察的总数相比很小。。

This is hardly surprising. Human beings are making the observations, or being observed as the case may be, and there are no more than 6.75 billion of them at the moment, which sets a rather practical upper bound. The objects about which we collect data, if they are of the human world—Web pages, stores, products, accounts, securities, countries, cities, houses, phones, IP addresses—tend to be fewer in number than the total world population. Even in scientific datasets, a practical limit on cardinalities is often set by such factors as the number of available sensors (a state-of-the-art neurophysiology dataset, for example, might reflect 512 channels of recording5) or simply the number of distinct entities that humans have been able to detect and identify (the largest astronomical catalogs, for example, include several hundred million objects8).

这不足为奇,人类正在进行的观察或者视情况而定的被观察,目前不超过67.5亿,这设定了一个相当实际的上限。 如果它们是人类世界,则收集数据的对象(网页、商店、产品、帐户、证券、国家、城市、房屋、电话、IP地址)的数量往往少于世界总人口的数量。 即使在科学数据集中,基数的实际限制也经常由诸如可用传感器的数量(例如,最先进的神经生理学数据集可能反映了512个记录通道5)之类的因素来设置,或者仅由不同的数目决定,人类已经能够检测和识别的实体(例如最大的天文目录包括数亿个对象8)。

What makes most big data big is repeated observations over time and/or space. The Web log records millions of visits a day to a handful of pages; the cellphone database stores time and location every 15 seconds for each of a few million phones; the retailer has thousands of stores, tens of thousands of products, and millions of customers but logs billions and billions of individual transactions in a year. Scientific measurements are often made at a high time resolution (thousands of samples a second in neurophysiology, far more in particle physics) and really start to get huge when they involve two or three dimensions of space as well; fMRI neuroimaging studies can generate hundreds or even thousands of gigabytes in a single experiment. Imaging in general is the source of some of the biggest big data out there, but the problems of large image data are a topic for an article by themselves; I won’t consider them further here.

使大多数大数据变大( big data big)的原因是随着时间和/或空间的不断观察。 Web日志每天记录数以百万计的访问量;手机数据库每15秒钟就存储几百万个电话中的每个的时间和位置;零售商拥有数千家商店、数万种产品和数百万的客户,但一年却记录了数十亿笔个人交易;科学测量通常是在高时间分辨率下进行的(神经生理学每秒要处理数千个样本,粒子物理学则更要多得多),并且当它们涉及到二维或三维空间时,就真的开始变得巨大;fMRI(Functional magnetic resonance imaging,功能性磁共振成像)神经成像研究可在单个实验中产生数百甚至数千GB的数据。 一般来说,成像(Imaging)是其中一些最大的大数据的来源,但是大成像数据的问题(the problems of large image data)本身就可作为一篇文章的主题,我这里不再进一步讨论它们。

The fact that most large datasets have inherent temporal or spatial dimensions, or both, is crucial to understanding one important way that big data can cause performance problems, especially when databases are involved. It would seem intuitively obvious that data with a time dimension, for example, should in most cases be stored and processed with at least a partial temporal ordering to preserve locality of reference as much as possible when data is consumed in time order. After all, most nontrivial analyses will involve at the very least an aggregation of observations over one or more contiguous time intervals. One is more likely, for example, to be looking at the purchases of a randomly selected set of customers over a particular time period than of a “contiguous range” of customers (however defined) at a randomly selected set of times.

大多数大型数据集(large dataset)具有固有的时间或空间维度,或两者兼而有之,这一事实对于理解大数据可能导致性能问题的一个重要方式至关重要,尤其是在涉及数据库时。 从直观上看,很明显,例如在大多数情况下,应以一定时间顺序存储和处理具有时间维度的数据,以便在按时间顺序使用数据时尽可能多地保留参考位置。 毕竟,大多数重要的分析至少会涉及一个或多个连续时间间隔内的观察结果集,例如,一个更可能是观察在特定时间段内随机选择的一组客户的购买情况,而不是在随机选择的时间组内“连续范围”的客户(无论如何定义)的购买情况。

The point is even clearer when we consider the demands of time-series analysis and forecasting, which aggregate data in an order-dependent manner (e.g., cumulative and moving-window functions, lead and lag operators, etc.). Such analyses are necessary for answering most of the truly interesting questions about temporal data, broadly: “What happened?” “Why did it happen?” “What’s going to happen next?”

当我们考虑时间序列分析(time-series analysis)和预测的需求时,这一点甚至更清楚了,这些需求以顺序相关的方式聚合数据(例如,累积和移动窗口函数,超前和滞后 operator 等)。 这样的分析对于回答大多数关于时态数据(temporal data)的真正令人关注的问题是必要的,宽泛地说:“发生了什么?” “为什么会发生?” “接下来会发生什么?”

The prevailing database model today, however, is the relational database, and this model explicitly ignores the ordering of rows in tables.1 Database implementations that follow this model, eschewing the idea of an inherent order on tables, will inevitably end up retrieving data in a nonsequential fashion once it grows large enough that it no longer fits in memory. As the total amount of data stored in the database grows, the problem only becomes more significant. To achieve acceptable performance for highly order-dependent queries on truly large data, one must be willing to consider abandoning the purely relational database model for one that recognizes the concept of inherent ordering of data down to the implementation level. Fortunately, this point is slowly starting to be recognized in the analytical database sphere.

但是,当今流行的数据库模型是关系数据库,该模型明确地忽略了表中行的顺序。1遵循该模型的数据库实现避免了表的固有顺序的思想,一旦数据变得足够大以至于不再适合内存时,将不可避免地以非连续的方式检索数据。随着数据库中存储的数据总量的增长,问题只会变得愈加严重。为了在真正大数据上高度依赖顺序的查询(order-dependent querie)中获得可接受的性能,必须愿意去考虑放弃一种纯粹的关系数据库模型(purely relational database model),因为一直到实现级别它都能认识到数据固有的排序概念,幸运的是,这一点正慢慢开始在分析型数据库领域得到认可。

Not only in databases, but also in application programming in general, big data greatly magnifies the performance impact of suboptimal access patterns. As dataset sizes grow, it becomes increasingly important to choose algorithms that exploit the efficiency of sequential access as much as possible at all stages of processing. Aside from the obvious point that a 10:1 increase in processing time (which could easily result from a high proportion of nonsequential accesses) is far more painful when the units are hours than when they are seconds, increasing data sizes mean that data access becomes less and less efficient. The penalty for inefficient access patterns increases disproportionately as the limits of successive stages of hardware are exhausted: from processor cache to memory, memory to local disk, and—rarely nowadays!—disk to off-line storage.

大数据不仅在数据库中,而且在一般的应用程序编程中,都极大地放大了次优访问模式(suboptimal access patterns)对性能的影响。 随着数据集大小的增长,选择在处理的所有阶段尽可能多地利用顺序访问效率的算法变得越来越重要,除了显而易见的一点,即处理时间以10:1 的比例增加(很可能是由于大量的非顺序访问所致)在单位为小时时要比单位为秒时痛苦得多,数据量的增加意味着数据访问变得越来越困难,效率越来越低,低效访问模式的代价是随着连续阶段的硬件限制而不成比例地增加:从处理器缓存到内存,从内存到本地磁盘(现在很少),磁盘到脱机存储。

On typical server hardware today, completely random memory access on a range much larger than cache size can be an order of magnitude or more slower than purely sequential access, but completely random disk access can be five orders of magnitude slower than sequential access (figure 3). Even state-of-the-art solid-state (flash) disks, although they have much lower seek latency than magnetic disks, can differ in speed by roughly four orders of magnitude between random and sequential access patterns. The results for the test shown in figure 3 are the number of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory; random disk reads are for 10,000 indices chosen at random between one and 1 billion.

在当今的典型服务器硬件上,远大于缓存大小范围内的完全随机内存访问可能比纯粹的顺序访问慢一个数量级或更慢,但是完全随机的磁盘访问可能比顺序访问慢5个数量级(图3)。 即使最先进的固态(闪存)磁盘,尽管其寻道等待时间比磁盘低得多,但它们在随机访问模式和顺序访问模式之间的速度差异大约有四个数量级。 图3中显示的测试结果是每秒从磁盘或内存中的10亿长(4 GB)数组中读取的四字节整数值的数量; 随机磁盘读取是针对从1到10亿之间随机选择的10,000个索引。

figure 3

A further point that’s widely underappreciated: in modern systems, as demonstrated in the figure, random access to memory is typically slower than sequential access to disk. Note that random reads from disk are more than 150,000 times slower than sequential access; SSD improves on this ratio by less than one order of magnitude. In a very real sense, all of the modern forms of storage improve only in degree, not in their essential nature, upon that most venerable and sequential of storage media: the tape.

另一个被广泛忽视的问题是:如图所示,在现代系统中,随机访问内存通常比顺序访问磁盘要慢。 请注意,从磁盘进行的随机读取比顺序访问要慢150,000倍以上(译者:也就是上段说的 5 个数量级),SSD将这个比率提高不到一个数量级。从非常现实的意义上说,所有现代存储形式的改进只是程度上的改进,而不是本质上的改进,是在最古老、最连续的存储介质磁带之上的改进。

The huge cost of random access has major implications for analysis of large datasets (whereas it is typically mitigated by various kinds of caching when data sizes are small). Consider, for example, joining large tables that are not both stored and sorted by the join key—say, a series of Web transactions and a list of user/account information. The transaction table has been stored in time order, both because that is the way the data was gathered and because the analysis of interest (tracking navigation paths, say) is inherently temporal. The user table, of course, has no temporal dimension.

随机访问的巨大成本对大型数据集的分析具有重大意义(而当数据量较小时,通常可以通过各种缓存来缓解这种情况),例如,考虑不按 join key 存储和排序的大表(例如,一系列 Web 事务 和 用户/帐户 信息列表)。事务表已按时间顺序存储,这既是因为这是收集数据的方式,又是因为感兴趣的分析(例如,跟踪导航路径)本质上是时态性的,当然,用户表没有时间维度。

As records from the transaction table are consumed in temporal order, accesses to the joined user table will be effectively random—at great cost if the table is large and stored on disk. If sufficient memory is available to hold the user table, performance will be improved by keeping it there. Because random access in RAM is itself expensive, and RAM is a scarce resource that may simply not be available for caching large tables, the best solution when constructing a large database for analytical purposes (e.g., in a data warehouse) may, surprisingly, be to build a fully denormalized table—that is, a table including each transaction along with all user information that is relevant to the analysis (figure 4). Denormalizing a 10-million-row, 10-column user information table onto a 1-billion-row, four-column transaction table adds substantially to the size of data that must be stored (the denormalized table is more than three times the size of the original tables combined). If data analysis is carried out in timestamp order but requires information from both tables, then eliminating random look-ups in the user table can improve performance greatly. Although this inevitably requires much more storage and, more importantly, more data to be read from disk in the course of the analysis, the advantage gained by doing all data access in sequential order is often enormous.

由于事务表中的记录是按时间顺序使用的,如果表很大并且存储在磁盘上,对 join 的用户表的访问实际上又将是随机的,则成本很高。如果有足够的内存来容纳用户表,将其保留在内存可以提高性能。因为RAM中的随机访问本身很昂贵,并且RAM是一种稀缺资源,可能根本无法用于缓存大表,所以令人惊讶的是,构建大型数据库以进行分析时(例如,在数仓中)的最佳解决方案是来构建一个完全非规范化的表(fully denormalized table),即一个包含每个transaction以及与分析相关的所有用户信息的表(图4)。将1000万行、10列的用户信息表归一化为10亿行、4列的 ransaction 表,这实质上增加了必须存储的数据大小(非规范化后的表的大小是原始表大小总和的3倍多)。如果数据分析是按时间戳顺序执行的,但是需要两个表中的信息,那么消除用户表中的随机查找可以大大提高性能,尽管这不可避免地需要更多的存储空间,更重要的是,在分析过程中需要从磁盘读取更多的数据,但是按顺序进行所有数据访问所获得的优势通常是巨大的。

figure 4


硬件限制(Hard Limits)

Another major challenge for data analysis is exemplified by applications with hard limits on the size of data they can handle. Here, one is dealing mostly with the end-user analytical applications that constitute the last stage in analysis. Occasionally the limits are relatively arbitrary; consider the 256-column, 65,536-row bound on worksheet size in all versions of Microsoft Excel prior to the most recent one. Such a limit might have seemed reasonable in the days when main RAM was measured in megabytes, but it was clearly obsolete by 2007 when Microsoft updated Excel to accommodate up to 16,384 columns and 1 million rows. Enough for anyone? Excel is not targeted at users crunching truly huge datasets, but the fact remains that anyone working with a 1-million-row dataset (a list of customers along with their total purchases for a large chain store, perhaps) is likely to face a 2-million-row dataset sooner or later, and Excel has placed itself out of the running for the job.

数据分析的另一个主要挑战是应用程序可以处理的数据大小受到严格限制,这里我们主要处理的是构成分析最后阶段的最终用户分析应用程序。 有时,限制是相对任意的,考虑到在最新版本之前的所有版本的 Microsoft Excel 中256列、65536行的工作表大小限制,在以兆字节为单位衡量主内存的日子里,这样的限制似乎是合理的,但是到2007年 Microsoft 更新 Excel 以容纳多达 16,384 列和 100 万行时,那样的限制显然已经过时了。够了吗?Excel 并非针对处理真正巨大数据集用户的,但是事实仍然是,使用100万行数据集(可能是客户列表以及他们在一家大型连锁店的总购买额)的人很可能迟早会遇到200万行数据集,而 Excel 已将自己排除在工作之外。

In designing applications to handle ever-increasing amounts of data, developers would do well to remember that hardware specs are improving too, and keep in mind the so-called ZOI (zero-one-infinity) rule, which states that a program should “allow none of foo, one of foo, or any number of foo.”11 That is, limits should not be arbitrary; ideally, one should be able to do as much with software as the hardware platform allows.

在设计用于处理不断增长的数据量的应用程序时,开发人员应牢记硬件规格也在不断改进,并牢记所谓的ZOI 规则(zero-one-infinity,零一无限。译者:ZOI 规则是软件设计中的一个经验法则,由 Willem van der Poel提出的,他认为不应允许对特定实体的实例数量进行任意限制),该规则指出程序应“allow none of foo, one of foo, or any number of foo."11,那也就是限制不应该是随意的, 理想情况下,应该能够在硬件平台允许的范围内对软件做尽可能多的事情。

Of course, hardware—chiefly memory and CPU limitations—is often a major factor in software limits on dataset size. Many applications are designed to read entire datasets into memory and work with them there; a good example of this is the popular statistical computing environment R.7 Memory-bound applications naturally exhibit higher performance than disk-bound ones (at least insofar as the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring all data to fit in memory means that if you have a dataset larger than your installed RAM, you’re out of luck. On most hardware platforms, there’s a much harder limit on memory expansion than disk expansion: the motherboard has only so many slots to fill.

当然,硬件(主要是内存和CPU的限制)通常是软件对数据集大小的限制的主要因素,许多应用程序旨在将整个数据集读取到内存中并在其中使用数据集,一个很好的例子是流行的统计计算环境 R。7 内存绑定的应用程序自然比磁盘绑定的应用程序具有更高的性能(至少在数据处理方面,它们超越了单次的、纯顺序的处理)。但是要求所有数据都能放入内存,这意味着如果您的数据集大于已安装的RAM,那么您将很不走运。在大多数硬件平台上,对内存扩展的限制要比对磁盘的扩展难得多:主板上只有这么多的插槽可填充。

The problem often goes further than this, however. Like most other aspects of computer hardware, maximum memory capacities increase with time; 32 GB is no longer a rare configuration for a desktop workstation, and servers are frequently configured with far more than that. There is no guarantee, however, that a memory-bound application will be able to use all installed RAM. Even under modern 64-bit operating systems, many applications today (e.g., R under Windows) have only 32-bit executables and are limited to 4-GB address spaces—this often translates into a 2- or 3-GB working set limitation.

然而,问题往往比这更严峻。像计算机硬件的大多数其他方面一样,最大存储容量会随着时间的增长而增加,32 GB不再是台式机工作站的常用配置,并且服务器配置的资源远远超过此数量,但是,不能保证内存绑定的应用程序将能够使用所有已安装的RAM,即使在现代的 64 位操作系统下,当今的许多应用程序(例如 Windows 下的 R)也只有32位可执行文件,并且限于 4 GB地址空间,这通常会转换为2 GB 或3 GB 的工作集(working set)限制。

Finally, even where a 64-bit binary is available—removing the absolute address space limitation—all too often relics from the age of 32-bit code still pervade software, particularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit versions of R (available for Linux and Mac) use signed 32-bit integers to represent lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such as the earlier world census example ends up being too big for R to handle.

最后,即使可以使用 64 位二进制文件(消除了绝对地址空间限制),从 32 位代码时代起的遗物仍然普遍存在于软件中,尤其是在使用32 位整数索引数组元素时。因此,例如 R 的 64 位版本(适用于 Linux 和 Mac)使用带符号的 32 位整数表示长度,从而将数据帧限制为最多231-1,即约20亿行。 因此,即使在具有足够 RAM 来存储数据的64位系统上,诸如67.5亿行的数据集(如较早的世界人口普查示例)最终对于R来说也太大了,无法处理。


分布式计算作为大数据策略(Distributed Computing as a Strategy for Big Data)

Any given computer has a series of absolute and practical limits: memory size, disk size, processor speed, and so on. When one of these limits is exhausted, we lean on the next one, but at a performance cost: an in-memory database is faster than an on-disk one, but a PC with 2-GB RAM cannot store a 100-GB dataset entirely in memory; a server with 128-GB RAM can, but the data may well grow to 200 GB before the next generation of servers with twice the memory slots comes out.

任何给定的计算机都具有一系列绝对的和实际的限制:内存大小、磁盘大小、处理器速度等。 当这些限制之一用尽时,我们会依靠下一个限制因素,但要以性能为代价:内存数据库比磁盘上数据库快,但是具有2 GB RAM 的PC无法存储100 GB 数据集完全在内存中,具有128 GB RAM 的服务器可以,但是在具有两倍内存插槽的下一代服务器问世之前,数据可能会增长到 200 GB。

The beauty of today’s mainstream computer hardware, though, is that it’s cheap and almost infinitely replicable. Today it is much more cost-effective to purchase eight off-the-shelf, “commodity” servers with eight processing cores and 128 GB of RAM each than it is to acquire a single system with 64 processors and a terabyte of RAM. Although the absolute numbers will change over time, barring a radical change in computer architectures, the general principle is likely to remain true for the foreseeable future. Thus, it’s not surprising that distributed computing is the most successful strategy known for analyzing very large datasets.

不过,当今主流计算机硬件的优点在于它价格便宜,且几乎可以无限复制。 如今,购买8台成品的“商业”级服务器(每个服务器有8个处理核心和128 GB RAM)比购买一个拥有 64个处理器和 1 TB RAM 的单个系统要划算得多。 尽管绝对数会随着时间的推移而发生变化,除非计算机体系结构发生根本变化,但在可预见的将来,一般原则可能仍然适用。因此,分布式计算是分析大型数据集的最成功的策略也就不足为奇了。

Distributing analysis over multiple computers has significant performance costs: even with gigabit and 10-gigabit Ethernet, both bandwidth (sequential access speed) and latency (thus, random access speed) are several orders of magnitude worse than RAM. At the same time, however, the highest-speed local network technologies have now surpassed most locally attached disk systems with respect to bandwidth, and network latency is naturally much lower than disk latency.

在多台计算机上进行分布式分析会产生巨大的性能成本:即使使用 1 Gb 和10 Gb的以太网,带宽(顺序访问速度)和延迟(因此,随机访问速度)也比 RAM 差几个数量级。 但是,与此同时,在带宽方面,最高速度的本地网络技术现在已经超过了大多数本地连接的磁盘系统,并且自然网络延迟比磁盘延迟要低得多。

As a result, the performance cost of storing and retrieving data on other nodes in a network is comparable to (and in the case of random access, potentially far less than) the cost of using disk. Once a large dataset has been distributed to multiple nodes in this way, however, a huge advantage can be obtained by distributing the processing as well—so long as the analysis is amenable to parallel processing.

结果,在网络中其他节点上存储和检索数据的性能成本可与使用磁盘的成本相媲美(并且在随机访问的情况下,可能远远低于使用磁盘的成本)。 然而,一旦将大型数据集以这种方式分配到多个节点,只要分析过程适合并行处理,就可以通过分布式处理获得巨大的优势。

Much has been and can be said about this topic, but in the context of a distributed large dataset, the criteria are essentially related to those discussed earlier: just as maintaining locality of reference via sequential access is crucial to processes that rely on disk I/O (because disk seeks are expensive), so too, in distributed analysis, processing must include a significant component that is local in the data—that is, does not require simultaneous processing of many disparate parts of the dataset (because communication between the different processing domains is expensive). Fortunately, most real-world data analysis does include such a component. Operations such as searching, counting, partial aggregation, record-wise combinations of multiple fields, and many time-series analyses (if the data is stored in the correct order) can be carried out on each computing node independently.

关于该主题已经有很多可以说的内容,但是在分布式大型数据集的上下文中,该标准本质上与前面讨论的标准相关:正如通过顺序访问维护引用的局部性对于依赖磁盘 I/O 的进程至关重要(因为磁盘查找非常昂贵),因此,在分布式分析中也是如此,处理过程必须包括数据中本地的重要组件,也就是说,不需要同时处理数据集的许多不同部分,因为不同组件之间的通信处理开销非常昂贵。 幸运的是,大多数现实世界中的数据分析确实包含这样的组件。 可以在每个计算节点上独立执行诸如搜索、计数、部分聚合、多字段的按记录智慧组合以及许多时间序列分析(如果数据以正确的顺序存储)之类的操作。

Furthermore, where communication between nodes is required, it often occurs after data has been extensively aggregated; consider, for example, taking an average of billions of rows of data stored on multiple nodes. Each node is required to communicate only two values—a sum and a count—to the node that produces the final result. Not every aggregation can be computed so simply, as a global aggregation of local sub-aggregations (consider the task of finding a global median, for example, instead of a mean), but many of the important ones can, and there are distributed algorithms for other, more complicated tasks that minimize communication between nodes.

此外,在需要节点间通信的情况下,通常发生在数据被广泛聚合之后,例如,考虑在多个节点上平均存储数10亿行数据,要求每个节点仅将两个值(sum 和 count )传递给产生最终结果的节点,并不是每一个聚合都可以如此简单地计算,就像局部子聚集的全局聚集一样(例如,考虑查找全局中位数的任务,而不是平均值),但是许多重要的聚合都可以,而且还有其他更复杂任务的分布式算法,可以最小化节点之间的通信。

Naturally, distributed analysis of big data comes with its own set of “gotchas.” One of the major problems is nonuniform distribution of work across nodes. Ideally, each node will have the same amount of independent computation to do before results are consolidated across nodes. If this is not the case, then the node with the most work will dictate how long we must wait for the results, and this will obviously be longer than we would have waited had work been distributed uniformly; in the worst case, all the work may be concentrated in a single node and we will get no benefit at all from parallelism.

自然,大数据的分布式分析具有其自己的“陷阱”,主要问题之一是跨节点的工作分配不均匀,理想情况下,在将结果合并到各个节点之前,每个节点将进行相同数量的独立计算。如果不是这种情况,那么工作最多的节点将决定我们必须等待多长时间,而这显然要比工作均匀分配的等待时间长。在最坏的情况下,所有工作可能都集中在单个节点上,我们从并行性中得不到任何好处。

Whether this is a problem or not will tend to be determined by how the data is distributed across nodes; unfortunately, in many cases this can come into direct conflict with the imperative to distribute data in such a way that processing at each node is local. Consider, for example, a dataset that consists of 10 years of observations collected at 15-second intervals from 1,000 sensor sites. There are more than 20 million observations for each site; and, because the typical analysis would involve time-series calculations—say, looking for unusual values relative to a moving average and standard deviation—we decide to store the data ordered by time for each sensor site (figure 5), distributed over 10 computing nodes so that each one gets all the observations for 100 sites (a total of 2 billion observations per node). Unfortunately, this means that whenever we are interested in the results of only one or a few sensors, most of our computing nodes will be totally idle. Whether the rows are clustered by sensor or by time stamp makes a big difference in the degree of parallelism with which different queries will execute.

是否存在问题将取决于数据如何在节点之间分布,不幸的是,在许多情况下,这可能会与以在每个节点上进行本地处理的方式分发数据的必要性直接冲突。例如,考虑一个数据集,该数据集包含从1000个传感器位置以15秒为间隔收集的10年观察结果,每个站点的观测值超过2000万,并且由于典型的分析将涉及时间序列计算,例如寻找相对于移动平均值和标准偏差的异常值,因此我们决定存储每个传感器站点按时间排序的数据(图5),分布在10个计算节点中,以便每个节点都可以获取100个站点的所有观测值(每个节点总共20亿个观测值)。不幸的是,这意味着每当我们只对一个或几个传感器的结果感兴趣时,我们的大多数计算节点将完全处于空闲状态。这些行是按传感器还是按时间戳进行聚类,将对执行不同查询的并行度产生很大影响。

figure 5

We could, of course, store the data ordered by time, one year per node, so that each sensor site is represented in each node (we would need some communication between successive nodes at the beginning of the computation to “prime” the time-series calculations). This approach also runs into the difficulty if we suddenly need an intensive analysis of the past year’s worth of data. Storing the data both ways would provide optimal efficiency for both kinds of analysis—but the larger the dataset, the more likely it is that two copies would be simply too much data for the available hardware resources.

我们当然可以存储按时间排序的数据,每个节点一年,以便每个节点中都有每个传感器站点(在计算开始时,需要连续的节点之间进行一些通信,以“准备”时间序列计算)。如果我们突然需要对过去一年的数据进行深入分析,这种方法也会遇到困难。两种方式存储数据将为其对应的两种分析提供最佳的效率,但是数据集越大,两个副本对于可用硬件资源而言可能就太多了。

Another important issue with distributed systems is reliability. Just as a four-engine airplane is more likely to experience an engine failure in a given period than a craft with two of the equivalent engines, so too is it 10 times more likely that a cluster of 10 machines will require a service call. Unfortunately, many of the components that get replicated in clusters—power supplies, disks, fans, cabling, etc.—tend to be unreliable. It is, of course, possible to make a cluster arbitrarily resistant to single-node failures, chiefly by replicating data across the nodes. Happily, there is perhaps room for some synergy here: data replicated to improve the efficiency of different kinds of analyses, as above, can also provide redundancy against the inevitable node failure. Once again, however, the larger the dataset, the more difficult it is to maintain multiple copies of the data.

分布式系统的另一个重要问题是可靠性。正如一架四引擎飞机在一段时间内发生引擎故障的可能性比拥有两个同等引擎的飞机更大,同样,由10台机器组成的集群需要维修的可能性也要高出10倍。 不幸的是,许多在群集中复制的组件电源、磁盘、风扇、电缆等等往往不可靠,当然,可以通过跨节点复制数据,使集群抵抗任意单节点故障。令人高兴的是,这里可能存在一些协同作用的空间:如上所述,复制数据以提高各种分析的效率,还可以为不可避免的节点故障提供冗余。 但是,数据集越大,维护数据的多个副本就越困难。


A Meta-definition

I have tried here to provide an overview of a few of the issues that can arise when analyzing big data: the inability of many off-the-shelf packages to scale to large problems; the paramount importance of avoiding suboptimal access patterns as the bulk of processing moves down the storage hierarchy; and replication of data for storage and efficiency in distributed processing. I have not yet answered the question I opened with: what is “big data,” anyway?

我在这里试图概述分析大数据时可能出现的一些问题:许多现成的软件包无法扩展到大的问题,当大量的处理向下移动到存储层次结构时,避免次优访问模式(suboptimal access patterns)至关重要,数据复制和存储,以提高分布式处理的效率。我还没有回答我提出的问题:到底什么是“大数据”?

I will take a stab at a meta-definition: big data should be defined at any point in time as “data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time.” In the early 1980s, it was a dataset that was so large that a robotic “tape monkey” was required to swap thousands of tapes in and out. In the 1990s, perhaps, it was any data that transcended the bounds of Microsoft Excel and a desktop PC, requiring serious software on Unix workstations to analyze. Nowadays, it may mean data that is too large to be placed in a relational database and analyzed with the help of a desktop statistics/visualization package—data, perhaps, whose analysis requires massively parallel software running on tens, hundreds, or even thousands of servers.

我将尝试一下元定义:大数据在任何时候都应该被定义为“其大小迫使我们超越当时流行的可靠方法的数据。” 在1980年代初期,它的数据集太大,以至于需要一个机器人 “tape monkey” 来交换上千个磁带的输入和输出。在1990年代,也许任何数据上超越了 Microsoft Excel 和台式PC界限的都需要 Unix工作站上使用正经的软件才能进行分析。如今,这可能意味着数据太大,无法放置在关系数据库中,无法借助桌面统计或可视化程序包进行分析,而这些数据的分析可能需要大量并行软件运行在数十台、数百台甚至数千台服务器上。

In any case, as analyses of ever-larger datasets become routine, the definition will continue to shift, but one thing will remain constant: success at the leading edge will be achieved by those developers who can look past the standard, off-the-shelf techniques and understand the true nature of the hardware resources and the full panoply of algorithms that are available to them.

无论如何,随着对更大数据集的分析变得日趋常规,定义将继续发生变化,但一件事将保持不变:那些能够超越标准,现成技术并了解硬件资源的真实特性以及他们可用的全部算法的开发人员,将能够在前沿取得成功。


References

  1. Codd, E. F. 1970. A relational model for large shared data banks. Communications of the ACM 13(6): 377-387.
  2. IBM 3850 Mass Storage System; http://www.columbia.edu/acis/history/mss.html.
  3. IBM Archives: IBM 3380 direct access storage device; http://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380.html.
  4. Kimball, R. 1996. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. New York: John Wiley & Sons.
  5. Litke, A. M., et al. 2004. What does the eye tell the brain? Development of a system for the large-scale recording of retinal output activity. IEEE Transactions on Nuclear Science 51(4): 1434-1440.
  6. PostgreSQL: The world’s most advanced open source database; http://www.postgresql.org.
  7. The R Project for Statistical Computing; http://www.r-project.org.
  8. Sloan Digital Sky Survey; http://www.sdss.org.
  9. Throughput and Interface Performance. Tom’s Winter 2008 Hard Drive Guide; http://www.tomshardware.com/reviews/hdd-terabyte-1tb,2077-11.html.
  10. WLCG (Worldwide LHC Computing Grid); http://lcg.web.cern.ch/LCG/public/.
  11. Zero-One-Infinity Rule; http://www.catb.org/~esr/jargon/html/Z/Zero-One-Infinity-Rule.html.

论文作者

Adam Jacobs is senior software engineer at 1010data Inc., where, among other roles, he leads the continuing development of Tenbase, the company’s ultra-high-performance analytical database engine. He has more than 10 years of experience with distributed processing of big datasets, starting in his earlier career as a computational neuroscientist at Weill Medical College of Cornell University (where he holds the position of Visiting Fellow) and at UCLA. He holds a Ph.D. in neuroscience from UC Berkeley and a B.A. in linguistics from Columbia University.


  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
高斯过程回归(Gaussian Process Regression, GPR)是一种基于贝叶斯学派的非参数回归方法。它使用高斯过程(Gaussian Process, GP)对函数进行建模,通过训练数据对函数进行学习和预测。GPR 能够有效地处理高维数据和非线性问题,并且能够提供对预测的不确定性估计。下面是一些关于高斯过程回归的文献综述: 1. Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning[M]. MIT Press, 2006. 这是高斯过程回归的经典著作,介绍了高斯过程的基本概念和应用,包括高斯过程回归、分类、核函数等。 2. Duvenaud D K, Rippel O, Adams R P. Avoiding pathologies in very deep networks[C]//Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015: 1366-1374. 这篇论文介绍了如何使用高斯过程回归来学习深度神经网络的超参数,从而避免网络退化的问题。 3. Lawrence N D. Gaussian process latent variable models for visualisation of high dimensional data[J]. Advances in neural information processing systems, 2004, 16: 329-336. 这篇论文介绍了一种使用高斯过程回归进行数据降维和可视化的方法,称为高斯过程潜在变量模型(Gaussian Process Latent Variable Model, GPLVM)。 4. Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning[C]//Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016: 1050-1059. 这篇论文介绍了一种使用高斯过程回归来表示神经网络的不确定性的方法,称为贝叶斯Dropout(Bayesian Dropout)。 5. Bonilla E V, Chai K M, Williams C K I. Multi-task Gaussian process prediction[C]//Advances in Neural Information Processing Systems, 2008: 153-160. 这篇论文介绍了一种使用高斯过程回归进行多任务学习的方法,即多任务高斯过程预测(Multi-Task Gaussian Process Prediction, MTGPP)。 6. Nguyen T T, Bonilla E V, Chai K M, et al. Predicting 3D facial deformation parameters using Gaussian process regression[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 281-288. 这篇论文介绍了一种使用高斯过程回归进行三维面部形变参数预测的方法。 以上是一些关于高斯过程回归的文献综述,希望对您有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值