Book Review of “Big Data: A Revolution That Will Transform How We Live, Work and Think”

按照库克耶和迈尔-舍恩伯格的说法，数据化需要在我们处理数据的方式上发生三个深刻变化。第一个变化他们称之为n=全部(n=all)，也就是说，收集并利用大量数据，而不是像统计学家至今之前所做的那样只满足于小样本。“当收集数据的成本太高，处理数据既麻烦又耗时的时候，抽样的样本成为了救星。现代抽样技术是基于这样一个思想的：在一定的误差范围内，个人可以根据小团体的数据推断出整体人口的某种情况，只要抽取的样本是随机的。”

抽样需要预设数据将要被如何使用，因此你可以设计哪些是合适的样本。这样做在论及全体样本的时候是有用的，但当你深入到小一些的团体时，这样做就不是很有用，因为你可能没有足够的数据来有效地做到这一点。还有，如果你对自己想从数据中得到的内情改变了想法，你通常必须抽取新的样本。当你可以收集并存储所有数据而不是一个样本时，所有这些问题都不复存在了，也就是说，样本量n=全部的时候。

下一个变化是，要求人们接受零乱的而非纯净、精心整理过的数据。“在越来越多的情况下，稍失准确是可以接受的，因为使用质量参差的海量数据带来的好处超越了使用少量精准数据的成本……当周遭没有那么多数据的时候，研究人员必须保证他们费劲收集的数字是尽可能准确的。发掘海量数据意味着我们现在可以允许一点不准确的数据无意中流入（只要数据集不是完完全全不正确的），换来的是巨大体量的数据带给我们的视野。”

我发现最后一个重大变化，即从因果关系到相关性关系的变化，这特别让人感兴趣。正如作者所说：“大数据帮助回答是什么而不是为什么的问题，这往往就足够了。”或者说，至少在经验科学的早期阶段这就足够了。在这一阶段，我们寻求的是能够帮助我们预测未来事件和行为的方式，没有必要拥有可以解释事情为何发生的良好模式或理论。那些模式和理论今后会有的，不过有时它们根本不会出现。

比如，在麻省理工学院首席信息长研讨会上，麻省理工学院教授季米特里斯•波特西玛斯(Dimitris Bertsimas)参加了布林约尔夫松教授主持的“大数据的现实”小组讨论。他谈到了自己最近的研究：分析数十年的癌症治疗数据，希望能够以合理的成本提高癌症病人的寿命和生活质量。他和他的三个学生一道开发了模型，利用病人的个人资料数据和他们接受的化疗药物及剂量方面的数据预测生存和死亡的几率。他们的论文《用分析法设计癌症临床试验》(An Analytics Approach to Designing Clinical Trials for Cancer)表明，根据过去的数据预测未来临床试验的结果是有可能的，哪怕要预测的确切的药物组合以前从来没有在临床试验中测试过，哪怕这个特定药物组合为何有效的原因不为人所知。

“使用大数据有时意味着放弃追究为什么，以换取弄清事物是什么……这表示人们开始放弃了解世界如何运作背后的深层原因，转而仅仅去了解现象之间的联系并利用这种联系来完成任务，”库克耶和迈尔-舍恩伯格写道，“当然，弄清事物背后的原因是可取的。问题是原因经常很难查明，很多时候我们认为自己找到了原因，其实那不过是一种沾沾自喜的错觉。行为经济学已经证实，人类在原因不存在的地方也习惯于看到原因。因此我们需要特别警惕，防止我们的认知偏见迷惑我们；有时，我们只需要让数据说话。”

“在一个数据越来越多地决定判断的世界里，对于人、直觉和违背事实的行为来讲还会留有什么用途呢？”作者在结尾部分问道，“如果人人都诉诸于数据，利用大数据工具，能够成为区分焦点的就是不可预见的东西：本能的人为因素、冒险、事故、甚至错误。如果真是这样，那么会有一种特殊的需要人为来开辟一块地方：为直觉、常识、和意外发现的本事留出空间，确保它们不会被数据和千篇一律的答案挤出去……不管大数据的威力多么令人眼花缭乱，它诱人的光芒绝对不能让我们对其固有的缺点视而不见。我们必须在既感受到大数据的威力又了解它的局限性之后才去采用这种技术。”

===================================================================

By Daniel Castro · May 31, 2013 · No comments
IT Matters · Tagged: big data, book review, data

There have been a number of attempts to chronicle exactly what is “big data” and why anyone should care. Last year’sThe Human Face of Big Data by Rick Smolan and Jennifer Erwitt focused on telling the personal stories behind big data (and accompanied these stories with some great photographs). The year before, James Gleick wrote The Information: A History, A Theory, A Floodwhich chronicled how information (and not just big data) has changed our world. The latest entrant isBig Data: A Revolution That Will Transform How We Live, Work and Think by Viktor Mayer-Schönberger and Kenneth Cukier which focuses heavily on explaining some of the more interesting impacts of living in a big data world. (Personally, I’m still not a fan of the term big data because 1) the term scares off people who think this is equivalent to “Big Oil” and 2) the term underrepresents the innovation happening around “small” data. But since this is the term used in the book, I’ll stick with it for this review.)

The first part of this book provides a fairly compelling vision of how big data is changing how we use data. Unlike some technology proponents who simply ignore the past, Mayer-Schönberger and Cukier make a point to highlight that the use of data itself is not new, but that information technology (IT) has made it possible to collect and analyze data on a scale not seen in the past. The authors explore three main changes they see arising from big data. First, we will have significantly more data available than in the past. This means that we will be able to approach N = all for some datasets rather than just using population samples. Second, as we increasingly quantify the world, we will have more measurement error in our data, but that is okay because with much larger datasets the messiness of data becomes less important. Third, we will focus much less on understanding causation (“why”) and more on understanding correlation (“what”). (For a detailed look at this last point, see Chris Anderson’s essay“The End of Theory.”)

While these chapters are interesting, Mayer-Schönberger and Cukier are at their best later in the book when they describe the economic consequences of big data, both in terms of how data is creating economic value and how data is disrupting many industries. Unlike other economic resources, the value from data is not exhausted after its initial use. Instead, data can be reused an unlimited number of times, either directly or by combining it with additional information. In addition, “data exhaust” that would have been discarded in the past can now be put to practical use, such as Google using typos entered by users in its search engine to create a better spell check program.

This is a crucial point. It is not always possible to know how data will be used when it is collected, and even if some uses are identified, the value of big data comes from its reuse. Policymakers stuck in theold way of thinking want to imposedata minimization requirements which would effectively create a “use once” policy for data. Instead, to take advantage of data-driven economic value, we need policies that allow and encourage responsible reuse of data.

Mayer-Schönberger and Cukier offer one of the best metaphors for the new type of thinking that we need around data. Using a normal camera, a photographer must decide when taking a photo where to focus the lens. In contrast,plenoptic cameras, like the newLytro camera, capture light field information and allow photographers to change the focus of a picture after the picture has been taken. Like photographers, most data users have been stuck having to decide how to use data at the outset. But with increasingly lower costs for collection, storage and processing, users are now free to explore possible uses after collecting it.

The authors also discuss the new value chain created by companies involved in big data. They identify three primary value propositions: those providing data, those providing the skills, such as the technology and the analytics, and those providing business opportunities. One of their more interesting insights is that new business models are being created to take advantage of data opportunities that do not fit into existing organizations. For example, the health insurers formed the non-profit Health Care Cost Institute to combine data sets for research that individually they could not perform. Similarly, UPS spun off its internal data analytics unit because it could provide substantially more value if it had access to data from UPS’s competitors, but this would never happen if it remained part of the parent company. The authors argue that most of the value will be in the data part of the value chain, but that it isn’t there now. Unfortunately, such an assertion is impossible to prove or disprove. We are still in the early stages of assigning value to data, both at the macro-economic level and the firm level. Government statistics agencies need to include more than just goods and services if they want to accurately measure the data economy (Mike Mandel haswritten a thoughtful piece on this exact point).

While the authors also carve out a chapter to explore the “dark side” of big data, including privacy and misuse, they mostly avoid the overwrought handwringing that typically characterizes writing on this subject. And they recognize that much of the big data revolution does not involve personal data. With regards to personal data, my primary criticism is that they unfairly dismiss de-identification techniques, mostly relying on thecritiques leveled by Paul Ohm, while ignoring theshortcomings of his work described by individuals such as Jane Yakowitz or the continued advancement ofdifferential privacy research. They also get wrapped up in a surprisingly lengthy discussion of the risk of criminal profiling similar to what was seen in the movieMinority Report, where individuals were arrested for crimes before they were actually committed. While perhaps an interesting thought experiment, the authors provide little evidence that this is anything but a far-fetched science-fiction nightmare.

Overall, the book is an enjoyable read if for nothing else than some of the great nuggets of big data trivia that show just how much data has changed. For example, Mayer-Schönberger and Cukier report that the Sloan Digital Sky Survey generated 140 terabytes of information in about 10 years; it’s successor, the Large Synoptic Survey Telescope in Chile will generate as much every 5 days. In addition, the way they handle the risks section of their book bodes well for the future of data—it seems the more people come to understand it, the fewer concerns they have.

Photo credit: Chatham House