原标题:Information Platforms and the Rise of the Data Scientist 

原作者:Jeff Hammerbacher



       杰夫·哈默巴赫,数据科学家, 前FaceBook数据团队负责人,Cloudera联合创始人,


       杰夫·哈默巴赫在加入Cloudera之前就是Accel Partners的驻地企业家。



Facebook Information Platform

& Businesses Intelligence


Our first attempt at an offline repository of information involved a Python script for farming queries out to Facebook’s tier of MySQL servers and a daemon process, written in C++, for processing our event logs in real time. When the scripts worked as planned, we collected about 10 gigabytes a day. I later learned that this aspect of our system is commonly termed the “ETL” process, for “Extract, Transform, and Load.”

      我们在离线信息库中的第一次尝试涉及一个Python脚本,用于向Facebook的MySQL服务器层查询,以及用C ++编写的守护进程,用于实时处理我们的事件日志。 当脚本按计划运行时,我们每天收集大约10千兆字节。 我后来才知道,我们系统的这方面功能通常被称为“提取,转换和加载”的“ETL”过程。


Once our Python scripts and C++ daemon had siphoned the data from Facebook’s source systems, we stuffed the data into a MySQL database for offline querying. We also had some scripts and queries that ran over the data once it landed in MySQL to aggregate it into more useful representations. It turns out that this offline database for decision support is better known as a “Data Warehouse.”

一旦我们的Python脚本和C ++守护进程从Facebook的源系统中吸取了数据,我们就会将数据填充到MySQL数据库中以进行离线查询。 我们还有一些脚本和查询用来在数据登陆MySQL后将其聚合到更有用的表示中。 事实证明,这个用于决策支持的离线数据库就是广为人知的“数据仓库”。


Finally, we had a simple PHP script to pull data from the offline MySQL database and display summaries of the information we had collected to internal users. For the first time, we were able to answer some important questions about the impact of certain site features on user activity. Early analyses looked at maximizing growth through several channels: the layout of the default page for logged-out users, the source of invitations, and the design of the email contact importer. In addition to analyses, we started to build simple products using historical data, including an internal project to aggregate features of sponsored group members that proved popular with brand advertisers.

最后,我们有一个简单的PHP脚本从离线MySQL数据库中提取数据,并将我们收集的信息摘要显示给内部用户。 一开始,我们能够回答一些关于某些网站功能对用户活动影响的重要问题。 早期分析着眼于通过多种渠道实现增长最大化:登出用户的默认页面布局,邀请来源以及电子邮件联系人导入器的设计。 除了分析之外,我们还开始使用历史数据构建简单的产品,包括一个内部项目,用于聚合受到品牌广告商欢迎的广告功能。


I didn’t realize it at the time, but with our ETL framework, Data Warehouse, and internal dashboard, we had built a simple “Business Intelligence” system.



A Business Intelligence System

In a 1958 paper in the IBM Systems Journal, Hans Peter Luhn describes a system for “selective dissemination” of documents to “action points” based on the “interest profiles” of the individual action points. The author demonstrates shocking prescience. The title of the paper is “A Business Intelligence System,” and it appears to be the first use of the term “Business Intelligence” in its modern context.

在1958年IBM Systems Journal的一篇论文中,Hans Peter Luhn描述了一种基于各个用户行为点的“兴趣概况”将文档“选择性地传播”到“行动点”的系统。 作者表现出令人震惊的先见之明。 该论文的标题是“商业智能系统”,它似乎是现代语境中“商业智能”一词的首次使用。


In addition to the dissemination of information in real time, the system was to allow for “information retrieval”—search—to be conducted over the entire document collection. Luhn’s emphasis on action points focuses the role of information processing on goal completion. In other words, it’s not enough to just collect and aggregate data; an organization must improve its capacity to complete critical tasks because of the insights gleaned from the data. He also proposes “reporters” to periodically sift the data and selectively move information to action points as needed.

除了实时传播信息外,该系统还允许对整个文件集进行“信息检索” - 研究。 Luhn强调行动要点侧重于信息处理对目标完成的作用。 换句话说,仅仅收集和汇总数据是远远不够的;组织必须提高其完成关键任务的能力,因为要从数据中获得远见(洞察一切)。 他还建议“记者”定期筛选数据,并根据需要有选择地将信息移动到行动点。

The field of Business Intelligence has evolved over the five decades since Luhn’s paper was published, and the term has come to be more closely associated with the management of structured data. Today, a typical business intelligence system consists of an ETL framework pulling data on a regular basis from an array of data sources into a Data Warehouse, on top of which sits a Business Intelligence tool used by business analysts to generate reports for internal consumption. How did we go from Luhn’s vision to the current state of affairs?

自Luhn的论文发表以来,商业智能领域已经发展了五十年,并且该术语与结构化数据的管理更加密切相关。 今天,典型的商业智能系统由一个ETL框架组成,该框架定期将数据从一系列数据源提取到数据仓库中,其中包括业务分析师用来生成内部消费报告的商业智能工具。 我们应该如何从Luhn的愿景转向当今的事务?


E. F. Codd first proposed the relational model for data in 1970, and IBM had a working prototype of a relational database management system (RDBMS) by the mid-1970s. Building user-facing applications was greatly facilitated by the RDBMS, and by the early 1980s, their use was proliferating.



In 1983, Teradata sold the first relational database designed specifically for decision support to Wells Fargo. A few years later, in 1986, Ralph Kimball founded Red Brick Systems to build databases for the same market. Solutions were developed using Teradata and Red Brick’s offerings, but it was not until 1991 that the first canonical text on data warehousing was published.

1983年,Teradata向Wells Fargo出售了专门为决策支持而设计的第一个关系数据库。几年后,也就是1986年,拉尔夫·金博尔创建了Red Brick Systems,为同一市场建立数据库。解决方案是用Teradata and Red Brick提供的产品开发的,但直到1991年才发布了关于数据仓库的第一个规范文本。


Bill Inmon’s Building the Data Warehouse (Wiley) is a coherent treatise on data warehouse design and includes detailed recipes and best practices for building data warehouses. Inmon advocates constructing an enterprise data model after careful study of existing data sources and business goals.

Bill Inmon建立的数据仓库(Wiley)是一篇关于数据仓库设计的连贯的论文,包括构建数据仓库的详细配方和最佳实践。Inmon倡导在仔细研究现有数据源和业务目标之后构建企业数据模型。


In 1995, as Inmon’s book grew in popularity and data warehouses proliferated inside enterprise data centers, The Data Warehouse Institute (TDWI) was formed. TDWI holds conferences and seminars and remains a critical force in articulating and spreading knowledge about data warehousing. That same year, data warehousing gained currency in academic circles when Stanford University launched its WHIPS research initiative.

1995年,随着Inmon的订制在企业数据中心中的普及和数据仓库的激增,数据仓库研究所(TDWI)成立了。TDWI举行会议和研讨会,并仍然是阐明和传播有关数据仓库知识的关键力量。同年,斯坦福大学(Stanford University)发起了“鞭子研究”计划,数据仓库在学术界获得了广泛应用。


A challenge to the Inmon orthodoxy came in 1996 when Ralph Kimball published The Data Warehouse Toolkit (Wiley). Kimball advocated a different route to data warehouse nirvana, beginning by throwing out the enterprise data model. Instead, Kimball argued that different business units should build their own data “marts,” which could then be connected with a “bus.” Further, instead of using a normalized data model, Kimball advocated the use of dimensional modeling, in which the relational data model was manhandled a bit to fit the particular workload seen by many data warehouse implementations.



As data warehouses grow over time, it is often the case that business analysts would like to manipulate a small subset of data quickly. Often this subset of data is parameterized by a few “dimensions.” Building on these observations, the CUBE operator was introduced in 1997 by a group of Microsoft researchers, including Jim Gray. The new operator enabled fast querying of small, multidimensional data sets.

随着数据仓库的增长,业务分析人员通常希望快速操作一小部分数据。通常,这个子集的数据是由几个“维度”参数化的。在这些观察的基础上,多维数据集操作于1997年由微软的一组研究人员引入,其中包括Jim Gray。新运算符允许快速查询小型多维数据集。


Both dimensional modeling and the CUBE operator were indications that, despite its success for building user-facing applications, the relational model might not be best for constructing an Information Platform. Further, the document and the action point, not the table, were at the core of Luhn’s proposal for a business intelligence system. On the other hand, an entire generation of engineers had significant expertise in building systems for relational data processing.

With a bit of history at our back, let’s return to the challenges at Facebook.



The Death and Rebirth of a Data Warehouse

At Facebook, we were constantly loading more data into, and running more queries over, our MySQL data warehouse. Having only run queries over the databases that served the live site, we were all surprised at how long a query could run in our data warehouse. After some discussion with seasoned data warehousing veterans, I realized that it was normal to have queries running for hours and sometimes days, due to query complexity, massive data volumes, or both.


在Facebook, 我们不断地在MySQL数据仓库中加载更多的数据,并运行更多的查询。由于只在服务于活动站点的数据库上运行查询,我们都对查询在数据仓库中运行耗时到惊讶。在与经验丰富的数据仓库老手进行了一些讨论之后,我意识到,由于查询的复杂性、海量的数据量,或者两者兼而有之,运行几个小时甚至几天的查询是正常的。


One day, as our database was nearing a terabyte in size, the mysqld daemon process came to a sudden halt. After some time spent on diagnostics, we tried to restart the database. Upon initiating the restart operation, we went home for the day.



When I returned to work the next morning, the database was still recovering. To get a consistent view of data that’s being modified by many clients, a database server maintains a persistent list of all edits called the “redo log” or the “write-ahead log.” If the database server is unceremoniously killed and restarted, it will reread the recent edits from the redo log to get back up to speed. Given the size of our data warehouse, the MySQL database had quite a bit of recovery to catch up on. It was three days before we had a working data warehouse again.



We made the decision at that point to move our data warehouse to Oracle, whose database software had better support for managing large data sets. We also purchased some expensive high-density storage and a powerful Sun server to run the new data warehouse.



During the transfer of our processes from MySQL to Oracle, I came to appreciate the differences between supposedly standard relational database implementations. The bulk import and export facilities of each database used completely different mechanisms. Further, the dialect of SQL supported by each was different enough to force us to rewrite many of our queries. Even worse, the Python client library for Oracle was unofficial and a bit buggy, so we had to contact the developer directly.



After a few weeks of elbow grease, we had the scripts rewritten to work on the new Oracle platform. Our nightly processes were running without problems, and we were excited to try out some of the tools from the Oracle ecosystem. In particular, Oracle had an ETL tool called Oracle Warehouse Builder (OWB) that we hoped could replace our handwritten Python scripts. Unfortunately, the software did not expect the sheer number of data sources we had to support: at the time, Facebook had tens of thousands of MySQL databases from which we collected data each night. Not even Oracle could help us tackle our scaling challenges on the ETL side, but we were happy to have a running data warehouse with a few terabytes of data.

经过几周的努力,我们重新编写了脚本,以便在新的Oracle平台上工作。我们的夜间进程没有出现问题,我们很高兴能够尝试Oracle生态系统中的一些工具。特别是,Oracle有一个名为Oracle Warehouse Builder(OWB)的ETL工具,我们希望它能够取代我们手写的Python脚本。不幸的是,该软件并没有预料到我们必须支持的数据源的数量:当时,Facebook拥有数以万计的MySQL数据库,我们每晚都从这些数据库中收集数据。即使是Oracle也无法帮助我们解决ETL方面的扩展挑战,但我们很高兴拥有一个运行中的数据仓库,其中包含了数兆字节的数据。


And then we turned on clickstream logging: our first full day sent 400 gigabytes of unstructured data rushing over the bow of our Oracle database. Once again, we cast a skeptical eye on our data warehouse.

然后我们打开了点击流日志:我们的第一天发送了400 GB的非结构化数据,横扫我们的Oracle数据库。再一次,我们对自己的数据仓库持怀疑态度。


Beyond the Data Warehouse

According to IDC, the digital universe will expand to 1,800 Exabyte by 2011. The vast majority of that data will not be managed by relational databases. There’s an urgent need for data management systems that can extract information from unstructured data in concert with structured data, but there is little consensus on the way forward.

根据IDC的数据,到2011年,数字世界将扩展到1800 EB。其中绝大多数数据将不会由关系数据库管理。迫切需要数据管理系统,它可以与结构化数据一起从非结构化数据中提取信息,但在前进的道路上几乎没有共识。


Natural language data in particular is abundant, rich with information, and poorly managed by a data warehouse. To manage natural language and other unstructured data, often captured in document repositories and voice recordings, organizations have looked beyond the offerings of data warehouse vendors to various new fields, including one known as enterprise search.



While most search companies built tools for navigating the collection of hyperlinked documents known as the World Wide Web, a few enterprise search companies chose to focus on managing internal document collections. Autonomy Corporation founded in 1996 by Cambridge University researchers, leveraged Bayesian inference algorithms to facilitate the location of important documents.

虽然大多数搜索公司构建了用于导航称为万维网(World Wide Web)的超链接文档集合的工具,但少数企业搜索公司选择将重点放在管理内部文档集合上。由剑桥大学的研究人员于1996年成立的Autonomy公司,利用贝叶斯推理算法来帮助重要文档的定位。


Fast Search and Transfer (FAST) was founded in 1997 in Norway with more straightforward keyword search and ranking at the heart of its technology.



Two years later, Endeca was founded with a focus on navigating document collections using structured metadata, a technique known as “faceted search.” Google, seeing an opportunity to leverage its expertise in the search domain, introduced an enterprise search appliance in 2000.



In a few short years, enterprise search has grown into a multibillion-dollar market segment that is almost totally separate from the data warehouse market. Endeca has some tools for more traditional business intelligence, and some database vendors have worked to introduce text mining capabilities into their systems, but a complete, integrated solution for structured and unstructured enterprise data management remains unrealized.



Both enterprise search and data warehousing are technical solutions to the larger problem of leveraging the information resources of an organization to improve performance. As far back as 1944, MIT professor Kurt Lewin proposed “action research” as a framework that uses “a spiral of steps, each of which is composed of a circle of planning, action, and fact-finding about the result of the action.”

企业搜索和数据仓库都是利用组织的信息资源来提高性能这一更大问题的技术解决方案。早在1944年,麻省理工学院(MIT)教授库尔特·莱文(Kurt Lewin)就提出了“行动研究”这一框架,它使用的是“一系列步骤,每一步都由对行动结果的计划、行动和事实发现组成。”


A more modern approach to the same problem can be found in Peter Senge’s “Learning Organization” concept, detailed in his book The Fifth Discipline (Broadway Business).

彼得·森格的“学习组织”(Learning Organization)概念中可以找到一个更现代的方法来解决同样的问题。详细描述在他的书“第五学科”(百老汇商业)。


Both management theories rely heavily upon an organization’s ability to adapt its actions after reflecting upon information collected from previous actions. From this perspective, an Information Platform is the infrastructure required by a Learning Organization to ingest, process, and generate the information necessary for implementing the action research spiral.



Having now looked at structured and unstructured data management, let’s get back to the Facebook story.



On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours. It was clear we’d need to aggregate our log files outside of the database and store only the summary information for later querying.

在记录Facebook点击流的第一天,就收集了超过400 GB的数据。此数据集的加载、索引和聚合过程确实对Oracle数据仓库进行了评估。即使在进行了重要的调优之后,我们也无法在不到24小时内聚集一天的点击流数据。显然,我们需要将日志文件聚合到数据库之外,并且只存储摘要信息以供以后查询。

Luckily, a top engineer from a large web property had recently joined our team and had experience processing clickstream data at web scale. In just a few weeks, he built a parallelized log processing system called Cheetah that was able to process a day of clickstream data in two hours. There was much rejoicing.


Despite our success, Cheetah had some drawbacks: first, after processing the clickstream data, the raw data was stored in archival storage and could not be queried again.



In addition, Cheetah pulled the clickstream data from a shared NetApp filer with limited read bandwidth. The “schema” for each logfile was embedded in the processing scripts rather than stored in a format that could be queried.



We did not collect progress information and we scheduled Cheetah jobs using a basic Unix utility called cron, so no sophisticated loadsharing logic could be applied. Most importantly, however, Cheetah was not open source.



We had a small team and could not afford the resources required to develop, maintain, and train new users to use our proprietary system.



The Apache Hadoop project, started in late 2005 by Doug Cutting and Mike Cafarella, was a top candidate to replace Cheetah. Named after the stuffed elephant of Doug’s son, the Hadoop project aimed to implement Google’s distributed file system and MapReduce technologies under the Apache 2.0 license. Yahoo! hired Doug Cutting in January 2006 and devoted significant engineering resources to developing Hadoop.

Apache Hadoop项目于2005年底由Doug Cutting和Mike Cafarella开始,是取代Cheetah的最佳候选者。 Hadoop项目以Doug的儿子填充大象命名,旨在Apache 2.0许可下实施Google的分布式文件系统和MapReduce技术。雅虎 2006年1月聘请Doug Cutting,并投入大量工程资源开发Hadoop。

In April 2006, the software was able to sort 1.9 terabytes in 47 hours using 188 servers. Although Hadoop’s design improved on Cheetah’s in several areas, the software was too slow for our needs at that time. By April 2008, however, Hadoop was able to sort 1 terabyte in 209 seconds using 910 servers. With the improved performance numbers in hand, I was able to convince our operations team to stick three 500-gigabyte SATA drives in the back of 60 unused web servers, and we went forward with our first Hadoop cluster at Facebook.

2006年4月,该软件使用188台服务器在47小时内对1.9TB进行了分类。 虽然Hadoop的设计在几个方面对Cheetah的设计有所改进,但该软件对我们当时的需求来说太慢了。 然而,到2008年4月,Hadoop能够使用910台服务器在209秒内对1TB进行排序。 随着性能数据的提升,我能够说服我们的运营团队在60个未使用的Web服务器后面贴上三个500 GB的SATA硬盘(驱动器),我们在Facebook上推出了我们的第一个Hadoop集群。

Initially, we started streaming a subset of our logs into both Hadoop and Cheetah. The enhanced programmability of Hadoop coupled with the ability to query the historical data led to some interesting projects. One application involved scoring all directed pairs of interacting users on Facebook to determine their affinity; this score could then be used for search and News Feed ranking. After some time, we migrated all Cheetah workflows to Hadoop and retired the old system. Later, the transactional database collection processes were moved to Hadoop as well.



With Hadoop, our infrastructure was able to accommodate unstructured and structured data analysis at a massive scale. As the platform grew to hundreds of terabytes and thousands of jobs per day, we learned that new applications could be built and new questions could be answered simply because of the scale at which we were now able to store and retrieve data.



When Facebook opened registration to all users, the user population grew at disproportionately rapid rates in some countries. At the time, however, we were not able to perform granular analyses of clickstream data broken out by country. Once our Hadoop cluster was up, we were able to reconstruct how Facebook had grown rapidly in places such as Canada and Norway by loading all of our historical access logs into Hadoop and writing a few simple MapReduce jobs.



Every day, millions of semi-public conversations occur on the walls of Facebook users. One internal estimate put the size of the wall post corpus at 10 times the size of the blogosphere! Before Hadoop, however, the contents of those conversations remained inaccessible for data analysis.


In 2007, a summer intern with a strong interest in linguistics and statistics, Roddy Lindsay, joined the Data team. Using Hadoop, Roddy was able to single-handedly construct a powerful trend analysis system called Lexicon that continues to process terabytes of wall post data every night; you can see the results for yourself at https://facebook.com/lexicon.

2007年,一位对语言学和统计学有浓厚兴趣的暑期实习生罗迪·林赛(Roddy Lindsay)加入了数据团队。使用Hadoop,Roddy能够独自构建一个强大的趋势分析系统,名为lDicon,它每天晚上继续处理兆字节的墙帖数据;您可以在https://facebook.com/lexicon中自己看到结果。


Having the data from disparate systems stored in a single repository proved critical for the construction of a reputation scoring system for Facebook applications.



Soon after the launch of the Facebook Platform in May of 2007, our users were inundated with requests to add applications. We quickly realized that we would need a tool to separate the useful applications from those the users perceived as spam.



Using data collected from the API servers, user profiles, and activity data from the site itself, we were able to construct a model for scoring applications that allowed us to allocate invitations to the applications deemed most useful to users.



The Unreasonable Effectiveness of Data

In a recent paper, a trio of Google researchers distilled what they have learned from trying to solve some of machine learning’s most difficult challenges.



When discussing the problems of speech recognition and machine translation, they state that, “invariably, simple models and a lot of data trump more elaborate models based on less data.”



 I don’t intend to debate their findings; certainly there are domains where elaborate models are successful. Yet based on their experiences, there does exist a wide class of problems for which more data and simple models are better.



At Facebook, Hadoop was our tool for exploiting the unreasonable effectiveness of data. For example, when we were translating the site into other languages, we tried to target users who spoke a specific language to enlist their help in the translation task.



One of our Data Scientists, Cameron Marlow, crawled all of Wikipedia and built character trigram frequency counts per language. Using these frequency counts, he built a simple classifier that could look at a set of wall posts authored by a user and determine his spoken language.

我们的数据科学家卡梅隆·马洛(Cameron Marlow)爬行了所有维基百科,并构建了每种语言的字符Trigram频率计数。使用这些频率计数,他建立了一个简单的分类器,可以查看一组由用户编写的墙帖,并确定他的口语。


Using this classifier, we were able to actively recruit users into our translation program in a targeted fashion. Both Facebook and Google use natural language data in many applications; see Chapter 14 of this book for Peter Norvig’s exploration of the topic.



The observations from Google point to a third line of evolution for modern business intelligence systems: in addition to managing structured and unstructured data in a single system, they must scale to store enough data to enable the “simple models, lots of data” approach to machine learning.



New Tools and Applied Research

Most of the early users of the Hadoop cluster at Facebook were engineers with a taste for new technologies. To make the information accessible to a larger fraction of the organization, we built a framework for data warehousing on top of Hadoop called Hive.

新工具和应用研究-Facebook Hadoop集群的大多数早期用户都是对新技术感兴趣的工程师。为了使更大一部分组织能够访问这些信息,我们在Hadoop之上构建了一个名为Hive的数据仓库框架。


Hive includes a SQL-like query language with facilities for embedding MapReduce logic, as well as table partitioning, sampling, and the ability to handle arbitrarily serialized data.



The last feature was critical, as the data collected into Hadoop was constantly evolving in structure; allowing users to specify their own serialization format allowed us to pass the problem of specifying structure for the data to those responsible for loading the data into Hive.



In addition, a simple UI for constructing Hive queries, called HiPal, was built. Using the new tools, non-engineers from marketing, product management, sales, and customer service were able to author queries over terabytes of data.



After several months of internal use, Hive was contributed back to Hadoop as an official subproject under the Apache 2.0 license and continues to be actively developed.



In addition to Hive, we built a portal for sharing charts and graphs called Argus (inspired by IBM’s work on the Many Eyes project), a workflow management system called Databee, a framework for writing MapReduce scripts in Python called PyHive, and a storage system for serving structured data to end users called Cassandra (now available as open source in the Apache Incubator).

除了Hive之外,我们还构建了一个共享图表和图形的门户,名为Argus(受IBM在多个眼睛项目上的工作的启发)、一个名为Databee的工作流管理系统、一个用Python编写MapReduce脚本的框架PyHive,以及一个存储系统,用于为最终用户提供结构化数据,称为Cassandra(现在作为Apache Incubator中的开放源代码提供)。


As the new systems stabilized, we ended up with multiple tiers of data managed by a single Hadoop cluster. All data from the enterprise, including application logs, transactional databases, and web crawls, was regularly collected in raw form into the Hadoop distributed file system (HDFS).



Thousands of nightly Databee processes would then transform some of this data into a structured form and place it into the directory of HDFS managed by Hive. Further aggregations were performed in Hive to generate reports served by Argus.



Additionally, within HDFS, individual engineers maintained “sandboxes” under their home directories against which prototype jobs could be run.



At its current capacity, the cluster holds nearly 2.5 PB of data, and new data is added at a rate of 15 TB per day. Over 3,000 MapReduce jobs are run every day, processing 55 terabytes of data. To accommodate the different priorities of jobs that are run on the cluster, we built a job scheduler to perform fair sharing of resources over multiple queues.

按照目前的容量,集群拥有将近2.5PB的数据,新的数据以每天15 TB的速度增加。每天运行3,000多个MapReduce作业,处理55 TB的数据。为了适应集群上运行的作业的不同优先级,我们构建了一个作业调度程序,用于在多个队列上公平地共享资源。


In addition to powering internal and external reports, a/b testing pipelines, and many different data-intensive products and services, Facebook’s Hadoop cluster enabled some interesting applied research projects.



One longitudinal study conducted by Data Scientists Itamar Rosenn and Cameron Marlow set out to determine what factors were most critical in predicting long-term user engagement.

数据科学家Itamar Rosenn和Cameron Marlow进行了一项纵向研究,以确定哪些因素是预测长期用户参与的最关键因素。


We used our platform to select a sample of users, trim outliers, and generate a large number of features for use in several least-angle regressions against different measures of engagement. Some features we were able to generate using Hadoop included various measures of friend network density and user categories based on profile features.



Another internal study to understand what motivates content contribution from new users was written up in the paper “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” published at the 2009 CHI conference.



A more recent study from the Facebook Data team looks at how information flows through the Facebook social graph; the study is titled “Gesundheit! Modeling Contagion through Facebook News Feed,” and has been accepted for the 2009 ICWSM conference.

Facebook数据小组最近的一项研究着眼于信息是如何在Facebook社交图中流动的;这项研究的标题是“Gesundheit!通过Facebook News Feed建模宣传,“并已被2009ICWSM会议接受。


Every day, evidence is collected, hypotheses are tested, applications are built, and new insights are generated using the shared Information Platform at Facebook. Outside of Facebook, similar systems were being constructed in parallel.



MAD Skills and Cosmos

In “MAD Skills: New Analysis Practices for Big Data,” a paper from the 2009 VLDB conference, the analysis environment at Fox Interactive Media (FIM) is described in detail.



Using a combination of Hadoop and the Greenplum database system, the team at FIM has built a familiar platform for data processing in isolation from our work at Facebook.



The paper’s title refers to three tenets of the FIM platform: Magnetic, Agile, and Deep. “Magnetic” refers to the desire to store all data from the enterprise, not just the structured data that fits into the enterprise data model.



Along the same lines, an “Agile” platform should handle schema evolution gracefully, enabling analysts to work with data immediately and evolve the data model as needed. “Deep” refers to the practice of performing more complex statistical analyses over data.



In the FIM environment, data is separated into staging, production, reporting, and sandbox schemas within a single Greenplum database, quite similar to the multiple tiers inside of Hadoop at Facebook described earlier.

在FIM环境中,数据被分离为单个Greenmers数据库中的分阶段、生产、报告和沙箱模式,非常类似于前面描述的Facebook Hadoop内部的多层结构。


Separately, Microsoft has published details of its data management stack. In papers titled “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” and “SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets,” Microsoft describes an information platform remarkably similar to the one we had built at Facebook.



 Its infrastructure includes a distributed file system called Cosmos and a system for parallel data processing called Dryad; it has even invented a SQL-like query language called SCOPE.



Three teams working with three separate technology stacks have evolved similar platforms for processing large amounts of data. What’s going on here? By decoupling the requirements of specifying structure from the ability to store data and innovating on APIs for data retrieval, the storage systems of large web properties are starting to look less like databases and more like data spaces.



Information Platforms As Data spaces

Anecdotally, similar petabyte-scale platforms exist at companies such as Yahoo!, Quantcast, and Last.fm. These platforms are not quite data warehouses, as they’re frequently not using a relational database or any traditional data warehouse modeling techniques.



They’re not quite enterprise search systems, as only some of the data is indexed and they expose far richer APIs. And they’re often used for building products and services in addition to traditional data analysis workloads.



Similar to the brain and the library, these shared platforms for data processing serve as the locus of their organization’s efforts to ingest, process, and generate information, and with luck, they hasten their organization’s pace of learning from empirical data.



In the database community, there has been some work to transition the research agenda from purely relational data management to a more catholic system for storage and querying of large data sets called a “dataspace.”



In “From Databases to Dataspaces: A New Abstraction for Information Management”


the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system’s understanding of the data.



I’d contend that the Information Platforms we’ve described are real-world examples of dataspaces: single storage systems for managing petabytes of structured and unstructured data from all parts of an organization that expose a variety of data access APIs for engineering, analysis, and reporting.



Given the proliferation of these systems in industry, I’m hopeful that the database community continues to explore the theoretical foundations and practical implications of dataspaces.



An Information Platform is the critical infrastructure component for building a Learning Organization. The most critical human component for accelerating the learning process and making use of the Information Platform is taking the shape of a new role: the Data Scientist.





The Data Scientist

In a recent interview, Hal Varian, Google’s chief economist, highlighted the need for employees able to extract information from the Information Platforms described earlier. As Varian puts it, “find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis.”

“数据科学家”在最近的一次采访中,谷歌首席经济学家哈尔·瓦里安(Hal Varian)强调,员工需要能够从前面描述的信息平台中提取信息。正如瓦里安所说,“在那里,你可以提供一种稀缺的、互补的服务,以满足无处不在和廉价的需求。”那么,什么东西变得无处不在又便宜呢?数据。什么是数据的补充?分析。“


At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team.



The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion.



 To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”



In the financial services domain, large data stores of past market activity are built to serve as the proving ground for complex new models developed by the Data Scientists of their domain, known as Quants. Outside of industry, I’ve found that grad students in many scientific domains are playing the role of the Data Scientist.



One of our hires for the Facebook Data team came from a bioinformatics lab where he was building data pipelines and performing offline data analysis of a similar kind. The well-known Large Hadron Collider at CERN generates reams of data that are collected and pored over by graduate students looking for breakthroughs.



Recent books such as Davenport and Harris’s Competing on Analytics (Harvard Business School Press, 2007), Baker’s The Numerati (Houghton Mifflin Harcourt, 2008), and Ayres’s Super Crunchers (Bantam, 2008) have emphasized the critical role of the Data Scientist across industries in enabling an organization to improve over time based on the information it collects.



达文波特(Davenport)和哈里斯(Harris)的“分析竞争”(Harvard Business School Press,2007)、贝克(Baker)的“数值”(Houghton Mifflin HarCourt,2008)和艾尔斯(Ayres)的“超级分析者”(Ayres‘s SuperCruncher,2008)等书都强调了数据科学家在跨行业推动企业根据收集到的信息不断改进的关键作用。


In conjunction with the research community’s investigation of dataspaces, further definition for the role of the Data Scientist is needed over the coming years. By better articulating the role, we’ll be able to construct training curricula, formulate promotion hierarchies, organize conferences, write books, and fill in all of the other trappings of a recognized profession.



In the process, the pool of available Data Scientists will expand to meet the growing need for expert pilots for the rapidly proliferating Information Platforms, further speeding the learning process across all organizations.




When faced with the challenge of building an Information Platform at Facebook, I found it helpful to look at how others had attempted to solve the same problem across time and problem domains.



As an engineer, my initial approach was directed by available technologies and appears myopic in hindsight. The biggest challenge was keeping focused on the larger problem of building the infrastructure and human components of a Learning Organization rather than specific technical systems, such as data warehouses or enterprise search systems.



I’m certain that the hardware and software employed to build an Information Platform will evolve rapidly, and the skills required of a Data Scientist will change at the same rate.



Staying focused on the goal of making the learning process move faster will benefit both organizations and science. The future belongs to the Data Scientist!





  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
©️2022 CSDN 皮肤主题:技术黑板 设计师:CSDN官方博客 返回首页




¥2 ¥4 ¥6 ¥10 ¥20
余额支付 (余额:-- )



钱包余额 0