GraphLab：将大数据分析从理念运用到生产

最新推荐文章于 2021-11-06 13:04:43 发布

mishidemudong

最新推荐文章于 2021-11-06 13:04:43 发布

阅读量1.3k

点赞数

分类专栏： GraphLab 文章标签： big data GraphLab

GraphLab 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

GraphLab: Big Data Analytics Scaled From Inspiration to Production

Q&A with Carlos Guestrin, CEO of GraphLab

The AWS Startup Spotlight features startups all over the world building innovative, disruptive businesses on top of cloud infrastructure.

GraphLab, Inc., is the company behind a complete platform for using scalable machine learning to build big data analytics products. Companies like Zillow, Adobe, Zynga, Pandora, Bosch, and ExxonMobil rely on GraphLab to turn big data inspiration to predictive applications in production in the form of recommender systems, fraud detection systems, sentiment and social network analyzers, among other applications and services.

Carlos Guestrin is the CEO and cofounder of GraphLab and the Amazon Professor of Machine Learning at the University of Washington. A world-recognized leader in the field of machine learning, Carlos was named one of the 2008 “Brilliant 10″ by Popular Science magazine, received the 2009 IJCAI Computers and Thought Award for his contributions to Artificial Intelligence, and garnered a Presidential Early Career Award for Scientists and Engineers (PECASE).

What is machine learning and how has it evolved over the past 10 years?

Machine learning is a science that advances the idea that computers can be programmed to “learn” from patterns in data and use that knowledge as a basis for making highly accurate predictions and decisions in an automated way. Over the past decade, we have seen machine learning manifested in applications that enable self-driving cars, online stores that recommend products we’re likely to buy, targeted marketing, and credit card fraud detection, to name a few. The variety and volume of data now available has put machine learning at the forefront of investment, because it promises to transform our “big data” into insights that improve all areas of life and business.

Can you share the story behind GraphLab? How did it come to be?

GraphLab began in 2008 at Carnegie Mellon University under my stewardship and that of my co-founding doctoral and post-doctoral students. The team had been working on the application of machine learning for advanced graph analysis and required more scalable tools to implement the work they were publishing and sharing. The tools they built, became so popular that a modest workshop to discuss them drew over 300 participants — 10 times more than anticipated. This outcome pointed to both an unmet need and a well-designed product platform. What the team had done is leverage Amazon EC2 and significant advances in graph representation, asynchronous communication, and scheduling to achieve orders-of-magnitude performance gains over alternative systems for graph analysis.

Fast forward to 2012, a time during which my wife Emily, also a Computer Science professor, and I were considering new job opportunities. We had been speaking with universities out east when Jeff Bezos stepped in to help with the UW recruiting effort. The Amazon founder and Chief Executive met with us both and subsequently established two endowed professorships in machine learning to help fund our salaries.

With that I became the new Amazon Professor of Machine Learning for UW and moved to the PNW, bringing with me some of my talented students and aspirations of making a significant impact in the emerging area of big data analytics. A year later, with funding from Madrona Ventures and NEA, GraphLab the company was born and a short year after that in March 2014, GraphLab’s first commercial offering, GraphLab CreateTM, was released in Beta form.

Tell us about GraphLab Create and how it simplifies big data analysis.

Transforming raw data to insights and building predictive applications is a laborious and complex process today. Such endeavors require data scientists or similarly knowledgeable software engineers and an array of disparate and complex tools to gather, clean, model, analyze and ultimately present those insights to some store or application. In many instances the process is made more lengthy and expensive by the need to reimplement the prototype into code that can be used in a production environment. This situation leaves many data scientists hamstrung by lack of programming experience and organizations unable to derive value from their data.

GraphLab makes a platform that provides all of the tools a data scientist needs to go from an inspiring idea to a production-worthy data product quickly. Current users report that GraphLab Create helps them be immensely more productive and deliver value more quickly while requiring less programming expertise and fewer personnel.

And it supports predictive applications that can be deployed on AWS?

The journey from raw data to business-transforming predictive analytics often starts with a data scientist, a laptop, and a prototype but invariably requires a critical proof-of-concept stage to test the model at scale, likely in the cloud. Here for many the journey is cut short because although scaling to AWS is easy enough, for many data scientists reimplementing their prototype to production-ready code is not.

That’s where GraphLab comes in. GraphLab Create can be run entirely in the AWS environment. Data scientists can take their GraphLab-built prototype from a laptop to AWS in seconds by changing a single line of code. Data sets of any size as well as models can be loaded and accessed from Amazon S3. GraphLab also provides tools for deploying, monitoring, and optimizing data pipelines and predictive services across AWS clusters.

What are some common use cases? How are people using GraphLab Create?

The most common uses of GraphLab Create span a variety of disciplines:

Retail: recommender systems and pricing prediction (e.g., airfare)
Financial services: fraud detection through behavior and transaction analysis
Biomedicine: disease prediction by medical records analysis, personalized drug design
Telecommunications: prediction of customer churn
Social network analysis: identification of key network and community influencers
Marketing and media: sentiment analysis, targeting

Are only enterprise companies using predictive applications?

No, at all. State and local governments use GraphLab to analyze citizen sentiment and pinpoint which areas of local infrastructure need the most immediate attention. Biomedical research teams have used GraphLab to analyze clinical notes in the prediction of patient propensity toward a particular disease. Sensor networks of all types help provide valuable data whose analysis can make for safer air and rail travel. Generally, governments, research organizations, health and services providers are all mirroring the desires of industry to put their data to work in improving the effectiveness of their processes and people.

So it’s important for early-stage companies to be considering data science? When is the right time for a startup to think about big data?

Data science and data-driven decision making are a key consideration for companies of every size. Large companies are updating their dated customer recommender systems to take advantage of more advanced predictive techniques that include real-time inputs not just purchase histories. Text and sentiment analysis of surveys and comment fields is helping boost customer satisfaction and reduce churn. Similarly, early-stage companies that have data analytics–based business models are likely to begin life with data scientists and the application of machine learning at the core of their prototypes. This is particularly true of firms in the sales and marketing as well as media and advertising verticals that are innovating in customer targeting, acquisition, and retention. Other newer startup categories, for which data science is central, include those creating highly specialized predictive services customized for a specific vertical and application — for instance predicting financial waste in health care, supply chain optimization, or insurance claim fraud detection.

What all of these firms, large and small, have in common is lots of data but few data science resources and very limited compute. This is where the true power of GraphLab with AWS comes into view. With the scaling barrier removed, big data finally moves past hype to a very real source of inspired products.

Ten years from now how do you predict machine learning will be used to drive big data insights?

In a decade, machine learning will be accessible to many more people than the data scientists and skilled engineers who are the most productive with it today. Business analysts and line-of-business owners, for instance, will come to rely on predictive services for near real-time access to conditions that affect profit. Service providers in the government, health care, and private sectors will be able to customize products to the needs of individuals. It is also likely that awareness of machine learning and its impact to data-driven decision making will become mainstream enough for nontechnical persons to understand its value as a significant differentiator of products and services.

What’s next for GraphLab?

GraphLab is on a journey to democratize machine learning and aims to be instrumental in actualizing the ten-year vision discussed. In the short term, we are looking forward to making version 1.0 of our flagship product, GraphLab Create, generally available on October 15. With this initial commercial offering, the power of data science will be delivered to the hands of every organization, and many more big data aspirations will make their way to production by means of GraphLab.

GraphLab提供了一个完整的平台，让机构可以使用可扩展的机器学习系统建立大数据以分析产品，该公司客户包括Zillow、Adobe、Zynga、Pandora、Bosch、ExxonMobil等，它们从别的应用程序或者服务中抓取数据，通过推荐系统、欺诈监测系统、情感及社交网络分析系统等系统模式将大数据理念转换为生产环境下可以使用的预测应用程序。

Carlos Guestrin是GraphLab的联合创始人兼首席执行官，同时也是华盛顿大学的机器学习的Amazon Professor。作为机器学习界国际公认的引领者，Carlos获得过多项殊荣——被Popular Science杂志评为2008年 “Brilliant 10”，凭借AI领域的卓越贡献获得IJCAI Computers and Thought Award，同时他也是美国青年科学家总统奖获得人。

本文基于GraphLab首席执行官Carlos Guestrin关于AWS服务的QA内容撰写的。

Q：机器学习是什么？在过去10年又获得了什么样的发展？

Carlos Guestrin：机器学习是一种科学，它设想计算机可以通过大量的数据阅读从模型中学习，并将学到的知识作为基础，自动地进行准确的预测和决策制定。在过去10年中，我们可以看到机器学习已经在无人汽车驾驶、在线商店喜好产品推荐、营销定位、信用卡防欺诈等领域得到使用。鉴于其可以将“大数据”转化为改善生产生活的洞察力，多样化和大体积的数据让机器学习成为一个热门的投资方向。

Q：你可以分享GraphLab成立背后的故事吗？为什么会开始这样一个业务？

Carlos Guestrin：GraphLab原型诞生在2008年卡耐基梅隆大学，在我的引领下，与我的两个学生一起创办，他们分别是博士和博士后。在这之前，团队一直致力于先进图分析应用程序的研究。为了完成某些目标，他们需要建立具备更高扩展性的工具。这些工具一经建立就得到了众多关注，当时一个简单的研讨会甚至吸引了300余人参与，是预期的十倍。这一结果显示，这个市场拥有庞大的需求，同时也证明了平台设计的优越性。当时，团队完美地利用了EC2的能力，在图分析、异步通信取得了颠覆性进展，与同类图分析系统对比中拥有数量级的性能优势。

到2012年，我和妻子（同样是计算机科学教授）正在考虑一个新的工作。在Jeff Bezos游说我们去华盛顿大学的同时，Amazon创始人及董事长约见了我们夫妇，并确定两个华盛顿大学机器学习Amazon Professor的职位。随后我们转移到PNW，也遇见了一些有才华并且想在新兴大数据分析领域大展拳脚的学生。在Madrona Ventures和NEA的资金支持下，GraphLab公司在一年后的2014年3月正式诞生，并以测试版的形式推出了第一个商业版CreateTM。

Q：能否谈谈GraphLab的创建，以及它如何简化大数据分析？

Carlos Guestrin：当下，将原始数据转化为洞察力，并建立一个预测应用程序仍然是极具挑战并且复杂的，它需要数据科学家或同样知识渊博的软件工程师来完成。同时，必不可少的是，完成这个工作还需要大量复杂的工具，用以收集、清洗、建模、分析并将结果对商店或者应用程序进行展示。许多情况下，生产环境下的原型代码实现是个漫长及昂贵的过程。这样一来，许多没有编程经验的数据科学家将毫无用武之地，同时，机构也很难从他们的数据中提取价值。

应运而生的是，GraphLab提供了一个这样的平台，让毫无编程经验的数据科学家可以快速地将理念转化为生产环境可以使用的产品。通过大量GraphLab用户了解到，GraphLab Create可以帮助他们快速的提高生产力，并且在不需要太多编程经验和人力的情况下快速交付价值。

Q：它是否支持部署在AWS的预测应用程序？

Carlos Guestrin：原始数据到业务转化预测分析的过程往往起始于一个数据科学家、一台笔记本及一个必须在大规模测试下验证关键概念的原型。这个过程可能因为AWS非常易于扩展而缩短，但是对于许多数据科学家来说，将原型进行生产环境的代码重实现仍然非常困难。

这就给GraphLab带来了机会。GraphLab Create可以在全AWS环境下运行。只需要修改一行代码，数据科学家就可以将他们笔记本上基于GraphLab建立的原型迁移到AWS中。任何大小的数据集及模型都可以从Amazon S3加载和访问，同时GraphLab还提供了跨AWS集群的部署、监测、数据管道优化及预测服务。

Q：可以分享一些常见用例吗？人们是怎么使用GraphLab Create的？

Carlos Guestrin：常见GraphLab Create使用覆盖了各种领域：

零售业：推荐系统和价格预测（比如机票）

金融服务：通过行为和交易分析预防欺诈

生物医学：通过医疗记录分析预报疾病，定制化药物设计

通信领域：预报客户流失

社交网络分析：识别关键网络和社区影响者

市场和媒体：情绪分析，目标锁定

Q：预测应用程序是不是只有企业在使用？

Carlos Guestrin：完全错误。州政府和地方政府使用GraphLab来分析市民情绪以及判断哪个区域的地方基础设施需要及时关注；生物医学研究团队使用GraphLab分析临床记录来预测病人的病情发展趋势；各种类型的传感器网络使用GraphLab获得有价值的数据以帮助提升航空和铁路运输安全。通常情况下，政府、科研机构、保健和服务提供者都期望通过有效的数据利用来提高运行效率。

Q：那么，早期的公司是否需要重点对待数据科学？创业公司需要在什么阶段开始关注大数据？

Carlos Guestrin：对于任何规模的公司来说，数据科学及数据驱动决策都有着非常重要的意义。大型公司不能只止步于历史记录分析，他们需要让陈旧的客户推荐系统高效起来，利用包括实时分析的领先预测技术。调查及评论字段的文本和情感分析可以帮助了解用户情绪，从而减少意外发生。同样的，刚起步的公司拥有基于数据分析的业务模型同样非常重要，尤其有益于销售、市场、媒体、广告等领域。当下，还存在一些以数据科学为核心的创业公司，他们为某个特殊垂直领域或应用建立高度专业化的定制服务，比如分析在医疗保健上的浪费、供应链优化以及保险索赔。

所有这些公司，不管规模大小，他们都有着共同的特点，那就是拥有大量的数据，但是缺少数据科学资源以及计算能力。这些正是AWS和GraphLab联合提供的优势，通过移除扩展瓶颈，大数据已经从炒作过渡到真正投入生产阶段。

Q：十年后，你眼中的机器学习会对大数据有什么样的推进？

Carlos Guestrin：十年后，对比当下数据科学家和富有经验的工程师，机器学习将掌握在更多人手中，他们将提供比现在更多的生产力。比如，业务分析人员和业务线拥有者将更依赖预测服务提供的实时利润前景预测，政府、医疗、私营部门的服务提供者将可以根据需求定制化产品。同时，对于非技术人们来说，机器学习和数据驱动决策带来的独立价值提升将被人们公认。

Q：GraphLab的下一个举措是什么？

Carlos Guestrin：GraphLab正走在大众化机器学习的路上，旨在实现上文所述“全民都可以机器学习”的愿景。就眼前来说，我们正在致力于旗舰产品1.0版本的打造，GraphLab Create将在10月15日全面可用。首次发行后，我们会将机器学习能力交付到所有机构。同时，我们也将看到更多的大数据需求得以实现。

mishidemudong

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
GraphLab：将大数据分析从理念运用到生产

GraphLab: Big Data Analytics Scaled From Inspiration to ProductionQ&A with Carlos Guestrin, CEO of GraphLabThe AWS Startup Spotlight features startups all over the world building innovative, d
复制链接

扫一扫