10 Common Hadoop-able Problems Webinar——十个适用Hadoop的场合

10 Common Hadoop-able Problems Webinar - Presentation Transcript

  1. 10 Common Hadoop-able Problems August 5, 2010
  2. Topics • Introduction • 10 Common Hadoop-able Problems • Summary • Questions Copyright 2010 Cloudera Inc. All rights reserved 2
  3. Today’s speaker - Jeff Hammerbacher • hammer@cloudera.com • Studied Mathematics at Harvard • Worked as a Quant on Wall Street • Conceived, built, and led Data team at Facebook • Nearly 30 amazing engineers and data scientists • Several open source projects and research papers • Founder of Cloudera • Chief Scientist • Also, check out the book “Beautiful Data” Copyright 2010 Cloudera Inc. All rights reserved 3
  4. What is Hadoop? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Scalable data processing engine • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing • Key value • Flexible -> store data without a schema and add it later as needed • Affordable -> cost / TB at a fraction of traditional options • Broadly adopted -> a large and active ecosystem • Proven at scale -> dozens of petabyte + implementations in production today Copyright 2010 Cloudera Inc. All Rights Reserved. 4
  5. Cloudera’s Distribution for Hadoop, version 3 The industry’s leading Hadoop distribution Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume, Sqoop HBase Zookeeper • Open source – 100% Apache licensed • Simplified – Component versions & dependencies managed for you • Integrated – All components & functions interoperate through standard API’s • Reliable – Patched with fixes from future releases to improve stability • Supported – Employs project founders and committers for >70% of components Copyright 2010 Cloudera Inc. All Rights Reserved. 5
  6. How does Cloudera know which problems are Hadoop-able? • Talking to 1000s of users • Supporting 100s of implementations • Experience putting Hadoop into production with customers across a range of industries Copyright 2010 Cloudera Inc. All rights reserved 6
  7. Summary – 10 Common Hadoop-able Problems 1. Modeling true risk 6. Analyzing network data to predict failure 2. Customer churn analysis 7. Threat analysis 3. Recommendation 8. Trade surveillance engine 9. Search quality 4. Ad targeting 10. Data “sandbox” 5. PoS transaction analysis Copyright 2010 Cloudera Inc. All rights reserved 7
  8. What is common across Hadoop-able problems? Nature of the data • Complex data • Multiple data sources • Lots of it Nature of the analysis • Batch processing • Parallel execution • Spread data over a cluster of servers and take the computation to the data Copyright 2010 Cloudera Inc. All rights reserved 8
  9. What Analysis is Possible With Hadoop? • Text mining • Collaborative filtering • Index building • Prediction models • Graph creation and • Sentiment analysis analysis • Risk assessment • Pattern recognition Copyright 2010 Cloudera Inc. All rights reserved 9
  10. Benefits of Analyzing With Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility Copyright 2010 Cloudera Inc. All rights reserved 10
  11. Topics • Introduction • 10 Common Hadoop-able Problems • Summary • Questions Copyright 2010 Cloudera Inc. All rights reserved 11
  12. 1. Modeling True Risk Copyright 2010 Cloudera Inc. All rights reserved 12
  13. 1. Modeling True Risk Solution with Hadoop • Source, parse and aggregate disparate data sources to build comprehensive data picture • e.g. credit card records, call recordings, chat sessions, emails, banking activity • Structure and analyze • Sentiment analysis, graph creation, pattern recognition Typical Industry • Financial Services (Banks, Insurance) Copyright 2010 Cloudera Inc. All rights reserved 13
  14. 2. Customer Churn Analysis Copyright 2010 Cloudera Inc. All rights reserved 14
  15. 2. Customer Churn Analysis Solution with Hadoop • Rapidly test and build behavioral model of customer from disparate data sources • Structure and analyze with Hadoop • Traversing • Graph creation • Pattern recognition Typical Industry • Telecommunications, Financial Services Copyright 2010 Cloudera Inc. All rights reserved 15
  16. 3. Recommendation Engine Copyright 2010 Cloudera Inc. All rights reserved 16
  17. 3. Recommendation Engine Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Collaborative filtering • Collecting ‘taste’ information from many users • Utilizing information to predict what similar users like Typical Industry • Ecommerce, Manufacturing, Retail Copyright 2010 Cloudera Inc. All rights reserved 17
  18. 4. Ad Targeting Copyright 2010 Cloudera Inc. All rights reserved 18
  19. 4. Ad Targeting Solution with Hadoop • Data analysis can be conducted in parallel, reducing processing times from days to hours • With Hadoop, as data volumes grow the only expansion cost is hardware • Add more nodes without a degradation in performance Typical Industry • Advertising Copyright 2010 Cloudera Inc. All rights reserved 19
  20. 5. Point of Sale Transaction Analysis Copyright 2010 Cloudera Inc. All rights reserved 20
  21. 5. Point of Sale Transaction Analysis Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Pattern recognition • Optimizing over multiple data sources • Utilizing information to predict demand Typical Industry • Retail Copyright 2010 Cloudera Inc. All rights reserved 21
  22. 6. Analyzing Network Data to Predict Failure Copyright 2010 Cloudera Inc. All rights reserved 22
  23. 6. Analyzing Network Data to Predict Failure Solution with Hadoop • Take the computation to the data • Expand the range of indexing techniques from simple scans to more complex data mining • Better understand how the network reacts to fluctuations • How previously thought discrete anomalies may, in fact, be interconnected • Identify leading indicators of component failure Typical Industry • Utilities, Telecommunications, Data Centers Copyright 2010 Cloudera Inc. All rights reserved 23
  24. 7. Threat Analysis Copyright 2010 Cloudera Inc. All rights reserved 24
  25. 7. Threat Analysis Solution with Hadoop • Parallel processing over huge datasets • Pattern recognition to identify anomalies i.e. threats Typical Industry • Security • Financial Services • General: spam fighting, click fraud Copyright 2010 Cloudera Inc. All rights reserved 25
  26. 8. Trade Surveillance Copyright 2010 Cloudera Inc. All rights reserved 26
  27. 8. Trade Surveillance Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Pattern recognition • Detect trading anomalies and harmful behavior Typical Industry • Financial services • Regulatory bodies Copyright 2010 Cloudera Inc. All rights reserved 27
  28. 9. Search Quality Copyright 2010 Cloudera Inc. All rights reserved 28
  29. 9. Search Quality Solution with Hadoop • Analyzing search attempts in conjunction with structured data • Pattern recognition • Browsing pattern of users performing searches in different categories Typical Industry • Web • Ecommerce Copyright 2010 Cloudera Inc. All rights reserved 29
  30. 10. Data “Sandbox” Copyright 2010 Cloudera Inc. All rights reserved 30
  31. 10. Data “Sandbox” Solution with Hadoop • With Hadoop an organization can “dump” all this data into a HDFS cluster • Then use Hadoop to start trying out different analysis on the data • See patterns or relationships that allow the organization to derive additional value from data Typical Industry • Common across all industries Copyright 2010 Cloudera Inc. All rights reserved 31
  32. Topics • Introduction • 10 Common Hadoop-able Problems • Summary • Questions Copyright 2010 Cloudera Inc. All rights reserved 32
  33. Summary – 10 Common Hadoop-able Problems 1. Modeling true risk 6. Threat analysis 2. Customer churn 7. Analyzing network analysis data to predict failure 3. Recommendation 8. Trade surveillance engine 9. Search quality 4. Ad targeting 10. Data “sandbox” 5. PoS transaction analysis Copyright 2010 Cloudera Inc. All rights reserved 33
  34. Who is Cloudera? • Enterprise software & services company providing the industry’s leading Hadoop-based data management platform • Founding team came from large Web companies • Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop • All necessary packages, matched, tested and supported • Tools to support production use of Hadoop • The leading distribution for the enterprise • Contributors and committers • Fixing, patching and adding features 34
  35. Hear More Examples @ Hadoop World 2010 http://www.cloudera.com/company/press-center/hadoop-world-nyc/ • 2nd annual event focused on practical applications of Hadoop • Date: October 12th 2010 • Location: Hilton New York Confirmed speakers from • Keynote from Tim O’Reilly – founder O’Reilly Media • Pre and post conference training available for Hadoop and related projects • 36 business and technical focused sessions Copyright 2010 Cloudera Inc. All Rights Reserved. 35
  36. Questions? Copyright 2010 Cloudera Inc. All Rights Reserved. 36

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值