Machine Learning Scientist Sr at Robert Half

Its a great question and merits some elaboration. So the short answer is hadoop and spark are not even apples to apples. Let me illustrate through my own personal experience 


1. Say a company is wanting to get into the big data ecosystem. A typical entry point that enables success is to convert existing aggregations, reporting using NOSQL. 

2. Also lets assume two user types in this company 
[a] Information/ Data Analyst (can do RDB SQLs pretty well) 
[b] Developer who can do Java/Python, SQLs well 

3. The starting point is data. So the first step is to say replicate the entire DB on a daily basis into NOSQL. What this means is somehow (every developer including me has their own personal flavor of how to extract and transport the data from RDB to HDFS every day - Sqoop / Pentaho/ or data extracts and bash SCP whatever) 

4. But to put these flat files extracts from tables in the DB you need a Hadoop ecosystem. You have several choices. I have sent multiple hadoop/hive projects using Cloudera CDH distribution and I have confidence in Cloudera. So u get a few EC2 instances and using Cloudera Manager and setup the hadoop ecosystem and it comes with Hive and Impala as well. 

5. Next you define Hive table DDLs and point the location of each table to a location on HDFS where the flat files are. So in theory if u have X tables in your RDB, I would start with X tables in Hive 

6. At this stage of your journey to the big data ecosystem, you can open the gates to your big data system to [a] Information/ Data Analyst , and they can run ad-hoc queries using Impala(in memory and blazing fast) or Hive (more powerful functions available) and schedule daily reports etc (Oozie is my choice for DAG workflows and scheduling). You may want top configure security on the Hive tables using Sentry. I use Presto as well that sits very nicely on HDFS and used Hive metastore (like Impala) 

7. With this ecosystem in place , the developers can start working on one or more of the following 
- Write Spark streaming code ( I prefer Scala) / or Hadoop Map Reduce code in Java to process logs and other data into say , TSV or CSV that be loaded into newly created Hive tables. These tables created from log files can then be used for any correlations between the hive tables replicated from the enterprise DB 
- Build datasets from the above logs and tables and can feed machine learning software such as SparkMLlib or H20 (by 0xdata) 
- Spark SQL has can talk nicely with Hive metastore , so if you already have Hive tables then you can write SQL-ish code using Spark and get your answers. 

Sorry for the long answer to a short question ! Hope it helps

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值