Kaggle now has 100K data scientists, but what's a data scientist?

最新推荐文章于 2019-06-14 08:57:25 发布

zxjor91

最新推荐文章于 2019-06-14 08:57:25 发布

阅读量1.3k

点赞数

分类专栏：数据科学家

数据科学家专栏收录该内容

2 篇文章 0 订阅

订阅专栏

转载自：
https://gigaom.com/2013/07/11/kaggle-now-has-100k-data-scientists-but-whats-a-data-scientist/

Data science competition platform Kaggle has reached the 100,000-member milestone just over three years after launching, the company announced on its blog Thursday morning. That could mean either a lot of people picked up the skills for the decade’s hottest job in a hurry, or a lot of people realized they might already have them. I’d say it’s a little of both.

You have to give credit where credit is due, though: Given the field we’re talking about, that’s a lot of people any way you slice it. Assuming there are a few million people across the world who can even loosely call themselves data scientists, getting 100,000 of them to congregate on a single platform is pretty impressive. Part of this has to do with Kaggle making it easier for people interested in data science to actually test their skills on real-world data.

这里写图片描述
You can see the effects of Kaggle’s popularity in the evolution of company. It’s still best known for hosting public data mining and predictive analytics competitions on behalf of companies and other institutions, but the company has quietly grown into a full-fledged business. Top competitors are competing in invite-only private competitions that mean bigger prizes for everyone involved. I was surprised to bring up the Kaggle homepage and see its Connect service (an evolution of an earlier service called Prospect), for putting customers directly in touch with those top competitors, front and center with the competitions playing second fiddle.

Define “data scientist”

Of course, “data scientist” is term that emerged from the morass of terminological confusion that is “big data” — something that becomes pretty clear when you ask someone what a data scientist is or does. I’ve heard the core data science competencies explained as SQL, statistics, predictive modeling and programming, probably in Python. Those sound reasonable, but many would be quick to add to the list things like Hadoop/MapReduce, machine learning, visualization and perhaps a good, old-fashioned Ph.D. in mathematics, physics, computer science or something equally quantitative.

IBMer Swami Chandrasekaran built a great subway-style map of the optimal data scientist skillset, which you can see on his blog.

And those are just the technological prerequisites. A lot of people trumpet the importance of domain expertise, business acumen, creativity and storytelling, too. A data scientist can’t just be good with numbers (those people are called statisticians or analysts) but also needs to be able to understand the business; why certain data and results are or aren’t important to it; be able to find new datasets and build new products around them; and then be able to explain all this to the the C-suite in plain English.

That’s a tall order. I’m pretty sure there are a handful of these people in the world, and I’ve met most of them.

Eric Huls of Allstate Insurance, Jeremy Howard of Kaggle, and Ryan Kim of GigaOM at Structure:Data 2012
Kaggle’s Jeremy Howard (left) at Structure: Data 2012 (c) 2012 Pinar Ozger. pinar@pinarozger.com
The good news is that all those characteristics are probably overkill, necessary for the cream of the crop trying to do really advanced things, but not required to effect positive change in more pedestrian environments. They’re certainly not all necessary for success on a platform like Kaggle, where the business problem is often well understood and competitors are just trying to optimize it through predictive models. Many top competitors no doubt have some serious statistical and data analysis backgrounds — even if they’re not up on the latest techniques — but some competition winners have been college kids with a little coding experience and Coursera’s introductory Machine Learning course.

This isn’t because data science or predictive competitions are easy, but rather because we’re at the precipice of a big change. It’s easier than ever to learn what you need thanks to online courses and coding programs; easy enough to learn and access tools such as R and Hadoop — especially with the advent of cloud computing; and easy enough to hone your skills on platforms like Kaggle or Topcoder. This pedigree might not land anyone an engineering job at Google, but it’s probably enough to be dangerous (in a good way) in a lot of other places.

【IT168 专稿】“数据科学家”在2009年由Natahn Yau首次提出，其概念是采用科学方法、运用数据挖掘工具寻找新的数据洞察的工程师。数据科学家集技术专家与数量分析师的角色于一身，与传统数量分析师相比：后者通常利用企业的内部数据进行分析，以支持领导层的决策;而前者更多的是通过关注面向用户的数据来创造不同特性的产品和流程，为客户提供有意义的增值服务。
　　面向客户的性质决定了大部分数据科学家担任公司产品开发或营销部门的职位，或是效力于首席技术官。那么数据科学家需要具备哪些核心能力呢?科技记者Derrick Harris在其文章中介绍了数据科学家应具备的一些技能。
　　他表示，在你询问别人什么是数据科学家，或者数据科学家是做什么的时候，很容易发现：“数据科学家”其实是从“大数据”引发的术语混乱中形成的。数据科学的核心能力被定义为：SQL、统计、预测建模和编程、Python等，这些听起来很合理。但是很快就有更多名词添加到其中：Hadoop/MapReduce、机器学习、可视化，甚至还有传统的数学、物理、计算机科学等类似能力。

上文所罗列的这些只是技术方面的条件，许多人呼吁专业领域、商业智慧、创造力及表达能力也是同样重要的。一个数据科学家不能只擅长数字(这种人被称为统计学家或分析师)，也要能够理解业务：什么样的数据或结果才是有参考性的;能够找到新的数据集并为其创造新产品;然后能够让CXO们理解这一切。
　　这是一个艰巨的任务，这个世界上这类人是很少的。Kaggle公司日前在其博客上宣布：从三年前成立至今，数据科学竞赛平台Kaggle成员已经到达10万名。这可能意味着很多人已经轻松的获得了这十年来最热门的技能，或者意识到他们已经拥有了这一技能。

即使这世界上有几百万人轻易的称自己是数据科学家，让10万人聚集在一个平台上也是相当了不起的，这些对数据科学感兴趣的人通过Kaggle来测试他们的技能水平。Kaggle在竞赛方面的业务现在已经发展的很成熟了：代表其他公司、机构组织最知名的公共数据挖掘和预测分析比赛。越来越多地公司进入Kaggle提出需求意味着每一个参加者都将更容易获得奖励。Kaggle的主页查看其连接服务(一种称为Prospect的早期服务发展而来的)：通过实时排名让客户直接认识那些顶级的竞参赛者。

作为顶尖的数据科学家，不要求他们对环境做出什么积极的改变，但是需要他们尝试做一些真正先进的东西。不是说他们为了成功就必须聚集在某一类似于Kaggle的平台，这一平台只是用于帮助大家更好的解决业务上的问题，或是组织比赛者试图通过预测模型优化自己的业务。很多顶级参赛者毫无疑问也有一些统计和数据分析的经历——即时他们不是使用最新的技术。但是一些竞赛胜利者也不过是只有很少编码经验和学习过Coursera的介绍性机器学习课程的大学生。
　　这并不是因为数据科学或预测的比赛很简单，而是因为比赛有了很大的变化，Kaggle帮助人们降低了进入数据科学的门槛。它比过去更容易了解到对你有用的在线课程和编码程序;更容易学习或访问各种工具如R或Hadoop，云计算的到来使这一变得更加简单;在Kaggle或Topcoder这类平台上将有更多机会锻炼你的技能。在这里可能不能帮助你成为谷歌的工程师，但是足够帮助你在其他地方任职，并获得丰厚的薪酬。

转载自：http://www.moozhi.com/topic/show/543a1527775c309e74b70d8f

从图开始

我相信这张图很多人看过，作者是Swami Chandrasekran，点击图片可以放大

这里要说的，是在MOOC中，怎么尽力完成这张图。也就是说有哪些MOOC和其中知识相关，让你通过上课的手段，逐渐接近成为一个数据科学家。

这里写图片描述

Fundamentals 基础

基础部分，主要是数学基础

其中矩阵，线性代数的只是可以学习课程 Coding the Matrix 布朗大学
其中Hash的概念，二叉树，大O标记，可以学习课程数据结构清华大学
其中关系代数，JSON，XML，可以学习课程 Introducation to Database 斯坦福大学
其中关于基本的数据科学家环境搭建，可以学习课程 The Data Scientist’s Toolbox 约翰霍普金斯大学

这四门课基本上覆盖了基础的部分

Matrics & Linear Algebra Fundamentals
Hash Functions, Binary Tree, O(n)
Relational Algebra, DB Basics
Inner, Outer, Cross, Theta Join
CAP THEOREM
Tabular Data
Data Frames & Series
Sharding
OLAP
Multidimensional Data Model
ETL
Reporting Vs BI Vs Analytics
JSON & XML
NoSQL
Regex
Vendor Landscape
Env Setup
Statistics 统计

统计方面课程非常多，不过基本上只有三个部分，概率的基础，统计知识，统计的应用

概率方面，可以学习台大的机率课程机率台湾大学
也可以学习MIT的概率课程，不过比较难 Intro to Probability 麻省理工

其实有些统计课程也包含一些简单的概率知识，因为这俩不分家的，关于统计可以参考
intro to statistic 伯克利
Data Analysis and statistic inference 杜克大学
Math biostatistics boot camp 1 约翰霍普金斯大学

这些课程基本覆盖统计部分的绝大部分所需知识

Pick a Dataset(UCI Repo)
Descriptive Statistics(mean, median, range, SD, Var)
Exploratory Data Analysis
Histograms
Percentiles & Outliers
Probability Theory
Bayes Theorem
Random Variables
Cumul Dist Fn(CDF)
Continuos Distributions(Normal, Poisson, Gaussian)
Skewness
ANOVA
Prob Den Fn(PDF)
Central Limit THeorem
Monte Carlo Method
Hypothesis Testing
p-Value
Chiz Test
Estimation
Confid Int(CI)
MLE
Kernel Density Estimate
Regression
Convariance
Correlation
Pearson Coeff
Causation
Least2 fit
Eculidean Distance
Programming 编程

编程主要是R和python编程，这两个语言也是数据科学家最常用的语言

python的课程可以选择的非常多
可以先试试学习 Introduction to Computer Science and Programming Using Python 麻省理工
也可以考虑看看 An Introduction to Interactive Programming in Python 莱斯大学
实际上之前在基础门类里的 Coding the Matrix 布朗大学也有python的入门教学

R语言的课程更是多
R language 约翰霍普金斯大学
Intro to Data science 华盛顿大学
Data Analysis and statistic inference 杜克大学也可以作为不错的R语言入门
在Getting and Cleaning Data 约翰霍普金斯大学中有很多关于使用R语言获取和处理数据的知识

Python Basics
Working in Excel
R Setup, R studio
Varibles
Vectors
Matrices
Arrays
Factors
Lists
Data Frames
Reading CSV Data
Reading Raw Data
Subsetting Data
Manipulate Data Frames
Functions
Factor Analysis
Install Pkgs
Machine Learning 机器学习

机器学习的课程，首先推荐的就是Andrew Ng的Machine Learning 斯坦福大学
然后是林老师的机器学习基石台湾大学
或者Yaser S. Abu-Mostafa的Learning from data，Abu-Mostafa老师和林老师是师徒关系，这两门课大纲基本一直，不过Abu-Mostafa老师的内容丰富一点

如果说全面，Udacity上机器学习的三部曲可能是最全面的：
Machine Learning: Supervised Learning 乔治亚理工
Machine Learning: Unsupervised Learning 乔治亚理工
Machine Learning: Reinforcement Learning 乔治亚理工

What is ML?
Numerical Var
Categorical Var
supervised Learning
Unsupervied Learning
Concepts, Inputs & Attributes
Traning & Test Data
Classifier
Prediction
Lift
Overfitting
Bias & Variance
Trees & Classification
Classification Rate
Decision Tress
Boosting
Naive Bayes Classifiers
K-Nearest Neighbour
Logistic Regression
Ranking
Linear Regression
Perceptron
Hierarchical Clustering
K-means Clusterning
Neural Networks
Sentimeter Analysis
Collaborative Fitering
Tagging
Text Mining / NLP 文本挖掘，自然语言处理

关于自然语言理解的课程在MOOC上比较少
coursera上的Natural Language Process可能不会开新一期了，不过也可以期待
在Intro to Data science 华盛顿大学曾经稍微讲过一点NLP的词袋法之类非常简单的
在这里Support Vector Machine可以查看上面机器学习的课程，Ng的课程，Yaser的课程都有所提及

Corpus
Named Entity Recognition
Text Analysis
UIMA
Term Document Matrix
Tern Document Matrix
Term Frequency & Weight
Support Vector Machines
Association Rules
Market Based Analysis
Feature Extraction
Using Mahout
Using Weka
Using NLTK
Classify Text
Vocabulary Mapping
Visualization 可视化

在约翰霍普金斯的data science speciallization里面穿着讲了一些画图的做法，重点在课程Exploratory Data Analysis 约翰霍普金斯中
Data Analysis and statistic inference 杜克大学也讲了一些绘图，基本上就覆盖了大多数需要做图的情况
至于Decision Tree（决策树）的概念可以在Machine Learning: Supervised Learning 乔治亚理工里面找到，在其他一些关于统计的课程，例如Intro to Data science 华盛顿大学里面也有

Data Exploration in R(Hist, Boxplot etc)
Uni, Bi & Multivariate Viz
ggplot2
Histogram & Pie(Uni)
Tree & Tree Map
Scatter Plot (Bi)
Line Charts (Bi)
Spatial Charts
Survey Plot
Timeline
Decision Tree
D3.js
infoVis
IBM ManyEyes
Tableau
Big Data 大数据

Intro to Data science 华盛顿大学讲了mapreduce的基本概念
最近开课的Mining Massive Data评价很高，建议上
Udacity的Into to Haddop and mapreduce可能更偏向实用，不过实际上比较短，并不详细

Map Reduce Fundamentals
Hadoop Components
HDFS
Data Replication Principles
Setup Hadoop (IBM/Cloudera/HortonWorks)
Name & Data Nodes
Job & Task Tracker
MIR Programming
Sqoop: Loading Data in HDFS
Flue, Scribe: For Unstruct Data
SQL with Pig
DWH with Hive
Scribe, Chunkwa For Weblog
Using Mahout
Zookeeper Avro
Storm: Hadoop Realtime
Rhadoop, Phipe
rmr
Classandra
MongoDB, Neo4j
Data Ingestion 数据获取

这部分比较形式，我个人认为经验成分很重，没有绝对相关的MOOC
Getting and Cleaning Data 约翰霍普金斯大学中稍微涉及了一些数据可能的获取

Summary of Data Formats
Data Discovery
Data Sources & Acquisition
Data Integration
Data Fusion
Transformation & Enrichament
Data Survey
Google OpenRefine
How much Data
Using ETL
Data Munging 数据清理/数据转换

在Getting and Cleaning Data 约翰霍普金斯大学提到了一些数据处理
在Machine Learning 斯坦福大学中也提到过数据标准化，特征提取问题
基本这部分知识都可以通过上完机器学习那部分课程的内容来获取

Dimensionality & Numerosity Reduction
Normalization
Data Scrubbing
Handling Missing Values
Unbiased Estimators
Binning Sparse Values
Feature Extraction
Denoising
Sampling
Stratified Sampling
Principal Component Analysis
Toolbox 工具箱

路漫漫～～漫漫～～漫～～～～

MS Excel w/ Analysis Toolpak
Java, Python
R, Rstudio, Rattle
Weka, Knime, RapidMiner
Hadoop Dist of Choice
Spark, Storm
Flume, Scibe, Chukwa
Nutch, Talend, Scraperwiki
Webscraper, Flume, Sqoop
tm, RWeka, NLTK
PHIPE
D3.js, ggplot2, Shiny
IBM Languageware
Cassandra, MongoDB>

zxjor91

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Kaggle now has 100K data scientists, but what's a data scientist?

转载自： https://gigaom.com/2013/07/11/kaggle-now-has-100k-data-scientists-but-whats-a-data-scientist/Data science competition platform Kaggle has reached the 100,000-member milestone just over three year
复制链接

扫一扫

专栏目录