Before you begin, you need Multivariable Calculus, Linear Algebra, and Python.
If your math background is up to multivariable calculus and linear algebra, you'll have enough background to understand almost all of the probability / statistics / machine learning for the job.
Multivariate Calculus: https://www.quora.com/What-are-the-best-resources-for-mastering-multivariable-calculus
Numerical Linear Algebra / Computational Linear Algebra / Matrix Algebra: Linear Algebra
Multivariate calculus is useful for some parts of machine learning and a lot of probability. Linear / Matrix algebra is absolutely necessary for a lot of concepts in machine learning.
You also need some programming background to begin, preferably in Python. Most other things on this guide can be learned on the job (like random forests, pandas, A/B testing), but you can't get away without knowing how to program!
Python is the most important language for a data scientist to learn.Check out
For some reasoning behind that.
To learn Python, check out How can I learn to program in Python?
Plug Yourself Into the Community
Check out Meetup
to find some that interest you! Attend an interesting talk, learn about data science live, and meet data scientists and other aspirational data scientists!
Start reading data science blogs and following influential data scientists!
Setup your tools
- Install Python, iPython, and related libraries (guide)
- Install R and RStudio (I
would say that R is the second most important language. It's good to know both Python and R)
- Install Sublime Text
Learn to use your tools
Learn Probability and Statistics
Be sure to go through a course that involves heavy application in R or Python.
Complete Harvard's Data Science Course
This course is developed in part by a fellow Quora user, Professor Joe Blitzstein
. Intro to the classLectures and Slides2013 Assignments
2014 Assignments2013 Labs
- Intro to Python, Numpy, Matplotlib (Homework 0) (solutions)
- Poll aggregation, web scraping, plotting, model evaluation, and forecasting (Homework
- Data prediction, manipulation, and evaluation (Homework 2)
- Predictive modeling, model calibration, sentiment analysis(Homework 3)
- Recommendation engines, Using mapreduce (Homework 4)
- Network visualization and analysis (Homework 5) (solutions)
Do most of Kaggle's Getting Started and Playground Competitions
I would NOT recommend doing any of the prize-money competitions. They usually have datasets that are too large, complicated, or annoying, and are not good for learning (Kaggle.com
Start by learning scikit-learn, playing around, reading through tutorials and forums at Data
Science London + Scikit-learn
for a simple, synthetic, binary classification task.
Next, play around some more and check out the tutorials for Titanic: Machine Learning from Disaster
with a slightly more complicated binary
task (with categorical variables, missing values, etc.)
Afterwards, try some multi-class classification
with Forest Cover Type Prediction
Now, try a regression
task Bike Sharing Demand
that involves incorporating timestamps.
Try out some natural language processing
with Sentiment Analysis on Movie Reviews
Finally, try out any of the other knowledge-based competitions that interest you!
is just a rebranded version of what pharmaceutical companies have been doing for decades. Learn more about A/B testing here: The
Ultimate Guide To A/B Testing - Smashing MagazineVisualization
- I would recommend picking up ggplot2 in R to make simple yet beautiful graphics, checking out The
Visual Display of Quantitative Information
($), and just browsing DataIsBeautiful • /r/dataisbeautiful
ideas and inspiration.User Behavior -
This set of blogs posts looks useful and interesting - This Explains Everything " User Behavior
Do Side Projects
Code in Public
Create public github respositories, make a blog, and post your work, side projects, Kaggle solutions, insights, and thoughts! This helps you gain visibility, build a portfolio for your resume, and connect with other people working on the same tasks
Check out more specific versions of this question:
Think like a Data Scientist
In addition to the concrete steps I listed above to develop the skillset of a data scientist, I include seven challenges
below so you can learn to think like a data scientist
and develop the right attitude
to become one.
(1) Satiate your curiosity through data
As a data scientist you write your own questions and answers.
Data scientists are naturally curious about the data that they're looking at, and are creative with ways to approach and solve whatever problem needs to be solved.Much of data science is not the analysis itself, but discovering an interesting question and figuring out how to answer it.
Here are two great examples:Challenge
: Think of a problem or topic you're interested in and answer it with data!
(2) Read news with a skeptical eye
Much of the contribution of a data scientist (and why it's really hard to replace a data scientist with a machine), is that a data scientist will tell you what's important and what's spurious. This persistent skepticism is healthy in all sciences, and is especially
necessarily in a fast-paced environment where it's too easy to let a spurious result be misinterpreted.
You can adopt this mindset yourself by reading news with a critical eye.Many news articles have inherently flawed main premises.
Try these two articles. Sample answers are available in the comments.
Easier: You Love Your iPhone. Literally.
Harder: Who predicted Russia’s military intervention?Challenge:
Do this every day when you encounter a news article. Comment on the article and point out the flaws.
(3) See data as a tool to improve consumer products
Visit a consumer internet product (probably that you know doesn't do extensive A/B testing already), and then think about their main funnel. Do they have a checkout funnel? Do they have a signup funnel? Do they have a virility mechanism? Do they have an engagement
Go through the funnel multiple times and hypothesize about different ways it could do better to increase a core metric (conversion rate, shares, signups, etc.). Design an experiment to verify if your suggested change can actually change the core metric.Challenge
: Share it with the feedback email for the consumer internet site!
(4) Think like a Bayesian
To think like a Bayesian, avoid the Base rate fallacy
. This means to form new beliefs you must incorporate both newly observed information AND prior
information formed through intuition and experience.Checking your dashboard, user engagement numbers are significantly down today. Which of the following is most likely?
1. Users are suddenly less engaged
2. Feature of site broke
3. Logging feature broke
Even though explanation #1 completely explains the drop, #2 and #3 should be more likely because they have a much higher prior probability.You're in senior management at Tesla, and five of Tesla's Model S's have caught fire in the last five months. Which is more likely?
1. Manufacturing quality has decreased and Teslas should now be deemed unsafe.
2. Safety has not changed and fires in Tesla Model S's are still much rarer than their counterparts in gasoline cars.
While #1 is an easy explanation (and great for media coverage), your prior should be strong on #2 because of your regular quality testing. However, you should still be seeking information that can update your beliefs on #1 versus #2 (and still find ways
to improve safety
). Question for thought: what information should you seek?Challenge:
Identify the last time you committed the Base rate fallacy
. Avoid committing the fallacy from now on.
(5) Know the limitations of your tools
“Knowledge is knowing that a tomato is a fruit, wisdom is not putting it in a fruit salad.” - Miles Kington
Knowledge is knowing how to perform a ordinary linear regression, wisdom is realizing how rare it applies cleanly in practice.
Knowledge is knowing five different variations of K-means clustering, wisdom is realizing how rarely actual data can be cleanly clustered, and how poorly K-means clustering can work with too many features.
Knowledge is knowing a vast range of sophisticated techniques, but wisdom is being able to choose the one that will provide the most amount of impact for the company in a reasonable amount of time.
You may develop a vast range of tools while you go through your Coursera or EdX courses, but your toolbox is not useful until you know which tools to use.Challenge:
Apply several tools to a real dataset and discover the tradeoffs and limitations of each tools. Which tools worked best, and can you figure out why?
(6) Teach a complicated concept
How does Richard Feynman distinguish which concepts he understands and which concepts he doesn't?
Feynman was a truly great teacher. He prided himself on being able to devise ways to explain even the most profound ideas to beginning students. Once, I said to him, "Dick, explain to me, so that I can understand it, why spin one-half particles obey Fermi-Dirac
statistics." Sizing up his audience perfectly, Feynman said, "I'll prepare a freshman lecture on it." But he came back a few days later to say, "I couldn't do it. I couldn't reduce it to the freshman level. That means we don't really understand it." - David
L. Goodstein, Feynman's Lost Lecture: The Motion of Planets Around the Sun
What distinguished Richard Feynman was his ability to distill complex concepts into comprehendible ideas. Similarly, what distinguishes top data scientists is their ability to cogently share their ideas and explain their analyses.
Check out Edwin Chen
's answers to these questions for examples of cogently-explained technical concepts:Challenge:
Teach a technical concept to a friend or on a public forum, like Quora or YouTube.
(7) Convince others about what's important
Perhaps even more important than a data scientist's ability to explain their analysis is their ability to communicate the value and potential impact of the actionable insights.Certain tasks of data science will be commoditized as data science tools become better and better.
New tools will make obsolete certain tasks such as writing dashboards, unnecessary data wrangling, and even specific kinds of predictive
modeling.However, the need for a data scientist to extract out and communicate what's important
will never be made obsolete. With increasing amounts of data and potential insights, companies will always need data scientists (or people in data science-like
roles), to triage all that can be done and prioritize tasks based on impact.
The data scientist's role in the company is the serve as the ambassador between the data and the company
. The success of a data scientist is measured by how well he/she can tell a story and make an impact. Every other skill is
amplified by this ability.Challenge:
Tell a story with statistics. Communicate the important findings in a dataset. Make a convincing presentation that your audience cares about.
Any feedback on this post is appreciated - in the comments, as a suggested edit, or in a private message.
If you liked this material, please consider following:
1) Me! William Chen
2) My personal blog, Storytelling with Statistics
3) Learn Data Science
, where I am curating material on Quora that is relevant for anyone seeking to become a data scientist!