About this course
- This course covers the basic ideas behind machine learning/prediction
- Study design - training vs. test sets
- Conceptual issues - out of sample error, ROC curves
- Practical implementation - the caret package
- What this course depends on
- The Data Scientist's Toolbox
- R Programming
- What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research
- Regression models
Who predicts?
- Local governments -> pension payments
- Google -> whether you will click on an ad
- Amazon -> what movies you will watch
- Insurance companies -> what your risk of death is
- Johns Hopkins -> who will succeed in their programs
Why predict? Glory!
http://www.zimbio.com/photos/Chris+Volinsky
Why predict? Riches!
http://www.heritagehealthprize.com/c/hhp
Why predict? For sport!
Why predict? To save lives!
http://www.oncotypedx.com/en-US/Home
A useful (if a bit advanced) book
The elements of statistical learning
A useful package
http://caret.r-forge.r-project.org/
Machine learning (more advanced material)
https://www.coursera.org/course/ml
Even more resources
- List of machine learning resources on Quora
- List of machine learning resources from Science
- Advanced notes from MIT open courseware
- Advanced notes from CMU
- Kaggle - machine learning competitions
The central dogma of prediction
What can go wrong
http://www.sciencemag.org/content/343/6176/1203.full.pdf
Components of a predictor
question -> input data -> features -> algorithm -> parameters -> evaluation
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
Start with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
http://rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
Dear Jeff,
Can you send me your address so I can send you the invitation?
Thanks,
Ben
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
Dear Jeff,
Can you send me your address so I can send you the invitation?
Thanks,
Ben
Frequency of you $= 2/17 = 0.118$
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
make address all num3d our over remove internet order mail receive will people report addresses
1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00
2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14
3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75
4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00
5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00
6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00
free business email you credit your font num000 money hp hpl george num650 lab labs telnet
1 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 0
2 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 0
3 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 0
4 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0
5 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0
6 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0 0 0 0 0 0
num857 data num415 num85 technology num1999 parts pm direct cs meeting original project re edu
1 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
2 0 0 0 0 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00
3 0 0 0 0 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06
4 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
5 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
6 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar
1 0 0 0.00 0.000 0 0.778 0.000
2 0 0 0.00 0.132 0 0.372 0.180
3 0 0 0.01 0.143 0 0.276 0.184
4 0 0 0.00 0.137 0 0.137 0.000
5 0 0 0.00 0.135 0 0.135 0.000
6 0 0 0.00 0.223 0 0.000 0.000
charHash capitalAve capitalLong capitalTotal type
1 0.000 3.756 61 278 spam
2 0.048 5.114 101 1028 spam
3 0.010 9.821 485 2259 spam
4 0.000 3.537 40 191 spam
5 0.000 3.537 40 191 spam
6 0.000 3.000 15 54 spam
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
Our algorithm
- Find a value $C$.
- frequency of 'your' $>$ C predict "spam"
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation
prediction nonspam spam
nonspam 0.4590 0.1017
spam 0.1469 0.2923
Accuracy$ \approx 0.459 + 0.292 = 0.751$
Relative order of importance
question > data > features > algorithms
An important point
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
John Tukey
Garbage in = Garbage out
question -> input data -> features -> algorithm -> parameters -> evaluation
- May be easy (movie ratings -> new movie ratings)
- May be harder (gene expression data -> disease)
- Depends on what is a "good prediction".
- Often more data > better models
- The most important step!
Features matter!
question -> input data -> features -> algorithm -> parameters -> evaluation
Properties of good features
- Lead to data compression
- Retain relevant information
- Are created based on expert application knowledge
Common mistakes
- Trying to automate feature selection
- Not paying attention to data-specific quirks
- Throwing away information unnecessarily
May be automated with care
question -> input data -> features -> algorithm -> parameters -> evaluation
http://arxiv.org/pdf/1112.6209v5.pdf
Algorithms matter less than you'd think
question -> input data -> features -> algorithm
http://arxiv.org/pdf/math/0606441.pdf
Issues to consider
http://strata.oreilly.com/2013/09/gaining-access-to-the-best-machine-learning-methods.html
Prediction is about accuracy tradeoffs
- Interpretability versus accuracy
- Speed versus accuracy
- Simplicity versus accuracy
- Scalability versus accuracy
Interpretability matters
http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf
Scalability matters
http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
In sample versus out of sample
In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.
Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.
Key ideas
- Out of sample error is what you care about
- In sample error $<$ out of sample error
- The reason is overfitting
- Matching your algorithm to the data you have
In sample versus out of sample errors
Prediction rule 1
- capitalAve $>$ 2.7 = "spam"
- capitalAve $<$ 2.40 = "nonspam"
- capitalAve between 2.40 and 2.45 = "spam"
- capitalAve between 2.45 and 2.7 = "nonspam"
Apply Rule 1 to smallSpam
nonspam spam
nonspam 5 0
spam 0 5
Prediction rule 2
- capitalAve $>$ 2.40 = "spam"
- capitalAve $\leq$ 2.40 = "nonspam"
Apply Rule 2 to smallSpam
nonspam spam
nonspam 5 1
spam 0 4
Apply to complete spam data
nonspam spam
nonspam 2141 588
spam 647 1225
nonspam spam
nonspam 2224 642
spam 564 1171
[1] 0.7316
[1] 0.7379
Look at accuracy
[1] 3366
[1] 3395
What's going on?
Overfitting
- Data have two parts
- Signal
- Noise
- The goal of a predictor is to find signal
- You can always design a perfect in-sample predictor
- You capture both signal + noise when you do that
- Predictor won't perform as well on new samples
http://en.wikipedia.org/wiki/Overfitting
Prediction study design
- Define your error rate
- Split data into:
- Training, Testing, Validation (optional)
- On the training set pick features
- Use cross-validation
- On the training set pick prediction function
- Use cross-validation
- If no validation
- Apply 1x to test set(即只用一次)
- If validation
- Apply to test set and refine
- Apply 1x to validation
Know the benchmarks
http://www.heritagehealthprize.com/c/hhp/leaderboard
benchmark就是一个基准
Study design
http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf
probe探头
Used by the professionals
Avoid small sample sizes
- Suppose you are predicting a binary outcome
- Diseased/healthy
- Click on ad/not click on ad
- One classifier is flipping a coin
- Probability of perfect classification is approximately:
- $\left(\frac{1}{2}\right)^{sample \; size}$
- $n = 1$ flipping coin 50% chance of 100% accuracy
- $n = 2$ flipping coin 25% chance of 100% accuracy
- $n = 10$ flipping coin 0.10% chance of 100% accuracy
Rules of thumb for prediction study design
- If you have a large sample size
- 60% training
- 20% test
- 20% validation
- If you have a medium sample size
- 60% training
- 40% testing
- If you have a small sample size
- Do cross validation
- Report caveat of small sample size
Some principles to remember
- Set the test/validation set aside and don't look at it
- In general randomly sample training and test
- Your data sets must reflect structure of the problem
- If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
- All subsets should reflect as much diversity as possible
- Random assignment does this
- You can also try to balance by features - but this is tricky
Basic terms
In general, Positive = identified and negative = rejected. Therefore:
True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected
Medical testing example:
True positive = Sick people correctly diagnosed as sick
False positive= Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy.
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
Key quantities
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key quantities as fractions
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Screening tests
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
General population
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
General population as fractions
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
At risk subpopulation
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
At risk subpopulation as fraction
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key public health issue
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key public health issue
For continuous data
Mean squared error (MSE):
$$\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2$$
Root mean squared error (RMSE):
$$\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}$$
Common error measures
- Mean squared error (or root mean squared error)
- Continuous data, sensitive to outliers
- Median absolute deviation
- Continuous data, often more robust
- Sensitivity (recall)
- If you want few missed positives
- Specificity
- If you want few negatives called positives
- Accuracy
- Weights false positives/negatives equally
- Concordance
- One example is kappa
- Predictive value of a positive (precision)
- When you are screeing and prevelance is low
Why a curve?
- In binary classification you are predicting one of two categories
- Alive/dead
- Click on ad/don't click
- But your predictions are often quantitative
- Probability of being alive
- Prediction on a scale from 1 to 10
- The cutoff you choose gives different results,不同的截断值出现不同的结果,如前节中,》0.5就看作spam。改为0.3可能就不是spam
ROC curves
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
An example
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Area under the curve
- AUC = 0.5: random guessing
- AUC = 1: perfect classifer
- In general AUC of above 0.8 considered "good"
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
What is good?
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Study design
http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf
Key idea
- Accuracy on the training set (resubstitution accuracy) is optimistic
- A better estimate comes from an independent set (test set accuracy)
- But we can't use the test set when building the model or it becomes part of the training set
- So we estimate the test set accuracy with the training set.
Cross-validation
Approach:
-
Use the training set
-
Split it into training/test sets
-
Build a model on the training set
-
Evaluate on the test set
-
Repeat and average the estimated errors
Used for:
-
Picking variables to include in a model
-
Picking the type of prediction function to use
-
Picking the parameters in the prediction function
-
Comparing different predictors
Random subsampling
K-fold
Leave one out
Considerations
- For time series data data must be used in "chunks"
- For k-fold cross validation
- Larger k = less bias, more variance
- Smaller k = more bias, less variance
- Random sampling must be done without replacement
- Random sampling with replacement is the bootstrap
- Underestimates of the error
- Can be corrected, but it is complicated (0.632 Bootstrap)
- If you cross-validate to pick predictors estimate you must estimate errors on independent data.
A succcessful predictor
Polling data
Weighting the data
http://www.fivethirtyeight.com/2010/06/pollster-ratings-v40-methodology.html
Key idea
To predict X use data related to X
Key idea
To predict player performance use data about player performance
Key idea
To predict movie preferences use data about movie preferences
Key idea
To predict hospitalizations use data about hospitalizations
Not a hard rule
To predict flu outbreaks use Google searches
http://www.google.org/flutrends/
Looser connection = harder prediction
Data properties matter
Unrelated data is the most common mistake
http://www.nejm.org/doi/full/10.1056/NEJMon1211064