
About this course

  • This course covers the basic ideas behind machine learning/prediction
    • Study design - training vs. test sets
    • Conceptual issues - out of sample error, ROC curves
    • Practical implementation - the caret package
  • What this course depends on
    • The Data Scientist's Toolbox
    • R Programming
  • What would be useful
    • Exploratory analysis
    • Reporting Data and Reproducible Research
    • Regression models

Who predicts?

  • Local governments -> pension payments
  • Google -> whether you will click on an ad
  • Amazon -> what movies you will watch
  • Insurance companies -> what your risk of death is
  • Johns Hopkins -> who will succeed in their programs

Why predict? Glory!

Why predict? Riches!

Why predict? For sport!

Why predict? To save lives!

A useful (if a bit advanced) book

The elements of statistical learning

A useful package

Machine learning (more advanced material)

Even more resources

The central dogma of prediction

What can go wrong

Components of a predictor

question -> input data -> features -> algorithm -> parameters -> evaluation

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?



SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?



Frequency of you $= 2/17 = 0.118$

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
  make address  all num3d  our over remove internet order mail receive will people report addresses
1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64   0.00   0.00      0.00
2 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79   0.65   0.21      0.14
3 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45   0.12   0.00      1.75
4 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.00
5 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.00
6 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00 0.00   0.00   0.00      0.00
  free business email  you credit your font num000 money hp hpl george num650 lab labs telnet
1 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00  0.00  0   0      0      0   0    0      0
2 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43  0.43  0   0      0      0   0    0      0
3 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16  0.06  0   0      0      0   0    0      0
4 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      0
5 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      0
6 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00  0.00  0   0      0      0   0    0      0
  num857 data num415 num85 technology num1999 parts pm direct cs meeting original project   re  edu
1      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
2      0    0      0     0          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.00
3      0    0      0     0          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.06
4      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
5      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
6      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
  table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar
1     0          0          0.00            0.000                 0           0.778      0.000
2     0          0          0.00            0.132                 0           0.372      0.180
3     0          0          0.01            0.143                 0           0.276      0.184
4     0          0          0.00            0.137                 0           0.137      0.000
5     0          0          0.00            0.135                 0           0.135      0.000
6     0          0          0.00            0.223                 0           0.000      0.000
  charHash capitalAve capitalLong capitalTotal type
1    0.000      3.756          61          278 spam
2    0.048      5.114         101         1028 spam
3    0.010      9.821         485         2259 spam
4    0.000      3.537          40          191 spam
5    0.000      3.537          40          191 spam
6    0.000      3.000          15           54 spam

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
     col="blue",main="",xlab="Frequency of 'your'")
plot of chunk unnamed-chunk-1

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Our algorithm

  • Find a value $C$.
  • frequency of 'your' $>$ C predict "spam"

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
     col="blue",main="",xlab="Frequency of 'your'")
plot of chunk unnamed-chunk-2

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
prediction <- ifelse(spam$your > 0.5,"spam","nonspam")

prediction nonspam   spam
   nonspam  0.4590 0.1017
   spam     0.1469 0.2923

Accuracy$ \approx 0.459 + 0.292 = 0.751$

Relative order of importance

question > data > features > algorithms

An important point

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

John Tukey

Garbage in = Garbage out

question -> input data  -> features -> algorithm -> parameters -> evaluation
  1. May be easy (movie ratings -> new movie ratings)
  2. May be harder (gene expression data -> disease)
  3. Depends on what is a "good prediction".
  4. Often more data > better models
  5. The most important step!

Features matter!

question -> input data -> features  -> algorithm -> parameters -> evaluation

Properties of good features

  • Lead to data compression
  • Retain relevant information
  • Are created based on expert application knowledge

Common mistakes

  • Trying to automate feature selection
  • Not paying attention to data-specific quirks
  • Throwing away information unnecessarily

May be automated with care

question -> input data -> features  -> algorithm -> parameters -> evaluation

Algorithms matter less than you'd think

question -> input data -> features -> algorithm
 -> parameters -> evaluation

Issues to consider

Prediction is about accuracy tradeoffs

  • Interpretability versus accuracy
  • Speed versus accuracy
  • Simplicity versus accuracy
  • Scalability versus accuracy

Interpretability matters

Scalability matters

In sample versus out of sample

In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.

Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.

Key ideas

  1. Out of sample error is what you care about
  2. In sample error $<$ out of sample error
  3. The reason is overfitting
    • Matching your algorithm to the data you have

In sample versus out of sample errors

library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1],size=10),]
spamLabel <- (smallSpam$type=="spam")*1 + 1
plot of chunk loadData

Prediction rule 1

  • capitalAve $>$ 2.7 = "spam"
  • capitalAve $<$ 2.40 = "nonspam"
  • capitalAve between 2.40 and 2.45 = "spam"
  • capitalAve between 2.45 and 2.7 = "nonspam"

Apply Rule 1 to smallSpam

rule1 <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 2.7] <- "spam"
  prediction[x < 2.40] <- "nonspam"
  prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
  prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"

          nonspam spam
  nonspam       5    0
  spam          0    5

Prediction rule 2

  • capitalAve $>$ 2.40 = "spam"
  • capitalAve $\leq$ 2.40 = "nonspam"

Apply Rule 2 to smallSpam

rule2 <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 2.8] <- "spam"
  prediction[x <= 2.8] <- "nonspam"

          nonspam spam
  nonspam       5    1
  spam          0    4

Apply to complete spam data


          nonspam spam
  nonspam    2141  588
  spam        647 1225

          nonspam spam
  nonspam    2224  642
  spam        564 1171
[1] 0.7316
[1] 0.7379

Look at accuracy

[1] 3366
[1] 3395

What's going on?

  • Data have two parts
    • Signal
    • Noise
  • The goal of a predictor is to find signal
  • You can always design a perfect in-sample predictor
  • You capture both signal + noise when you do that
  • Predictor won't perform as well on new samples

Prediction study design

  1. Define your error rate
  2. Split data into:
    • Training, Testing, Validation (optional)
  3. On the training set pick features
    • Use cross-validation
  4. On the training set pick prediction function
    • Use cross-validation
  5. If no validation
    • Apply 1x to test set(即只用一次)
  6. If validation
    • Apply to test set and refine
    • Apply 1x to validation

Know the benchmarks


Study design


Used by the professionals

Avoid small sample sizes

  • Suppose you are predicting a binary outcome
    • Diseased/healthy
    • Click on ad/not click on ad
  • One classifier is flipping a coin
  • Probability of perfect classification is approximately:
    • $\left(\frac{1}{2}\right)^{sample \; size}$
    • $n = 1$ flipping coin 50% chance of 100% accuracy
    • $n = 2$ flipping coin 25% chance of 100% accuracy
    • $n = 10$ flipping coin 0.10% chance of 100% accuracy

Rules of thumb for prediction study design

  • If you have a large sample size
    • 60% training
    • 20% test
    • 20% validation
  • If you have a medium sample size
    • 60% training
    • 40% testing
  • If you have a small sample size
    • Do cross validation
    • Report caveat of small sample size

Some principles to remember

  • Set the test/validation set aside and don't look at it
  • In general randomly sample training and test
  • Your data sets must reflect structure of the problem
    • If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
  • All subsets should reflect as much diversity as possible
    • Random assignment does this
    • You can also try to balance by features - but this is tricky

Basic terms

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified

False positive = incorrectly identified

True negative = correctly rejected

False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick

False positive= Healthy people incorrectly identified as sick

True negative = Healthy people correctly identified as healthy

False negative = Sick people incorrectly identified as healthy.

Key quantities

Key quantities as fractions

Screening tests

General population

General population as fractions

At risk subpopulation

At risk subpopulation as fraction

Key public health issue

Key public health issue

For continuous data

Mean squared error (MSE):

$$\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2$$

Root mean squared error (RMSE):

$$\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}$$

Common error measures

  1. Mean squared error (or root mean squared error)
    • Continuous data, sensitive to outliers
  2. Median absolute deviation
    • Continuous data, often more robust
  3. Sensitivity (recall)
    • If you want few missed positives
  4. Specificity
    • If you want few negatives called positives
  5. Accuracy
    • Weights false positives/negatives equally
  6. Concordance
  7. Predictive value of a positive (precision)
    • When you are screeing and prevelance is low

Why a curve?

  • In binary classification you are predicting one of two categories
    • Alive/dead
    • Click on ad/don't click
  • But your predictions are often quantitative
    • Probability of being alive
    • Prediction on a scale from 1 to 10
  • The cutoff you choose gives different results,不同的截断值出现不同的结果,如前节中,》0.5就看作spam。改为0.3可能就不是spam

ROC curves

An example

Area under the curve

  • AUC = 0.5: random guessing
  • AUC = 1: perfect classifer
  • In general AUC of above 0.8 considered "good"

What is good?

Study design

Key idea

  1. Accuracy on the training set (resubstitution accuracy) is optimistic
  2. A better estimate comes from an independent set (test set accuracy)
  3. But we can't use the test set when building the model or it becomes part of the training set
  4. So we estimate the test set accuracy with the training set.



  1. Use the training set

  2. Split it into training/test sets

  3. Build a model on the training set

  4. Evaluate on the test set

  5. Repeat and average the estimated errors

Used for:

  1. Picking variables to include in a model

  2. Picking the type of prediction function to use

  3. Picking the parameters in the prediction function

  4. Comparing different predictors

Random subsampling


Leave one out


  • For time series data data must be used in "chunks"
  • For k-fold cross validation
    • Larger k = less bias, more variance
    • Smaller k = more bias, less variance
  • Random sampling must be done without replacement
  • Random sampling with replacement is the bootstrap
    • Underestimates of the error
    • Can be corrected, but it is complicated (0.632 Bootstrap)
  • If you cross-validate to pick predictors estimate you must estimate errors on independent data.

A succcessful predictor

Polling data

Weighting the data

Key idea

To predict X use data related to X

Key idea

To predict player performance use data about player performance

Key idea

To predict movie preferences use data about movie preferences

Key idea

To predict hospitalizations use data about hospitalizations

Not a hard rule

To predict flu outbreaks use Google searches

Looser connection = harder prediction

Data properties matter

Unrelated data is the most common mistake





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


