【Python】Section 5: 分类和逻辑回归 Classification and Logistic Regression from HarvardX

1. Classification and kNN

1.1 Classification

Classification is the problem of predicting a discrete label- for example, predicting whether the patient has the value "Yes" or "No" in the AHD (heart disease) column.

AgeRestBPThal...AND
63145Fixed...No
67108Normal...Yes
...............
41130Reversable...No

 

A key example of classification is medical diagnosis. We can use one (or more) of the features in the data to try and predict whether AHD will be true or false. This can help doctors find patients most at risk of developing heart disease. Similar to regression, we can use any number of features as the inputs to our model.

Our model's response/target variables in a classification problem are categorical. Some examples of categorical targets are:

  • The binary target of whether a patient has heart disease
  • The 3-dimensional target of a self-driving car scanning a traffic light to determine if the light is green, yellow, or red.
  • The 5-dimensional target of a weather-prediction model that is trying to determine if tomorrow's description will be Rain, Snow, Cloudy, Sunny, or Partly Cloudy
  • Grouping customers into similar "types"

In the following section we will focus on two algorithms for classification:

  • k-Nearest Neighbors classification (similar to the kNN algorithm for regression)
  • Logistic regression

1.2 kNN for Regression

The k-Nearest Neighbors (kNN) algorithm works for both regression and classification. In both cases, a kNN model makes predictions on new data points using the k most similar points from our training dataset.

kNN for Regression

For each value of k, our output value is the average of the k nearest neighbors' outputs:

f(x)=\frac{1}{k} \left\{ f(y)\ |\ y\ \text{is a } k\text{-nearest neighbor of }x \right\}

kNN for Classification

We classify a data point x_0 based on the labels of its neighbors

\[P\left( Y=j | X=x_0 \right) = \frac{1}{k}\sum_{i\in N_0}\ \ I\left(y_i=j\right)\]

P\left( Y=j | X=x_0 \right) = \frac{1}{k}\sum_{i\in N_0}I\left(y_i=j\right)

N_0 is defined as the set of the k nearest neighbors to x_0.

I is the indicator function, equal to 1 when neighbor i has the label j and equal to 0 otherwise.

This formula builds a probability distribution for the class as the relative frequency of the classes in the set of neighbors N_0.

For example, if there are k=50 neighbors of x_0, and 10 of the neighbors have label j, then the probability that x_0 has that same label j is given by

P(Y=j|X=x_0)=\frac{10}{50}=0.2

Changing the number of neighbors k: how smooth should our model be?

Notice that the behavior of kNN for classification is similar to kNN for regression: as we increase k, we tend to see a smoother pattern in our model. Here's an example:

Here's another set of examples:

A set of 100 data points is fit with four different k values: 1, 10, 70, and 100. The line starts off very jagged, becomes smoother for 10 and smoother for 70, and is just a flat line at k=100.

How do we identify similar data points when the data has many features?

Here we walk through an example of standardizing a dataset so that we can get a better "distance" measurement between multi-dimensional data that includes a mix of quantitative and categorical features. In Python, you can use sklearn.preprocessing.StandardScaler to do this.

Suppose you are working for a movie ticket purchasing website. Users create an account for your website through an existing social media app. Your job is to build a model that can predict a user's favorite movie genre. You have access to some survey data your company recently collected:

featuredata type
year borninteger
new userboolean, True for new users, False otherwise
# social media friendsinteger
favorite genrecategorical from 4 options: Comedy, Action, Romance, and SciF

Using this data, you can build a k-nearest neighbors algorithm to try and predict, for other users not included in the dataset, what their favorite genre is.

Original unscaled data

Year BornNew /Existing User# FriendsFavorite Genre
19980312Action
1992165SciFi
200111923Comedy
19870203Romance
19741767Romance
2000054Action

Consider a user born in 1990, who is an existing user with 1000 friends. We represent this new user as

User = [1990, 0, 1000]

and compute a Euclidean distance measurement with our existing users in our dataset. For example, the distance between User and our first row of data is given by

D\left(\left[1990,\ 0,\ 1000\right],\ \left[1998,\ 0,\ 312\right]\right)=\ \sqrt{\left(1990-1998\right)^2+\left(0 - 0\right)^2+\left(1000 - 312\right)^2}

Year BornNew/Existing User# FriendsFavorite GenreEuclidean Distance to User
19980312Action688.05
1992165SciFi935.00
200111923Comedy923.07
19870203Romance797.00
19741767Romance233.55
2000054Action946.05

Now we can predict User's favorite genre using k=3 nearest neighbors. The nearest neighbors are rows 1, 4, and 5. We assign a 2/3 probability that User's favorite genre is Romance based on the two neighbors who prefer Romance, and a 1/3 probability the favorite genre is Action based on the one neighbor who prefers Action- and therefore we would predict the label Romance.

But what is possibly going wrong here? Our distance measurement is flawed because most of the distance is coming from the difference in the # of friends, therefore under-reacting to the difference in year born. We have not given any reason why we should want out distance measurement to depend more on one feature than another. In the absence of such an explicit reason we should have the distance measure treat all features equally.

Scaled data using standard scaler

One way to get a balanced distance between data, regardless of the specific features, is to "standardize".

Standardization

Standardizing means applying this formula:

x_i\ \rightarrow\frac{x_i-mean\left(x_i\right)}{stdev\left(x_i\right)}

to each feature x_i for all the rows in the dataset.

Scaled User: [-0.213, -1.0, 0.679]

Scaled Year BornScaled New/Existing UserScaled # FriendsFavorite GenreEuclidean Distance to Scaled User
0.638-1.0-0.368Action1.349
0.01.0-0.744SciFi2.464
0.9581.02.083Comedy2.710
-0.532-1.0-0.534Romance1.254
-1.9151.00.324Romance2.650
0.851-1.0-0.761Action1.790

Now we can predict User's favorite genre using k=3 nearest neighbors. This time-nearest neighbors are rows 1, 4, and 6. We assign a 2/3 probability that User's favorite genre is Action based on the two neighbors who prefer Action, and a 1/3 probability that the favorite genre is Romance based on the one neighbor who prefers Romance- and therefore we would predict the label Action.

Is this a better distance function, and is it better that this time our model labeled Action instead of Romance? Well, without the actual labels for the data we cannot know for certain. But this time, with scaled data, our model is recognizing the similarities/differences in age at the same level of importance as the similarities/differences in the # of friends. So we can at least have more confidence that our model will be learning more directly from patterns in the data.

2. Logistic Regression

To learn our logistic model through optimization:

  1. We compute the log-likelihood function, which is the loss function of our model.
  2. We compute the derivative of the log-likelihood function for gradient descent optimization.

2.1 Computing the log-likelihood function

We compute how well our logistic model fits the data by comparing its output probabilities with the labels for each data point, using a likelihood function.

The likelihood function for our i (between 0 and 1) and the real i (either 0 or 1) is given by:

\mathcal{L}\left(p_i\ |\ Y_i\right)=P\left(Y_i=y_i\right)=p_i^{y_i}\left(1-p_i\right)^{\left(1-y_i\right)}

We use this formula because when the label y_i is 1, the likelihood is p_i , and when the label y_i is 0, the likelihood is 1 - p_i.

To get our total likelihood, we multiply the likelihoods of all our data points (using the capital greek letter P_i to represent a product for all i)

\mathcal{L} \left( p|Y\right)\ =\prod_i\mathcal{L}\left(p_i|Y_i\right)

However, an easier function to optimize is the log-likelihood.

Review from previous section

The log function is "convex", so if we find an optimum for the log-likelihood, it is equivalently an optimum for the total likelihood. And since the log-likelihood is an expression of a sum instead of a product, it is computationally easier to optimize. To turn the procedure into a minimization instead of a maximization, we seek to minimize negative log-likelihood instead of maximize log-likelihood.

Our log-likelihood function is

\mathcal{L}\left(p|Y\right)=-\log(\mathcal{L}(p|Y))

We simplified the equation by converting the log of a product into a sum of logs

-\log(\mathcal{L}(p|Y))= -\log(\Pi_i(\mathcal{L}(p_i|Y_i)))=-\Sigma_i(\log(\mathcal{L}(p_i|Y_i)))

We simplify the equation again by converting a log of products into a sum of logs

-\Sigma_i(\log(\mathcal{L}(p_i|Y_i))) = -\Sigma_i(\log(p_i^{y_i}\left(1-p_i\right)^{\left(1-y_i\right)})) = -\Sigma_i(\log(p_i^{y_i})+\log((1-p_i)^{1-y_i}))

Finally we move the exponents within the log term to be products outside the log term

-\Sigma_i(\log(p_i^{y_i})+\log((1-p_i)^{1-y_i})) = -\Sigma_i\left(y_i\log\left(p_i\right)+\left(1-y_i\right)\log\left(1-p_{i}\right)\right)

Our loss function will be the average log-likelihood of our data points instead of the total log-likelihood, so that we can equally compare the log-likelihoods when the number of data points changes.

Loss\left(p|Y\right)= -\frac{1}{N}\Sigma_i\left(y_i\log\left(p_i\right)+\left(1-y_i\right)\log\left(1-p_{i}\right)\right)

Here is a NumPy implementation of this loss function which is vectorized, meaning you can use it for any number of beta parameters in a vector called beta. Our work so far has assumed that \beta = [\beta_0, \beta_1] for a one-dimensional dataset, X_\text{train} . But the code below would also work for \beta = [\beta_0, \dots, \beta_{10}] for a ten-dimensional X_\text{train}.

sigmoid = lambda x : 1 / (1 + np.exp(-x))
def loss(X, y, beta, N):
    # X: inputs, y : labels, beta : model parameters, N : # datapoints
    p_hat = sigmoid(np.dot(X, theta))
    return -(1/N)*sum(y*np.log(p_hat) + (1 - y)*np.log(1 - p_hat))

2.2 Computing the derivative of the log-likelihood function for gradient descent optimization

To optimize our logistic regression model, we want to find beta parameters to minimize the above equation for negative log-likelihood. This corresponds to finding a model whose output probabilities most closely match the data labels.

To compute our derivative, we first calculate the derivative of loss with respect to the probabilities p_i, then calculate the derivative of the probabilities with respect to our beta parameters. Using the chain rule, we will derive a simple formula for the derivative of our loss.

Calculating the derivative of the loss with respect to our model probabilities

Since the derivative of log(x) is 1/x, the derivative of our loss function with respect to the probabilities p_i is

\frac{d\ \text{Loss}\left(p|Y\right)}{d\ p}=-\frac{1}{N}\Sigma_i(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i})

The derivative of p_i with respect to our Beta parameters (using the chain rule) is

\frac{dp_i}{d\beta}=\frac{d}{d\beta}(\frac{1}{1+e^{-\left(\beta x_i\right)}})
=\frac{-1}{(1+e^{-(\beta x_i)})^2} (e^{-(\beta x_i)})(-x_i)

=\frac{1}{(1+e^{-(\beta x_i)})^2}(e^{-(\beta x_i)})x_i

To simplify this equation, we can re-write it directly in terms of the probabilities p_i:

=\frac{1}{(1+e^{-(\beta x_i)})}\frac{e^{-(\beta x_i)}}{(1+e^{-(\beta x_i)})}x_i \\=p_i(1-p_i)x_i

Calculating the derivative of the loss with respect to our model parameters

Due to the chain rule, the derivative of our loss is

\frac{d\ \text{Loss}\left(p|Y\right)}{d\beta}= \frac{d\ \text{Loss}(p|Y)}{dp}\frac{dp}{d\beta}=\Sigma_i[\frac{d\ \text{Loss}\left(p_i|Y_i\right)}{d\ p_i}\frac{dp_i}{d\beta}]

=-\frac{1}{N}\Sigma_i[(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i})p_i(1-p_i)x_i]

=-\frac{1}{N}\Sigma_i[(y_i(1-p_i)-(1-y_i)p_i)x_i]

This final sum of products can be re-written as a dot product in the following way, making our Python implementation simpler:

=-\frac{1}{N}[Y(1-p)-(1-Y)p]\cdot X

Here is a NumPy implementation of the derivative of our loss function.

def d_loss(X, y, theta, N):
    # X: inputs, y : labels, beta : model parameters, N : # datapoints
    p_hat = sigmoid(np.dot(X, beta))
    return np.dot(X.T,-(1/N)*(y*(1 - p_hat) - (1 - y)*p_hat))

To use gradient descent, we start with an initial guess for our Beta parameters, and then update them by small steps in the direction that results in fitting the data better. That is, in the direction of lowest negative log-likelihood. After each step we can compute the loss to track the progress of the optimization.

step_size = .5
n_iter = 500
beta = np.zeros(2) # initial guess for Beta_0 and Beta_1
losses = []
for _ in range(n_iter):
    beta = beta - step_size * d_loss(beta)
    losses.append(loss(beta))

Turning probabilities into discrete classifications

The model below can act as a classifier for the heart disease dataset which contains a lot of overlap. The height of the orange model curve represents our model's probability that a patient with that maxHR has heart disease. Notice that for a given patient's maximum heart rate close to the mean of our dataset, like 130, it is really unclear whether the patient is in fact diagnosable with heart disease. In this case, our model predicts close to 50% probability of heart disease. On the other hand, if the max heart rate is 190, our model would predict close to 10% probability of heart disease, or if the max heart rate is 85, our model would predict close to 90% probability of heart disease.

A graph as described above.

Turning model probabilities into concrete classifications— for example, declaring that a patient is a YES, NO, or MAYBE on whether or not they have heart disease— requires choosing a classification thresholds. A common threshold is 50%, (predict YES for >50% and NO for < 50%) but for certain applications it may make sense to choose a different threshold.

Let p be the probability output by our model. Because our model here contains significant overlap, it may be sensible to classify a patient as YES when p\geq 90, classify a patient as NO when p\leq 10 , and classify a patient as MAYBE when p=50 .

  • 18
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值