【Python】Section 5: 分类和逻辑回归 Classification and Logistic Regression from HarvardX

最新推荐文章于 2024-05-14 11:00:23 发布

Yqalu

最新推荐文章于 2024-05-14 11:00:23 发布

阅读量1.5k

点赞数 18

分类专栏： HarvardX CS109x 文章标签： python 数据分析机器学习

本文链接：https://blog.csdn.net/Yqalu/article/details/138348568

版权

HarvardX CS109x 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1. Classification and kNN

1.1 Classification

Classification is the problem of predicting a discrete label- for example, predicting whether the patient has the value "Yes" or "No" in the AHD (heart disease) column.

Age	RestBP	Thal	...	AND
63	145	Fixed	...	No
67	108	Normal	...	Yes
...	...	...	...	...
41	130	Reversable	...	No

A key example of classification is medical diagnosis. We can use one (or more) of the features in the data to try and predict whether AHD will be true or false. This can help doctors find patients most at risk of developing heart disease. Similar to regression, we can use any number of features as the inputs to our model.

Our model's response/target variables in a classification problem are categorical. Some examples of categorical targets are:

The binary target of whether a patient has heart disease
The 3-dimensional target of a self-driving car scanning a traffic light to determine if the light is green, yellow, or red.
The 5-dimensional target of a weather-prediction model that is trying to determine if tomorrow's description will be Rain, Snow, Cloudy, Sunny, or Partly Cloudy
Grouping customers into similar "types"

In the following section we will focus on two algorithms for classification:

k-Nearest Neighbors classification (similar to the kNN algorithm for regression)
Logistic regression

1.2 kNN for Regression

The k-Nearest Neighbors (kNN) algorithm works for both regression and classification. In both cases, a kNN model makes predictions on new data points using the k most similar points from our training dataset.

kNN for Regression

For each value of k, our output value is the average of the k nearest neighbors' outputs:

$f(x)=\frac{1}{k} \left\{ f(y)\ |\ y\ \text{is a } k\text{-nearest neighbor of }x \right\}$

kNN for Classification

We classify a data point $x_0$ based on the labels of its neighbors

\[P\left( Y=j | X=x_0 \right) = \frac{1}{k}\sum_{i\in N_0}\ \ I\left(y_i=j\right)\]

$P\left( Y=j | X=x_0 \right) = \frac{1}{k}\sum_{i\in N_0}I\left(y_i=j\right)$

$N_0$ is defined as the set of the $k$ nearest neighbors to $x_0$ .

$I$ is the indicator function, equal to 1 when neighbor $i$ has the label $j$ and equal to 0 otherwise.

This formula builds a probability distribution for the class as the relative frequency of the classes in the set of neighbors $N_0$ .

For example, if there are $k=50$ neighbors of $x_0$ , and 10 of the neighbors have label $j$ , then the probability that $x_0$ has that same label $j$ is given by

$P(Y=j|X=x_0)=\frac{10}{50}=0.2$

Changing the number of neighbors k: how smooth should our model be?

Notice that the behavior of kNN for classification is similar to kNN for regression: as we increase k, we tend to see a smoother pattern in our model. Here's an example:

Here's another set of examples:

How do we identify similar data points when the data has many features?

Here we walk through an example of standardizing a dataset so that we can get a better "distance" measurement between multi-dimensional data that includes a mix of quantitative and categorical features. In Python, you can use sklearn.preprocessing.StandardScaler to do this.

Suppose you are working for a movie ticket purchasing website. Users create an account for your website through an existing social media app. Your job is to build a model that can predict a user's favorite movie genre. You have access to some survey data your company recently collected:

feature	data type
year born	integer
new user	boolean, True for new users, False otherwise
# social media friends	integer
favorite genre	categorical from 4 options: Comedy, Action, Romance, and SciF

Using this data, you can build a k-nearest neighbors algorithm to try and predict, for other users not included in the dataset, what their favorite genre is.

Original unscaled data

Year Born	New /Existing User	# Friends	Favorite Genre
1998	0	312	Action
1992	1	65	SciFi
2001	1	1923	Comedy
1987	0	203	Romance
1974	1	767	Romance
2000	0	54	Action

Consider a user born in 1990, who is an existing user with 1000 friends. We represent this new user as

User = [1990, 0, 1000]

and compute a Euclidean distance measurement with our existing users in our dataset. For example, the distance between User and our first row of data is given by

$D\left(\left[1990,\ 0,\ 1000\right],\ \left[1998,\ 0,\ 312\right]\right)=\ \sqrt{\left(1990-1998\right)^2+\left(0 - 0\right)^2+\left(1000 - 312\right)^2}$

Year Born	New/Existing User	# Friends	Favorite Genre	Euclidean Distance to User
1998	0	312	Action	688.05
1992	1	65	SciFi	935.00
2001	1	1923	Comedy	923.07
1987	0	203	Romance	797.00
1974	1	767	Romance	233.55
2000	0	54	Action	946.05

Now we can predict User's favorite genre using k=3 nearest neighbors. The nearest neighbors are rows 1, 4, and 5. We assign a 2/3 probability that User's favorite genre is Romance based on the two neighbors who prefer Romance, and a 1/3 probability the favorite genre is Action based on the one neighbor who prefers Action- and therefore we would predict the label Romance.

But what is possibly going wrong here? Our distance measurement is flawed because most of the distance is coming from the difference in the # of friends, therefore under-reacting to the difference in year born. We have not given any reason why we should want out distance measurement to depend more on one feature than another. In the absence of such an explicit reason we should have the distance measure treat all features equally.

Scaled data using standard scaler

One way to get a balanced distance between data, regardless of the specific features, is to "standardize".

Standardization

Standardizing means applying this formula:

$x_i\ \rightarrow\frac{x_i-mean\left(x_i\right)}{stdev\left(x_i\right)}$

to each feature $x_i$ for all the rows in the dataset.

Scaled User: [-0.213, -1.0, 0.679]

Scaled Year Born	Scaled New/Existing User	Scaled # Friends	Favorite Genre	Euclidean Distance to Scaled User
0.638	-1.0	-0.368	Action	1.349
0.0	1.0	-0.744	SciFi	2.464
0.958	1.0	2.083	Comedy	2.710
-0.532	-1.0	-0.534	Romance	1.254
-1.915	1.0	0.324	Romance	2.650
0.851	-1.0	-0.761	Action	1.790

Now we can predict User's favorite genre using k=3 nearest neighbors. This time-nearest neighbors are rows 1, 4, and 6. We assign a 2/3 probability that User's favorite genre is Action based on the two neighbors who prefer Action, and a 1/3 probability that the favorite genre is Romance based on the one neighbor who prefers Romance- and therefore we would predict the label Action.

Is this a better distance function, and is it better that this time our model labeled Action instead of Romance? Well, without the actual labels for the data we cannot know for certain. But this time, with scaled data, our model is recognizing the similarities/differences in age at the same level of importance as the similarities/differences in the # of friends. So we can at least have more confidence that our model will be learning more directly from patterns in the data.

2. Logistic Regression

To learn our logistic model through optimization:

We compute the log-likelihood function, which is the loss function of our model.
We compute the derivative of the log-likelihood function for gradient descent optimization.

2.1 Computing the log-likelihood function

We compute how well our logistic model fits the data by comparing its output probabilities with the labels for each data point, using a likelihood function.

The likelihood function for our $i$ (between 0 and 1) and the real $i$ (either 0 or 1) is given by:

$\mathcal{L}\left(p_i\ |\ Y_i\right)=P\left(Y_i=y_i\right)=p_i^{y_i}\left(1-p_i\right)^{\left(1-y_i\right)}$

We use this formula because when the label $y_i$ is 1, the likelihood is $p_i$ , and when the label $y_i$ is 0, the likelihood is $1 - p_i$ .

To get our total likelihood, we multiply the likelihoods of all our data points (using the capital greek letter $P_i$ to represent a product for all i)

$\mathcal{L} \left( p|Y\right)\ =\prod_i\mathcal{L}\left(p_i|Y_i\right)$

However, an easier function to optimize is the log-likelihood.

Review from previous section

The log function is "convex", so if we find an optimum for the log-likelihood, it is equivalently an optimum for the total likelihood. And since the log-likelihood is an expression of a sum instead of a product, it is computationally easier to optimize. To turn the procedure into a minimization instead of a maximization, we seek to minimize negative log-likelihood instead of maximize log-likelihood.

Our log-likelihood function is

$\mathcal{L}\left(p|Y\right)=-\log(\mathcal{L}(p|Y))$

We simplified the equation by converting the log of a product into a sum of logs

$-\log(\mathcal{L}(p|Y))= -\log(\Pi_i(\mathcal{L}(p_i|Y_i)))=-\Sigma_i(\log(\mathcal{L}(p_i|Y_i)))$

We simplify the equation again by converting a log of products into a sum of logs

$-\Sigma_i(\log(\mathcal{L}(p_i|Y_i))) = -\Sigma_i(\log(p_i^{y_i}\left(1-p_i\right)^{\left(1-y_i\right)})) = -\Sigma_i(\log(p_i^{y_i})+\log((1-p_i)^{1-y_i}))$

Finally we move the exponents within the log term to be products outside the log term

$-\Sigma_i(\log(p_i^{y_i})+\log((1-p_i)^{1-y_i})) = -\Sigma_i\left(y_i\log\left(p_i\right)+\left(1-y_i\right)\log\left(1-p_{i}\right)\right)$

Our loss function will be the average log-likelihood of our data points instead of the total log-likelihood, so that we can equally compare the log-likelihoods when the number of data points changes.

$Loss\left(p|Y\right)= -\frac{1}{N}\Sigma_i\left(y_i\log\left(p_i\right)+\left(1-y_i\right)\log\left(1-p_{i}\right)\right)$

Here is a NumPy implementation of this loss function which is vectorized, meaning you can use it for any number of beta parameters in a vector called beta. Our work so far has assumed that $\beta = [\beta_0, \beta_1]$ for a one-dimensional dataset, $X_\text{train}$ . But the code below would also work for $\beta = [\beta_0, \dots, \beta_{10}]$ for a ten-dimensional $X_\text{train}$ .

sigmoid = lambda x : 1 / (1 + np.exp(-x))
def loss(X, y, beta, N):
    # X: inputs, y : labels, beta : model parameters, N : # datapoints
    p_hat = sigmoid(np.dot(X, theta))
    return -(1/N)*sum(y*np.log(p_hat) + (1 - y)*np.log(1 - p_hat))

2.2 Computing the derivative of the log-likelihood function for gradient descent optimization

To optimize our logistic regression model, we want to find beta parameters to minimize the above equation for negative log-likelihood. This corresponds to finding a model whose output probabilities most closely match the data labels.

To compute our derivative, we first calculate the derivative of loss with respect to the probabilities $p_i$ , then calculate the derivative of the probabilities with respect to our beta parameters. Using the chain rule, we will derive a simple formula for the derivative of our loss.

Calculating the derivative of the loss with respect to our model probabilities

Since the derivative of log(x) is 1/x, the derivative of our loss function with respect to the probabilities $p_i$ is

$\frac{d\ \text{Loss}\left(p|Y\right)}{d\ p}=-\frac{1}{N}\Sigma_i(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i})$

The derivative of $p_i$ with respect to our Beta parameters (using the chain rule) is

$\frac{dp_i}{d\beta}=\frac{d}{d\beta}(\frac{1}{1+e^{-\left(\beta x_i\right)}})$
$=\frac{-1}{(1+e^{-(\beta x_i)})^2} (e^{-(\beta x_i)})(-x_i)$

$=\frac{1}{(1+e^{-(\beta x_i)})^2}(e^{-(\beta x_i)})x_i$

To simplify this equation, we can re-write it directly in terms of the probabilities $p_i$ :

$=\frac{1}{(1+e^{-(\beta x_i)})}\frac{e^{-(\beta x_i)}}{(1+e^{-(\beta x_i)})}x_i \\=p_i(1-p_i)x_i$

Calculating the derivative of the loss with respect to our model parameters

Due to the chain rule, the derivative of our loss is

$\frac{d\ \text{Loss}\left(p|Y\right)}{d\beta}= \frac{d\ \text{Loss}(p|Y)}{dp}\frac{dp}{d\beta}=\Sigma_i[\frac{d\ \text{Loss}\left(p_i|Y_i\right)}{d\ p_i}\frac{dp_i}{d\beta}]$

$=-\frac{1}{N}\Sigma_i[(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i})p_i(1-p_i)x_i]$

$=-\frac{1}{N}\Sigma_i[(y_i(1-p_i)-(1-y_i)p_i)x_i]$

This final sum of products can be re-written as a dot product in the following way, making our Python implementation simpler:

$=-\frac{1}{N}[Y(1-p)-(1-Y)p]\cdot X$

Here is a NumPy implementation of the derivative of our loss function.

def d_loss(X, y, theta, N):
    # X: inputs, y : labels, beta : model parameters, N : # datapoints
    p_hat = sigmoid(np.dot(X, beta))
    return np.dot(X.T,-(1/N)*(y*(1 - p_hat) - (1 - y)*p_hat))

To use gradient descent, we start with an initial guess for our Beta parameters, and then update them by small steps in the direction that results in fitting the data better. That is, in the direction of lowest negative log-likelihood. After each step we can compute the loss to track the progress of the optimization.

step_size = .5
n_iter = 500
beta = np.zeros(2) # initial guess for Beta_0 and Beta_1
losses = []
for _ in range(n_iter):
    beta = beta - step_size * d_loss(beta)
    losses.append(loss(beta))

Turning probabilities into discrete classifications

The model below can act as a classifier for the heart disease dataset which contains a lot of overlap. The height of the orange model curve represents our model's probability that a patient with that maxHR has heart disease. Notice that for a given patient's maximum heart rate close to the mean of our dataset, like 130, it is really unclear whether the patient is in fact diagnosable with heart disease. In this case, our model predicts close to 50% probability of heart disease. On the other hand, if the max heart rate is 190, our model would predict close to 10% probability of heart disease, or if the max heart rate is 85, our model would predict close to 90% probability of heart disease.

Turning model probabilities into concrete classifications— for example, declaring that a patient is a YES, NO, or MAYBE on whether or not they have heart disease— requires choosing a classification thresholds. A common threshold is 50%, (predict YES for >50% and NO for < 50%) but for certain applications it may make sense to choose a different threshold.

Let $p$ be the probability output by our model. Because our model here contains significant overlap, it may be sensible to classify a patient as YES when $p\geq 90$ , classify a patient as NO when $p\leq 10$ , and classify a patient as MAYBE when $p=50$ .

Yqalu

关注

18
点赞
踩
17

收藏

觉得还不错? 一键收藏
2
评论
【Python】Section 5: 分类和逻辑回归 Classification and Logistic Regression from HarvardX

HarvardX CS109x "Introduction to Data Science with Python"\Section 5: Classification and Logistic Regression
复制链接

扫一扫