模型表达(Model Representation)
我们通过一个例子作为开始,这个例子就是我们之前的预测房价的例子。我们要使用一个数据集,数据集包含某地的住房价格。假设你有一朋友正想出售自己面积为1250平方英尺的房子,你要告诉他这房子可以卖多少钱。这时你就需构建一个模型,从这个数据集来看这或许是条直线,从图中的原谅绿你可以对你朋友说这房子可以卖22万美元左右。
房屋价格数据集
这就是监督学习的一个例子,其这所以称之为监督学习是因为对于每个数据而言,我们预先给出了“正确答案”,即告诉我们根基我们所拥有的数据,我们可以得出房屋的实际价格。
对于这个例子更为具体而言。这是一个回归问题,回归一词指的是我们根据已有的数据预测出一个正确的输出值。更进一步来说,在监督学习中我们都会拥有一个数据集,这个数据集称为训练集。我们将在以后用如下标记描述回归问题:
- m:number of training examples
- x:"input" variable / features
- y:"output" variable / "target" variable
除此之外,我们使用(x, y)来表示一个训练样本(训练样本:训练过程中使用的数据称为“训练数据”,其中每一个样本称为“训练样本”。),其中我们用(x(i), y(i))来表示某一个训练样本。
我们通常使用h(hypothesis)表示一个函数。在预测房价的例子中,x对应房屋面积,y对应房屋价格,根据监督学习我们可以构建一个函数表达式h:
hθ(x) = θ0 + θ1x
我们将这种可以构建出线性函数的模型称为线性回归模型。在这里只有一个特征变量x,因此我们称这种单一变量模型为单变量线性回归。
补充笔记
Model Representation
To establish notation for future use, we'll use x(i) to denote the "input" variables (living area in this example), also called input features, and y(i) to denote the "output" or target variable that we are trying to predict (price). A pair (x(i), y(i)) is called a training example, and the dataset that we'll be using to learn -- a list of m training examples (x(i), y(i)); i = 1, ... , m -- is called a training set. Note that the superscript "(i)" in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = R (Real Number ).
To describle the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h: X -> Y so that h(x) is a "good" predictor for the corresponding value of y. For histirical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:
When the target variable that we're trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on a small number of discrete of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
代价函数(Cost Function)
在这之前我们介绍了如下这个表达式:
hθ(x) = θ0 + θ1x
表达式中的θ0和 θ1这些参数,我们将其(θi)称为模型参数。我们选择不同的参数值,将得到不同的假设函数。因此,在我们预测房价的例子中,我们通过选择参数值尽可能正确地预测y的值。若我们想要取得θ0和 θ1的值来使得h(x)与y之间的差最小化,我们要做的就是尽量减少假设函数的输出值与房子真实价格之间的差的平方,即使得代价函数最小,用数学表达式可表示为:
补充笔记
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
Cost Function - Intuition Ⅰ
If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points.
Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass though all the points of our training data set. In such a case, the value of J(θ0, θ1) will be 0. The following example shows the ideal situation where we have a cost function of 0.
When θ1 = 1, we get a slope of 1 which goes which goes through every single data point in our model. Conversely, when θ1 = 0.5, we see the vertical distance from our fit to the data points increase.
This increase our cost function to 0.58. Plotting several other points yields to the following graph:
Thus as a goal, we should try to minimize the cost function. In this case, θ1 = 1 is our global minimum.
Cost Function - Intuition Ⅱ
A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.
Taking any color and going along the 'circle', one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(θ0, θ1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when θ0 = 800 and θ1 = -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:
When θ0 = 360 and θ1 = 0, the value of J(θ0, θ1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.
The graph above minimizes the cost function as much as possible and consequently, the result of θ1 and θ0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most 'circle'.