Programming Exercise 1: Linear Regression
Python版本3.6
编译环境:anaconda Jupyter Notebook
链接:ex1data1.txt、ex1data2.txt 和编程作业ex1.pdf(实验指导书)
提取码:i7co
2 多变量线性回归(Linear regression with multiple variable)
本章课程笔记部分见:4.多变量线性回归、梯度下降和正规方程
2.1查看数据
The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.
%matplotlib inline
#IPython的内置magic函数,可以省掉plt.show(),在其他IDE中是不会支持的
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid",color_codes=True)
df = pd.read_csv('ex1data2.txt', header=None, names=['Size', 'Bedrooms', 'Price'])
df.head()#查看前五行
size | bedrooms | price | |
---|---|---|---|
0 | 2104 | 3 | 399900 |
1 | 1600 | 3 | 329900 |
2 | 2400 | 3 | 369000 |
3 | 1416 | 2 | 232000 |
4 | 3000 | 4 | 539900 |
df.describe()#查看数据相关统计
size | bedrooms | price | |
---|---|---|---|
count | 47.000000 | 47.000000 | 47.000000 |
mean | 2000.680851 | 3.170213 | 340412.659574 |
std | 794.702354 | 0.760982 | 125039.899586 |
min | 852.000000 | 1.000000 | 169900.000000 |
25% | 1432.000000 | 3.000000 | 249900.000000 |
50% | 1888.000000 | 3.000000 | 299900.000000 |
75% | 2269.000000 | 4.000000 | 384450.000000 |
max | 4478.000000 | 5.000000 | 699900.000000 |
df.info()#查看数据DataFrame
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 3 columns):
size 47 non-null int64
bedrooms 47 non-null int64
price 47 non-null int64
dtypes: int64(3)
memory usage: 1.2 KB
2.2 特标准化(Feature Normalization)
多维特征问题的时候,保证这些特征都具有相近的尺度,需要进行特征缩放,这将帮助梯度下降算法更快地收敛。
最简单的方法是:Z-score标准化方法,也称均值归一化(mean normaliztion), 给予原始数据的均值(mean)和标准差(standard deviation)进行数据的标准化。经过处理的数据符合标准正态分布,即均值为0,标准差为1。转化函数为 x n = x n − μ n s n {
{x}_{n}}=\frac{
{
{x}_{n}}-{
{\mu}_{n}}}{
{
{s}_{n}}} xn=snxn−μn,其中 μ n {\mu_{n}} μn是平均值, s n {s_{n}} sn是标准差。
df = (df - df.mean())/df.std(