[机器学习第二章]一个线性回归的端到端入门项目

最新推荐文章于 2021-11-20 20:00:00 发布

带带二师兄

最新推荐文章于 2021-11-20 20:00:00 发布

阅读量171

点赞数 1

文章标签：机器学习线性回归人工智能

本文链接：https://blog.csdn.net/qq_42523037/article/details/120917586

版权

会议的话从ccf c开始练手，思路建议有两个

一个是场景迁移，比如通用方法迁移到应用场景，然后做出特色

一个是领域迁移，cv nlp dm抄来抄去也是管用的套路，比如万金油对比学

从一个项目来了解机器学习的基本步骤。
0.观察大局
1.获得数据
2.从数据探索和可视化洞见
3.机器学习算法的数据准备。
4.选择和训练模型
5.微调模型
6启动、监控、维护项目

使用真实数据

常见开放数据库：
1.UCI机器学习数据库(http://archive.ics.uci.edu/ml/index.php）
2.kaggle数据集https://www.kaggle.com/datasets
3.很好用的数据门户网站：https://dataportals.org/
4.国外数据：https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex
5.国内的数据https://www.zhihu.com/question/20033475/answer/1811190627

这一章选择了加州住房价格数据集。

框架问题

这是一个典型的监督式学习任务,因为已经给出了标记的训练示例，并且这也是一个典型的回归任务，因为你要对某个值进行预测，这是一个多变量回归问题，因为要对多个特征进行预测，如果数据巨大要用到批量多服务器，这时候还是要学一下mapreduce和spark.

选择性能指标

回归问题的典型性能衡量指标是均方根误差（RMSE),它测量的是预测错误的标准差（方差的算术平方根），例如RMSE等于50 000，就意味着系统的预测值 68%落在50 000美元之内，95%在 100 000 美元之内，正态分布：68%落在1 $\sigma$ 内，95%落在2 $\sigma$ 内，99.7落在3 $\sigma$ 内

$\sigma$ = RMSE(X,h) = $\sqrt{ \frac{1}{m} \sum_1^m(h(x^i)-y^i)^2}$

m:数据实例的数量。
$x^i$ : 数据集中，第i个实例的所有特征值的向量， $y^i$ 是标签。
例如某个数据：维度118.29，经度33.91，居民数量1416，平均收入38326美元，房价中位数156400美元（暂且忽略其他特征）

$x^i$ = $\left(\begin{matrix} -118.29 \\ 33.91 \\ 1416 \\ 38372 \\ \end{matrix}\right)$

$y^i$ = 156400

矩阵X = $\left(\begin{matrix} (x^1)^T \\ (x^2)^T \\ ……. \\ (x^{2000})^T \end{matrix}\right)$ = $\left( \begin{matrix} -118.29 &33.91&1416&38372 \\ …&…&…&… \\ …&…&…&… \\ \end{matrix}\right)$

h是系统的预测函数， $\overline{y}$ 读作y-hat, $\textbf{X}$ 表示矩阵， $\textbf{x}$ 表示向量.

平均绝对误差MAE( $\textbf{X},h)$ = $\frac{1}{m} \sum_{i=1}^m｜(h(x^i)-y^i|$

RMSE（平方和的根）对应 $l_2$ 范数
MAE计算绝对值的和对应 $l_1$ 范数
范数指数越高，越关注大的价值，忽视小的价值，RMSE因此比MAE更对异常值更敏感，当异常值非常少（如钟形曲线）RMSE表现优异，首选。

获取数据

创建工作区

启动jupyter,一个jupyter服务器正在终端运行，监听端口号为8888，

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "http://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path,"housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path = housing_path)
    housing_tgz.close()


fetch_housing_data()


import pandas as pd
def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)


housing = load_housing_data()
housing.head(5)
housing.info
housing.describe()

%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

import numpy as np

def split_train_test(data,test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data)*test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices],data.iloc[test_indices]

import hashlib
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256*test_ratio

def split_train_test_by_id(data,test_ratio, id_column, hash = hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_:test_set_check(id_,test_ratio,hash))
    return data.loc[~in_test_set],data.loc[in_test_set]

housing_with_id = housing.reset_index()
train_set,test_set = split_train_test_by_id(housing_with_id,0.2,"index")

hist方法
latex’

创建测试集

从数据探索和可视化中获取洞见

数据清理

对于缺失值有三种处理方法

放弃这些属性
放弃这些数据
将缺失值设置为某个值

housing.dropna(subset = ["total_bedrooms"]
# 1
housing.drop("total_bedrooms",axis = 1)
#2
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median)
#3
记得保存中位数，以后可能还用到

scikit-learn提供了imputer来处理缺失值

from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(strategy = "median")

由于中位数只能在数值属性上计算，所以我们需要创建一个没有文本属性的数据副本 ocean_proximity

housing_num = housing.drop("ocean_proximity",axis = 1)
imputer.fit(housing_num) # fit方法将imputer实例适配到训练集

imputer仅仅是计算了每个属性的中位数，并将其存在实例变量statistics_中，为了保证所有属性都没有缺失值，我们将imputer应用于所有属性。

imputer.statistics_
housing_num.median().values
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X,columns = housing_num.columns)

处理文本和分类属性

之前我们排除了文本属性ocean_proximity,我们可以将文本标签转化为数字。
scikit-learn提供了labelEncoder转化属性为数字

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

one-hot编码解决数字大小离散程度影响属性的问题

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

使用labelbinarizer能直接转换

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot

自定义转换器

python中的空行很重要
fit(),transform(),fit_transform().

from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, household_ix = 3,4,5,6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:,rooms_ix]/X[:,household_ix]
        population_per_household = X[:,population_ix]/X[:,household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:,bedroom_ix]/X[:,rooms_ix] 
            return np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room]
        else:
            return np.c_[X,rooms_per_household, population_per_household]
        
    attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
    housing_extra_attribs = attr_adder.transform(housing.values)

特征缩放

如果输入的数值属性具有非常大的比例差异，将导致算法性能不佳，因此要将数据进行缩放。有两种缩放方法

最小-最大缩放
标准化
最小-最大缩放：
将值最终范围归于0-1之间，实现方法是将值减去最小值，并除以最大值和最小值的差。
标准化：
首先减去平均值，然后除以方差，从而使的结果的分布具有单位方差。

转换流水线

from3

带带二师兄

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[机器学习第二章]一个线性回归的端到端入门项目

从一个项目来了解机器学习的基本步骤。0.观察大局1.获得数据2.从数据探索和可视化洞见3.机器学习算法的数据准备。4.选择和训练模型5.微调模型6启动、监控、维护项目使用真实数据常见开放数据库：1.UCI机器学习数据库(http://archive.ics.uci.edu/ml/index.php）2.kaggle数据集https://www.kaggle.com/datasets3.很好用的数据门户网站：https://dataportals.org/4.国外数据：https:/
复制链接

扫一扫