对于Hands-On_Machine_Learning_with_Scikit-Learn-Keras-and-TensorFlow-2nd-Edition-Aurelien-Geron书籍的学习笔记

Book Note:

Title: Hands-On_Machine_Learning_with_Scikit-Learn-Keras-and-TensorFlow-2nd-Edition-Aurelien-Geron

在这里插入图片描述

CHAPTER 1 The machine learning landscape

01 机器学习系统的种类:
  1. 监督学习:最典型的任务是回归,预测信息

  2. 非监督学习: 聚类

  3. 半监督学习

  4. 强化学习:通过作出反应后得到的反馈(正反馈和负反馈)明确是否是最优策略

  5. 批量学习(离线学习):

    1. 批量学习没有办法在原有系统中学习新数据,可以在数据集(原有数据和新数据)的基础上重新训练模型
    2. 如果数据太大,会耗费特别多的算力,所以是遥不可及的
    3. 可以通过渐进学习改良
  6. 在线学习(渐进式学习):

    1. 可以依次提供数据实例来逐步训练系统
    2. 如果计算资源有限,也可以选用在线学习,因为在线学习系统了解了新的数据实例,就可以丢弃(节省空间)
    3. 学习率(系统适应数据变化的速度):数值越高,适应速度越快,遗忘速度越快;数值越低,系统的惯性越大,学习速度越慢,对新数据的异常值的敏感度会更低
    4. 如果系统输入了不良数据,性能就会下降。需要密切监控,关闭学习(恢复工作状态)
  7. Instance-Based Versus Model-Based Learning:

    1. 基于实例的学习:使用相似性度量将新实例和学习到的实例相比较,归纳出新的实例

    2. 基于模型的学习:

      1. 在使用模型之前:
        定义参数值 : θ 0 θ 1 定义参数值:θ_0 θ_1 定义参数值:θ0θ1

      2. 才能知道哪个参数值会使模型性能最佳(效用函数,适合度函数)

02 机器学习面临的主要挑战:
  1. 训练数据不充足

  2. 数据的不合理,数据的有效性:“对于复杂问题,数据比算法更重要”

  3. 没有代表性的训练数据:抽样存在偏差

  4. 数据质量低

  5. 无关的特征

  6. 过度拟合训练数据:数据和噪声过于复杂

    1. 选择参数较少的模型
    2. 减少训练数据中的属性数量或限制模型
    3. 收集更多训练数据
    4. 减少异常值
  7. 训练数据拟合不足

  8. 训练模型不能希望它能泛化到新的案例中,需要评估微调

03 测试与验证:
  1. 评估模型:测试集(比较泛化效果)

  2. 应用正则化避免过度拟合:保留验证(超参数训练)

    超参数是指在训练模型之前人工设定的配置选项或参数

    1. 保留部分训练集(验证集
    2. 简化的训练集(完整训练集-验证集)上进行训练
    3. 选择在验证集上表现最好的模型 ->最终模型
    4. 验证集不可过小,可能会导致错误结果
    5. 解决方法:选择许多小的验证集执行重复的交叉验证,每个模型对每个验证集进行一次评估(缺点:训练时间长)
  3. 数据不匹配:缺少代表性的照片

    1. 一半放在验证集,一半放在测试集
    2. 可以进行预处理
    3. 简化或者正则化,获取更多数据,清理训练数据
  4. 没有免费的午餐定理:

    1. 如果不对模型做任何假设,没有理由选择一个模型而不是其他模型

    2. 需要对数据进行合理的假设,评估合理的模型

      1. 对于简单的任务:正则化评估线性模型
      2. 对于复杂的任务:评估各种神经网络

CHAPTER 2 End-to-Eng Machine Learning Project

目的: 获取数据 -> 可视化 -> 准备数据 -> 训练模型 -> 微调模型 -> 解决方案 -> 启动监控维护

01 使用真实数据

最好不要使用人工数据集,要使用真实世界的数据进行实验

  1. 流行的开放数据存储库

    1. 加州大学欧文分校机器学习存储库: https://archive.ics.uci.edu/
    2. Kaggle数据集: https://www.kaggle.com/datasets
    3. 亚马逊AWS数据集: https://registry.opendata.aws/
  2. Meta portals (它们列出开放的数据存储库)

    1. 数据门户: http://dataportals.org/
    2. OpenDataMonitor: http://opendatamonitor.eu/
    3. Quandl: http://quandl.com/
  3. 其他页面列出了许多流行的开放数据存储库

    1. Wikipedia’s list of ML datasets: https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
    2. Quora.com: https://www.quora.com/
    3. The datasets subreddit: https://www.reddit.com/r/datasets
02 观察大局

目的:使用加州人口普查数据建立房价模型(包括:每个街区的人口,收入中位数 -> 房价中位数等)

模型:从上述数据中学习,能够在给定其他所有指标的情况下预测任何地区的房价中位数

1. 框定问题
  1. 了解问题——选择算法

  2. 流水线:

    1. 每个组件相对独立异步运行
    2. 组件之间是数据存储
    3. 如果组件出现故障,可以在一段时间内,使用组件的最后输出继续正常运行
    4. 如果没有适当的监控,损坏的组件没有被发现整个系统性能下降,数据陈旧
  3. 框定问题:(在本节示例中用“加黑”代表结果)

    1. 有监督 / 无监督
    2. 强化学习 / 分类任务 / 回归任务:预测输出
    3. 批量学习 / 在线学习技术:没有连续的数据流入系统
    4. 如果数据量很大,可以使用MapReduce技术,将批量学习工作分散到多个服务器上,或使用在线学习技术。
2. 选择测量方法
  1. 回归问题的典型性能度量是均方根误差 (RMSE: Root Mean Square Error):系统在预测中通常会产生多大误差,误差越大权重越大
    R M S E ( X , h ) = 1 m ∑ i = 1 m ( h ( x ( i ) ) − y ( i ) ) 2 m :测量 R M S E 的数据集中的实例数量 x ( i ) : 第 i 个实例的所有特征值; y ( i ) : 第 i 个实例的标签 \mathrm{RMSE}(\mathbf{X},h)=\sqrt{\frac1m\sum_{i=1}^m\left(h{\left(\mathbf{x}^{(i)}\right)}-y^{(i)}\right)^2} \newline m:测量RMSE的数据集中的实例数量 \newline {x}^{(i)}:第i个实例的所有特征值;{y}^{(i)}:第i个实例的标签 RMSE(X,h)=m1i=1m(h(x(i))y(i))2 m:测量RMSE的数据集中的实例数量x(i):i个实例的所有特征值;y(i):i个实例的标签

  2. 如果第一个区经度-118.29,纬度33.91,居民1416,收入中位数38372美元,房屋价值中位数156400美元,则:

x ( 1 ) = ( − 118.29 33.91 1 , 416 38 , 372 ) a n d y ( 1 ) = 156 , 400 \mathbf{x}^{(1)}=\begin{pmatrix}-118.29\\33.91\\1,416\\38,372\end{pmatrix} \newline and \newline y^{(1)}= 156, 400 x(1)= 118.2933.911,41638,372 andy(1)=156,400

  1. X是包含数据集中所有实例的所有特征值(不包含标签)
    X = ( ( x ( 1 ) ) T ( x ( 2 ) ) T ⋮ ( x ( 1999 ) ) T ( x ( 2000 ) ) T ) = ( − 118.29 33.91 1 , 416 38 , 372 ⋮ ⋮ ⋮ ⋮ ) \mathbf{X}=\begin{pmatrix}\left(\mathbf{x}^{(1)}\right)^\mathsf{T}\\\left(\mathbf{x}^{(2)}\right)^\mathsf{T}\\\vdots\\\left(\mathbf{x}^{(1999)}\right)^\mathsf{T}\\\left(\mathbf{x}^{(2000)}\right)^\mathsf{T}\end{pmatrix}=\begin{pmatrix}-118.29&33.91&1,416&38,372\\\vdots&\vdots&\vdots&\vdots\end{pmatrix} X= (x(1))T(x(2))T(x(1999))T(x(2000))T =(118.2933.911,41638,372)

  2. h是系统的预测函数(假设)

给定一个实例的特征向量: x ( i ) 输出预测值: y ^ ( i ) = h ( x ( i ) ) R M S E ( X , h ) 是使用假设 h 在一组示例上测量的成本函数 给定一个实例的特征向量:{x}^{(i)} \newline 输出预测值:\hat{y}^{(i)}=h(\mathbf{x}^{(i)}) \newline \mathrm{RMSE}(\mathbf{X},h)是使用假设h在一组示例上测量的成本函数 给定一个实例的特征向量:x(i)输出预测值:y^(i)=h(x(i))RMSE(X,h)是使用假设h在一组示例上测量的成本函数

3. 检查假设:与后续模块沟通,是需要分类还是回归,这不能等到最后才发现
03 获取数据

数据代码来源:Jupyter Notebook: https://github.com/ageron/handson-ml2

1. 创建工作空间

Jupyter, Numpy, pandas, Matplotlib, Scikit-Learn

2. 下载数据 (不对代码做过多解释,如有疑问,请问chatgpt)
  1. 下载数据
import os
import tarfile
import urllib.request

# 这里的raw.githubusercontent.com 是 GitHub 用来提供原始文件内容的域名。当你访问这个域名时,你将直接获得文件的内容,而不是 GitHub 网页界面。
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()


if __name__ == '__main__':
    fetch_housing_data(HOUSING_URL, HOUSING_PATH)
  1. 加载数据,并使用pandas的DataFrame结构读取
# %matplotlib inline # only in a Jupyter notebook
import pandas as pd
import os
import matplotlib.pyplot as plt

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"


def load_house_data(hosing_path=HOUSING_PATH):
    csv_path = os.path.join(hosing_path, "housing.csv")
    return pd.read_csv(csv_path)


if __name__ == '__main__':
    housing = load_house_data()
    print(housing.info())
    pd.set_option('display.max_columns', None)
    print(housing["ocean_proximity"].value_counts())
    print(housing.describe().to_string())

    housing.hist(bins=50, figsize=(9, 8))
    plt.show()
# housing.info()结果
[20640 rows x 10 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64 #注意:有20433个non-null,有207个nan,需要处理
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
  1. 建立一个测试类
import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))

    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]

    return data.iloc[train_indices], data.iloc[test_indices]

缺点: 该方法的测试集是随机生成的,随着时间的推移,系统将见到所有数据集

解决方案
1. 在第一次运行时保存数据集
2. 设置随机数生成的种子,以便于生成相同的 shuffled indices np.random.seed(42)`
上述两种方法会在获取更新的数据集时失效
3. 使用每个实例的标识符决定是否进入测试集

When most people hear “Machine Learning,” they picture a robot: a dependable butler or a deadly Terminator depending on who you ask. But Machine Learning is not just a futuristic fantasy, it’s already here. In fact, it has been around for decades in some specialized applications, such as Optical Character Recognition (OCR). But the first ML application that really became mainstream, improving the lives of hundreds of millions of people, took over the world back in the 1990s: it was the spam filter. Not exactly a self-aware Skynet, but it does technically qualify as Machine Learning (it has actually learned so well that you seldom need to flag an email as spam anymore). It was followed by hundreds of ML applications that now quietly power hundreds of products and features that you use regularly, from better recommendations to voice search. Where does Machine Learning start and where does it end? What exactly does it mean for a machine to learn something? If I download a copy of Wikipedia, has my computer really “learned” something? Is it suddenly smarter? In this chapter we will start by clarifying what Machine Learning is and why you may want to use it. Then, before we set out to explore the Machine Learning continent, we will take a look at the map and learn about the main regions and the most notable landmarks: supervised versus unsupervised learning, online versus batch learning, instance-based versus model-based learning. Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron English | 2017 | ISBN: 1491962291 | 566 Pages | EPUB | 8.41 MB Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, efficient tools to implement programs capable of learning from data. This practical book shows you how. By using concrete examples, minimal theory, and two production-ready Python frameworks—scikit-learn and TensorFlow—author Aurélien Géron helps you gain an intuitive understanding of the concepts and tools for building intelligent systems. You’ll learn a range of techniques, starting with simple linear regression and progressing to deep neural networks. With exercises in each chapter to help you apply what you’ve learned, all you need is programming experience to get started. Explore the machine learning landscape, particularly neural nets Use scikit-learn to track an example machine-learning project end-to-end Explore several training models, including support vector machines, decision trees, random forests, and ensemble methods Use the TensorFlow library to build and train neural nets Dive into neural net architectures, including convolutional nets, recurrent nets, and deep reinforcement learning Learn techniques for training and scaling deep neural nets Apply practical code examples without acquiring excessive machine learning theory or algorithm details
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值