Hands-On Machine Learning with Scikit-Learn and TensorFlow第二章 (1)

本文探讨了在机器学习项目中如何快速查看和理解数据,包括数据集的构成、缺失值处理、属性类型分析。通过加利福尼亚房价预测数据集,展示了数据的统计信息和直方图,揭示了数据预处理的需求。接着,文章介绍了训练集和测试集的划分方法,包括随机划分和分层抽样,并提到了Scikit-Learn的train_test_split函数在数据切分中的应用。
摘要由CSDN通过智能技术生成

快速查看数据以及划分数据集的探索

作为一个组织能力强的你,首先第一件事就是拿出你的机器学习项目检查清单:

  1. 勾勒出问题的框架(什么问题,怎么解决,性能衡量指标)
  2. 得到数据
  3. 发现数据
  4. 准备数据
  5. 对比每个模型的好坏
  6. 微调你的模型,奖他们组成一个很好的解决方案
  7. 展示你的解决方案
  8. 启动,监视和维护你的系统

让我们以加利福利亚房价预测开始吧(数据集可在github下载)

简单看下数据集组成

import pandas as pd
housing = pd.read_csv('housing.csv')
housing.head()

housing.info()

观察可以发现  total_bedrooms的非空数值是20433,比20640少了207个值,这部分即是空白值。除了ocean_proximity所有属性都是整数,这一列可能是种类属性

data['ocean_proximity'].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

显示房子的地理位置信息

观察下数据集的统计信息

housing.describe()

观察到total_bedrooms的空值被忽略。

另外一种快读了解您正在处理数据方法是绘制每一个数字属性的直方图

%matplotlib inline  #jupyter画图的魔法函数
import matplotlib.pyplot as plt
housing.hists(bins=50,figsize=(20,15))
plt.show()

  1. median_income看起来不像是美元。在与手机收集数据的团队核对后,他们会告诉你该数据已被缩放,median_income高于15和低于0.5部分都被归到15和0.5去了。使用预处理属性在机器学习中很常见,并不一定是问题,但你应该尝试了解数据的计算方式。
  2. 从图上看来median_house_value和house_median_age都采取了一样的操作,前者影响比较大,因为他是你的目标属性(标签),你的机器学习算法预测的价格将不会超过这个限制。你需要和你的客户团队一起检查是否存在问题。如果他们告诉你他们需要预测超过此限制,那么你主要有两种选择:
    a.收集超出此限制的数据集
    b.从训练集删除超过此限制的数据集
  3. 这些属性具有不同的尺度,我们将在后面讨论特征缩放
  4. 最后,许多直方图有很长的尾巴,右边的尾巴比左边长得多,这可能会使一些机器学习算法难以检测模式。我们会尝试改变这些属性,使其具有更多的钟型分布

划分训练集和测试集

最简单的划分方式就是随机划分:

import numpy as np
def split_train_test(data,test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    size = test_ratio*len(data)
    test_indices = shuffled_indices[:size]
    train_indices = shuffled_indices[size:]
    return data.iloc[train_indices],data.iloc[test_indices]

调用以上定义的函数

trn,tst = split_train_test(housing,0.2)
print(len(trn),'train_set + ',len(tst),'test_set')
16512 train + 4128 tst

但是你再次运行程序的时候又会产生新的训练集和测试集。一个解决的办法就是保存数据集,等下次用的时候再加载。

还有一个解决方案是使用每个实例的标识符来决定它是否应该放在测试集中(假设实例具有唯一且不可变的标识符)。

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

但不巧的是房屋数据并没有标识符,这里我们把一行当成一个标识符

housing_with_id=housing.reset_index()
train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"index")

如果使用行索引作为唯一识别码,你需要将新数据都放在数据的尾部,且没有行被删除。如果做不到,则可以用最稳定的特征来创建唯一识别码。例如,一个区的维度和经度是可以保证几百年不变的,我们将它结合作为特征码

housing_with_id['id'] = housing['longitude']*1000+housing['latitude']
train_set,test_set = split_train_test_by_id(housing_with_id,0.2,'id')
16267 train + 4373 tst

Scikit_Learn提供了一些函数可以多种方式将数据集切分成多个小数据集,最简单的函数就是train_test_split,它的作用跟之前提到的split_train_test类似,并带有其他一些功能。

from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(data,test_size=0.2,random_state=42)
print(len(train_set),'trn + ',len(test_set),'tst')

16512 trn +  4128 tst

以上我们讨论的是样本数量非常大的时候,但是有的时候样本集不大,就会有采样偏差的风险,这时候我们要根据合适的属性进行分层抽样。比如这里专家告诉你median_income对预测价格非常重要。我们按此属性进行分层抽样

data['income_cat'] = np.ceil(data['median_income']/1.5)
data['income_cat'].where(data['income_cat']<5,5.0,inplace=True)
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in split.split(housing,housing['income_cat']):
    strat_train_set = housing.iloc[train_index]
    start_test_set = housing.iloc[test_index]
16512 trn +  4128 tst

新加入的最后一列income_cat要给予删除

for set in (strat_train_set,strat_test_set):
    set.drop(['income_cat'],axis=1,inplace=True)
When most people hearMachine Learning,” they picture a robot: a dependable butler or a deadly Terminator depending on who you ask. But Machine Learning is not just a futuristic fantasy, it’s already here. In fact, it has been around for decades in some specialized applications, such as Optical Character Recognition (OCR). But the first ML application that really became mainstream, improving the lives of hundreds of millions of people, took over the world back in the 1990s: it was the spam filter. Not exactly a self-aware Skynet, but it does technically qualify as Machine Learning (it has actually learned so well that you seldom need to flag an email as spam anymore). It was followed by hundreds of ML applications that now quietly power hundreds of products and features that you use regularly, from better recommendations to voice search. Where does Machine Learning start and where does it end? What exactly does it mean for a machine to learn something? If I download a copy of Wikipedia, has my computer really “learned” something? Is it suddenly smarter? In this chapter we will start by clarifying what Machine Learning is and why you may want to use it. Then, before we set out to explore the Machine Learning continent, we will take a look at the map and learn about the main regions and the most notable landmarks: supervised versus unsupervised learning, online versus batch learning, instance-based versus model-based learning. Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值