二、《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一个完整的机器学习项目

本文以一个完整的机器学习项目为例,讲解如何使用Scikit-Learn和TensorFlow处理加州房价预测问题。涉及获取数据、数据分析、特征工程、模型选择与训练、性能评估等步骤,旨在模拟实际工作流程。
摘要由CSDN通过智能技术生成

 

本章中,你会假装作为被一家地产公司刚刚雇佣的数据科学家,完整地学习一个案例项目。
下面是主要步骤:
1. 项目概述。
2. 获取数据。
3. 发现并可视化数据,发现规律。
4. 为机器学习算法准备数据。
5. 选择模型,进行训练。
6. 微调模型。
7. 给出解决方案。
8. 部署、监控、维护系统。


2.1 使用真实数据集练手

本章选择了加州房价数据集,代码可以从https://github.com/ageron/handson-ml获取。

2.2 分析整体情况

我们的目的是使用加州的人口普查数据,建立模型,预测加州各区域的房价中位数。

训练数据的特征包括加州各区域的人口、收入中位数、房价中位数等。

2.2.1 问题构建(Frame the Problem,提出问题,给出框架,提出假设)

首先问清楚老板的商业目的,以及当前的解决方案(如果有的话)。比如,了解到了当前方案的错误率大概15%,我们就要奋斗目标了。

接下来就可以分析,这是一个有监督学习、无监督学习、还是增强学习?分类任务还是回归任务,或者别的什么?应该使用批量学习(batch learning)还是在线学习(online learning)?

如果数据量巨大,可以使用MapReduce技术,将数据分给多个服务器处理。也可是使用在线学习。

2.2.2 选择性能衡量指标(Select a Performance Measure) 

回归问题典型的衡量指标选择均方根误差(Root Mean Square Error,RMSE),它揭示了预测值的标准偏差(standard deviation)。例如,RMSE 等于 50000,意味着,68% 的系统预测值位于实际值的 50000 美元以内,95% 的预测值位于实际的 100000 美元以内(一个特征通常都符合高斯分布,即满足 “68-95-99.7”规则:大约68%的值落在 1σ  内,95% 的值落在 2σ  内,99.7%的值落在 3σ  内,这里的 σ  等于50000)。公式 2-1 展示了计算 RMSE 的方法。

 

有时候样本中存在很多离群点(outlier) ,我们可能就会使用绝对误差(Mean Absolute Error,MAE)。

 

RMSE和MAE都是向量距离的度量方式(预测值向量和目标值向量)。向量的距离,也可以为称为向量的模(norm),有

When most people hearMachine Learning,” they picture a robot: a dependable butler or a deadly Terminator depending on who you ask. But Machine Learning is not just a futuristic fantasy, it’s already here. In fact, it has been around for decades in some specialized applications, such as Optical Character Recognition (OCR). But the first ML application that really became mainstream, improving the lives of hundreds of millions of people, took over the world back in the 1990s: it was the spam filter. Not exactly a self-aware Skynet, but it does technically qualify as Machine Learning (it has actually learned so well that you seldom need to flag an email as spam anymore). It was followed by hundreds of ML applications that now quietly power hundreds of products and features that you use regularly, from better recommendations to voice search. Where does Machine Learning start and where does it end? What exactly does it mean for a machine to learn something? If I download a copy of Wikipedia, has my computer really “learned” something? Is it suddenly smarter? In this chapter we will start by clarifying what Machine Learning is and why you may want to use it. Then, before we set out to explore the Machine Learning continent, we will take a look at the map and learn about the main regions and the most notable landmarks: supervised versus unsupervised learning, online versus batch learning, instance-based versus model-based learning. Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值