《Hands-On Machine Learning》学习笔记-2.3 获取数据

本文是《Hands-On Machine Learning》学习笔记的一部分,主要讲述了如何获取数据,包括下载数据、使用Pandas加载数据,以及快速浏览数据结构。接着介绍了创建测试集的重要性,避免数据透视偏差,并详细解释了如何通过hash映射和分层抽样来创建具有代表性的测试集。最后,强调了测试集划分在机器学习项目中的关键作用。
摘要由CSDN通过智能技术生成

端到端机器学习项目

获取数据

下载数据

可以直接使用浏览器下载数据文件,然后解压出其中的CSV文件,但是更好的办法是写一个函数来实现它,特别是当数据会变化的时候,使用函数的形式能够随时随地获取最新的数据。

import pdb
# pdb.set_trace()
import os
import tarfile
from six.moves import urllib


DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
HOUSING_LOCAL_PATH = r"E:\Hands-On ML data"

def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_LOCAL_PATH):
    if not os.path.isdir(housing_path):
        os.mkdirs(housing_path)
    
    tgz_path = os.path.join(housing_path, "housing.tgz")
#     从网络地址获取tgz文件
    urllib.request.urlretrieve(housing_url, tgz_path)
    #打开tgz文件
    housing_tgz = tarfile.open(tgz_path)
    #解压tgz
    housing_tgz.extractall(path=housing_path)
    #关闭tgz
    housing_tgz.close()
fetch_housing_data()

调用fetch_housing_data()函数,就会从网络上下载housing.tgz并解压其中的housing.csv
使用Pandas库来加载数据

import pandas as pd

def load_housing_data(housing_path = HOUSING_LOCAL_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

该函数调用pandas库的read_csv

Hands-On Data Science and Python Machine Learning by Frank Kane English | 31 July 2017 | ISBN: 1787280748 | ASIN: B072QBVXGH | 420 Pages | AZW3 | 7.21 MB Key Features Take your first steps in the world of data science by understanding the tools and techniques of data analysis Train efficient Machine Learning models in Python using the supervised and unsupervised learning methods Learn how to use Apache Spark for processing Big Data efficiently Book Description Join Frank Kane, who worked on Amazon and IMDb's machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank's successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis. What you will learn Learn how to clean your data and ready it for analysis Implement the popular clustering and regression methods in Python Train efficient machine learning models using decision trees and random forests Visualize the results of your analysis using Python's Matplotlib library Use Apache Spark's MLlib package to perform
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值