[蜥蜴书Chapter2] -- 下载和加载数据

Jizhi_Zhang

已于 2024-02-26 16:17:12 修改

阅读量653

点赞数 15

分类专栏：机器学习文章标签： linux ubuntu python

于 2024-02-26 14:50:42 首次发布

本文链接：https://blog.csdn.net/m0_71291382/article/details/136240756

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、下载和加载数据的函数代码

二、代码说明

1、urllib.request.urlretrieve

一、下载和加载数据的函数代码

#the module needed by fetching the data 
import os
import tarfile
from six.moves import urllib

#the module needed by loading the data
import pandas as pd


#the way of finding the data
ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
PATH = os.path.join("datasets", "housing")
#创建路径 -- 文件夹datasets内的housing文件夹
URL = ROOT + "datasets/housing/housing.tgz"

#identify the function

#function to fetch the data
def fetch_data(housing_url = URL, housing_path = PATH):
    #如果找不到路径，就创建一个路径
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    #创建一个housing.tgz文件的路径
    tgz_path = os.path.join(housing_path, "housing.tgz")
    #在该代码后会详细说明一下
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path = housing_path)
    housing_tgz.close()


#return a pandas dataframe object containing all the data
def load_data(housing_path = PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

二、代码说明

1、urllib.request.urlretrieve

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

作用：将 URL 中的网络资源拷贝到本地。

如果 URL 指定的是一个本地文件并且没有提供filename，拷贝不会被执行。

该函数会返回一个二元组(filename, headers)：

filename指定一个本地文件，用来存储 URL 中的网络资源。

headers是调用urlopen()后的返回对象再调用info()方法后的返回值（用于远程对象）。

抛出的异常类型和urlopen()相同。

对于第二个参数来说，如果这个文件存在，会将 URL 中的网络资源拷贝到这个文件中（如果不存在，本地会按照提供的文件名生成一个临时文件）。

第三个参数是一个回调函数（callable），如果提供了这个函数，那么函数在连接建立时被调用一次，之后每读取到一个块都会再被调用一次。这个回调函数被传递3个参数：到目前为止收到的块数、一个块占多少字节、整个网络资源有多少块。

2、extractall

extract只返回第一个匹配到的字符；extractall将匹配所有返回的字符

Series.str.extractall(pat, flags=0)

参数的具体解释为：

pat：字符串或者正则表达式
flags：整型

返回值一定是一个DataFrame数据框

三、如何调用函数

py函数文件和想要调用该函数的py文件在同一文件夹下：

在该py文件中声明py函数文件名，之后可直接调用该函数文件中的所有函数：

代码如下：

import download_data

download_data.fetch_data()
housing = download_data.load_data()

四、查看数据的结构

1、head函数：

从头开始查看，括号里可写要查看的行数

housing.head(10)#显示十行

2、info函数：

获得对数据的大体概述：总行数，属性的数据类型和修饰的属性和参数不能为空

housing.info()

在数据集中有20640个例子，但在total_bedrooms中仅有20433个，意味着有207个数据丢失了

除了ocean_proximity的数据类型是object，其他的数据类型都是numerical

观察前几行时，可以意识到这个是文本类型，并且是有类别的，用value_counts进行统计

housing["ocean_proximity"].value_counts()

可以看到分为5类，分别为：内陆、近海、近海湾和小岛

3、describe函数：

对数字属性进行总结：

housing.describe()

注：null的数据会被忽略

4、绘制柱状图：

import matplotlib.pyplot as plt
housing.hist(bins = 5, figsize = (20,15))
plt.show()