李沐测试15-California房价预测测试_california 房价 csv-CSDN博客

本文链接：https://blog.csdn.net/Rrrrrr900/article/details/137282934

声明：本笔记非常主观，笔者也在学习过程中

1.首先，先在官网处下载好California房价文件。网址：https://www.kaggle.com/competitions/california-house-prices/data

2.不建议通过python直接下载数据集，因为会出现在pandas.read_csv()的时候出现诸如“Error tokenizing data”等其他问题。

3.如坚持使用python下载，出现“Error tokenizing data.”但直接尝试设置delimiter=‘\t’，会出现数据丢失

数据的下载和处理

# 读取本地下载好的加州房价文件zip

import pandas as pd  
import zipfile  
import os  
import tempfile  
  
# 定义ZIP文件的路径  
zip_filepath = "C:\\Users\\user\\Downloads\\california-house-prices.zip"  
# 创建一个临时目录来解压文件  
with tempfile.TemporaryDirectory() as tempdir:  
    # 打开ZIP文件  
    with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:  
        # 解压train.csv和test.csv到临时目录  
        zip_ref.extract('train.csv', tempdir)  
        zip_ref.extract('test.csv', tempdir)  
          
        # 构建CSV文件的完整路径  
        train_csv_path = os.path.join(tempdir, 'train.csv')  
        test_csv_path = os.path.join(tempdir, 'test.csv')  
          
        # 使用pandas读取CSV文件  
        train_df = pd.read_csv(train_csv_path)  
        test_df = pd.read_csv(test_csv_path)  
          
        # 打印数据的前几行以验证数据是否已正确加载  
        print("Train Data:")  
        print(train_df.head())  
        print("\nTest Data:")  
        print(test_df.head())

Train Data:
   Id            Address  Sold Price  \
0   0        540 Pine Ln   3825000.0   
1   1     1727 W 67th St    505000.0   
2   2     28093 Pine Ave    140000.0   
3   3  10750 Braddock Dr   1775000.0   
4   4  7415 O Donovan Rd   1175000.0   

                                             Summary          Type  \
0  540 Pine Ln, Los Altos, CA 94022 is a single f...  SingleFamily   
1  HURRY, HURRY.......Great house 3 bed and 2 bat...  SingleFamily   
2  'THE PERFECT CABIN TO FLIP!  Strawberry deligh...  SingleFamily   
3  Rare 2-story Gated 5 bedroom Modern Mediterran...  SingleFamily   
4  Beautiful 200 acre ranch land with several pas...    VacantLand   

   Year built                                       Heating  \
0      1969.0  Heating - 2+ Zones, Central Forced Air - Gas   
1      1926.0                                   Combination   
2      1958.0                                    Forced air   
3      1947.0                                       Central   
4         NaN                                           NaN   

                                             Cooling  \
0    Multi-Zone, Central AC, Whole House / Attic Fan   
1  Wall/Window Unit(s), Evaporative Cooling, See ...   
2                                                NaN   
3                                        Central Air   
4                                                NaN   

                                         Parking     Lot  ...  \
0             Garage, Garage - Attached, Covered     1.0  ...   
1                       Detached Carport, Garage  4047.0  ...   
2                                       0 spaces  9147.0  ...   
3  Detached Carport, Driveway, Garage - Two Door     NaN  ...   
4                                       0 spaces     NaN  ...   

                                Parking features  Tax assessed value  \
0             Garage, Garage - Attached, Covered            886486.0   
1                       Detached Carport, Garage            505000.0   
2                                            NaN             49627.0   
3  Detached Carport, Driveway, Garage - Two Door           1775000.0   
4                                            NaN                 NaN   

   Annual tax amount   Listed On  Listed Price  Last Sold On Last Sold Price  \
0            12580.0  2019-10-24     4198000.0           NaN             NaN   
1             6253.0  2019-10-16      525000.0    2019-08-30        328000.0   
2              468.0  2019-08-25      180000.0           NaN             NaN   
3            20787.0  2019-10-24     1895000.0    2016-08-30       1500000.0   
4                NaN  2019-06-07     1595000.0    2016-06-27        900000.0   

          City    Zip  State  
0    Los Altos  94022     CA  
1  Los Angeles  90047     CA  
2   Strawberry  95375     CA  
3  Culver City  90230     CA  
4      Creston  93432     CA  

[5 rows x 41 columns]

Test Data:
      Id                    Address  \
0  47439        3034 N Coolidge Ave   
1  47440         565 Kenilworth Ave   
2  47441        3028 N Coolidge Ave   
3  47442  3022 N North Coolidge Ave   
4  47443           2515 Admiral Cir   

                                             Summary          Type  \
0  Live within steps to the scenic views on the L...  SingleFamily   
1                 duplex fixer. Input for comps only  SingleFamily   
2  Live within steps to the scenic views on the L...  SingleFamily   
3  Live within steps to the scenic views on the L...  SingleFamily   
4  This beautiful, spacious home built in 2017 is...  SingleFamily   

   Year built      Heating      Cooling                       Parking  \
0      2020.0      Central  Central Air              Tandem Uncovered   
1      1924.0  Natural Gas         None                      Detached   
2      2020.0      Central  Central Air              Tandem Uncovered   
3      2020.0      Central  Central Air              Tandem Uncovered   
4      2017.0   Forced Air  Central Air  Garage Door Opener, Attached   

       Lot Bedrooms  ...              Parking features  Tax assessed value  \
0    940.0        2  ...              Tandem Uncovered                 NaN   
1  10018.8        3  ...                      Detached            521977.0   
2    940.0        2  ...              Tandem Uncovered                 NaN   
3    940.0        2  ...              Tandem Uncovered            442800.0   
4   2613.6        4  ...  Garage Door Opener, Attached            965282.0   

   Annual tax amount   Listed On  Listed Price Last Sold On Last Sold Price  \
0                NaN  2020-11-06      799900.0   2020-07-01        819000.0   
1             7494.0  2014-04-04      479950.0   2020-11-03         15000.0   
2                NaN  2020-11-12      839900.0          NaN             NaN   
3             5370.0  2020-11-06      809900.0   2020-09-21        810000.0   
4            12912.0  2020-12-02     1095000.0   2019-12-27       1041000.0   

          City    Zip State  
0   Dodgertown  90090    CA  
1  San Leandro  94577    CA  
2  Los Angeles  90039    CA  
3   Dodgertown  90090    CA  
4      Hayward  94545    CA  

[5 rows x 40 columns]

访问数据：shape和删除id

train_data=train_df
test_data=test_df

print(train_data.shape)
print(test_data.shape)

(47439, 41)
(31626, 40)

print(train_data.iloc[0:4, [0, 1, 2, 3, 4, 5, 6, -3, -2, -1]])

   Id            Address  Sold Price  \
0   0        540 Pine Ln   3825000.0   
1   1     1727 W 67th St    505000.0   
2   2     28093 Pine Ave    140000.0   
3   3  10750 Braddock Dr   1775000.0   

                                             Summary          Type  \
0  540 Pine Ln, Los Altos, CA 94022 is a single f...  SingleFamily   
1  HURRY, HURRY.......Great house 3 bed and 2 bat...  SingleFamily   
2  'THE PERFECT CABIN TO FLIP!  Strawberry deligh...  SingleFamily   
3  Rare 2-story Gated 5 bedroom Modern Mediterran...  SingleFamily   

   Year built                                       Heating         City  \
0      1969.0  Heating - 2+ Zones, Central Forced Air - Gas    Los Altos   
1      1926.0                                   Combination  Los Angeles   
2      1958.0                                    Forced air   Strawberry   
3      1947.0                                       Central  Culver City   

     Zip State  
0  94022    CA  
1  90047    CA  
2  95375    CA  
3  90230    CA

提取 train和 test特征部分，并非删除id

pd.concat这行代码在Python的pandas库中用于合并两个数据集（train_data和test_data）的特征部分，但排除了id列（通常假设id列是每行的索引或唯一标识符，并且不作为模型的特征）。

让我们逐步解析这行代码：

train_data.iloc[:, 1:-1]:

iloc 是基于整数位置的索引，用于选择数据。

表示选择所有的行。
1:-1 表示选择从第二列（索引为1）开始到倒数第二列的所有列（排除最后一列）。这通常是因为第一列是id列，而最后一列是目标变量（例如房价），在特征工程中通常不作为特征。
test_data.iloc[:, 1:]:

同样地，iloc 用于选择数据。

表示选择所有的行。
1: 表示选择从第二列开始到最后一列的所有列。在测试集中，通常没有目标变量，所以只需要排除id列。
pd.concat(…):

pd.concat 是pandas中用于合并两个或多个DataFrame对象的函数。
在这里，它将train_data的特征部分和test_data的特征部分合并在一起。
结果是一个新的DataFrame，其中包含了train_data和test_data的所有特征列（除了id列和目标变量列）。

# 在加州房价的数据集中，房价需要在最后一列，但是原始文件房价在第2列（从0开始）,在excel中将其移到最后一列，否则和test数据集出现不一致
# 因此在all_features数据集中，需要排除第0列id，第-1列 Price和第-2列State（全都是CA）
# 因为后续的哑变量处理出现内存不足，因此在此处就需要对特征进行筛选。
all_features = pd.concat((train_data.iloc[:, train_data.columns!='Sold Price'], test_data.iloc[:, 1:]))

all_features.info()  # 特征总览

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79065 entries, 0 to 31625
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Id                           47439 non-null  float64
 1   Address                      79065 non-null  object 
 2   Summary                      78226 non-null  object 
 3   Type                         79065 non-null  object 
 4   Year built                   77123 non-null  float64
 5   Heating                      67552 non-null  object 
 6   Cooling                      63956 non-null  object 
 7   Parking                      77389 non-null  object 
 8   Lot                          56076 non-null  float64
 9   Bedrooms                     74467 non-null  object 
 10  Bathrooms                    73655 non-null  float64
 11  Full bathrooms               66137 non-null  float64
 12  Total interior livable area  75187 non-null  float64
 13  Total spaces                 77398 non-null  float64
 14  Garage spaces                77398 non-null  float64
 15  Region                       79063 non-null  object 
 16  Elementary School            70572 non-null  object 
 17  Elementary School Score      70330 non-null  float64
 18  Elementary School Distance   70572 non-null  float64
 19  Middle School                50788 non-null  object 
 20  Middle School Score          50786 non-null  float64
 21  Middle School Distance       50788 non-null  float64
 22  High School                  71891 non-null  object 
 23  High School Score            71281 non-null  float64
 24  High School Distance         71890 non-null  float64
 25  Flooring                     57138 non-null  object 
 26  Heating features             66517 non-null  object 
 27  Cooling features             62432 non-null  object 
 28  Appliances included          55716 non-null  object 
 29  Laundry features             59083 non-null  object 
 30  Parking features             72437 non-null  object 
 31  Tax assessed value           72742 non-null  float64
 32  Annual tax amount            71856 non-null  float64
 33  Listed On                    79065 non-null  object 
 34  Listed Price                 79065 non-null  float64
 35  Last Sold On                 49520 non-null  object 
 36  Last Sold Price              49520 non-null  float64
 37  City                         79065 non-null  object 
 38  Zip                          79065 non-null  int64  
 39  State                        79065 non-null  object 
dtypes: float64(18), int64(1), object(21)
memory usage: 24.7+ MB

print(train_data.columns)

Index(['Id', 'Address', 'Sold Price', 'Summary', 'Type', 'Year built',
       'Heating', 'Cooling', 'Parking', 'Lot', 'Bedrooms', 'Bathrooms',
       'Full bathrooms', 'Total interior livable area', 'Total spaces',
       'Garage spaces', 'Region', 'Elementary School',
       'Elementary School Score', 'Elementary School Distance',
       'Middle School', 'Middle School Score', 'Middle School Distance',
       'High School', 'High School Score', 'High School Distance', 'Flooring',
       'Heating features', 'Cooling features', 'Appliances included',
       'Laundry features', 'Parking features', 'Tax assessed value',
       'Annual tax amount', 'Listed On', 'Listed Price', 'Last Sold On',
       'Last Sold Price', 'City', 'Zip', 'State'],
      dtype='object')

自变量与因变量的相关性

#查看自变量与因变量的相关性
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,4))
abs(train_data.corr()['Sold Price']).sort_values(ascending=False).plot.bar()
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 [Text(0, 0.0, '0.0'),
  Text(0, 0.2, '0.2'),
  Text(0, 0.4, '0.4'),
  Text(0, 0.6000000000000001, '0.6'),
  Text(0, 0.8, '0.8'),
  Text(0, 1.0, '1.0'),
  Text(0, 1.2000000000000002, '1.2')])

在这里插入图片描述

选择相关性较大的特征继续进行训练，其余的被排除，此处我们选择>0.4(较为明显的分界线)：Annual tax amount，Tax assessed value，Last Sold
Price，Listed Price，Full bathrooms，Bathrooms。

数据清洗

通过画出上一节中分析出的相关性高的特征的值的分布，通过分析剔除异常的值。

#异常值处理
import seaborn as sns
figure=plt.figure()
sns.pairplot(x_vars=['Annual tax amount','Tax assessed value','Last Sold Price','Listed Price','Full bathrooms','Bathrooms'],
             y_vars=['Sold Price'],data=train_data,dropna=True)
plt.show()

<Figure size 640x480 with 0 Axes>

在这里插入图片描述

#删除异常值
train_data = train_data.drop(train_data[(train_data['Last Sold Price']>8*1e7) &
                                        (train_data['Sold Price']<4*1e7)].index)
 
train_data = train_data.drop(train_data[(train_data['Full bathrooms']>15) &
                                        (train_data['Sold Price']<5*1e7)].index)
 
train_data = train_data.drop(train_data[(train_data['Bathrooms']>20) &
                                        (train_data['Sold Price']<2*1e7)].index)
# 图像展示
figure=plt.figure()
sns.pairplot(x_vars=['Annual tax amount','Tax assessed value','Last Sold Price','Listed Price','Full bathrooms','Bathrooms'],
             y_vars=['Sold Price'],data=train_data,dropna=True)
plt.show()

<Figure size 640x480 with 0 Axes>

在这里插入图片描述

数据预处理：标准化数据集

1.所有缺失值替换成相应特征的平均值

2.将所有特征值放在共同的尺度上，即标准化数据集，均值0，方差1

数据集中有字符串应该如何处理？

numeric_features = all_features.dtypes[all_features.dtypes != ‘object’].index

all_features.dtypes: 获取all_features中每列的数据类型。

all_features.dtypes != ‘object’: 判断每列的数据类型是否不是’object’（即非字符串类型，通常可以认为是数值型）。

all_features.dtypes[all_features.dtypes != ‘object’].index: 获取所有非字符串类型（数值型）列的索引。

numeric_features: 存储这些数值型列的索引。

all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))

all_features[numeric_features]: 选择all_features中所有的数值型列。

lambda x: (x - x.mean()) / (x.std()): 这是一个匿名函数（lambda函数），用于计算z-score标准化。对于每一列x，它首先计算x的均值（x.mean()）和标准差（x.std()），然后用x中的每一个值减去均值并除以标准差，得到标准化的值。

all_features[numeric_features].apply(…): 使用apply方法将上述的lambda函数应用到all_features中所有的数值型列上，实现z-score标准化。

总结：这段代码首先找出all_features中所有的数值型特征，然后对这些特征进行z-score标准化处理。

# 若无法获得测试数据，则可根据训练数据计算均值和标准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))
# 在标准化数据之后，所有均值消失，因此我们可以将缺失值设置为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

独热编码处理

在进行one-hot之前先观察object类型的特征有哪些，并且他们共有多少种不同的类别。

for in_object in all_features.dtypes[all_features.dtypes=='object'].index:
    print(in_object.ljust(20),len(all_features[in_object].unique()))

Address              78875
Summary              77775
Type                 174
Heating              2660
Cooling              911
Parking              9913
Bedrooms             278
Region               1259
Elementary School    3568
Middle School        809
High School          922
Flooring             1740
Heating features     1763
Cooling features     596
Appliances included  11290
Laundry features     3031
Parking features     9695
Listed On            2815
Last Sold On         6949
City                 1122
State                2

jupyter notebook运行内存溢出解决方法？

24/4/1
需要对all_features的特征进行筛选，筛选的条件可参考如下：

1.只保留真正需要的列。
2.NaN太多的可删除，即信息较少的
3.删除重复或冗杂的信息的列
4.进行主成分分析PCA进行初筛

实际上，当内存不足无法直接执行pd.get_dummies()时候，可选择用SPSS软件进行哑变量操作，再把处理后数据搬回来（自行判断选择）

但在本题中，笔者选择直接扩展jupyter notebook的运行容量，参考http://t.csdnimg.cn/tHMb8 ，路径是"C:\anaconda3\envs\d2l\Lib\site-packages\tornado\iostream.py"，在记事本中打开，需以管理员身份才可以修改该文件。

24/4/2
仍需要筛选重要特征。

最终选取了State、Type、Bedrooms加入用于训练的特征，并进行了one-hot。太多类别会导致数据量激增，冗余。

# 选择特征列
features = list(numeric_features)
features.extend(['State','Type',"Bedrooms"])

all_features = all_features[features[1:]]
# 原本第一列是Id，去掉

print(all_features.shape)
#“Dummy_na=True”将“na”（缺失值）视为有效的特征值，并为其创建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)

(79065, 21)
(79065, 474)

get_dummies的作用?

get_dummies是将拥有不同值的变量转换为0/1数值
比如，特征MSZoning里面有两个不同的离散值RL和RM
那么这一步将去掉MSZoning特征，并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或者1.
如果一个样本在原来MSZoning里的值为RL，那么MSZoning_RL=1且MSZoning_RM=0.
————————————————
原文链接：https://blog.csdn.net/weixin_48632275/article/details/124359335

总结：

正确的数据预处理能减少占用电脑运行内存，避免冗余。这里的预处理做了异常值处理、特征归一化、空缺值填补和独热编码处理。

Value转换为张量用于训练

通过values属性，我们可以从pandas格式中提取NumPy格式，并将其转换为张量表示用于训练。

n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
    train_data['Sold Price'].values.reshape(-1, 1), dtype=torch.float32)  # 列名出现空格的修改

MLP模型预测

import numpy as np
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l

loss = nn.MSELoss()
in_features = train_features.shape[1]

def get_net():
    net = nn.Sequential(nn.Linear(in_features,1))
    return net

def log_rmse(net, features, labels):
    # 为了在取对数时进一步稳定该值，将小于1的值设置为1
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds),
                           torch.log(labels)))
    return rmse.item()

def train(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    # 这里使用的是Adam优化算法
    optimizer = torch.optim.Adam(net.parameters(),
                                 lr = learning_rate,
                                 weight_decay = weight_decay)
    for epoch in range(num_epochs):
        for X, y in train_iter:
            optimizer.zero_grad()
            l = loss(net(X), y)
            l.backward()
            optimizer.step()
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls

def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid

def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
           batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
                                   weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        if i == 0:
            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
                     legend=['train', 'valid'], yscale='log')
        print(f'折{i + 1}，训练log rmse{float(train_ls[-1]):f}, '
              f'验证log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
                          weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
      f'平均验证log rmse: {float(valid_l):f}')
## 训练函数借助了Adam优化器，它的优点是在一开始对学习率不那么敏感

折1，训练log rmse0.817052, 验证log rmse0.647753
折2，训练log rmse0.803511, 验证log rmse0.717971
折3，训练log rmse0.794341, 验证log rmse0.927424
折4，训练log rmse0.829598, 验证log rmse0.917323
折5，训练log rmse0.788533, 验证log rmse0.970247
5-折验证: 平均训练log rmse: 0.806607, 平均验证log rmse: 0.836143

在这里插入图片描述

提交submission文件

def train_and_pred(train_features, test_features, train_labels, test_data,
                   num_epochs, lr, weight_decay, batch_size):
    net = get_net()
    train_ls, _ = train(net, train_features, train_labels, None, None,
                        num_epochs, lr, weight_decay, batch_size)
    d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
             ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
    print(f'训练log rmse：{float(train_ls[-1]):f}')
    # 将网络应用于测试集。
    preds = net(test_features).detach().numpy()
    # 将其重新格式化以导出到Kaggle
    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    submission.to_csv('submission.csv', index=False)

train_and_pred(train_features, test_features, train_labels, test_data,
               num_epochs, lr, weight_decay, batch_size)

训练log rmse：0.798724

在这里插入图片描述