声明:本笔记非常主观,笔者也在学习过程中
1.首先,先在官网处下载好California房价文件。网址:https://www.kaggle.com/competitions/california-house-prices/data
2.不建议通过python直接下载数据集,因为会出现在pandas.read_csv()的时候出现诸如“Error tokenizing data”等其他问题。
3.如坚持使用python下载,出现“Error tokenizing data.”但直接尝试设置delimiter=‘\t’,会出现数据丢失
数据的下载和处理
# 读取本地下载好的加州房价文件zip
import pandas as pd
import zipfile
import os
import tempfile
# 定义ZIP文件的路径
zip_filepath = "C:\\Users\\user\\Downloads\\california-house-prices.zip"
# 创建一个临时目录来解压文件
with tempfile.TemporaryDirectory() as tempdir:
# 打开ZIP文件
with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
# 解压train.csv和test.csv到临时目录
zip_ref.extract('train.csv', tempdir)
zip_ref.extract('test.csv', tempdir)
# 构建CSV文件的完整路径
train_csv_path = os.path.join(tempdir, 'train.csv')
test_csv_path = os.path.join(tempdir, 'test.csv')
# 使用pandas读取CSV文件
train_df = pd.read_csv(train_csv_path)
test_df = pd.read_csv(test_csv_path)
# 打印数据的前几行以验证数据是否已正确加载
print("Train Data:")
print(train_df.head())
print("\nTest Data:")
print(test_df.head())
Train Data:
Id Address Sold Price \
0 0 540 Pine Ln 3825000.0
1 1 1727 W 67th St 505000.0
2 2 28093 Pine Ave 140000.0
3 3 10750 Braddock Dr 1775000.0
4 4 7415 O Donovan Rd 1175000.0
Summary Type \
0 540 Pine Ln, Los Altos, CA 94022 is a single f... SingleFamily
1 HURRY, HURRY.......Great house 3 bed and 2 bat... SingleFamily
2 'THE PERFECT CABIN TO FLIP! Strawberry deligh... SingleFamily
3 Rare 2-story Gated 5 bedroom Modern Mediterran... SingleFamily
4 Beautiful 200 acre ranch land with several pas... VacantLand
Year built Heating \
0 1969.0 Heating - 2+ Zones, Central Forced Air - Gas
1 1926.0 Combination
2 1958.0 Forced air
3 1947.0 Central
4 NaN NaN
Cooling \
0 Multi-Zone, Central AC, Whole House / Attic Fan
1 Wall/Window Unit(s), Evaporative Cooling, See ...
2 NaN
3 Central Air
4 NaN
Parking Lot ... \
0 Garage, Garage - Attached, Covered 1.0 ...
1 Detached Carport, Garage 4047.0 ...
2 0 spaces 9147.0 ...
3 Detached Carport, Driveway, Garage - Two Door NaN ...
4 0 spaces NaN ...
Parking features Tax assessed value \
0 Garage, Garage - Attached, Covered 886486.0
1 Detached Carport, Garage 505000.0
2 NaN 49627.0
3 Detached Carport, Driveway, Garage - Two Door 1775000.0
4 NaN NaN
Annual tax amount Listed On Listed Price Last Sold On Last Sold Price \
0 12580.0 2019-10-24 4198000.0 NaN NaN
1 6253.0 2019-10-16 525000.0 2019-08-30 328000.0
2 468.0 2019-08-25 180000.0 NaN NaN
3 20787.0 2019-10-24 1895000.0 2016-08-30 1500000.0
4 NaN 2019-06-07 1595000.0 2016-06-27 900000.0
City Zip State
0 Los Altos 94022 CA
1 Los Angeles 90047 CA
2 Strawberry 95375 CA
3 Culver City 90230 CA
4 Creston 93432 CA
[5 rows x 41 columns]
Test Data:
Id Address \
0 47439 3034 N Coolidge Ave
1 47440 565 Kenilworth Ave
2 47441 3028 N Coolidge Ave
3 47442 3022 N North Coolidge Ave
4 47443 2515 Admiral Cir
Summary Type \
0 Live within steps to the scenic views on the L... SingleFamily
1 duplex fixer. Input for comps only SingleFamily
2 Live within steps to the scenic views on the L... SingleFamily
3 Live within steps to the scenic views on the L... SingleFamily
4 This beautiful, spacious home built in 2017 is... SingleFamily
Year built Heating Cooling Parking \
0 2020.0 Central Central Air Tandem Uncovered
1 1924.0 Natural Gas None Detached
2 2020.0 Central Central Air Tandem Uncovered
3 2020.0 Central Central Air Tandem Uncovered
4 2017.0 Forced Air Central Air Garage Door Opener, Attached
Lot Bedrooms ... Parking features Tax assessed value \
0 940.0 2 ... Tandem Uncovered NaN
1 10018.8 3 ... Detached 521977.0
2 940.0 2 ... Tandem Uncovered NaN
3 940.0 2 ... Tandem Uncovered 442800.0
4 2613.6 4 ... Garage Door Opener, Attached 965282.0
Annual tax amount Listed On Listed Price Last Sold On Last Sold Price \
0 NaN 2020-11-06 799900.0 2020-07-01 819000.0
1 7494.0 2014-04-04 479950.0 2020-11-03 15000.0
2 NaN 2020-11-12 839900.0 NaN NaN
3 5370.0 2020-11-06 809900.0 2020-09-21 810000.0
4 12912.0 2020-12-02 1095000.0 2019-12-27 1041000.0
City Zip State
0 Dodgertown 90090 CA
1 San Leandro 94577 CA
2 Los Angeles 90039 CA
3 Dodgertown 90090 CA
4 Hayward 94545 CA
[5 rows x 40 columns]
访问数据:shape和删除id
train_data=train_df
test_data=test_df
print(train_data.shape)
print(test_data.shape)
(47439, 41)
(31626, 40)
print(train_data.iloc[0:4, [0, 1, 2, 3, 4, 5, 6, -3, -2, -1]])
Id Address Sold Price \
0 0 540 Pine Ln 3825000.0
1 1 1727 W 67th St 505000.0
2 2 28093 Pine Ave 140000.0
3 3 10750 Braddock Dr 1775000.0
Summary Type \
0 540 Pine Ln, Los Altos, CA 94022 is a single f... SingleFamily
1 HURRY, HURRY.......Great house 3 bed and 2 bat... SingleFamily
2 'THE PERFECT CABIN TO FLIP! Strawberry deligh... SingleFamily
3 Rare 2-story Gated 5 bedroom Modern Mediterran... SingleFamily
Year built Heating City \
0 1969.0 Heating - 2+ Zones, Central Forced Air - Gas Los Altos
1 1926.0 Combination Los Angeles
2 1958.0 Forced air Strawberry
3 1947.0 Central Culver City
Zip State
0 94022 CA
1 90047 CA
2 95375 CA
3 90230 CA
提取 train和 test特征部分,并非删除id
pd.concat这行代码在Python的pandas库中用于合并两个数据集(train_data和test_data)的特征部分,但排除了id列(通常假设id列是每行的索引或唯一标识符,并且不作为模型的特征)。
让我们逐步解析这行代码:
train_data.iloc[:, 1:-1]:
-
iloc 是基于整数位置的索引,用于选择数据。
-
表示选择所有的行。
1:-1 表示选择从第二列(索引为1)开始到倒数第二列的所有列(排除最后一列)。这通常是因为第一列是id列,而最后一列是目标变量(例如房价),在特征工程中通常不作为特征。
test_data.iloc[:, 1:]:
同样地,iloc 用于选择数据。
-
表示选择所有的行。
1: 表示选择从第二列开始到最后一列的所有列。在测试集中,通常没有目标变量,所以只需要排除id列。
pd.concat(…):
pd.concat 是pandas中用于合并两个或多个DataFrame对象的函数。
在这里,它将train_data的特征部分和test_data的特征部分合并在一起。
结果是一个新的DataFrame,其中包含了train_data和test_data的所有特征列(除了id列和目标变量列)。
# 在加州房价的数据集中,房价需要在最后一列,但是原始文件房价在第2列(从0开始),在excel中将其移到最后一列,否则和test数据集出现不一致
# 因此在all_features数据集中,需要排除第0列id,第-1列 Price和第-2列State(全都是CA)
# 因为后续的哑变量处理出现内存不足,因此在此处就需要对特征进行筛选。
all_features = pd.concat((train_data.iloc[:, train_data.columns!='Sold Price'], test_data.iloc[:, 1:]))
all_features.info() # 特征总览
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79065 entries, 0 to 31625
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 47439 non-null float64
1 Address 79065 non-null object
2 Summary 78226 non-null object
3 Type 79065 non-null object
4 Year built 77123 non-null float64
5 Heating 67552 non-null object
6 Cooling 63956 non-null object
7 Parking 77389 non-null object
8 Lot 56076 non-null float64
9 Bedrooms 74467 non-null object
10 Bathrooms 73655 non-null float64
11 Full bathrooms 66137 non-null float64
12 Total interior livable area 75187 non-null float64
13 Total spaces 77398 non-null float64
14 Garage spaces 77398 non-null float64
15 Region 79063 non-null object
16 Elementary School 70572 non-null object
17 Elementary School Score 70330 non-null float64
18 Elementary School Distance 70572 non-null float64
19 Middle School 50788 non-null object
20 Middle School Score 50786 non-null float64
21 Middle School Distance 50788 non-null float64
22 High School 71891 non-null object
23 High School Score 71281 non-null float64
24 High School Distance 71890 non-null float64
25 Flooring 57138 non-null object
26 Heating features 66517 non-null object
27 Cooling features 62432 non-null object
28 Appliances included 55716 non-null object
29 Laundry features 59083 non-null object
30 Parking features 72437 non-null object
31 Tax assessed value 72742 non-null float64
32 Annual tax amount 71856 non-null float64
33 Listed On 79065 non-null object
34 Listed Price 79065 non-null float64
35 Last Sold On 49520 non-null object
36 Last Sold Price 49520 non-null float64
37 City 79065 non-null object
38 Zip 79065 non-null int64
39 State 79065 non-null object
dtypes: float64(18), int64(1), object(21)
memory usage: 24.7+ MB
print(train_data.columns)
Index(['Id', 'Address', 'Sold Price', 'Summary', 'Type', 'Year built',
'Heating', 'Cooling', 'Parking', 'Lot', 'Bedrooms', 'Bathrooms',
'Full bathrooms', 'Total interior livable area', 'Total spaces',
'Garage spaces', 'Region', 'Elementary School',
'Elementary School Score', 'Elementary School Distance',
'Middle School', 'Middle School Score', 'Middle School Distance',
'High School', 'High School Score', 'High School Distance', 'Flooring',
'Heating features', 'Cooling features', 'Appliances included',
'Laundry features', 'Parking features', 'Tax assessed value',
'Annual tax amount', 'Listed On', 'Listed Price', 'Last Sold On',
'Last Sold Price', 'City', 'Zip', 'State'],
dtype='object')
自变量与因变量的相关性
#查看自变量与因变量的相关性
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
abs(train_data.corr()['Sold Price']).sort_values(ascending=False).plot.bar()
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
[Text(0, 0.0, '0.0'),
Text(0, 0.2, '0.2'),
Text(0, 0.4, '0.4'),
Text(0, 0.6000000000000001, '0.6'),
Text(0, 0.8, '0.8'),
Text(0, 1.0, '1.0'),
Text(0, 1.2000000000000002, '1.2')])
选择相关性较大的特征继续进行训练,其余的被排除,此处我们选择>0.4(较为明显的分界线):Annual tax amount,Tax assessed value,Last Sold
Price,Listed Price,Full bathrooms,Bathrooms。
数据清洗
通过画出上一节中分析出的相关性高的特征的值的分布,通过分析剔除异常的值。
#异常值处理
import seaborn as sns
figure=plt.figure()
sns.pairplot(x_vars=['Annual tax amount','Tax assessed value','Last Sold Price','Listed Price','Full bathrooms','Bathrooms'],
y_vars=['Sold Price'],data=train_data,dropna=True)
plt.show()
<Figure size 640x480 with 0 Axes>
#删除异常值
train_data = train_data.drop(train_data[(train_data['Last Sold Price']>8*1e7) &
(train_data['Sold Price']<4*1e7)].index)
train_data = train_data.drop(train_data[(train_data['Full bathrooms']>15) &
(train_data['Sold Price']<5*1e7)].index)
train_data = train_data.drop(train_data[(train_data['Bathrooms']>20) &
(train_data['Sold Price']<2*1e7)].index)
# 图像展示
figure=plt.figure()
sns.pairplot(x_vars=['Annual tax amount','Tax assessed value','Last Sold Price','Listed Price','Full bathrooms','Bathrooms'],
y_vars=['Sold Price'],data=train_data,dropna=True)
plt.show()
<Figure size 640x480 with 0 Axes>
数据预处理:标准化数据集
1.所有缺失值替换成相应特征的平均值
2.将所有特征值放在共同的尺度上,即标准化数据集,均值0,方差1
数据集中有字符串应该如何处理?
numeric_features = all_features.dtypes[all_features.dtypes != ‘object’].index
all_features.dtypes: 获取all_features中每列的数据类型。
all_features.dtypes != ‘object’: 判断每列的数据类型是否不是’object’(即非字符串类型,通常可以认为是数值型)。
all_features.dtypes[all_features.dtypes != ‘object’].index: 获取所有非字符串类型(数值型)列的索引。
numeric_features: 存储这些数值型列的索引。
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
all_features[numeric_features]: 选择all_features中所有的数值型列。
lambda x: (x - x.mean()) / (x.std()): 这是一个匿名函数(lambda函数),用于计算z-score标准化。对于每一列x,它首先计算x的均值(x.mean())和标准差(x.std()),然后用x中的每一个值减去均值并除以标准差,得到标准化的值。
all_features[numeric_features].apply(…): 使用apply方法将上述的lambda函数应用到all_features中所有的数值型列上,实现z-score标准化。
总结:这段代码首先找出all_features中所有的数值型特征,然后对这些特征进行z-score标准化处理。
# 若无法获得测试数据,则可根据训练数据计算均值和标准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# 在标准化数据之后,所有均值消失,因此我们可以将缺失值设置为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)
独热编码处理
在进行one-hot之前先观察object类型的特征有哪些,并且他们共有多少种不同的类别。
for in_object in all_features.dtypes[all_features.dtypes=='object'].index:
print(in_object.ljust(20),len(all_features[in_object].unique()))
Address 78875
Summary 77775
Type 174
Heating 2660
Cooling 911
Parking 9913
Bedrooms 278
Region 1259
Elementary School 3568
Middle School 809
High School 922
Flooring 1740
Heating features 1763
Cooling features 596
Appliances included 11290
Laundry features 3031
Parking features 9695
Listed On 2815
Last Sold On 6949
City 1122
State 2
jupyter notebook运行内存溢出解决方法?
24/4/1
需要对all_features的特征进行筛选,筛选的条件可参考如下:
1.只保留真正需要的列。
2.NaN太多的可删除,即信息较少的
3.删除重复或冗杂的信息的列
4.进行主成分分析PCA进行初筛
实际上,当内存不足无法直接执行pd.get_dummies()时候,可选择用SPSS软件进行哑变量操作,再把处理后数据搬回来(自行判断选择)
但在本题中,笔者选择直接扩展jupyter notebook的运行容量,参考http://t.csdnimg.cn/tHMb8 ,路径是"C:\anaconda3\envs\d2l\Lib\site-packages\tornado\iostream.py",在记事本中打开,需以管理员身份才可以修改该文件。
24/4/2
仍需要筛选重要特征。
最终选取了State、Type、Bedrooms加入用于训练的特征,并进行了one-hot。太多类别会导致数据量激增,冗余。
# 选择特征列
features = list(numeric_features)
features.extend(['State','Type',"Bedrooms"])
all_features = all_features[features[1:]]
# 原本第一列是Id,去掉
print(all_features.shape)
#“Dummy_na=True”将“na”(缺失值)视为有效的特征值,并为其创建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)
(79065, 21)
(79065, 474)
get_dummies的作用?
get_dummies是将拥有不同值的变量转换为0/1数值
比如,特征MSZoning里面有两个不同的离散值RL和RM
那么这一步将去掉MSZoning特征,并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或者1.
如果一个样本在原来MSZoning里的值为RL,那么MSZoning_RL=1且MSZoning_RM=0.
————————————————
原文链接:https://blog.csdn.net/weixin_48632275/article/details/124359335
总结:
正确的数据预处理能减少占用电脑运行内存,避免冗余。这里的预处理做了异常值处理、特征归一化、空缺值填补和独热编码处理。
Value转换为张量用于训练
通过values属性,我们可以 从pandas格式中提取NumPy格式,并将其转换为张量表示用于训练。
n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
train_data['Sold Price'].values.reshape(-1, 1), dtype=torch.float32) # 列名出现空格的修改
MLP模型预测
import numpy as np
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l
loss = nn.MSELoss()
in_features = train_features.shape[1]
def get_net():
net = nn.Sequential(nn.Linear(in_features,1))
return net
def log_rmse(net, features, labels):
# 为了在取对数时进一步稳定该值,将小于1的值设置为1
clipped_preds = torch.clamp(net(features), 1, float('inf'))
rmse = torch.sqrt(loss(torch.log(clipped_preds),
torch.log(labels)))
return rmse.item()
def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
train_ls, test_ls = [], []
train_iter = d2l.load_array((train_features, train_labels), batch_size)
# 这里使用的是Adam优化算法
optimizer = torch.optim.Adam(net.parameters(),
lr = learning_rate,
weight_decay = weight_decay)
for epoch in range(num_epochs):
for X, y in train_iter:
optimizer.zero_grad()
l = loss(net(X), y)
l.backward()
optimizer.step()
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
def get_k_fold_data(k, i, X, y):
assert k > 1
fold_size = X.shape[0] // k
X_train, y_train = None, None
for j in range(k):
idx = slice(j * fold_size, (j + 1) * fold_size)
X_part, y_part = X[idx, :], y[idx]
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat([X_train, X_part], 0)
y_train = torch.cat([y_train, y_part], 0)
return X_train, y_train, X_valid, y_valid
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
batch_size):
train_l_sum, valid_l_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
train_l_sum += train_ls[-1]
valid_l_sum += valid_ls[-1]
if i == 0:
d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
legend=['train', 'valid'], yscale='log')
print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f}, '
f'验证log rmse{float(valid_ls[-1]):f}')
return train_l_sum / k, valid_l_sum / k
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
f'平均验证log rmse: {float(valid_l):f}')
## 训练函数借助了Adam优化器,它的优点是在一开始对学习率不那么敏感
折1,训练log rmse0.817052, 验证log rmse0.647753
折2,训练log rmse0.803511, 验证log rmse0.717971
折3,训练log rmse0.794341, 验证log rmse0.927424
折4,训练log rmse0.829598, 验证log rmse0.917323
折5,训练log rmse0.788533, 验证log rmse0.970247
5-折验证: 平均训练log rmse: 0.806607, 平均验证log rmse: 0.836143
提交submission文件
def train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size):
net = get_net()
train_ls, _ = train(net, train_features, train_labels, None, None,
num_epochs, lr, weight_decay, batch_size)
d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
print(f'训练log rmse:{float(train_ls[-1]):f}')
# 将网络应用于测试集。
preds = net(test_features).detach().numpy()
# 将其重新格式化以导出到Kaggle
test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
submission.to_csv('submission.csv', index=False)
train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size)
训练log rmse:0.798724