机器学习实践：Multiple Features Of One Input

最新推荐文章于 2022-06-14 09:00:12 发布

JinBerg

最新推荐文章于 2022-06-14 09:00:12 发布

阅读量700

点赞数

分类专栏：机器学习实践文章标签：机器学习

本文链接：https://blog.csdn.net/JinBerg/article/details/56686851

版权

机器学习实践专栏收录该内容

1 篇文章 0 订阅

订阅专栏

这个数据集合是在华盛顿大学公开课中下载的，是华盛顿州king county一段时间的房屋交易详情。用到一半才知道这只是某连续12个月的。

启动graphlab

import graphlab as gl

加载数据

sale = gl.SFrame('kc_house_data.gl')

sale.show(view='Line Chart', x='date', y='price')

Canvas is accessible via web browser at the URL: http://localhost:42164/index.html Opening Canvas in default web browser.

数据预处理

打算只用时间和平均成交价格学习一下，所以先把这两个量整理一下

data = sale.select_columns(['id', 'date', 'price'])

data.head((5))

id	date	price
7129300520	2014-10-13 00:00:00+00:00	221900.0
6414100192	2014-12-09 00:00:00+00:00	538000.0
5631500400	2015-02-25 00:00:00+00:00	180000.0
2487200875	2014-12-09 00:00:00+00:00	604000.0
1954400510	2015-02-18 00:00:00+00:00	510000.0

[5 rows x 3 columns]

观察到date是datetime类型，而且我只想用到年份和月份，所以剔除掉不需要的部分,之后还要划分训练集和测试集

from datetime import date
data['date'] = data['date'].apply(lambda x:date(x.year, x.month, 1))
data.head((5))

id	date	price
7129300520	2014-10-01 00:00:00	221900.0
6414100192	2014-12-01 00:00:00	538000.0
5631500400	2015-02-01 00:00:00	180000.0
2487200875	2014-12-01 00:00:00	604000.0
1954400510	2015-02-01 00:00:00	510000.0

[5 rows x 3 columns]

train_data, test_data = data.random_split(0.8, 0)

按照年月分组计算每组的平均值

 import graphlab.aggregate as agg

mean_price_month = train_data.groupby(key_columns=['date'], operations={'mean_price':agg.MEAN('price')})
mean_price_month.head((5))

date	mean_price
2014-10-01 00:00:00	536548.106171
2015-03-01 00:00:00	546654.241562
2014-09-01 00:00:00	522656.637439
2014-08-01 00:00:00	535107.794344
2014-12-01 00:00:00	522391.405724

[5 rows x 2 columns]

画出来图像看看

import matplotlib.pyplot as plt
%matplotlib inline

#首先需要根据时间顺序对表进行排序
mean_price_month = mean_price_month.sort(['date'])
mean_price_month

date	mean_price
2014-05-01 00:00:00	549571.932862
2014-06-01 00:00:00	556720.479011
2014-07-01 00:00:00	546336.921229
2014-08-01 00:00:00	535107.794344
2014-09-01 00:00:00	522656.637439
2014-10-01 00:00:00	536548.106171
2014-11-01 00:00:00	523900.917319
2014-12-01 00:00:00	522391.405724
2015-01-01 00:00:00	523216.857143
2015-02-01 00:00:00	500171.623

[13 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

plt.plot(mean_price_month['date'], mean_price_month['mean_price'], '.')

这里写图片描述

有点尴尬，本来以为数据有好几年的，结果只有一年的，实在是尴尬。硬着头皮按照之前的想法做吧

内建的模型拟合

处理datetime类型不方便，转换成整数类型，从2014-05开始累计

mean_price_month['mon_count'] = (mean_price_month['date'].apply(lambda x:x.year)-2014)*12 + mean_price_month['date'].apply(lambda x: x.month)-5
mean_price_month.head((5))

date	mean_price	mon_count
2014-05-01 00:00:00	549571.932862	0
2014-06-01 00:00:00	556720.479011	1
2014-07-01 00:00:00	546336.921229	2
2014-08-01 00:00:00	535107.794344	3
2014-09-01 00:00:00	522656.637439	4

[5 rows x 3 columns]

model1 = gl.regression.create(mean_price_month, target='mean_price', features=['mon_count'])
plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.',
         mean_price_month['mon_count'], model1.predict(mean_price_month), '-')

Boosted trees regression:

--------------------------------------------------------

Number of examples          : 13

Number of features          : 1

Number of unpacked features : 1

+-----------+--------------+--------------------+---------------+

| Iteration | Elapsed Time | Training-max_error | Training-rmse |

+-----------+--------------+--------------------+---------------+

| 1         | 0.000348     | 415092.625000      | 388280.468750 |

| 2         | 0.000616     | 307045.250000      | 280394.250000 |

| 3         | 0.000691     | 229096.781250      | 202668.718750 |

| 4         | 0.000758     | 172862.531250      | 146742.203125 |

| 5         | 0.000824     | 132293.531250      | 106596.460938 |

| 6         | 0.000914     | 103025.906250      | 77906.953125  |

+-----------+--------------+--------------------+---------------+

这里写图片描述

然后试试内建的线性回归方法

model2 = gl.linear_regression.create(mean_price_month, target='mean_price', features=['mon_count'])
plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.',
         mean_price_month['mon_count'], model2.predict(mean_price_month), '-')

Linear regression:

--------------------------------------------------------

Number of examples          : 13

Number of features          : 1

Number of unpacked features : 1

Number of coefficients    : 2

Starting Newton Method

--------------------------------------------------------

+-----------+----------+--------------+--------------------+---------------+

| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |

+-----------+----------+--------------+--------------------+---------------+

| 1         | 2        | 0.000243     | 37346.856866       | 18013.646218  |

+-----------+----------+--------------+--------------------+---------------+

SUCCESS: Optimal solution found.

这里写图片描述

梯度下降法来实现自己的模型

将价格随着月份的变化看作是由两部分组成的，首先是一个线性部分，然后是一个周期性变化的三角函数部分

y i = w 0 + w 1 x i + w 2 sin (2 π x i / 12 - Φ) + ϵ i

$y_i = w_0+w_1x_i+w_2\sin{(2 \pi x_i/12-\Phi)}+\epsilon _i$
即

y i = w 0 + w 1 x i + w 2 sin (2 π x i / 12) + w 3 cos (2 π x i / 12) + ϵ i

$y_i = w_0+w_1x_i+w_2\sin{(2 \pi x_i/12)}+w_3\cos{(2 \pi x_i/12)}+\epsilon _i$
參差平方和

R S S = \sum i = 1 N [y i - (w 0 + w 1 x i + w 2 sin (2 π x i / 12) + w 3 cos (2 π x i / 12))] 2

$RSS = \sum_{i=1}^N [y_i-(w_0+w_1x_i+w_2\sin{(2 \pi x_i/12)}+w_3\cos{(2 \pi x_i/12)})]^2$
求出梯度
为方便表示，令

A=yi−(w0+w1xi+w2sin(2πxi/12)+w3cos(2πxi/12)) $\mathcal{A} = y_i-(w_0+w_1x_i+w_2\sin{(2 \pi x_i/12)}+w_3\cos{(2 \pi x_i/12)})$

\nabla R S S = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ - 2 \sum A - 2 \sum A x i - 2 \sum A sin (π x i / 6) - 2 \sum A cos (π x i / 6) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$\nabla RSS= \left[ \begin{array}{c} -2\sum \mathcal{A}\\ -2\sum \mathcal{A} x_i\\ -2\sum \mathcal{A} \sin (\pi x_i/6)\\ -2\sum \mathcal{A} \cos (\pi x_i/6) \end{array} \right]$

首先计算 $\hat y_i$

from math import *
import numpy as np
def get_haty(W, X):
    w0, w1, w2, w3 = W
    return w0 + w1*X + w2*np.sin(X*pi/6) + w3*np.cos(X*pi/6)

计算RSS

def get_RSS(W, X, Y):
    return sum((Y-get_haty(W,X))*((Y-get_haty(W,X))))

计算RSS的梯度

def get_RSS_grad(W, X, Y):
    a = -2*sum(Y - get_haty(W, X))
    b = -2*sum((Y - get_haty(W, X))*X)
    c = -2*sum((Y - get_haty(W, X))*np.sin(pi*X/6))
    d = -2*sum((Y - get_haty(W, X))*np.cos(pi*X/6))
    return np.array([a, b, c, d])

迭代计算参数

def get_W(ini_W, X, Y, stop_grad, step, max_ite=1000000):
    W = ini_W
    ite = 0
    grad = get_RSS_grad(W, X, Y)
    RSS = [get_RSS(W, X, Y)]
    while sum(grad*grad) > stop_grad*stop_grad and ite < max_ite:
        ite = ite+1
        W = W - step*grad#如果必要可以给step加上一个越来越大的分母来控制学习率
        grad = get_RSS_grad(W, X, Y)
        RSS.append(get_RSS(W, X, Y))
    return W, RSS, ite

W = np.array([520000, 3000, -21000, 21000])
plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.',
         mean_price_month['mon_count'], get_haty(W,mean_price_month['mon_count']), '-')

这里写图片描述

W, RSS, ite = get_W(W, np.array(mean_price_month['mon_count']), np.array(mean_price_month['mean_price']), 0.1, 0.0015 )
#学习率和初始位置的选择很重要，花费了较长的时间
print ite         
print RSS[-100:]
plt.plot(mean_price_month['mon_count'], mean_price_month['mean_price'], '.',
         mean_price_month['mon_count'], get_haty(W,mean_price_month['mon_count']), '-')

0
[1484680834.0960073]

这里写图片描述
从上边的图形中可以看出新模型对训练集的拟合是非常好的

在测试集上实验

train_data, test_data = data.random_split(0.8, 0)
test_data = test_data.groupby(key_columns=['date'], operations={'mean_price':agg.MEAN('price')})
test_data = test_data.sort('date')
test_data.head((5))

date	mean_price
2014-05-01 00:00:00	542100.963173
2014-06-01 00:00:00	563056.417234
2014-07-01 00:00:00	538206.346793
2014-08-01 00:00:00	541864.872396
2014-09-01 00:00:00	557384.857567

[5 rows x 2 columns]

test_data['mon_count'] = (test_data['date'].apply(lambda x:x.year)-2014)*12 + test_data['date'].apply(lambda x: x.month)-5

plt.plot(test_data['mon_count'], test_data['mean_price'], '.',
         test_data['mon_count'], get_haty(W, test_data['mon_count']), '-')
print get_RSS(W, np.array(test_data['mon_count']), np.array(test_data['mean_price']))

2498013785.02

这里写图片描述
这里看模型表现的不怎么样，主要原因应该是数据太少了，而且用的是每个月份所有房子的平均价格（不管大小、区位、档次等因素），如果有类似单位面积价格的话更好一些。另外，如果数据量大，每个月都有各种各样的房子成交（类似于训练集的数据量），那么表现也会好一些吧。

总结

数据时间跨度太小
选房子的总价做预测不是很合理，如果有单位面积价格可能更好
数据量较小

JinBerg

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习实践：Multiple Features Of One Input

这个数据集合是在华盛顿大学公开课中下载的，是华盛顿州king county一段时间的房屋交易详情。用到一半才知道这只是某连续12个月的。启动graphlabimport graphlab as gl加载数据sale = gl.SFrame('kc_house_data.gl')sale.show(view='Line Chart', x='date', y='price') Canvas
复制链接

扫一扫

专栏目录