基于sklearn随机森林实现东莞房价预测

最新推荐文章于 2023-07-07 16:42:09 发布

反反反反反反气旋

最新推荐文章于 2023-07-07 16:42:09 发布

阅读量2.2k

点赞数 2

分类专栏：爬虫文章标签：大数据

本文链接：https://blog.csdn.net/weixin_41432124/article/details/105803121

版权

爬虫专栏收录该内容

3 篇文章 3 订阅

订阅专栏

== 答辩结束，记录一下觉得是毕设的一个亮点 ==
以下是实现房价预测的流程图：
首先将收集的数据进行处理，就是数据都是需要时数字的格式，之后通过调用随机森林模型建模，再带入数据就完成
原始数据大概如下（展示部分）：
在这里插入图片描述
在清洗之前先进行数据字段的相关系数的计算，相关度越大影响越大，对于后期建模建模分数比较有利
相关系数计算：

engine = create_engine("mysql+pymysql://root:root@localhost:3306/houseinfo?charset=utf8") 
sql = "select * from forecast_data_copy"
house_data = pd.read_sql(sql, engine)
print(house_data.corr()) #调用corr()方法并且查看

之后需要将数据处理为：
在这里插入图片描述
将数据库数据文本转化为特定的数字，实现代码如下：

def format_data():
    engine = create_engine("mysql+pymysql://root:root@localhost:3306/houseinfo?charset=utf8")
    sql = "select * from ershoufang_info"
    house_data = pd.read_sql(sql, engine)
    house_data[u'decoration'] = house_data[u'decoration'].map(
        {'毛坯': 1001, '简装修': 1002, '精装修': 1003, '其他': 1000, '': 1000})
    house_data[u'house_range'] = house_data[u'house_range'].map(
        {'1室': 2001, '2室': 2002, '3室': 2003, '4室': 2004, '5室': 2005,
         '6室': 2006, '7室': 2007, '8室': 2008, '9室': 2009, '车位室': 2000,
         '0室': 2000, '未知': 2000, '': 2000})
    house_data[u'town'] = house_data[u'town'].map(
        {'南城区': 4000, '万江区': 5000, '石碣镇': 101000, '石龙镇': 102000, '茶山镇': 103000,
         '石排镇': 104000,
         '企石镇': 105000, '横沥镇': 106000, '桥头镇': 107000, '谢岗镇': 108000, '大岭山镇': 118000,
         '常平镇': 110000, '寮步镇': 111000, '长安镇': 119000, '大朗镇': 113000, '黄江镇': 114000,
         '清溪镇': 115000, '望牛墩镇': 127000, '东城区': 3000, '莞城区': 6000,
         '东坑镇': 109000, '塘厦镇': 116000, '凤岗镇': 117000, '虎门镇': 121000, '厚街镇': 122000,
         '沙田镇': 123000, '道滘镇': 124000, '洪梅镇': 125000, '麻涌镇': 126000, '中堂镇': 128000,
         '高埗镇': 129000, '樟木头镇': 112000, '松山湖': 401000})
    house_data[u'built'] = house_data[u'built'].map(
        {'未知': 0,
         '2019': 2019, '2018': 2018, '2017': 2017, '2016': 2016, '2015': 2015,
         '2014': 2014, '2013': 2013, '2012': 2012, '2011': 2011, '2010': 2010,
         '2009': 2009, '2008': 2008, '2007': 2007, '2006': 2006, '2005': 2005,
         '2004': 2004, '2003': 2003, '2002': 2002, '2001': 2001,
         '2000': 2000, '1999': 1999, '1998': 1998, '1997': 1997, '1996': 1996,
         '1995': 1995, '1994': 1994, '1993': 1993, '1992': 1992, '1991': 1991,
         '1990': 1990, '1989': 1989, '1980': 1980})
    house_data[u'orientation'] = house_data[u'orientation'].map(
        {'三面单边': 3001, '西南': 3002, '东南': 3003, '南': 3004, '东北': 3005,
         '西北': 3006, '南北': 3007, '北': 3008, '西': 3009, '东': 30010,
         '未知': 3000, '东西': 30011, '': 3000})
    house_data[u'floor'] = house_data[u'floor'].map(
        {'地下室': 4000, '低楼层': 4001, '中楼层': 4002, '高楼层': 4003, '未知': 4000, '': 4000})
    house_data[u'finsh_data'] = house_data[u'finsh_data'].apply(lambda x: int(x.split('/')[0]+x.split('/')[1]))
    print(house_data[u'finsh_data'])
    house_data = house_data.drop(['built_range', 'use_year', 'price_tag', 'cycle', 'id',
                                  'total_price'], axis=1)
    print(house_data.isnull().all())
    house_data.to_sql(name='clear_sql', con=engine, if_exists='replace', index=False)
    return house_data

再者就是建立预测模型，预测比较简单，导入DataFrame，并且分为训练集和测试集，推荐B站关于随机森林视频的介绍，5-10分钟

def RF():    # 随机森林
    data = format_data()
    print(np.isnan(data).any())
    data.dropna(inplace=True)
    print(np.isnan(data).any())
    data1 = data.drop([u'unit_price'], axis=1)
    df_data = data1
    # #将目标值转化df
    df_target = data[u'unit_price']
    # 将数据集拆分成训练集 和测试集
    # 常用占比  分为测试集占比0.2  训练集0.8
    # 返回前面训练集 后面测试集
    x_train, x_test, y_train, y_test = train_test_split(df_data, df_target, test_size=0.2)
    from sklearn.ensemble import RandomForestRegressor
    model = RandomForestRegressor(n_estimators=100, max_features=9)
    model.fit(x_train, y_train)
    predicted = model.predict(x_test)
    # 生成pkl文件，供后期调用
    joblib.dump(model, 'RF.pkl')
    #回归模型的四大评价指标
    # mse = metrics.mean_squared_error(y_test, predicted)
    # print(metrics.accuracy_score(y_test, predicted)) 
    # 0-1之间，接近1证明建模可靠性高
    mse = metrics.explained_variance_score(y_test, predicted)
    # mse = metrics.accuracy_score(y_test, predicted)
    return (mse)

def main():
    RF()
    print('RF mse: ', RF())

最后就是传入数据，调用模型

def test_RF():
    shuju = pd.DataFrame([{'town': 121000, 'house_range': 3004, 'floor': 4002, 'area': 98.0, 'built': 2015, 'orientation': 3004,
                           'decoration': 1000, 'finsh_data': 202001}], columns=['town', 'house_range', 'floor', 'area', 'built',
                                                                      'orientation', 'decoration', 'finsh_data'])
    print(shuju)
    # 调用文件，其实模型上一步已经建好了，现在只需要传入数据，会根据之前训练，给出值。
    clf = joblib.load('RF.pkl')
    print(clf)
    price = clf.predict(shuju)
    print(price)


def main():
    test_RF()

可以运用在网站开发实践中~
在这里插入图片描述
这是小小的记录呀，可能会存在用法或者数据处理不当哈哈

反反反反反反气旋

关注

2
点赞
踩
41

收藏

觉得还不错? 一键收藏
1
评论
基于sklearn随机森林实现东莞房价预测

== 答辩结束，记录一下觉得是毕设的一个亮点 ==以下是实现房价预测的流程图：原始数据大概如下（展示部分）：在清洗之前先进行数据字段的相关系数的计算，相关度越大影响越大，对于后期建模建模分数比较有利相关系数计算：engine = create_engine("mysql+pymysql://root:root@localhost:3306/houseinfo?charset=utf8...
复制链接

扫一扫