1. 数据准备
所需数据:深圳罗湖二手房信息.csv,资源已绑定,可以在文章(页面)顶部直接下载 |
---|

将csv文件保存到本地,我这里放在了和代码同级的菜单中

房屋编码,小区,朝向,房屋单价,参考首付,参考总价,经度,纬度
605093949,大望新平村,南北,5434,15,50,114.180964,22.603698
605768856,通宝楼,南北,3472,7.5,25,114.179298,22.56691
606815561,罗湖区罗芳村,南北,5842,15.6,52,114.158869,22.547223
605147285,兴华苑,南北,3829,10.8,36,114.15804,22.554343
606030866,京基东方都会,西南,47222,51,170,114.149243,22.55437
605610283,水库新村,南北,5897,13.8,46,114.1454697,22.57018661
601250774,水库新村,南北,8295,21.9,73,114.1454697,22.57018661
605525982,水库新村,南北,6145,17.7,59,114.1454697,22.57018661
606810540,新天地名居,南,51282,60,200,114.1407852,22.55086327
599540811,翠岭苑,南北,11160,30,100,114.1373593,22.59192018
606693036,松泉公寓,东,38557,60,200,114.1354523,22.58370781
606348908,钻石时代,南,45833,49.5,165,114.135184,22.54832501
605140018,东门E公馆,南,11891,31.5,105,114.134773,22.56072914
606590991,美园,北,51923,40.5,135,114.1346817,22.54956245
596462998,京基东方华都,南,62500,60,200,114.1343842,22.55811119
594298847,京基东方华都,东,52631,60,200,114.1343842,22.55811119
605560931,长丰苑,西北,38888,42,140,114.1342954,22.54779651
596278133,长丰苑,南,43023,55.5,185,114.1342954,22.54779651
605665139,金丽豪苑,东,95238,60,200,114.1338196,22.56984901
604613670,金丽豪苑,南,80000,60,200,114.1338196,22.56984901
606637625,愉天小区,北,66574,57.9,193,114.1337433,22.57199097
606637625,愉天小区,北,66574,57.9,193,114.1337433,22.57199097
606252043,雅园公寓,南北,8928,15,50,114.1337363,22.55559386
602026329,雅园公寓,东南,5714,12,40,114.1337363,22.55559386
602117545,东门168,东西,52173,36,120,114.1334229,22.55591774
606660644,东门168,西南,61000,36.6,122,114.1334229,22.55591774
599004917,阳光新干线家园,南,48725,58.47,194.9,114.1334152,22.54417419
605769039,培峰苑,南北,3835,8.4,28,114.1322949,22.59471036
605769102,培峰苑,南北,3958,11.4,38,114.1322949,22.59471036
605769039,培峰苑,南北,3835,8.4,28,114.1322949,22.59471036
605906134,金色都汇,东,48888,52.8,176,114.131928,22.546667
604329044,缤纷时代家园,南,63879,49.5,165,114.1311035,22.55740738
603204276,嘉湖新都,东南,89523,56.4,188,114.1310808,22.57252346
606779885,嘉湖新都,南,64516,60,200,114.1310808,22.57252346
605628024,嘉湖新都,南,66000,59.4,198,114.1310808,22.57252346
604870821,湖润大厦,南北,5058,12.9,43,114.1304169,22.55123329
605702262,湖润大厦,南北,4545,12,40,114.1304169,22.55123329
590392825,东门天下,东南,55714,58.5,195,114.1286697,22.55532455
603513631,田贝花园,东南,9911,27,90,114.1281815,22.57090759
606616471,田贝花园,南北,9693,28.5,95,114.1281815,22.57090759
606616471,田贝花园,南北,9693,28.5,95,114.1281815,22.57090759
598334198,田贝花园,南北,8363,13.8,46,114.1281815,22.57090759
599340816,田贝花园,南北,9552,26.7,89,114.1281815,22.57090759
599044788,银座金钻,南,41666,52.5,175,114.128142,22.547827
604872669,置地逸轩,北,50000,58.5,195,114.1273651,22.54327393
606625129,置地逸轩,南,50000,58.5,195,114.1273651,22.54327393
601093683,罗湖村,西北,5113,13.5,45,114.125588,22.541119
604870556,罗湖村,西北,4772,12.6,42,114.125588,22.541119
606482810,罗湖村,南北,4545,13.5,45,114.125588,22.541119
606355577,罗湖村,南北,5842,15.6,52,114.125588,22.541119
601897164,海丰苑,西南,38000,57,190,114.1245873,22.54687355
605532790,罗湖1号大楼,南北,4777,12.9,43,114.123972,22.546023
605279416,罗湖1号大楼,南北,6588,16.8,56,114.123972,22.546023
606729036,友谊大厦,南,36923,57.6,192,114.1237106,22.54456711
597559191,金田大厦,南北,5582,15.24,50.8,114.1215574,22.54521646
606578375,虹桥星座,东,47878,47.4,158,114.1205521,22.5762043
600682443,虹桥星座,东,47878,47.4,158,114.1205521,22.5762043
601845711,田心村,南北,5370,17.4,58,114.119935,22.573407
598296969,时尚新居,南,41818,41.4,138,114.1188431,22.5744648
603983611,祥福雅居,南,48387,45,150,114.1188049,22.57196808
606543980,西湖大厦,东,30411,51,170,114.1164398,22.56119537
606535099,风格名苑,东,55806,51.9,173,114.1159592,22.55708122
595068607,幸福华府,南北,26388,57,190,114.1143646,22.5544281
606799436,武警七支队大院,南北,3960,12,40,114.111282,22.557374
594102300,新闻大厦,南,6052,13.8,46,114.1093938,22.54749824
606719083,星湖花园三期,西南,41578,47.4,158,114.1058044,22.57372665
603105329,武警家属大院,东南,4444,12,40,114.1005318,22.57568425
605322083,恒通花园,南北,4245,13.5,45,114.097673,22.570293
605244548,恒通花园,南,4128,9.66,32.2,114.097673,22.570293
601116785,恒通花园,南北,3773,12,40,114.097673,22.570293
598258845,三九花园,南,5833,12.6,42,114.0895386,22.57707977
594221866,三九花园,南,5681,15,50,114.0895386,22.57707977
606700179,城市春天,南北,3571,7.5,25,114.083405,22.5395049
603950517,皇御苑,东北,59701,54,180,114.0817954,22.53139307
605232094,晨晖家园,南,54285,57,190,114.0676249,22.52550815
2. 读取数据,初步的分布式分析
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
data = pd.read_csv('深圳罗湖二手房信息.csv',engine='python',encoding='GBK')
data

plt.scatter(x=data['经度'],y=data['纬度'],
s=data['房屋单价']/500,
c = data['参考总价'],
cmap='Reds',
alpha = 0.8)
plt.grid(visible=True,linestyle='--')
data.head()

3. 分布分析
分布分析主要研究数据的分布特征和分布类型,分定量数据、定性数据区分基本统计量
主要有极差、频率分布情况、分组组距及组数
3.1 极差
- 极差越小,说明越稳定
- 极差很好理解,就是最大值和最小值作差,下面以参考总价和参考首付为例
def d_range(df,*cols):
krange=[]
for col in cols:
crange = df[col].max() - df[col].min()
krange.append(crange)
return krange
key1 = '参考总价'
key2 = '参考首付'
dr = d_range(data,key1,key2)
print("%s极差为%f" %(key1,dr[0]))
print("%s极差为%f" %(key2,dr[1]))

3.2 频率分布情况
直方图可以方便的看出参考总价的频率分布情况
plt.hist(data[key1],bins=8,edgecolor='black')

但是如果要拆分出来看呢?直接生成直方图就达不到目的了
- 利用pandas中的cut函数划分区间
gcut = pd.cut(data[key1],10,right = False)
gcut

- pandas的series的value_counts()函数可以查看每个区间落了多少数据
gcut.value_counts(sort=False)

如果我们想要每个数据后面都加上我们上面搞出来的所在区间的信息,就可以添加一个字段到data中
cut返回给我们的结果,可以通过values属性获取到
data['%s分组区间' % key1] = gcut.values
data

- 我们可以创建新的表来存放,首先需要保存gcut.value_counts(sort=False)频数
- 根据频数构建频率和累计频率,并利用pandas表格可视化为其添加条形图
- 使用pandas的apply()函数,指定lambda表达式为频率和累计频率生成百分比形式
gcut_count=gcut.value_counts(sort=False)
r_zj = pd.DataFrame(gcut_count)
r_zj.rename(columns={gcut_count.name:'频数'},inplace=True)
r_zj['频率'] = r_zj['频数']/r_zj['频数'].sum()
r_zj['频率%'] = r_zj['频率'].apply(lambda x:'%.2f%%' % (x*100))
r_zj['累计频率'] = r_zj['频率'].cumsum()
r_zj['累计频率%'] = r_zj['累计频率'].apply(lambda x:'%.2f%%' % (x*100))
r_zj.style.bar(subset=['频率','累计频率'])

x显示区间,y显示频率,每个柱子标注出频数
fig = plt.figure(figsize = (12,2))
ax = plt.subplot(111)
ax.bar(x=list,height=r_zj['频率'],color='r',alpha=0.8,width=10)
ax.grid(visible=True,linestyle='--')
list = []
for i in gcut_count.keys():
list.append(i.left)
ax.set_xticks(ticks=list,labels=gcut_count.keys(),rotation=90)
for a,b,s in zip(list,r_zj['频率'],r_zj['频数']):
ax.text(a,b,'频数为:%s'%s,ha='center',va='baseline')
ax.set_xlabel("所在区间")
ax.set_ylabel("频率")

- 同理,我们也可以对定性字段做频率统计,我们例子中,房屋的朝向就是一个定性字段,代码完全一样,只是分析的数据变化了而已
cx_g=data['朝向'].value_counts(sort=True)
r_cx = pd.DataFrame(cx_g)
r_cx.rename(columns={cx_g.name:'频数'},inplace=True)
r_cx['频率'] = r_cx['频数']/r_cx['频数'].sum()
r_cx['频率%'] = r_cx['频率'].apply(lambda x:'%.2f%%' % (x*100))
r_cx['累计频率'] = r_cx['频率'].cumsum()
r_cx['累计频率%'] = r_cx['累计频率'].apply(lambda x:'%.2f%%' % (x*100))
r_cx.style.bar(subset=['频率','累计频率'])

fig = plt.figure(figsize = (12,8))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.bar(x=cx_g.keys(),height=r_cx['频率'],color='r',alpha=0.8)
ax1.grid(visible=True,linestyle='--')
ax1.set_title("房屋朝向分布频率直方图")
ax2.pie(r_cx['频数'],labels=r_cx.index,autopct='%.2f%%',shadow=True)
plt.axis('equal')
