houseprice_analysis_广州房子租售比分析（中）

本文链接：https://blog.csdn.net/weixin_44216391/article/details/107633831

续上篇，houseprice_analysis_广州房子租售比分析（上）
https://blog.csdn.net/weixin_44216391/article/details/106457799

本篇还是没有润色的完整代码。后续完成全部分析之后，如果还有时间有心情，再把这啰嗦的三篇提取解析有效关键代码出来，精简成新的、可读性更强的帖子。


"""
由上，如果我们要对比租售比情况，可取二者共同信息是：
salehouse：所在区域-板块（eg番禺-新塘北）、小区名称（eg锦绣天伦花园）、大小（eg 90.58平米）、楼龄（eg 2015年建）、总价（eg 178万）、均价（eg 19652元/平米）
lendhouse：所在区域-板块（eg黄埔-科学城）、小区名称（eg沙湾新村）、大小（eg 18㎡）、月租金（1000 元/月）
同时，lendhouse租房需考虑：租赁方式（整租or合租）

最后，我们通过数据清洗和合并，需要得到的，应该是：
同个小区：每平米售房价格/每平米租赁价格。可以加上“楼龄”“区域”“板块”分组分层。
"""

好好的周日下午，阳台外下起了淅淅沥沥的，大雨。先刷CBAP去了，未完待续。——2020.5.31，以上耗时，约摸1小时。

"""
接下来提取二者所需信息。

（A）lendhouse
lendhouse.标题 —— 整租·万科里享家 3室2厅 南
lendhouse.content__list--item--des —— 黄埔-科学城-沙湾新村\n        /\n        18㎡\n        /南        /\n          4室2厅2卫        \n          /\n          低楼层                        （16层）
lendhouse.content__list--item-price —— 2200 元/月


（B）salehouse
salehouse.positionInfo —— 锦绣天伦花园 - 新塘北
salehouse.houseInfo —— 3室2厅 | 90.58平米 | 南 | 精装 | 中楼层(共32层) | 2015年建 | 塔楼
salehouse.totalPrice —— 178万
salehouse.unitPrice —— 单价19652元/平米

"""


```python
# import re  # x.re.split 运行提示 'str' object has no attribute ‘re’
lendhouse_title_split = pd.DataFrame((x.split(r"·") for x in lendhouse['标题']),
                                     index=lendhouse.index)
lendhouse_title_split2 = pd.DataFrame((x.split(r" ") for x in lendhouse_title_split[1]))

lendhouse_title_split.columns=['lendstyle','title1','title2']
lendhouse_title_split2.columns=['name','housetype','direction','extra']

print(lendhouse_title_split.head(),"\n")
print(lendhouse_title_split2.head())

  lendstyle                         title1 title2
0        整租                   万科里享家 3室2厅 南   None
1        合租                    沙湾新村 4居室 南卧   None
2        整租                  雅苑青年公馆 1室1厅 北   None
3        独栋  魔尔公寓 魔尔公寓广州华侨新村店 精装套房可短租 1室1厅   None
4        整租                 华景新城绿茵居 2室1厅 南   None 

      name    housetype direction extra
0    万科里享家         3室2厅         南  None
1     沙湾新村          4居室        南卧  None
2   雅苑青年公馆         1室1厅         北  None
3     魔尔公寓  魔尔公寓广州华侨新村店   精装套房可短租  1室1厅
4  华景新城绿茵居         2室1厅         南  None

# 合并 lendhouse_title_split 和 lendhouse_title_split2

lendhouse_title=pd.merge(lendhouse_title_split.iloc[:,0],lendhouse_title_split2.iloc[:,:4],
                         right_index=True, left_index=True)
print("得到 lendhouse_title 的数据状态：\n",lendhouse_title.head())

lendhouse2 = pd.merge(lendhouse_title,lendhouse.iloc[:,1:],right_index=True, left_index=True)
print("\n得到 lendhouse2 的数据状态：")
lendhouse2.head()
# 至此，lendhouse标题列 分列清洗合并完成。

得到 lendhouse_title 的数据状态：
   lendstyle     name    housetype direction extra
0        整租    万科里享家         3室2厅         南  None
1        合租     沙湾新村          4居室        南卧  None
2        整租   雅苑青年公馆         1室1厅         北  None
3        独栋     魔尔公寓  魔尔公寓广州华侨新村店   精装套房可短租  1室1厅
4        整租  华景新城绿茵居         2室1厅         南  None

得到 lendhouse2 的数据状态：

	lendstyle	name	housetype	direction	extra	content__list--item--des	content__list--item-price	brand
0	整租	万科里享家	3室2厅	南	None	黄埔-科学城-万科里享家\n /\n 78㎡\n ...	2200 元/月	NaN
1	合租	沙湾新村	4居室	南卧	None	黄埔-科学城-沙湾新村\n /\n 18㎡\n /...	1000 元/月	安屋
2	整租	雅苑青年公馆	1室1厅	北	None	番禺-石碁-雅苑青年公馆\n /\n 61㎡\n ...	1580 元/月	链家
3	独栋	魔尔公寓	魔尔公寓广州华侨新村店	精装套房可短租	1室1厅	仅剩4间\n /\n 56㎡\n ...	4000-5500 元/月	魔尔公寓
4	整租	华景新城绿茵居	2室1厅	南	None	天河-华景新城-华景新城绿茵居\n /\n 62㎡\n ...	2200 元/月	NaN

# 另一种方式: 拆分后的合并 ——亲测可行，保留这种思路供参考
# lendhouse_title_split.drop(['title1','title2'], axis=1, inplace=True)
# lendhouse_title_split2.drop(['extra'], axis=1, inplace=True)
# lendhouse.drop(['标题'], axis=1, inplace=True)

# lendhouse_title_split2 = pd.merge(lendhouse_title_split,lendhouse_title_split2,right_index=True, left_index=True)
# lendhouse2 = pd.merge(lendhouse,lendhouse_title_split2,right_index=True, left_index=True)
# lendhouse2.head()

print("接下来到 lendhouse 地址列 content__list--item--des 的拆分：")
# 样例 '黄埔-科学城-沙湾新村\n        /\n        18㎡\n        /南        /\n          4室2厅2卫        \n          /\n          低楼层                        （16层）')
lendhouse_content_split = pd.DataFrame((x.split("-") for x in lendhouse['content__list--item--des']),
                                     index=lendhouse.index,
                                     columns=['district','area','name2','extra'])
lendhouse_content_split.head()

接下来到 lendhouse 地址列 content__list--item--des 的拆分：

	district	area	name2	extra
0	黄埔	科学城	万科里享家\n /\n 78㎡\n /南 ...	None
1	黄埔	科学城	沙湾新村\n /\n 18㎡\n /南 ...	None
2	番禺	石碁	雅苑青年公馆\n /\n 61㎡\n /北 ...	None
3	仅剩4间\n /\n 56㎡\n ...	None	None	None
4	天河	华景新城	华景新城绿茵居\n /\n 62㎡\n /南 ...	None

# lendhouse_content_split2 = pd.DataFrame((x.split(r" ") for x in lendhouse_content_split[1]))  # 报错 KeyError: 1
# lendhouse_content_split2 = pd.DataFrame((x.split(r" ") for x in lendhouse_content_split['name2']))  # 报错 AttributeError: 'NoneType' object has no attribute 'split'
# lendhouse_title_split2 = pd.DataFrame((x.split(r" ") for x in lendhouse_title_split[1]))  #前面该语句运行正常且有正确结果，特放这里参考。
print("上面给lendhouse_content_split['name2']按照空格字符分列，试了三个方法都不成功，\n所以转成重新给lendhouse['content__list--item--des']分列，先按空格字符:")
lendhouse_content_split2 = pd.DataFrame(x.split(" ") for x in lendhouse['content__list--item--des'])
lendhouse_content_split2.head()

上面给lendhouse_content_split['name2']按照空格字符分列，试了三个方法都不成功，
所以转成重新给lendhouse['content__list--item--des']分列，先按空格字符:

	0	8	...	89	90	91	92	93	94	95	96	97	98
0	黄埔-科学城-万科里享家\n	/\n	...						（34层）	None	None	None	None
1	黄埔-科学城-沙湾新村\n	/\n	...						（16层）	None	None	None	None
2	番禺-石碁-雅苑青年公馆\n	/\n	...						（5层）	None	None	None	None
3	仅剩4间\n		...	None	None	None	None	None	None	None	None	None	None
4	天河-华景新城-华景新城绿茵居\n	/\n	...						（9层）	None	None	None	None

5 rows × 99 columns

# 从上面的分列效果看，太多空格了，居然分出了98列。看起来需要先将该列单元格内的空格去掉，用“/”来做分隔符。

# 【尝试一：失败告终】
# def replace(s):
# 	import re
# 	return re.sub(' ','',s)
# a = lendhouse['content__list--item--des']
# a = replace(a)   #报错TypeError: expected string or bytes-like object
# a.head()

# 【尝试二：失败告终】
# import re
# a = lendhouse['content__list--item--des']
# rows = a.shape[0]   # print结果有3003行
# print(rows)
# for i in rows:   # 报错 TypeError: 'int' object is not iterable
#     re.sub( '\s+', ' ', a[i]).strip()  # 这也将取代所有的制表符，换行符和其他“空格类”字符。
# a.head()

# 列多那就算了，分批查看，找出需要用的列，其他不用的就舍弃。
print(lendhouse_content_split2.iloc[:,:20].head(2),"\n")
print(lendhouse_content_split2.iloc[:,21:40].head(2),"\n")
print(lendhouse_content_split2.iloc[:,41:60].head(2),"\n")
print(lendhouse_content_split2.iloc[:,61:80].head(2),"\n")
print(lendhouse_content_split2.iloc[:,81:100].head(2),"\n")

# 由此可见，有信息的是第 [0,18,24,42,70,94] 这几列。接下来专门提取这几列确认一下。

                0 1 2 3 4 5 6 7    8 9 10 11 12 13 14 15     16 17 18 19
0  黄埔-科学城-万科里享家\n                /\n                      78㎡\n         
1   黄埔-科学城-沙湾新村\n                /\n                      18㎡\n          

  21 22 23  24 25 26 27 28 29 30 31   32 33 34 35 36 37 38 39
0           /南                       /\n                     
1           /南                       /\n                      

  41      42 43 44 45 46 47 48 49  50 51 52 53 54 55 56 57 58 59
0     3室2厅1卫                       \n                           
1     4室2厅2卫                       \n                            

  61 62 63 64 65 66 67 68 69   70 71 72 73 74 75 76 77 78 79
0                             中楼层                           
1                             低楼层                            

  81 82 83 84 85 86 87 88 89 90 91 92 93     94    95    96    97    98
0                                         （34层）  None  None  None  None
1                                         （16层）  None  None  None  None

print(lendhouse_content_split2.iloc[:,[0,18,24,42,70,94]].head(2),"\n")
# 发现第 18 列少了 面积，说明面积不是在第18列。我们缩小范围看一下：
print(lendhouse_content_split2.iloc[:,5:20].head(2),"\n")
# 这次比较清晰，应该是在第 16 列。我们再print第16列确认一下：
print(lendhouse_content_split2.iloc[:,[0,16,24,42,70,94]].head(2))
# 这次面积就出来了。搞定。

               0  18  24      42   70     94
0  黄埔-科学城-万科里享家\n     /南  3室2厅1卫  中楼层  （34层）
1   黄埔-科学城-沙湾新村\n     /南  4室2厅2卫  低楼层  （16层） 

  5 6 7    8 9 10 11 12 13 14 15     16 17 18 19
0        /\n                      78㎡\n         
1        /\n                      18㎡\n          

               0      16  24      42   70     94
0  黄埔-科学城-万科里享家\n  78㎡\n  /南  3室2厅1卫  中楼层  （34层）
1   黄埔-科学城-沙湾新村\n  18㎡\n  /南  4室2厅2卫  低楼层  （16层）

# 从提取出的几列来看，还有些细节需要再洗洗：例如为了统计和美观需要，“\n”和“/”这两个符号应去掉。

lendhouse_content_split3 = lendhouse_content_split2.iloc[:,[0,16,24,42,70,94]]
lendhouse_content_split3.columns=['location_name','area','direction','housetype','stair_type','stairs']
print("未使用replace前：\n",lendhouse_content_split3.head(2),"\n")

# lendhouse_content_split3 = lendhouse_content_split3.map(lambda x: x.replace("/n",""))   
# 报错 AttributeError: 'DataFrame' object has no attribute 'map'

lendhouse_content_split3 = lendhouse_content_split3.replace("/n","")
print("第一次使用replace：\n",lendhouse_content_split3.head(2)) 
# 并没有替换成功，看 print 结果还是有 “/n” 这个符号在。
# 第一列 location_name 还需要再分列，下面先分列整理。

未使用replace前：
     location_name   area direction housetype stair_type stairs
0  黄埔-科学城-万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1   黄埔-科学城-沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层） 

第一次使用replace：
     location_name   area direction housetype stair_type stairs
0  黄埔-科学城-万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1   黄埔-科学城-沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）

# lendhouse_content_split4 = pd.DataFrame(x.split("-") for x in lendhouse_content_split3[0])  # 报错 KeyError: 0 —— 备注以对比参考。

lendhouse_content_split4 = pd.DataFrame(x.split("-") for x in lendhouse_content_split3['location_name'])
lendhouse_content_split4.columns=['district','板块','name','none1']
lendhouse_content_split4.head()

	district	板块	name	none1
0	黄埔	科学城	万科里享家\n	None
1	黄埔	科学城	沙湾新村\n	None
2	番禺	石碁	雅苑青年公馆\n	None
3	仅剩4间\n	None	None	None
4	天河	华景新城	华景新城绿茵居\n	None

# 合并 lendhouse_content_split3 和 lendhouse_content_split4
lendhouse_content_split5 = pd.merge(lendhouse_content_split4.iloc[:,:3],lendhouse_content_split3.iloc[:,1:6],
                         right_index=True, left_index=True)
print("得到 lendhouse_content 的数据状态：\n",lendhouse_content_split5.head())

# 接下来要想办法清除 “\n”和“/”这两个符号。

得到 lendhouse_content 的数据状态：
   district    板块       name   area direction housetype stair_type stairs
0       黄埔   科学城    万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔   科学城     沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）
2       番禺    石碁   雅苑青年公馆\n  61㎡\n        /北    1室1厅1卫        中楼层   （5层）
3   仅剩4间\n  None       None                        /\n       None   None
4       天河  华景新城  华景新城绿茵居\n  62㎡\n        /南    2室1厅1卫        低楼层   （9层）

# lendhouse_content_split5['area'] = lendhouse_content_split5['area'].replace("\n","")
lendhouse_content_split5 = lendhouse_content_split5.replace("\r\n","")
print("第一次使用replace：\n",lendhouse_content_split5.head(2))

# lendhouse_content_split5['direction'] = lendhouse_content_split5['direction'].replace("/","")
lendhouse_content_split5 = lendhouse_content_split5.replace("/","")
print("\n第二次使用replace：\n",lendhouse_content_split5.head(2))

# 发现替换函数 replace 还是没有生效。那接下来看看能不能直接截取特定符号前面或者特定符号后面的字符串，作为新的内容。
# lendhouse_content_split5.to_excel(total_path+"\\lendhouse_content_split5"+".xlsx", encoding='utf-8', index=False, header=True)

第一次使用replace：
   district   板块     name   area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）

第二次使用replace：
   district   板块     name   area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）

# import re
# # lendhouse_content_split5['direction'] = re.findall(r'/*', lendhouse_content_split5['direction']) 
# # # 上述报错 error: nothing to repeat at position 0
# # print("第一次使用re.findall：\n",lendhouse_content_split5.head(2))

# lendhouse_content_split5['area'] = re.findall(r'*\n', lendhouse_content_split5['area'])  
# # 上述报错 error: nothing to repeat at position 0
# print("\n第二次使用re.findall：\n",lendhouse_content_split5.head(2))

# lendhouse_content_split5['area'] = lendhouse_content_split5['area'].replace("\n","")
lendhouse_content_split5 = lendhouse_content_split5.replace("\r\n","")
lendhouse_content_split5 = lendhouse_content_split5.replace("\n","")
print("第一次使用replace：\n",lendhouse_content_split5.head(2))

# lendhouse_content_split5['direction'] = lendhouse_content_split5['direction'].replace("/","")
lendhouse_content_split5 = lendhouse_content_split5.replace("/","")
print("\n第二次使用replace：\n",lendhouse_content_split5.head(2))

第一次使用replace：
   district   板块     name   area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）

第二次使用replace：
   district   板块     name   area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）

# 试了一下replace("\n","")，没问题，如下：
a="万科里享家\n沙湾新村"
print("original:\n",a)
print("-----")
b=a.replace("\n","")
print("updated:\n",b)

# 所以，前面替换不了，可能因为lendhouse_content_split5是dataframe而不是字符串？
# 那接下来，就要试试遍历 lendhouse_content_split5。

original:
 万科里享家
沙湾新村
-----
updated:
 万科里享家沙湾新村

# 参考该链接，还是不行。—— ——《去掉dataframe的换行符》https://blog.csdn.net/lvluobo/article/details/103527148
lendhouse_content_split5['name'] = lendhouse_content_split5['name'].replace("\n","").replace("\r","").replace("\r\n","")
print("第一次使用replace：\n",lendhouse_content_split5.head(2))

lendhouse_content_split5['area'] = lendhouse_content_split5['area'].apply(
    lambda x:x.replace('\n', '').replace('\r', '').replace("\r\n",""))
print("第二次使用replace：\n",lendhouse_content_split5.head(2))

lendhouse_content_split5['direction'] = lendhouse_content_split5['direction'].apply(lambda x:x.replace('/', '').replace('\n', '').replace('\r', ''))
print("第三次使用replace：\n",lendhouse_content_split5.head(2))

# 上面把“\”替换掉了，但是换行符"\n"还是不行。

第一次使用replace：
   district   板块     name   area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡\n        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡\n        /南    4室2厅2卫        低楼层  （16层）
第二次使用replace：
   district   板块     name area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡        /南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡        /南    4室2厅2卫        低楼层  （16层）
第三次使用replace：
   district   板块     name area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家\n  78㎡         南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村\n  18㎡         南    4室2厅2卫        低楼层  （16层）

# 参考这个链接，然后成功！！！—— ——《python如何去掉换行符》https://www.py.cn/jishu/jichu/12333.html

lendhouse_content_split5['name'] = lendhouse_content_split5['name'].str.rstrip()  
#亲测：
# （1）上面这个语句如果运行在楼上尝试之后——《去掉dataframe的换行符》https://blog.csdn.net/lvluobo/article/details/103527148
#则可以去除lendhouse_content_split5这个dataframe的所有列的换行符，并不局限于['name']列。
# （2）上面这个语句如果不运行在楼上尝试之后——《去掉dataframe的换行符》https://blog.csdn.net/lvluobo/article/details/103527148
# 则可以去除lendhouse_content_split5这个dataframe的['name']列的换行符，并不能去除['area']列的换行符。

# lendhouse_content_split5 = lendhouse_content_split5.str.rstrip() 
# # 前面运行lendhouse_content_split5['name']替换换行符成功后尝试该语句，但报错 AttributeError: 'DataFrame' object has no attribute 'str'

print("成功清除换行符：\n",lendhouse_content_split5.head(2))

成功清除换行符：
   district   板块   name area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家  78㎡         南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村  18㎡         南    4室2厅2卫        低楼层  （16层）

# 综上，从lendhouse_content_split5合并开始，清除 “\n”和“/”这两个符号，只需三个语句（每列更新各需1条）：

lendhouse_content_split5 = pd.merge(lendhouse_content_split4.iloc[:,:3],lendhouse_content_split3.iloc[:,1:6],
                         right_index=True, left_index=True)
lendhouse_content_split5['direction'] = lendhouse_content_split5['direction'].apply(lambda x:x.replace('/', ''))
lendhouse_content_split5['name'] = lendhouse_content_split5['name'].str.rstrip() 
lendhouse_content_split5['area'] = lendhouse_content_split5['area'].str.rstrip() 
print(r'成功清除 “\n”和“/”这两个符号：',"\n",lendhouse_content_split5.head(2))    #有兴趣可以试试去掉字符串前面的“r”，看看结果有什么不同。

import re
# lendhouse_content_split5['area'] = lendhouse_content_split5['area'].apply(lambda x:x.re.sub("/D","",str))  
# 报错AttributeError: 'str' object has no attribute 're'

# lendhouse_content_split5['stairs'] = lendhouse_content_split5['stairs'].re.sub("/D","")
# # 报错 AttributeError: 'Series' object has no attribute 're'

# lendhouse_content_split5['area'] = lendhouse_content_split5['area'].apply(lambda x:x.re.findall("\d+",ss)[0])  
# # 报错 AttributeError: 'str' object has no attribute 're'

# 参考这个链接，再战： 
# https://blog.csdn.net/weixin_42132740/article/details/106860487Python 《如何把DataFrame(Excel表格或CSV表格)中的某一列中的数字提取出来，生成新的一列。（正则表达式/Replace）》
# // .str将其中的字符串取出来
# // .replace(r'[^0-9]', '') 将这一列中的字符串提取出来
lendhouse_content_split5['area']  = lendhouse_content_split5['area'].str.replace(r'[^0-9]', '')
lendhouse_content_split5['stairs']  = lendhouse_content_split5['stairs'].str.replace(r'[^0-9]', '')

# 去掉单位只留数字后，为了方便理解，我们把单位备注进对应的列名
lendhouse_content_split5.rename(columns={'area':'area/㎡','stairs':'stairs/层'},inplace=True)
print("\n继续清除 面积单位“m2”和提取楼层数字：\n",lendhouse_content_split5.head(2))

# lendhouse_content_split5.columns
print("\n看看户型列前两个样例：\n",lendhouse_content_split5['housetype'][0:2])

# 新增一列表示房子户型是多少房的
# lendhouse_content_split5['housetpye2']=lendhouse_content_split5['housetype'].apply(lambda x:x[0])   # 报错 IndexError: string index out of range
lendhouse_content_split5['housetpyesize']=lendhouse_content_split5['housetype'].apply(lambda x:x[0:1]) # 截取字符串加上范围 [0:1] 后搞定
print("\n新增一列表示房子户型是多少房：\n")
lendhouse_content_split5.head(2)

# 至此，lendhouse_content清洗大功告成，后面合并就行。

成功清除 “\n”和“/”这两个符号： 
   district   板块   name area direction housetype stair_type stairs
0       黄埔  科学城  万科里享家  78㎡         南    3室2厅1卫        中楼层  （34层）
1       黄埔  科学城   沙湾新村  18㎡         南    4室2厅2卫        低楼层  （16层）

继续清除 面积单位“m2”和提取楼层数字：
   district   板块   name area/㎡ direction housetype stair_type stairs/层
0       黄埔  科学城  万科里享家     78         南    3室2厅1卫        中楼层       34
1       黄埔  科学城   沙湾新村     18         南    4室2厅2卫        低楼层       16

看看户型列前两个样例：
 0    3室2厅1卫
1    4室2厅2卫
Name: housetype, dtype: object

新增一列表示房子户型是多少房：

	district	板块	name	area/㎡	direction	housetype	stair_type	stairs/层	housetpyesize
0	黄埔	科学城	万科里享家	78	南	3室2厅1卫	中楼层	34	3
1	黄埔	科学城	沙湾新村	18	南	4室2厅2卫	低楼层	16	4

print("至此，lendhouse_content清洗大功告成，后面合并就行。下面是合并后的新lendhouse3\n")

lendhouse3 = pd.merge(lendhouse2.drop(["content__list--item--des"],axis=1),lendhouse_content_split5,
                         right_index=True, left_index=True)
lendhouse3

至此，lendhouse_content清洗大功告成，后面合并就行。下面是合并后的新lendhouse3

	lendstyle	name_x	housetype_x	direction_x	extra	content__list--item-price	brand	district	板块	name_y	area/㎡	direction_y	housetype_y	stair_type	stairs/层	housetpyesize
0	整租	万科里享家	3室2厅	南	None	2200 元/月	NaN	黄埔	科学城	万科里享家	78	南	3室2厅1卫	中楼层	34	3
1	合租	沙湾新村	4居室	南卧	None	1000 元/月	安屋	黄埔	科学城	沙湾新村	18	南	4室2厅2卫	低楼层	16	4
2	整租	雅苑青年公馆	1室1厅	北	None	1580 元/月	链家	番禺	石碁	雅苑青年公馆	61	北	1室1厅1卫	中楼层	5	1
3	独栋	魔尔公寓	魔尔公寓广州华侨新村店	精装套房可短租	1室1厅	4000-5500 元/月	魔尔公寓	仅剩4间\n	None	None			/\n	None	None	/
4	整租	华景新城绿茵居	2室1厅	南	None	2200 元/月	NaN	天河	华景新城	华景新城绿茵居	62	南	2室1厅1卫	低楼层	9	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2998	合租	华景新城芳满庭园	5居室	北卧	None	1430 元/月	自如	天河	华景新城	华景新城芳满庭园	7	北	5室1厅2卫	中楼层	36	5
2999	整租	西城花园	3室1厅	南	None	2300 元/月	NaN	番禺	市桥	西城花园	96	南	3室1厅1卫	中楼层	8	3
3000	合租	梅花村	3居室	南卧	None	2090 元/月	自如	越秀	东风东	梅花村	11	南	3室1厅1卫	中楼层	8	3
3001	合租	广州雅居乐花园雅悦庭	4居室	西南卧	None	1330 元/月	自如	番禺	雅居乐	广州雅居乐花园雅悦庭	10	西南	4室1厅2卫	低楼层	9	4
3002	整租	林和邨	2室1厅	南	None	4200 元/月	恋家公寓	天河	林和	林和邨	60	南	2室1厅1卫	高楼层	50	2

3003 rows × 16 columns

print("接下来提取月租金数字 from ['content__list--item-price']：\n")
lendhouse3.rename(columns={'content__list--item-price':'lendprice元/月'},inplace=True)
lendhouse3['lendprice元/月'] = lendhouse3['lendprice元/月'].str.replace(r'元/月', '')
print(lendhouse3['lendprice元/月'].head(2))

lendhouse3.head(2)

接下来提取月租金数字 from ['content__list--item-price']：

0    2200 
1    1000 
Name: lendprice元/月, dtype: object

	lendstyle	name_x	housetype_x	direction_x	extra	lendprice元/月	brand	district	板块	name_y	area/㎡	direction_y	housetype_y	stair_type	stairs/层	housetpyesize
0	整租	万科里享家	3室2厅	南	None	2200	NaN	黄埔	科学城	万科里享家	78	南	3室2厅1卫	中楼层	34	3
1	合租	沙湾新村	4居室	南卧	None	1000	安屋	黄埔	科学城	沙湾新村	18	南	4室2厅2卫	低楼层	16	4

# 接下来到 saleshouse 的清洗
salehouse2 = salehouse
salehouse2.head(2)

	标题	positionInfo	houseInfo	followInfo	totalPrice	unitPrice
0	地铁口锦绣天伦花园精致小三房南向视野开阔	锦绣天伦花园 - 新塘北	3室2厅 \| 90.58平米 \| 南 \| 精装 \| 中楼层(共32层) \| 2015年建 \| 塔楼	63人关注 / 5个月以前发布	178万	单价19652元/平米
1	恒大山水城南向3房，位置好。总价低	恒大山水城 - 中新镇	3室2厅 \| 99平米 \| 东南 \| 精装 \| 高楼层(共18层) \| 2008年建 \| 板塔结合	421人关注 / 7个月以前发布	135万	单价13637元/平米

# 趁热打铁，先把带单位的数字提取出来
print("提取单价和总价数字 from ['totalPrice'] and ['unitPrice'] ：\n")
salehouse2.rename(columns={'totalPrice':'totalPrice（万）'},inplace=True)
salehouse2['totalPrice（万）'] = salehouse2['totalPrice（万）'].str.replace(r'万', '')
salehouse2.rename(columns={'unitPrice':'unitPrice元/平'},inplace=True)
salehouse2['unitPrice元/平'] = salehouse2['unitPrice元/平'].str.replace(r'元/平米', '')
salehouse2['unitPrice元/平'] = salehouse2['unitPrice元/平'].str.replace(r'单价', '')
print(salehouse2[['totalPrice（万）','unitPrice元/平']].head(2))

salehouse2.head(2)

提取单价和总价数字 from ['totalPrice'] and ['unitPrice'] ：

  totalPrice（万） unitPrice元/平
0           178        19652
1           135        13637

	标题	positionInfo	houseInfo	followInfo	totalPrice（万）	unitPrice元/平
0	地铁口锦绣天伦花园精致小三房南向视野开阔	锦绣天伦花园 - 新塘北	3室2厅 \| 90.58平米 \| 南 \| 精装 \| 中楼层(共32层) \| 2015年建 \| 塔楼	63人关注 / 5个月以前发布	178	19652
1	恒大山水城南向3房，位置好。总价低	恒大山水城 - 中新镇	3室2厅 \| 99平米 \| 东南 \| 精装 \| 高楼层(共18层) \| 2008年建 \| 板塔结合	421人关注 / 7个月以前发布	135	13637

# 接下来是 拆解分列 【标题】【positionInfo】【houseInfo】