python房地产数据分析_Python3抓取深圳房地产均价数据，通过真实数据为购置不动产做决策分析（一）...-CSDN博客

经过之前的小练习，今天准备做一个相对较为复杂的小项目，最近看到一条新闻说深圳的房价断崖式下跌，平均每月均价下跌46块钱。。。所以准备尝试着抓取互联网上真实的卖房数据，通过大数据的分析，来帮想在深圳买房的小伙伴们，做一个辅助决策分析。

首先我们百度一下，top 3的卖房网站(对百度的竞价排名持怀疑态度$_$)

经过筛选，我准备从链家， Q房网，房天下，三个网站抓取房地产售价数据

首先抓取链家的代码如下：

from bs4 importBeautifulSoupimportrequestsimportcsvfrom requests.exceptions importRequestExceptiondefget_one_page(page):

url= "https://sz.lianjia.com/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Host': 'sz.lianjia.com','Referer': 'https://www.lianjia.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}

newUrl= url + 'ershoufang/' + 'pg' +str(page)try:

response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)

soup= BeautifulSoup(response.text, 'html.parser')#需要抓取：小区名称，面积大小，均价，以及详细信息的链接

for item in soup.select('li .clear'):

detailed_info= item.select('div .houseInfo')[0].text

community_name= detailed_info.split('|')[0].strip()

area= detailed_info.split('|')[2].strip()

average_price= item.select('div .unitPrice span')[0].text

detailed_url= item.select('a')[0].get('href')print("%s\t%s\t%s\t%s"%(community_name, area, average_price, detailed_url))defmain():

get_one_page(2)if __name__ == '__main__':

main()

测试结果如下：

其次抓取Q房网基本代码如下：

from bs4 importBeautifulSoupimportrequestsimportcsvimportrefrom requests.exceptions importRequestExceptiondefget_one_page(page):

url= "https://shenzhen.qfang.com/sale/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36','Host': 'shenzhen.qfang.com','Referer': 'https://www.qfang.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}

newUrl= url + 'f' +str(page)try:

response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)

soup= BeautifulSoup(response.text, 'html.parser')#需要抓取：小区名称，面积大小，均价，以及详细信息的链接

price_list =[]for item in soup.select('div .show-price'):

average_price= item.select('p')[0].text

price_list.append(average_price)

index=0;for item in soup.select('div .show-detail'):

detailed_url= 'https://shenzhen.qfang.com/sale' + item.select('a')[0].get('href')#在爬取面积的过程中，发现有数据缺失，原因为，有的存在第4个span tag中，有的存在第5个span tag中，所以先都取出来，然后用正则筛选

regax = re.compile('(.*?)平米')

result= item.select('span')[3].text + item.select('span')[4].text

area=re.findall(regax, result)[0]

community_name= (item.find_all(target = '_blank')[0].text).split(' ')[0]

average_price=price_list[index];

index+= 1

print("%s\t%s\t%s\t%s" %(community_name, area, average_price, detailed_url))defmain():

get_one_page(1)if __name__ == '__main__':

main()

测试结果如下：

最后房天下的抓取基本代码如下from bs4 importBeautifulSoupimportrequestsimportcsv

from requests.exceptions importRequestExceptionimportredefget_one_page(page):

url= "http://esf.sz.fang.com/house/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Host': 'esf.sz.fang.com','Referer': 'https://www.fang.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}

newUrl= url + 'i3' +str(page)try:

response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)#用正则抓取：

#regax = re.compile('