经过之前的小练习,今天准备做一个相对较为复杂的小项目,最近看到一条新闻说深圳的房价断崖式下跌,平均每月均价下跌46块钱。。。所以准备尝试着抓取互联网上真实的卖房数据,通过大数据的分析,来帮想在深圳买房的小伙伴们,做一个辅助决策分析。
首先我们百度一下,top 3的卖房网站(对百度的竞价排名持怀疑态度$_$)
经过筛选,我准备从链家, Q房网,房天下,三个网站抓取房地产售价数据
首先抓取链家的代码如下:
from bs4 importBeautifulSoupimportrequestsimportcsvfrom requests.exceptions importRequestExceptiondefget_one_page(page):
url= "https://sz.lianjia.com/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Host': 'sz.lianjia.com','Referer': 'https://www.lianjia.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}
newUrl= url + 'ershoufang/' + 'pg' +str(page)try:
response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)
soup= BeautifulSoup(response.text, 'html.parser')#需要抓取: 小区名称, 面积大小, 均价, 以及详细信息的链接
for item in soup.select('li .clear'):
detailed_info= item.select('div .houseInfo')[0].text
community_name= detailed_info.split('|')[0].strip()
area= detailed_info.split('|')[2].strip()
average_price= item.select('div .unitPrice span')[0].text
detailed_url= item.select('a')[0].get('href')print("%s\t%s\t%s\t%s"%(community_name, area, average_price, detailed_url))defmain():
get_one_page(2)if __name__ == '__main__':
main()
测试结果如下:
其次抓取Q房网基本代码如下:
from bs4 importBeautifulSoupimportrequestsimportcsvimportrefrom requests.exceptions importRequestExceptiondefget_one_page(page):
url= "https://shenzhen.qfang.com/sale/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36','Host': 'shenzhen.qfang.com','Referer': 'https://www.qfang.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}
newUrl= url + 'f' +str(page)try:
response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)
soup= BeautifulSoup(response.text, 'html.parser')#需要抓取: 小区名称, 面积大小, 均价, 以及详细信息的链接
price_list =[]for item in soup.select('div .show-price'):
average_price= item.select('p')[0].text
price_list.append(average_price)
index=0;for item in soup.select('div .show-detail'):
detailed_url= 'https://shenzhen.qfang.com/sale' + item.select('a')[0].get('href')#在爬取面积的过程中,发现有数据缺失,原因为,有的存在第4个span tag中,有的存在第5个span tag中,所以先都取出来,然后用正则筛选
regax = re.compile('(.*?)平米')
result= item.select('span')[3].text + item.select('span')[4].text
area=re.findall(regax, result)[0]
community_name= (item.find_all(target = '_blank')[0].text).split(' ')[0]
average_price=price_list[index];
index+= 1
print("%s\t%s\t%s\t%s" %(community_name, area, average_price, detailed_url))defmain():
get_one_page(1)if __name__ == '__main__':
main()
测试结果如下:
最后房天下的抓取基本代码如下from bs4 importBeautifulSoupimportrequestsimportcsv
from requests.exceptions importRequestExceptionimportredefget_one_page(page):
url= "http://esf.sz.fang.com/house/"headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','Host': 'esf.sz.fang.com','Referer': 'https://www.fang.com/','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9'}
newUrl= url + 'i3' +str(page)try:
response= requests.get(newUrl, headers=headers)exceptRequestException as e:print("error:" +response.status_code)#用正则抓取:
#regax = re.compile('