Python 爬虫之-58租房数据
小编最近收到房东的电话说下个月起 房租要涨200块 我勒个去 开始我还以为是听错了 再三确认了 没错 是涨了200快 完全超出了我的想象, 平时每年过年最多会涨50块 幅度都不是很大!但是这次有点儿过分了 一下子涨幅 200/950=0.21 也就是幅度是21% 真刺激 很多邻居已经受不住这样的房租压力已经纷纷搬走了! 所以决定在58网站上面找房子 !
先介绍下需要用到的库 requests_html 没有安装的 可以用pip安装pip install requests_html
先分析58网站的源代码吧!
xpath('//div[@class="des"]/h2/a/text()|//div[@class="money"]/b/text()')
我们主要抓取这两个元素的值 一个是出租的标题 和房子的价钱!
58同城是做了反爬的 所以必须要把浏览器UA加上 ,浏览器F12 把请求头的信息全部复制
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'f=n; commontopbar_new_city_info=4%7C%E6%B7%B1%E5%9C%B3%7Csz; id58=c5/njVuDxAqjH5OoCwE9Ag==; 58tj_uuid=51b41c17-7643-497b-8033-83b5aceb1a74; myfeet_tooltip=end; gr_user_id=b79faef8-92d8-4245-9f03-5e8ff56780b8; als=0; _ga=GA1.2.1544465273.1535445840; Hm_lvt_3013163ef40dcfa5b06ea83e8a1a797f=1535445860; wmda_uuid=0b663303d148f60fcd4d0256cf10284f; xxzl_deviceid=VdSgZJDpbzgP4ZSkU%2BPGCv5ExYaxjPP8jFUgkuOdga288mvtD9bDS%2F6EMIeAvXa%2B; Hm_lvt_3f405f7f26b8855bc0fd96b1ae92db7e=1535966739; wmda_visited_projects=%3B1409632296065%3B2385390625025; f=n; sessionid=4ac06013-82df-4ef9-850e-bd04a68f0bfb; mcity=sz; cancelbut_city=true; 58home=sz; city=sz; commontopbar_new_city_info=4%7C%E6%B7%B1%E5%9C%B3%7Csz; commontopbar_ipcity=sz%7C%E6%B7%B1%E5%9C%B3%7C0; _gid=GA1.2.793405159.1536132807; ipcity=sz%7C%u6DF1%u5733; Hm_lvt_e2d6b2d0ec536275bb1e37b421085803=1535445897,1536132833; Hm_lpvt_e2d6b2d0ec536275bb1e37b421085803=1536132833; GA_GTID=0d4000fc-0071-2824-757d-d1a7ace48f37; final_history=35297673109560%2C31329621877689%2C35145569746859%2C35250016832811; bai=16.; UM_distinctid=165a8de8c3bc6-0c548f018b02cc-3c604504-100200-165a8de8c4377; spm=u-LscBIm_2J9tMeMj.psy_111; utm_source=link; new_uv=5; init_refer=; wmda_session_id_2385390625025=1536136490143-ea72f913-60b4-9c7b; new_session=0; xzfzqtoken=8rEhCRCweZtJQraSUTcpLI%2FLkuFKPLzAC9JWw2OZepOzkBIgGBEuPuvDGuPD6AyAin35brBb%2F%2FeSODvMgkQULA%3D%3D; defraudName=defraud; ppStore_fingerprint=E0D010CC512D5EC95743B55994AAB96DD2BB0115C1E0B710%EF%BC%BF1536136615477',
'DNT': '1',
'Host': 'sz.58.com',
'Referer': 'http://sz.58.com/chuzu/?utm_source=link&spm=u-LscBIm_2J9tMeMj.psy_111&PGTID=0d100000-0000-4303-c40e-43b07349cb6e&ClickID=2',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
不加请求头信息 你会发现什么都抓取不到的!在这里一定要注意下!
直接附上代码吧!仅供参考 有不足的地方欢迎指出:from requests_html import HTMLSession
session = HTMLSession()
def fang_58():
#请求头信息必须加
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'f=n; commontopbar_new_city_info=4%7C%E6%B7%B1%E5%9C%B3%7Csz; id58=c5/njVuDxAqjH5OoCwE9Ag==; 58tj_uuid=51b41c17-7643-497b-8033-83b5aceb1a74; myfeet_tooltip=end; gr_user_id=b79faef8-92d8-4245-9f03-5e8ff56780b8; als=0; _ga=GA1.2.1544465273.1535445840; Hm_lvt_3013163ef40dcfa5b06ea83e8a1a797f=1535445860; wmda_uuid=0b663303d148f60fcd4d0256cf10284f; xxzl_deviceid=VdSgZJDpbzgP4ZSkU%2BPGCv5ExYaxjPP8jFUgkuOdga288mvtD9bDS%2F6EMIeAvXa%2B; Hm_lvt_3f405f7f26b8855bc0fd96b1ae92db7e=1535966739; wmda_visited_projects=%3B1409632296065%3B2385390625025; f=n; sessionid=4ac06013-82df-4ef9-850e-bd04a68f0bfb; mcity=sz; cancelbut_city=true; 58home=sz; city=sz; commontopbar_new_city_info=4%7C%E6%B7%B1%E5%9C%B3%7Csz; commontopbar_ipcity=sz%7C%E6%B7%B1%E5%9C%B3%7C0; _gid=GA1.2.793405159.1536132807; ipcity=sz%7C%u6DF1%u5733; Hm_lvt_e2d6b2d0ec536275bb1e37b421085803=1535445897,1536132833; Hm_lpvt_e2d6b2d0ec536275bb1e37b421085803=1536132833; GA_GTID=0d4000fc-0071-2824-757d-d1a7ace48f37; final_history=35297673109560%2C31329621877689%2C35145569746859%2C35250016832811; bai=16.; UM_distinctid=165a8de8c3bc6-0c548f018b02cc-3c604504-100200-165a8de8c4377; spm=u-LscBIm_2J9tMeMj.psy_111; utm_source=link; new_uv=5; init_refer=; wmda_session_id_2385390625025=1536136490143-ea72f913-60b4-9c7b; new_session=0; xzfzqtoken=8rEhCRCweZtJQraSUTcpLI%2FLkuFKPLzAC9JWw2OZepOzkBIgGBEuPuvDGuPD6AyAin35brBb%2F%2FeSODvMgkQULA%3D%3D; defraudName=defraud; ppStore_fingerprint=E0D010CC512D5EC95743B55994AAB96DD2BB0115C1E0B710%EF%BC%BF1536136615477',
'DNT': '1',
'Host': 'sz.58.com',
'Referer': 'http://sz.58.com/chuzu/?utm_source=link&spm=u-LscBIm_2J9tMeMj.psy_111&PGTID=0d100000-0000-4303-c40e-43b07349cb6e&ClickID=2',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
url = 'http://sz.58.com/szlhxq/chuzu/?utm_source=link&spm=u-LscBIm_2J9tMeMj.psy_111&PGTID=0d3090a7-0000-42b2-96fc-2ddafd27ab25&ClickID=2'
r = session.get(url, headers=headers)
r = r.html
#抓取租房标题和价钱的信息
r = r.xpath('//div[@class="des"]/h2/a/text()|//div[@class="money"]/b/text()')
for f in r:
print(f.strip())
if __name__ == '__main__':
fang_58()
运行如下图结果,这只是爬取一页的数据,需要爬多页的数据 用 while 循环即可 ! 由于时间关系 小编这里就不说了 大家可以自己去写哦 这样才会有进步!