房价python爬取_python爬取并解析 重庆2015-2019房价走势

该博客通过Python爬虫技术,详细介绍了如何从链家网站抓取并解析重庆江北海区2015年至2019年的房价数据。博主首先定义了爬虫函数`spider`来获取网页源码,然后使用`spider_detail`解析HTML,提取房屋的名称、面积、户型、价格等关键信息,最后将数据保存到CSV文件中。代码中还包含了异常处理和延时机制,以防止频繁请求导致封禁。
摘要由CSDN通过智能技术生成

1 #! /usr/bin/env python

2 #-*- coding:utf-8 -*-

3

4 '''

5 Created on 2019年11月24日6

7 @author: Admin8 '''

9

10 importrequests11 from lxml importetree12 importtime13 importcsv14

15 '''

16 方法名称:spider17 功能: 爬取目标网站,并以源码文本18 参数: url 目标网址19 '''

20

21

22 defspider(url):23 try:24 header ={25 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',26 'cookie':'TY_SESSION_ID=150d5f1d-3be9-47b7-8728-f5b0673e307d; lianjia_uuid=22c2fd7c-bd33-4b52-b13c-0455483c8c53; _smt_uid=5dda86a6.5152533e; UM_distinctid=16e9d9dfc04451-098d8fb5ad92f6-e353165-1fa400-16e9d9dfc05a07; _ga=GA1.2.1829982433.1574602409; digv_extends=%7B%22utmTrackId%22%3A%2221583074%22%7D; _jzqa=1.3521694123893513000.1574602407.1574773117.1575120474.3; _jzqc=1; _jzqckmp=1; _gid=GA1.2.1091277813.1575120477; CNZZDATA1255849584=948253718-1574601020-https%253A%252F%252Fwww.baidu.com%252F%7C1575116340; CNZZDATA1254525948=3090229-1574602304-https%253A%252F%252Fwww.baidu.com%252F%7C1575120323; _qzjc=1; CNZZDATA1255604082=2128363916-1574597104-https%253A%252F%252Fwww.baidu.com%252F%7C1575119427; lianjia_ssid=923a34dd-a281-4f27-8dd2-5acf42342745; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1574602407,1574773116,1575120687; _jzqy=1.1574602407.1575120687.3.jzqsr=baidu|jzqct=%E9%87%8D%E5%BA%86%E6%88%BF%E7%BD%91.jzqsr=baidu; select_city=500000; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216e9d9dfd30306-0d0ad150c956ec-e353165-2073600-16e9d9dfd319bc%22%2C%22%24device_id%22%3A%2216e9d9dfd30306-0d0ad150c956ec-e353165-2073600-16e9d9dfd319bc%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22sousuo%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; CNZZDATA1255633284=795194134-1574597808-https%253A%252F%252Fwww.baidu.com%252F%7C1575120759; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1575121032; _qzja=1.2113280281.1574602406885.1574773116776.1575120645619.1575120907050.1575121031859.0.0.0.46.3; _qzjb=1.1575120645619.11.0.0.0; _qzjto=11.1.0; _jzqb=1.15.10.1575120474.1; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiZTZiZTc4OGQ0MGZlZDJlZmZhNjRmMGYyNmJhNGM5NDBlZWZkMmM5N2Y0MzU2MmYyYzY2ZjQwMzZhNTk1MTI1M2ZiZjc2OGU3ODEzODBhZTYzMzNiOWZkZTExNTBkMmIxY2FhMzlmMzZmMTM5NGU0YmEwYmY5OTdlNDI5NmRiYTVjYzA5NmNkY2JkZmZkNWRmZmVhZWU1MDFjZjU0NTgyOTU0ZTkxZmVkZjhhYmI3ODc1YjJlNjA2Yzk3ZWRhNDJlYWUxZTBiZGJlMjBkNjQ2MWRkZDU3ZDRkOTE5ZTM0NDUwNDZjODNiZjE5ZGI3MzQ1MjU1YWFmNmRkZWJhZDJkZDNjMjk2MGFjNzIxNGY2YWY2Y2JkOWM5ZDcxYTU5N2FhMzMwMTJjNzNlNGEwNjhiOGI3MzEzMzIyYzM0NmVkM2ZcIixcImtleV9pZFwiOlwiMVwiLFwic2lnblwiOlwiNDNlZDc3OGJcIn0iLCJyIjoiaHR0cHM6Ly9jcS5saWFuamlhLmNvbS9jaGVuZ2ppYW8vIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=',27 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',28 'upgrade-insecure-requests': '1',29 }30 response = requests.get(url=url, headers=header)31 #print(response.text)

32 returnresponse.text33 except:34 print('failed to spider the target site, please check if the url is correct or the connection is available!')35

36

37 '''

38 方法名称:spider_detail39 功能: 解析html源码,提取房屋参数40 参数: url 目标网址41 '''

42

43

44 defspider_detail(url):45 response_text =spider(url)46 sel =etree.HTML(response_text)47 for house_num in range(1, 31):48 try:49 house_info = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[1]/a/text()'

50 % house_num)[0].strip().split(' ')51 house_name =house_info[0]52 house_mode = house_info[1]53 house_area = house_info[2].strip('平米')54

55 house_prim_money = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[4]/span[2]/span[1]/text()'

56 %house_num)[0].strip()57 house_sale_time = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[2]/div[2]/text()'

58 % house_num)[0].strip().split('.')[0]59 house_price = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[3]/div[3]/span/text()'

60 % house_num)[0].strip().strip("单价").strip("元/平米")61 house_totle = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[2]/div[3]/span/text()'

62 %house_num)[0].strip()63 house_url = sel.xpath('/html/body/div[5]/div[1]/ul/li[%d]/div/div[1]/a/@href'

64 %house_num)[0].strip()65 house_data =[house_name, house_area, house_mode, \66 house_sale_time, house_prim_money, house_price, house_totle, house_url]67 save_csv(house_data)68

69 exceptException as e:70 print(e)71 print("参数错误")72

73

74 '''

75 方法名称:save_csv76 功能: 将数据按行储存到csv文件中77 参数: house_data 获取到的房屋数据列表78 '''

79

80

81 defsave_csv(house_data):82

83 try:84 with open('E:/chongqing/cq_chengjiao_jiangbei_year.csv', 'a', encoding='utf-8-sig', newline='') as f:85 writer =csv.writer(f)86 writer.writerow(house_data)87 except:88 print('write csv error!')89

90

91 '''

92 方法名称:get_all_urls93 功能: 生成所有所有的url并存放到迭代器中94 参数: page_number 需要爬网页总数95 返回值: url 返回一个url的迭代96 '''

97

98

99 defget_all_urls(page_number):100 if (type(page_number) == type(1) and page_number > 0): #防止错误输入

101 for page in range(1, page_number + 1):102 url = 'https://cq.lianjia.com/chengjiao/jiangbei/pg'+str(page)+'l3a4a5/'

103 yieldurl104 else:105 print('page_number is incorrect!')106

107

108 #csv首列写入

109 save_csv(['house_name', 'house_area', 'house_mode', \110 'house_sale_time', 'house_prim_money', 'house_price', 'house_totle', 'house_url'])111

112 for url in get_all_urls(100):113 try:114 time.sleep(5)115 spider_detail(url)116 exceptException as e:117 print(e)118 print('An error has been occurred when spidering house-price of chongqing!')

好的,下面是一个简单的Python爬虫爬取网页数据并解析数据的教程: 1. 确定目标网站和要爬取的信息 首先,需要确定要爬取的网站和要提取的信息。可以使用Python的requests库向网站发送HTTP请求获取HTML源代码,并使用BeautifulSoup库解析HTML文档获取目标数据。 例如,我们要爬取CSDN博客的文章标题和链接,可以先打开CSDN博客主页,右键查看网页源代码,找到文章标题和链接所在的HTML标签。 2. 发送HTTP请求获取HTML源代码 接下来,使用Python的requests库向网站发送HTTP请求,获取HTML源代码。 ``` import requests url = 'https://blog.csdn.net/' response = requests.get(url) html = response.text ``` 3. 解析HTML文档获取目标数据 使用BeautifulSoup库解析HTML文档,获取目标数据。 ``` from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') titles = soup.find_all('div', class_='title') for title in titles: link = title.find('a').get('href') title_text = title.find('a').text.strip() print(title_text, link) ``` 上述代码中,通过`find_all`方法找到所有class属性为"title"的div标签,然后在每个div标签中找到第一个a标签,获取链接和标题文本。 4. 完整代码 ``` import requests from bs4 import BeautifulSoup url = 'https://blog.csdn.net/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'html.parser') titles = soup.find_all('div', class_='title') for title in titles: link = title.find('a').get('href') title_text = title.find('a').text.strip() print(title_text, link) ``` 以上就是一个简单的Python爬虫爬取网页数据并解析数据的教程。需要注意的是,在爬取网站数据时要遵守网站的爬虫协议,避免被网站封禁IP。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值