python爬虫爬取链家二手房信息(xpath)

最新推荐文章于 2024-03-04 23:28:09 发布

俞泰鑫

最新推荐文章于 2024-03-04 23:28:09 发布

阅读量3.6k

点赞数 3

分类专栏： spider #python 文章标签： python spider

本文链接：https://blog.csdn.net/god_yutaixin/article/details/103111546

版权

#python 同时被 2 个专栏收录

46 篇文章 2 订阅

订阅专栏

spider

23 篇文章 0 订阅

订阅专栏

python爬虫爬取链家二手房信息 -- xpath

需求

将小区名称、厅室、面积、毛坯还是精装、楼层、建筑年代、板楼还是塔楼、总价和每平方米单价等信息爬取放入字典中

流程

查看想要的数据在网页源码中是否存在(确认是否为静态部分)
找网页url的规律，规律如下：
第n页：https://sh.lianjia.com/ershoufang/pgn/
写xpath表达式
右键查看页面元素，发现节点 <ul class=“sellListContent” log-mod=“list” 下的每一个li子节点就是一个房源的信息
在chrome浏览器用xpath表达式匹配想要的房源信息节点
此处表达式为：//ul[@class=“sellListContent”]/li[@class=“clear LOGVIEWDATA LOGCLICKDATA”]
返回值为：节点对象列表[li1,li2,li3…]
注：
1.匹配出广告等不需要的信息，用xpath表达式的条件过滤掉
2. 页面中xpath能匹配出来，程序中不能匹配出来，属于正常现象。
一定是浏览器自动执行了一些js(因为xpath匹配的是js渲染完之后的html代码)，将页面结构做了一些调整，
一旦出现这种情况，查看网页响应的源码，在源码中找到关键字，截取周围源码，将源码html在线格式化后，看结构和内容，即页面中的xpath不能全信，以实际响应源码的内容为准，重写xpath表达式
for循环遍历提取每一个 li 房源数据
根据页面元素html写xpath表达式
1. 小区名称：//div[@class=“positionInfo”]/a[1]/text()
2. 区域：//div[@class="positioninfo’]/a[2]/text()
3. 商品房信息：//div[@class=“houseInfo”]/text()
  拿出来的是字符串：“户型+面积+方位+是否精装+楼层+年代+类型"
  e.g. 2室一厅 | 67.7平米 | 南北 | 毛坯 | 低楼层(共6楼) |1997年建 | 板楼
  split(’|’)之后
  1. 户型：houseinfo[0]
  2. 面积：houseinfo[1]
  3. 方位：houseinfo[2]
  4. 精装：houseinfo[3]
  5. 楼层：houseinfo[4]
  6. 年代：houseinfo[5]
  7. 类型：houseinfo[6]
4. 总价：//div[@class=“totalPrice”]/span/text()
5. 单价：//div[@class=“unitPrice”]/span/text()

详细代码

import requests
from lxml import etree
import time
import random
from fake_useragent import UserAgent

class LianjiaSpider(object):
	def __init__(self):
		self.url = 'https://sh.lianjia.com/ershoufang/pg{}/'

	def parse_html(self,url):
		headers = {'User-Agent':UserAgent().random}
		#当请求没响应时，再循环请求两次，没响应抛出异常。当响应成功，退出当前循环。继续下一页数据抓取
		for i in range(3):
			try:
				#Requests模块：向网站发请求并获取响应对象html内容，用content属性保险点,响应超时时间为3秒
				html = requests.get(url=url,headers=headers).content.decode('utf-8','ignore',timeout=3)
				self.get_data(html)
			except Exception as e:
				print('Retry')
				print(e)
		
	def get_data(self,html):
		#lxml模块：创建解析对象
		p = etree.HTML(html)
		#解析对象调用匹配房源信息的xpath表达式，返回值为节点对象列表：[li1,li2,li3]
		li_list = p.xpath('''/ul[@class="sellListContent"]/li[@class=
						"clear LOGVIEWDATA LOGCLICKDATA"]''')
		item = {}
		#for循环遍历列表，获取每一个房源信息中的每一个具体信息，放入字典
		for p in li_list:
			#注意遍历厚继续xpath，xpath表达式要以 .开头,代表在当前节点下(不在整个html下)
			#小区名
			name_list = p.xpath('.//div[@class="positioninfo']/a[1]/text()') #返回值是个列表
			#为防name_list没匹配出来，没匹配出来向数据库中存一个None
			item['name'] = name_list[0].strip() if name_list else None
			
			#区域
			region = p.xpath('.//div[@class="positioninfo']/a[2]/text()')
			item['region'] = region[0].strip() if region else None
			
			#二手房信息
			houseinfo = p.xpath('.//div[@class="houseInfo"]/text()')
			#houseinfo = [“户型+面积+方位+是否精装+楼层+年代+类型"]
			#谨慎点，如果匹配出houseinfo 就拿数据，如果为空，就直接存一个None
			if houseinfo:
				houseinfo = houseinfo[0].split('|')	#返回值是个列表
				#谨慎点，houseinfo列表里需要有7条数据,否则直接置为None
				if len(houseinfo) == 7:
					item['housetype'] = houseinfo[0].strip()
					item['area'] = houseinfo[1].strip()
					item['direction'] = houseinfo[2].strip()
					item['decoration'] = houseinfo[3].strip()
					item['level'] = houseinfo[4].strip()
					item['time'] = houseinfo[5].strip()[:-2]	#用切片取出数字，去掉最后两个汉字
					item['type'] = houseinfo[6].strip()
				else:
					item['housetype'] = item['area'] = item['direction'] = \
					item['decoration'] = item['level'] = item['time'] = item['type']  \
					= None
			else:
				item['housetype'] = item['area'] = item['direction'] =  \
     				item['decoration'] = item['level'] = item['time'] = item['type']  \
     				= None
			
			#总价
			total_list = p.xpath('.//div[@class="totalPrice"]/span/text()')
			item['total'] = total_list[0].strip() if total_list else None
			#每平米单价
			unit_list = p.xpath('.//div[@class="unitPrice"]/span/text()')
			item['unit'] = unit_list[0].strip() if unit_list else None

		print(item)

	def run(self):
		for page in range(1,101):
			url = self.url.format(page)
			self.parse_html(url)
			time.sleep(random.randint(1,2))

if __name__ == '__main__':
	spider = LianjiaSpider()
	spider.run()

俞泰鑫

关注

3
点赞
踩
31

收藏

觉得还不错? 一键收藏
1
评论
python爬虫爬取链家二手房信息(xpath)

python爬虫爬取链家二手房信息 -- xpath需求流程详细代码需求将小区名称、厅室、面积、毛坯还是精装、楼层、建筑年代、板楼还是塔楼、总价和每平方米单价等信息爬取放入字典中流程查看想要的数据在网页源码中是否存在(确认是否为静态部分)找网页url的规律，规律如下：第n页：https://sh.lianjia.com/ershoufang/pgn/写xpath表达式右键查看页面...
复制链接

扫一扫