Selenium动态网页抓取

最新推荐文章于 2024-08-13 20:32:54 发布

RonnieღC

最新推荐文章于 2024-08-13 20:32:54 发布

阅读量1.6k

点赞数 1

文章标签： Python 爬虫

本文链接：https://blog.csdn.net/Chenrong1009/article/details/95088919

版权

动态网页抓取

上次实现了静态网页抓取豆瓣读书Top250的书名，这次跟着同一本书，研究一下动态网页的抓取。

动态网页简介

动态网页和静态网页的区别就在于，静态网页展示的内容都在HTML源代码中，而动态网页常常使用AJAX技术实现后台与服务器的数据交换，就可以在不重新加载整个页面的情况下对网页进行局部更新。
AJAX，全称是Asynchronous JavaScript And XML，即异步的JavaScript和XML，它的使用让互联网应用程序更快、更小，减少网页重复内容的下载，节省流量，但爬虫过程就比较复杂

动态网页爬取过程

使用AJAX加载的动态网页，有两种方法爬取其内容：

通过浏览器审查元素解析地址
使用Chrome浏览器检查网页元素，找到真实的数据地址，点击Network显示浏览器从网页服务器中得到的所有文件，这一过程被称为“抓包”。这一方法容易遇到许多问题，比如有一些网页为了避免抓取数据做了一些加密措施，使用“检查”功能就很难找到调用的地址。
通过Selenium模拟浏览器抓取
这种方法即使用浏览器渲染引擎，直接用浏览器在显示网页时解析HTML、应用CSS样式并执行JavaScript语句。这个方法会在爬虫过程中自动操作浏览器浏览各个网页，顺便把数据爬下来，也就是将爬取动态网页转变为爬取静态网页。

Selenium安装

Selenium与其他Python库一样，可以使用pip进行安装，代码如下：

pip install selenium

出现Successfully就可以了。

Selenium实例：Airbnb短租数据

目的：获取湖南长沙前10页短租房源的名称、价格、评价数量、房屋类型、床数量和房客数量。
网页地址：https://www.airbnb.cn/s/homes?refinement_paths[]=%2Fhomes&query=长沙&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE&s_tag=R2PBwazh
打开Airbnb长沙前200短租房源网页，点击“检查”，查看数据所在位置，如图所示：
一个房子的所有数据
得到某一房子的数据地址为：div._gig1e7
在这些数据中定位价格数据的地址为：div._18gk84h
价格地址
同理可以得到评价数据、房屋名称数据、房间类型数据，归纳如下表：

数据	元素	Class
某房子的所有数据	div	_gig1e7
价格	div	_18gk84h
评价数	div	_13o4q7nw
名称	div	_qhtkbey
房屋种类	span	_fk7kh10

找到了数据的地址，就可以使用Selenium获取Airbnb第一页的数据了。代码如下：

import time
from selenium import webdriver

#init url
url = 'https://www.airbnb.cn/s/homes?refinement_paths%5B%5D=%2Fhomes&query=%E9%95%BF%E6%B2%99&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE&s_tag=R2PBwazh'

#init browser
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)

#get data
rent_list = driver.find_elements_by_css_selector('div._gig1e7')
for eachhouse in rent_list:
	#find the comments
	comment = eachhouse.find_element_by_css_selector('div._13o4q7nw')
	comment = comment.text
	#find the price
	price = eachhouse.find_element_by_css_selector('div._18gk84h')
	price = price.text.replace("每晚","").replace("价格", "").replace("\n", "")
	#find the name
	name = eachhouse.find_element_by_css_selector('div._qhtkbey')
	name = name.text
	#find other details
	details = eachhouse.find_element_by_css_selector('span._fk7kh10')
	details = details.text
	house_type = details.split(" · ")[0]
	bed_number = details.split(" · ")[1]
	print(comment,price,name,house_type,bed_number)

得到的结果是这样的：
爬取第一页的结果这仅仅只是获取了一个页面的内容，我们的目标是前10页，所以查看第二页的地址可以发现，地址已经变为：https://www.airbnb.cn/s/homes?refinement_paths[]=%2Fhomes&section_offset=6&items_offset=18&s_tag=mt59xV_D
第三页的地址是：https://www.airbnb.cn/s/homes?refinement_paths[]=%2Fhomes&section_offset=6&items_offset=36&s_tag=mt59xV_D
区别就在于offset，是18的倍数，所以增加一个循环，获取前十页的数据，代码可修改为：

import time
from selenium import webdriver

#init browser
driver = webdriver.Chrome()
for i in range(0,10):
	url = 'https://www.airbnb.cn/s/homes?refinement_paths%5B%5D=%2Fhomes&toddlers=0&query=%E9%95%BF%E6%B2%99&s_tag=qevSKrvy&section_offset=6&items_offset='+ str(i*18) + '&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE'
	driver.get(url)
	time.sleep(3)

	#get data
	rent_list = driver.find_elements_by_css_selector('div._gig1e7')
	for eachhouse in rent_list:
    	#find the comments
    	comment = eachhouse.find_element_by_css_selector('div._13o4q7nw')
    	comment = comment.text
    	#find the price
    	price = eachhouse.find_element_by_css_selector('div._18gk84h')
    	price = price.text.replace("每晚","").replace("价格", "").replace("\n", "")
    	#find the name
    	name = eachhouse.find_element_by_css_selector('div._qhtkbey')
    	name = name.text
    	#find other details
    	details = eachhouse.find_element_by_css_selector('span._fk7kh10')
    	details = details.text
    	house_type = details.split(" · ")[0]
    	bed_number = details.split(" · ")[1]
    	print(comment,price,name,house_type,bed_number)

现在的得到的结果就是Airbnb上长沙地区的前10页房源信息：
运行结果由于自己偷懒，居然花了这么久做这个，我去面壁反思了。