详细举例使用selenium深度爬取豆瓣网
这里举例豆瓣同城 北京最近一周的音乐会活动
1.使用Edge浏览器作为驱动
driver = webdriver.Edge()
2.指定url发送请求
根据F12得到网站的请求url,请求方法,内容类型
driver.get()
driver.get("https://beijing.douban.com/events/week-1002")
3.获取每项活动的url
可以发现是一个ul列表内包含多个li标签
其中href标签就是每项活动的url
可以根据url进入活动的详细页面
find_elements()爬取所有li列表数据
li_list = driver.find_elements(By.XPATH, "//ul[@class='events-list events-list-pic100 events-list-psmall']/li")
遍历li_list通过get_attribute()获得href标签内的url
url_list = [li.find_element(By.XPATH, "div/a").get_attribute("href") for li in li_list]
4.爬取网站元素
对获取到的每项活动的url再次发送请求
遍历url_list对url发送请求
for url in url_list:
driver.get(url)
F12查看网站元素
例如需要爬取活动的标题、时间、地点、费用、人数 这里可以发现都包含在div class="event-info"内
find_element爬取指定元素数据
此处对得到的数据做了去换行和去硬空格操作
concert_name = driver.find_element(By.XPATH, "//div[@class='event-info']/h1").text
concert_datetime = driver.find_element(By.XPATH, "//div[@class='event-info']/div[1]/ul/li").text
concert_location = driver.find_element(By.XPATH, "//div[@class='event-info']/div[2]/span[2]").text
concert_location_ = concert_location.replace("\n", "")
concert_price = driver.find_element(By.XPATH, "//div[@class='event-info']/div[3]/span").text
concert_price_ = concert_price.replace("\n", "").replace(u'\xa0', '')
concert_count = driver.find_element(By.XPATH, "//div[@class='event-info']/div[6]").text
5.爬取所有页面
点击下一页后url结尾的start0变成start10了 可以发现一个页面有十个活动
代码做一层循环处理可以做到爬取到所有页面
for i in range(0, 51, 10):
url请求代码加上start结尾并做格式化操作
driver.get(f"https://beijing.douban.com/events/week-1002?start={i}")
完整代码如下
其中使用sleep()可以强制等待页面加载数据以防数据爬取不完整
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
# TODO:使用Edge浏览器作为驱动
driver = webdriver.Edge()
# TODO:爬取所有页面
for i in range(0, 51, 10):
# TODO:指定url发送请求
driver.get(f"https://beijing.douban.com/events/week-1002?start={i}")
sleep(3)
# TODO:获取响应的数据
li_list = driver.find_elements(By.XPATH, "//ul[@class='events-list events-list-pic100 events-list-psmall']/li")
# TODO:获取url
url_list = [li.find_element(By.XPATH, "div/a").get_attribute("href") for li in li_list]
# TODO:对获取后的url发送请求
for url in url_list:
driver.get(url)
sleep(0.1)
# TODO:解析数据 name datetime location price count
try:
concert_name = driver.find_element(By.XPATH, "//div[@class='event-info']/h1").text
concert_datetime = driver.find_element(By.XPATH, "//div[@class='event-info']/div[1]/ul/li").text
concert_location = driver.find_element(By.XPATH, "//div[@class='event-info']/div[2]/span[2]").text
concert_location_ = concert_location.replace("\n", "")
concert_price = driver.find_element(By.XPATH, "//div[@class='event-info']/div[3]/span").text
concert_price_ = concert_price.replace("\n", "").replace(u'\xa0', '')
concert_count = driver.find_element(By.XPATH, "//div[@class='event-info']/div[6]").text
except Exception as e:
print(e)
else:
print(
f"{concert_name}\n时间:{concert_datetime}\n地点:{concert_location_}\n{concert_price_}\n{concert_count}\n")