目标网页地址:https://www.xib.com.cn/per-product/index.html#/per-product/product-info
- 导入库并声明浏览器对象
from selenium import webdriver
import time
import pandas as pd
driver = webdriver.Chrome()
完成了浏览器对象的初始化并将其赋值给driver对象,接下来就可以调用driver对象,让其执行各个动作以模拟浏览器操作。在Selenium中,get()方法会在网页框架加载结束后结束执行,此时如果获取page_source,可能并不是浏览器完全加载完成的页面,所以,这里需要导入time库,延时等待一定时间,以保证节点已经加载出来。
selenium主要有三种等待方式,分别为强制等待(本例直接使用强制等待方法)、显示等待和隐式等待,具体概念和实现方法可参考Python selenium —— 一定要会用selenium的等待,三种等待方式解读
- 页面最大化和页面下滑的设置
对目标网页的目标表格爬取数据时,首先会面临两个问题:
(1)我们需要爬取的目标表格需要相当长的时间才能加载出来;
(2)目标表格位于界面最下端,需要下滑界面至才能显示全部表格数据,那么在使用selenium模拟用户使用浏览器时如何才能让计算机自动下滑页面至最低端呢?可通过如下代码实现:
driver.get("https://www.xib.com.cn/per-product/index.html#/per-product/product-info")
driver.maximize_window() #最大化界面
sladedown = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(6) > div > div.el-table__header-wrapper > table > thead > tr") #表格的第一行所对应的css定位
driver.execute_script("arguments[0].scrollIntoView();", sladedown) #将页面滑动到以sladedown为页面顶部的位置
driver.execute_script("arguments[0].scrollIntoView(false);", sladedown) #爬取一页后再将页面上滑,因为下一页按钮在上方
-
循环爬取1页数据
分别通过检查网页源代码找到我们想要爬取的列所对应的css定位地址。我们的目标列为:产品代码、产品名称、自有/代销、产品类型、公募/私募,其分别对应的css地址为:
产品代码:’#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child’+’(’+order+’)’+’ > td.el-table_1_column_1.is-center > div’
产品名称:’#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child’+’(’+order+’)’+’ > td.el-table_1_column_2.is-center > div’
自有/代销:’#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child’+’(’+order+’)’+’ > td.el-table_1_column_3.is-center > div’
产品类型:’#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child’+’(’+order+’)’+’ > td.el-table_1_column_4.is-center > div’
公募/私募:’#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child’+’(’+order+’)’+’ > td.el-table_1_column_5.is-center > div’ -
模拟点击下一页按钮
爬取一页数据后,需要模拟用户点击“下一页”进入下个页面爬取数据。通过如下代码实现:
button = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(5) > div > button.btn-next")
button.click()
- 完整代码
from selenium import webdriver
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
product_list = []
driver.get("https://www.xib.com.cn/per-product/index.html#/per-product/product-info")
driver.maximize_window() #最大化界面
sladedown = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(6) > div > div.el-table__header-wrapper > table > thead > tr") #表格的第一行所对应的css定位
for page in (1, 95):
time.sleep(5) #等待加载完成,问题:能否换个等待方式?
# 下拉页面至底部
# app > div > div > div > div:nth-child(6) > div > div.el-table__header-wrapper > table > thead > tr
# sladedown = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(6) > div > div.el-table__header-wrapper > table > thead > tr")
driver.execute_script("arguments[0].scrollIntoView();", sladedown) #将页面滑动到以sladedown为页面顶部的位置
time.sleep(3)
for i in range(1,11):
order = str(i)
# app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child(1) > td.el-table_1_column_2.is-center > div
idlactor = '#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child'+'('+order+')'+' > td.el-table_1_column_1.is-center > div'
productid = driver.find_element_by_css_selector(idlactor)
productid = "\t" + str(productid.text) #字符串前加"\t"可以解决csv文件中保存以0开头的数字字符串问题,默认会自动去掉首位0
namelactor = '#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child'+'('+order+')'+' > td.el-table_1_column_2.is-center > div'
productname = driver.find_element_by_css_selector(namelactor)
productname = str(productname.text)
daixiaolactor = '#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child'+'('+order+')'+' > td.el-table_1_column_3.is-center > div'
daixiao = driver.find_element_by_css_selector(daixiaolactor)
daixiao = str(daixiao.text)
typelactor = '#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child'+'('+order+')'+' > td.el-table_1_column_4.is-center > div'
producttype = driver.find_element_by_css_selector(typelactor)
producttype = str(producttype.text)
gongmulactor = '#app > div > div > div > div:nth-child(6) > div > div.el-table__body-wrapper.is-scroll-left > table > tbody > tr:nth-child'+'('+order+')'+' > td.el-table_1_column_5.is-center > div'
gongmu = driver.find_element_by_css_selector(gongmulactor)
gongmu = str(gongmu.text)
# print(productid, productname, daixiao, producttype, gongmu)
product_list.append({'产品代码': productid, '产品名称': productname, '自有/代销': daixiao, '产品类型': producttype,
'公募/私募': gongmu})
# print(product_list)
# sladelift = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(6) > div > div.el-table__header-wrapper > table > thead > tr")
driver.execute_script("arguments[0].scrollIntoView(false);", sladedown) #爬取一页后再将页面上滑,因为下一页按钮在上方
#点击下一页 #问题2 自动判断是否是最后一页
time.sleep(2)
button = driver.find_element_by_css_selector("#app > div > div > div > div:nth-child(5) > div > button.btn-next")
button.click()
df = pd.DataFrame(product_list)
# print(df)
columns = ['产品代码', '产品名称', '自有/代销', '产品类型', '公募/私募']
df.to_csv('G:/网络爬虫/爬虫实例之银行数据爬取/result.csv', encoding='utf-8', columns=columns) # 写入到csv中
driver.close()