我在自学爬虫的过程中,我经历了urllib库–>requests库–>正则表达式的使用–>Xpath,Beautiful Soup–>查ajax解析动态网站–>selenium自动化解决动态网站爬虫–>scrapy框架的使用这么一个过程,在实操后,个人认为爬取体量不大的情况下,selenium使用起来是最方便的,当你前面几个过程打好基础后,会非常容易上手。
文章目录
我爬取网站数据时都采用chrome浏览器作为selenium自动化测试的浏览器,下面我把爬取的过程中我常用到的内容做一总结,便于大家参考:
1.1 Selenium 部署
selenium的部署分为两步
1.1.1 安装python包
pip install selenium
1.1.2 安装chrome driver
- 在Chome浏览器中输入chrome://version/来查看你当前浏览器的版本
- 打开chrome driver的下载网址
- 找到对应版本号的文件夹后,下载你电脑所对应操作系统的zip包
- 复制chromedriver.exe到chrome安装目录的Application文件夹下,例如C:\Program Files\Google\Chrome\Application,并把此路径部署到环境变量中
- 复制chromedriver.exe到python安装的根目录下
1.1.3 测试
完成上述操作后,Win+R打开命令行,输入chromedriver,出现如下页面代表配置成功
部署成功后就可以用python进行selenium的使用了
1.2 Selenium使用
1.2.1 Selenium 启动
首先是selenium的启动,selenium启动chrome可以分为四种启动方式
- chrome normal mode:正常模式,selenium启动chrome时不会隐藏浏览器弹窗,与我们正常浏览网站时一致,通常在程序编写测试阶段我会用此模式
- chrome headless mode:隐藏弹窗模式,selenium启动chrome时会隐藏浏览器弹窗,通常在程序编写完成正式爬取阶段我会用此模式
- chrome special port mode:特殊端口模式,有时登录验证较麻烦的网站可以用此模式进行,此模式用的较少,具体用法在下方说明
- chrome download pdf mode:下载pdf文件模式,爬取下载网站中存在的pdf文件时可以使用此模式
下边为封装好的方法,有需要的可以直接调用
# start driver function
def start_driver(mode):
# if want to start driver with chrome normal mode
if mode == 'chrome_normal':
# get start driver
driver = webdriver.Chrome()
# set driver window size,size parament can set by yourself
driver.set_window_size(300, 300)
return driver
# if want to start driver with chrome headless mode
elif mode == 'chrome_headless':
# setting chrome option with chrome headless mode
chrome_options = Options()
chrome_options.add_argument('--headless')
# get start driver
driver = webdriver.Chrome(chrome_options=chrome_options)
return driver
# if want to start driver with chrome special port mode
elif mode == 'chrome_special_port':
# setting chrome option with chrome special port mode
chrome_options = Options()
chrome_options.debugger_address = '127.0.0.1:9090'
# get start driver
driver = webdriver.Chrome(options=chrome_options)
return driver
# if want to start driver with download pdf mode
elif mode == 'chrome_download_pdf':
# setting chrome option with chrome download pdf mode
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {
# Change default directory for downloads
"download.default_directory": "E:/pdf_download",
# To auto download the file
"download.prompt_for_download": False,
# It will not show PDF directly in chrome
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True
})
# get start driver
driver = webdriver.Chrome(options=options)
return driver
注:chrome special port的使用方法:例如我想用selenium操作127.0.0.1:9090端口打开的chrome浏览器,那么我需要先WIN+R打开命令行,输入命令chrome --remote-debugging-port=9090后,并在此浏览器中进行登录操作等找到适合你爬虫的初始状态,不要关闭浏览器,随后调用chrome driver的special port启动方法进行程序编写测试即可
1.2.2 Selenium 打开url链接
下面是Selenium打开url链接时我编写的方法,调用时传入参数启动的driver,等待时间wait second以及所要打开的url。当超过等待时间页面还未加载出来的话,会抛出异常TimeoutException。
# package import
from selenium.common.exceptions import TimeoutException
# open url in chrome
def open_url(driver, wait_second, url):
driver.set_page_load_timeout(wait_second)
try:
# open this url in chrome
driver.get(url)
except TimeoutException:
driver.execute_script('window.stop()')
1.2.3 Selenium 获取节点
1.2.3.1 获取单个节点
此方法传入的参数为四个:获取的方式;启动的driver;等待时间wait second以及定位到该节点的xpath表达式或id/class的名称。该方法最终得到的是目标节点,可以通过element.text获取该节点包含的文本信息,也可以通过element.get_attribute(’属性名称‘)来获取想要获取的属性值
# element located function
def make_element_located(input_type, driver, wait_second, locate_str):
# if want to locate by xpath
if input_type == 'xpath':
# get element
element = WebDriverWait(driver, wait_second).until(ec.presence_of_element_located((By.XPATH, locate_str)))
return element
# if want to locate by id
elif input_type == 'id':
# get element
element = WebDriverWait(driver, wait_second).until(ec.presence_of_element_located((By.ID, locate_str)))
return element
# if want to locate by class
elif input_type == 'class':
# get element
element = WebDriverWait(driver, wait_second).until(ec.presence_of_element_located((By.CLASS_NAME, locate_str)))
return element
1.2.3.2 获取多个节点的列表
此方法传入的参数为四个:获取的方式;启动的driver;等待时间wait second以及定位到目标节点列表的xpath表达式或id/class的名称。该方法最终得到的是一个节点组成的列表,可以遍历列表后再调用1.2.3.1的方法得到目标节点
# elements located function
def make_elements_located(input_type, driver, wait_second, locate_str):
# if want to locate by xpath
if input_type == 'xpath':
# get elements
elements = WebDriverWait(driver, wait_second).until(ec.presence_of_all_elements_located((By.XPATH, locate_str)))
return elements
# if want to locate by id
elif input_type == 'id':
# get element
elements = WebDriverWait(driver, wait_second).until(ec.presence_of_all_elements_located((By.ID, locate_str)))
return elements
# if want to locate by class
elif input_type == 'class':
# get element
elements = WebDriverWait(driver, wait_second).until(
ec.presence_of_all_elements_located((By.CLASS_NAME, locate_str)))
return elements
1.2.4 Selenium 打开新的标签页
page_level是整数,例如浏览器存在两个标签页,那么打开新的标签页需要传入的page_level为2,与python列表的索引类似
# open new label page function
def open_new_label_page(driver, page_level):
# open new handle
driver.execute_script('window.open()')
# change to new page handle
driver.switch_to.window(driver.window_handles[page_level])
1.2.5 Selenium 关闭标签页
page_level是整数,例如浏览器存在两个标签页,那么关闭最新的标签页的话需要传入的page_level为1,与python列表的索引类似
# close label page function
def close_label_page(driver, page_level):
# close detail page
driver.close()
# change to first page handle
driver.switch_to.window(driver.window_handles[page_level])
运用Selenium爬取网站数据时我所用到的基本方法大致就这些,当然Selenium还有其他的高级使用方法,在此先不做展开了。