使用Selenium可以用来做CI/CD监控网页的活动情况,同时也能够用来爬数据。下面介绍一下如何用Selenium爬数据,我这里使用Firefox的Selenium类似的插件来生成Selenium代码。
1. 安装并打开Firefox,然后点击右上角的设置菜单下面的“附加组件”,打开附加组件页面
2. 然后输入“Katalon Recorder”,这里没有用“Selenium IDE”,“Selenium IDE” 无法导出代码。“Katalon Recorder”是“Selenium IDE” + “Code Export”
3. 安装之后,在右上角可以看到插件
4. 点击插件开始录屏生成代码
5. 点击“Record”按钮,开始录制你的操作过程,这里会自动生成一个Test Suite. 比如想做一个百度搜索,之后打开某个网站。
6. 注意输入完了,点击“百度一下”按钮,当然也可以回车,都会记录下来。所有操作都完成,点击“Stop”按钮即可。
7. 点击Export按钮,在弹出的窗口中选择你想要的代码格式,这里选择Python
8. 打开自己的代码编辑窗口,来运行一下程序。当然在我们的程序能跑之前需要做两件事情:
1)安装selenium包,之前是不能直接pip install 的,需要下载后安装,现在可以了。安装方法可以查看官网:https://pypi.org/project/selenium/
2)如果在Windows下跑python,为了支持Firefox和Chrome驱动,我们还需要下载两个文件放到python安装目录下的Script子目录下。Firefox对应的geckodriver.exe及Chrome的chromedriver.exe,下载地址:
gechodriver.exe: https://github.com/mozilla/geckodriver/releases
chromedriver.exe: http://npm.taobao.org/mirrors/chromedriver/
9. 我自己尝试driver为Firefox的时候总是fail,用Chrome没有问题。
完整代码如下:
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import unittest, time, re
class UntitledTestCase(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Chrome()
self.driver.implicitly_wait(30)
self.base_url = "https://www.katalon.com/"
self.verificationErrors = []
self.accept_next_alert = True
def test_untitled_test_case(self):
driver = self.driver
driver.get("https://www.baidu.com/")
driver.find_element_by_id("kw").click()
driver.find_element_by_id("kw").clear()
driver.find_element_by_id("kw").send_keys(u"人民网")
driver.find_element_by_id("form").submit()
driver.find_element_by_link_text(u"人民网_网上的人民日报").click()
# ERROR: Caught exception [ERROR: Unsupported command [selectWindow | win_ser_1 | ]]
def is_element_present(self, how, what):
try: self.driver.find_element(by=how, value=what)
except NoSuchElementException as e: return False
return True
def is_alert_present(self):
try: self.driver.switch_to_alert()
except NoAlertPresentException as e: return False
return True
def close_alert_and_get_its_text(self):
try:
alert = self.driver.switch_to_alert()
alert_text = alert.text
if self.accept_next_alert:
alert.accept()
else:
alert.dismiss()
return alert_text
finally: self.accept_next_alert = True
def tearDown(self):
self.driver.quit()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
10. 当然你的代码并不需要写成生成的UnitTest的这样。你可以只需要提取其中的一部分来爬取数据,我们以人民网的搜索结果为例看下怎么拿到数据。
# -*- coding: utf-8 -*-
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import json
######################################################################
#result structure
# result {
# source: "人民网",
# url: "http://search.people.com.cn/cnpeople/news/error.jsp"
# keyword: "补贴"
# news_list: [
# {
# subject: "寻求新一代人工智能发展突破口",
# abstract: "[市场占有率全国前三的龙头企业,给予500万元资金奖励。对制定智...",
# url: http://ai.people.com.cn/n1/2018/1214/c422228-30467125.html,
# pub_date: '2018-12-14 09:47:06'
# content: '天津市政府近期发布的《新一代人工智能产业发展三年行动计划...'
# },
# ...
# ]
# }
######################################################################
class MyCrawler:
def __init__(self):
# generate webdriver, you can use Chrome or Firefox,
# but Firefox driver has some problem on new version Firefox browser
# self.driver = webdriver.Chrome()
# you have to download executable file for chrome or firefox, we can try to use ChromeDriverManagerto download it
self.driver = webdriver.Chrome(ChromeDriverManager().install())
self.result = {}
# parse one of the website, because each page will different structure,
# you have define it by youself.
def run(self):
driver = self.driver
self.result["source"] = u"人民网"
self.result["url"] = "http://search.people.com.cn/cnpeople/news/error.jsp"
self.result["keyword"] = u"补贴"
self.result["news_list"] = []
driver.get(self.result["url"])
self.search_word(self.result["keyword"])
self.parse_search_result()
result_file_name = self.result["source"] + "_" + self.result["keyword"] + '.json'
self.save_data(result_file_name)
driver.close()
# search each key word and parse result
def search_word(self, keyword):
driver = self.driver
driver.find_element_by_xpath("//body").click()
driver.find_element_by_id("keyword").click()
driver.find_element_by_id("keyword").clear()
driver.find_element_by_id("keyword").send_keys(keyword)
driver.find_element_by_xpath("//img[@onclick=\"createParameter('/cnpeople/search.do','news')\"]").click()
# parse search result of one key word
def parse_search_result(self):
driver = self.driver
# We use try catch to make sure program will not crash once next-page is not existed
try:
while True:
self.parse_one_page()
### disable this break once you wang get all page search result ###
break
driver.find_element_by_link_text(u"下一页").click()
pass
except Exception as e:
print(e)
pass
# parse each page of search result
def parse_one_page(self):
driver = self.driver
search_result = driver.find_element_by_xpath("//*[@class='fr w800']")
for ul_item in search_result.find_elements_by_tag_name("ul"):
news = {}
news_item = ul_item.find_elements_by_tag_name("li")
news['subject'] = news_item[0].text
news['abstract'] = news_item[1].text
news['url'] = news_item[2].text.split()[0]
news['pub_date'] = news_item[2].text.split()[1]
news['content'] = self.parse_detail_page(news['url'])
self.result["news_list"].append(news)
### disable this break once you want get all items search result in current page ###
break
# get news detail content
def parse_detail_page(self, url):
driver = webdriver.Chrome()
driver.get(url)
content = ''
# because different page may have diffent tag, we have to enumerate all of them.
try:
content = driver.find_element_by_id("rwb_zw").text
pass
except Exception as e:
try:
content = driver.find_element_by_xpath("//*[@class='show_text']").text
pass
except Exception as e:
try:
content = driver.find_element_by_xpath("//*[@class='gray box_text']").text
pass
except Exception as e:
pass
driver.close()
return content
# save data to file, you can also connect database to save data into database
def save_data(self, file_name):
with open(file_name, "w", encoding='utf-8') as fp:
json.dump(self.result, fp, ensure_ascii=False)
if __name__ == "__main__":
crawler = MyCrawler()
crawler.run()