基于Selenium的爬虫

使用Selenium可以用来做CI/CD监控网页的活动情况,同时也能够用来爬数据。下面介绍一下如何用Selenium爬数据,我这里使用Firefox的Selenium类似的插件来生成Selenium代码。
1. 安装并打开Firefox,然后点击右上角的设置菜单下面的“附加组件”,打开附加组件页面


2. 然后输入“Katalon Recorder”,这里没有用“Selenium IDE”,“Selenium IDE” 无法导出代码。“Katalon Recorder”是“Selenium IDE” + “Code Export”

3. 安装之后,在右上角可以看到插件

4. 点击插件开始录屏生成代码

5. 点击“Record”按钮,开始录制你的操作过程,这里会自动生成一个Test Suite. 比如想做一个百度搜索,之后打开某个网站。

 

6. 注意输入完了,点击“百度一下”按钮,当然也可以回车,都会记录下来。所有操作都完成,点击“Stop”按钮即可。

7. 点击Export按钮,在弹出的窗口中选择你想要的代码格式,这里选择Python

8. 打开自己的代码编辑窗口,来运行一下程序。当然在我们的程序能跑之前需要做两件事情:

1)安装selenium包,之前是不能直接pip install 的,需要下载后安装,现在可以了。安装方法可以查看官网:https://pypi.org/project/selenium/

2)如果在Windows下跑python,为了支持Firefox和Chrome驱动,我们还需要下载两个文件放到python安装目录下的Script子目录下。Firefox对应的geckodriver.exe及Chrome的chromedriver.exe,下载地址:

gechodriver.exe: https://github.com/mozilla/geckodriver/releases

chromedriver.exe: http://npm.taobao.org/mirrors/chromedriver/

9. 我自己尝试driver为Firefox的时候总是fail,用Chrome没有问题。

完整代码如下:

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import unittest, time, re

class UntitledTestCase(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Chrome()
        self.driver.implicitly_wait(30)
        self.base_url = "https://www.katalon.com/"
        self.verificationErrors = []
        self.accept_next_alert = True
    
    def test_untitled_test_case(self):
        driver = self.driver
        driver.get("https://www.baidu.com/")
        driver.find_element_by_id("kw").click()
        driver.find_element_by_id("kw").clear()
        driver.find_element_by_id("kw").send_keys(u"人民网")
        driver.find_element_by_id("form").submit()
        driver.find_element_by_link_text(u"人民网_网上的人民日报").click()
        # ERROR: Caught exception [ERROR: Unsupported command [selectWindow | win_ser_1 | ]]
    
    def is_element_present(self, how, what):
        try: self.driver.find_element(by=how, value=what)
        except NoSuchElementException as e: return False
        return True
    
    def is_alert_present(self):
        try: self.driver.switch_to_alert()
        except NoAlertPresentException as e: return False
        return True
    
    def close_alert_and_get_its_text(self):
        try:
            alert = self.driver.switch_to_alert()
            alert_text = alert.text
            if self.accept_next_alert:
                alert.accept()
            else:
                alert.dismiss()
            return alert_text
        finally: self.accept_next_alert = True
    
    def tearDown(self):
        self.driver.quit()
        self.assertEqual([], self.verificationErrors)

if __name__ == "__main__":
    unittest.main()

10. 当然你的代码并不需要写成生成的UnitTest的这样。你可以只需要提取其中的一部分来爬取数据,我们以人民网的搜索结果为例看下怎么拿到数据。

# -*- coding: utf-8 -*-
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import json

######################################################################
#result structure
# result {
#   source: "人民网",
#   url: "http://search.people.com.cn/cnpeople/news/error.jsp"
#   keyword: "补贴"
#   news_list: [
#       {
#           subject: "寻求新一代人工智能发展突破口",
#           abstract: "[市场占有率全国前三的龙头企业,给予500万元资金奖励。对制定智...",
#           url: http://ai.people.com.cn/n1/2018/1214/c422228-30467125.html,
#           pub_date: '2018-12-14 09:47:06'
#           content: '天津市政府近期发布的《新一代人工智能产业发展三年行动计划...'
#       },
#       ...
#   ]
# }
######################################################################

class MyCrawler:
    def __init__(self):
        # generate webdriver, you can use Chrome or Firefox,
        # but Firefox driver has some problem on new version Firefox browser
        # self.driver = webdriver.Chrome()
        # you have to download executable file for chrome or firefox, we can try to use ChromeDriverManagerto download it
        self.driver = webdriver.Chrome(ChromeDriverManager().install())
        self.result = {}

    # parse one of the website, because each page will different structure,
    # you have define it by youself.
    def run(self):
        driver = self.driver
        self.result["source"] = u"人民网"
        self.result["url"] = "http://search.people.com.cn/cnpeople/news/error.jsp"
        self.result["keyword"] = u"补贴"
        self.result["news_list"] = []

        driver.get(self.result["url"])
        self.search_word(self.result["keyword"])
        self.parse_search_result()
        result_file_name = self.result["source"] + "_" + self.result["keyword"] + '.json'
        self.save_data(result_file_name)
        driver.close()

    # search each key word and parse result
    def search_word(self, keyword):
        driver = self.driver
        driver.find_element_by_xpath("//body").click()
        driver.find_element_by_id("keyword").click()
        driver.find_element_by_id("keyword").clear()
        driver.find_element_by_id("keyword").send_keys(keyword)
        driver.find_element_by_xpath("//img[@onclick=\"createParameter('/cnpeople/search.do','news')\"]").click()

    # parse search result of one key word
    def parse_search_result(self):
        driver = self.driver
        # We use try catch to make sure program will not crash once next-page is not existed
        try:
            while True:
                self.parse_one_page()
                ### disable this break once you wang get all page search result ###
                break
                driver.find_element_by_link_text(u"下一页").click()
            pass
        except Exception  as e:
            print(e)
            pass

    # parse each page of search result
    def parse_one_page(self):
        driver = self.driver
        search_result = driver.find_element_by_xpath("//*[@class='fr w800']")
        for ul_item in search_result.find_elements_by_tag_name("ul"):
            news = {}
            news_item = ul_item.find_elements_by_tag_name("li")
            news['subject'] = news_item[0].text
            news['abstract'] = news_item[1].text
            news['url'] = news_item[2].text.split()[0]
            news['pub_date'] = news_item[2].text.split()[1]
            news['content'] = self.parse_detail_page(news['url'])

            self.result["news_list"].append(news)
            ### disable this break once you want get all items search result in current page ###
            break 

    # get news detail content
    def parse_detail_page(self, url):
        driver = webdriver.Chrome()
        driver.get(url)
        content = ''
        # because different page may have diffent tag, we have to enumerate all of them.
        try:
            content = driver.find_element_by_id("rwb_zw").text
            pass
        except Exception  as e:
            try:
                content = driver.find_element_by_xpath("//*[@class='show_text']").text
                pass
            except Exception  as e:
                try:
                    content = driver.find_element_by_xpath("//*[@class='gray box_text']").text
                    pass
                except Exception  as e:
                    pass
        driver.close()
        return content

    # save data to file, you can also connect database to save data into database
    def save_data(self, file_name):
        with open(file_name, "w", encoding='utf-8') as fp:
            json.dump(self.result, fp, ensure_ascii=False) 


if __name__ == "__main__":
    crawler = MyCrawler()
    crawler.run()

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值