Selenium + PhantomJS教程

This post borrows from the previous selenium-based post here. If you have heard of PhantomJS, would like to try it out, and are curious to see how it performs against other browsers such as Chrome, this post will help. However, in my experience, using the PhantomJS browser for webscraping doesn’t really have many benefits compared to using Chrome or Firefox (unless you need to run your script on a server, in which case it’s your go-to). It is faster, though not as much as you might hope, and I’ve found it to be much less reliable (it can randomly freeze on tasks that run smoothly on Chrome despite extensive tweaking and troubleshooting). My current opinion is that it’s more trouble than it’s worth for webscraping purposes, but if you want to try it out for yourself, I hope you’ll find the below tutorial helpful.

这篇文章从以前基于Selenium后借用这里 。 如果您听说过PhantomJS,想尝试一下,并想知道它与其他浏览器(例如Chrome)相比的性能如何,这篇文章将对您有所帮助。 但是,以我的经验来看,与使用Chrome或Firefox相比,使用PhantomJS浏览器进行网络抓取并没有太多好处(除非您需要在服务器上运行脚本,在这种情况下这是您的首选)。 它速度更快,虽然不如您希望的那样快,但我发现它的可靠性要差得多(尽管进行了大量的调整和故障排除,它可以随机冻结在Chrome上平稳运行的任务)。 我目前的观点是,这样做比进行网络爬网要麻烦的多,但如果您想自己尝试一下,希望下面的教程对您有所帮助。

If you aren’t familiar with it, PhantomJS is a browser much like Chrome or Firefox but with one important difference: it’s headless. This means that using PhantomJS doesn’t require an actual browser window to be open. To install the PhantomJS browser, go here and choose the appropriate download (I’ll assume Windows from here on out, though process is similar in other OS’s). Unzip the zip file, named something like “phantomjs-2.1.1-windows.zip”. And there you have it, PhantomJS is installed. If you go into the unzipped folder, and then into the bin folder, you should find a file named “phantomjs.exe”. All we need to do now is reference that file’s path in our script to launch the browser.

如果您不熟悉PhantomJS,则它是一款类似于Chrome或Firefox的浏览器,但有一个重要区别:它没有头。 这意味着使用PhantomJS不需要打开实际的浏览器窗口。 要安装PhantomJS浏览器,请转到此处并选择适当的下载位置(尽管其他操作系统的过程与此类似,但我将假定Windows从这里开始)。 解压缩该zip文件,其名称类似于“ phantomjs-2.1.1-windows.zip”。 在那里,已经安装了PhantomJS。 如果进入解压缩的文件夹,然后进入bin文件夹,则应该找到一个名为“ phantomjs.exe”的文件。 现在我们需要做的就是在脚本中引用该文件的路径以启动浏览器。

Here is the start of our script from last time:

这是上次脚本的开始:

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

Now we launch the browser, referencing the PhantomJS executable:

现在,我们启动浏览器,引用PhantomJS可执行文件:

my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path)

However, at least for me, just simply launching the browser like this resulted in highly unreliable webscraping that would freeze at seemingly-random times. To make a long story short, here is some revised code for launching the browser that I found improved performance.

但是,至少对我来说,仅是这样启动浏览器会导致高度不可靠的网络抓取,并在看似随机的时间冻结。 长话短说,这里有一些经过修改的用于启动浏览器的代码,我发现它们提高了性能。

dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)

Now I’ll let you compare the runtimes for PhantomJS and Chrome. It’s set to run PhantomJS right now, so just paste the code into your own IDE and when you want to test Chrome just comment out the PhantomJS browser launch section instead.

现在,我将让您比较PhantomJS和Chrome的运行时。 它已设置为现在运行PhantomJS,因此只需将代码粘贴到自己的IDE中,而要测试Chrome时,只需注释掉PhantomJS浏览器启动部分即可。

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

start_time = time.time()

## launch the PhantomJS browser
###############################################################################
dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)
###############################################################################
"""
## launch the Chrome browser
###############################################################################    
my_path = "C:UsersgstantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
###############################################################################
"""

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    
    quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
    quarters = get_elements(quarters_xpath)
    
    quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
    quarter_endings = get_elements(quarter_endings_xpath)
    
    total_revenue = get_elements(financials_xpath.format("Total Revenue"))
    gross_profit = get_elements(financials_xpath.format("Gross Profit"))
    net_income = get_elements(financials_xpath.format("Net Income"))
    
    ## navigate to balance sheet quarterly page 
    url = url_form.format(symbol, "balance-sheet")
    browser.get(url)
    
    total_assets = get_elements(financials_xpath.format("Total Assets"))
    total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
    total_equity = get_elements(financials_xpath.format("Total Equity"))
    
    ## navigate to cash flow quarterly page 
    url = url_form.format(symbol, "cash-flow")
    browser.get(url)
    
    net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

    ## fill the datarame with the scraped data, 4 rows per company
    for j in range(4):  
        row = i + j
        df.loc[row, 'company'] = company
        df.loc[row, 'quarter'] = quarters[j]
        df.loc[row, 'quarter_ending'] = quarter_endings[j]
        df.loc[row, 'total_revenue'] = total_revenue[j]
        df.loc[row, 'gross_profit'] = gross_profit[j]
        df.loc[row, 'net_income'] = net_income[j]
        df.loc[row, 'total_assets'] = total_assets[j]
        df.loc[row, 'total_liabilities'] = total_liabilities[j]
        df.loc[row, 'total_equity'] = total_equity[j]
        df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
   
browser.quit()

## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)

print(time.time() - start_time)

翻译自: https://www.pybloggers.com/2016/11/selenium-phantomjs-tutorial/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值