selenium 获取数据
Note: The following post is a significant step up in difficulty from the previous selenium-based post, Automate Your Browser: A Guided Selenium Adventure. Please see the start of that post for links on getting selenium set up if this is your first time using it. If you really do need financial data, there are likely easier ways to obtain it than scraping Nasdaq or Yahoo or Morningstar with selenium. Examples may include Quandl and Yahoo’s finance API, or perhaps building a scraper with scrapy and splash. And there are many proprietary (and expensive) databases out there that will provide such data. But in any case, I hope this post is helpful in demonstrating a few more of the practices involved in real-life webscraping. The full script is at the end of the post for your convenience.
注意:下列文章比以前的基于Selenium的文章自动化浏览器:引导性Selenium冒险显着提高了难度。 如果您是第一次使用Selenium,请参阅该帖子的开头以获取有关设置Selenium的链接。 如果您确实需要财务数据,那么与用Selenium刮擦纳斯达克,雅虎或晨星公司相比,可能有更简单的方法来获取它。 示例可能包括Quandl和Yahoo的财务API,或者可能构建带有刮擦和飞溅的刮板。 而且有许多专有(且昂贵)的数据库将提供此类数据。 但是无论如何,我希望这篇文章对展示现实生活中的网络抓取中的更多实践有所帮助。 为了方便起见,完整的脚本位于文章的结尾。
One fine Monday morning, Todd is sipping a hot cup of decaf green tea, gazing out the office window in a state of Zen oneness as a Selenium script does his work for him. But just as he is on the brink of enlightenment, his boss, Mr. Peabody, bursts into his cubicle and barks, “TODD, quit daydreaming. I just got word from the CEO: we need quarterly financials on some of our competitors.” “Oh? What for?” “Some competitive analysis or something. We’ll be doing it on a regular basis. In any case, we need that data TODAY or YOU’RE FIRED!”
一个好星期一的早晨,托德正喝着一杯无咖啡因的绿茶,在禅宗统一的状态下凝视着办公室的窗户,因为Selenium脚本正在为他工作。 但是正当他处于开悟的边缘时,他的老板皮博迪先生突然闯进他的小房间,吠叫道:“托德,别做白日梦了。 我刚刚从CEO那里得到一个消息:我们需要一些竞争对手的季度财务报告。” “哦? 做什么的?” “一些竞争分析之类的东西。 我们会定期进行。 无论如何,我们今天需要这些数据,否则您将被解雇!”
As Mr. Peabody stomps away Todd lets out a sigh. His morning had been going so well, but now it seems he has to actually do some work. He decides though that if he’s going to do work, he’s going to do everything in his power to make sure he never has to do that work again. Brainstorming sources of financial data, Todd figures he could get it from nasdaq.com as easily as anywhere else. He navigates to the quarterly income statement of the first company on the list, Apple (ticker symbol: AAPL).
皮博迪先生sto着脚,托德发出一声叹息。 他的早晨过得很好,但现在看来他必须实际做一些工作。 他决定,如果他要去工作,他将尽其所能来确保自己不再需要再次做这项工作。 托德集思广益,他想可以从nasdaq.com上像从其他任何地方一样轻松地获取财务数据的来源。 他导航到列表中第一家公司Apple的季度收益表(股票代号:AAPL)。
http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly
http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly
The first thing Todd notices is that the actual financial data table is being generated via JavaScript (look for the tags in the html). This means that Python packages such as lxml and beautiful soup, which don’t support javascript, won’t be much help here. Todd knows that selenium doesn’t make for the fastest webscraper, but because he only needs data on 5 companies (Amazon, Apple, Facebook, IBM, Microsoft), he still decides to write up another quick selenium script.
Todd首先注意到的是,实际的财务数据表是通过JavaScript生成的(在html中查找标签)。 这意味着不支持javascript的Python包(例如lxml和beautiful soup)在这里不会有太大帮助。 托德知道Selenium不能满足最快的网络抓取者的需求,但是由于他只需要5家公司(亚马逊,苹果,Facebook,IBM,微软)的数据,他仍然决定编写另一种快速的Selenium脚本。
To start, he knows he needs to make some imports, initialize a dataframe to store his scraped data in, and launch the browser.
首先,他知道他需要进行一些导入,初始化一个数据框以存储其抓取的数据并启动浏览器。
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])
## launch the Chrome browser
my_path = "C:UsersgstantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
Next, Todd thinks about how he’s going to get from company page to company page. Observing the current page’s url, he sees that substituting in the company’s ticker symbol and desired financial statement at the appropriate places should allow him to navigate to all the pages he needs, no simulated-clicking required. He also sees a common pattern in the xpath for the financial data he’ll be scraping.
接下来,托德(Todd)考虑如何从公司页面转到公司页面。 通过观察当前页面的URL,他发现在适当的位置替换公司的股票代码和所需的财务报表应该可以导航到所需的所有页面,而无需进行模拟点击。 他还将在xpath中看到他要抓取的财务数据的通用模式。
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
The first thing he wants to grab is the company ticker symbol, just so he can verify he’s scraping the correct page.
他要抓住的第一件事是公司的股票代码,以便他可以验证自己是否在抓取正确的页面。
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
try:
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
except:
company = nan
Notice the line for the assignment of the company variable. This tells the browser to check and see if the element is present, just as it normally would. If the element isn’t present, the browser will check again for the element every half second until the specified 10 seconds are up. Then it will throw an exception. This sort of precaution can be very useful for making your scrapers more reliable.
注意公司变量的分配行。 这告诉浏览器像平常一样检查并查看元素是否存在。 如果不存在该元素,则浏览器将每半秒再次检查该元素,直到指定的10秒为止。 然后它将引发异常。 这种预防措施对于使刮板更加可靠非常有用。
Examining the xpaths for the rest of the financial info, Todd sees that he will be collecting data points in groups of 4 (one data point for each quarter). To account for the possibility that some data might be missing, and to efficiently extract the text from the web elements, Todd writes the following function to simplify the scraping code.
通过检查其余财务信息的xpath,Todd发现他将以4为一组的方式收集数据点(每个季度一个数据点)。 为了解决某些数据可能丢失的可能性,并有效地从Web元素中提取文本,Todd编写了以下函数来简化抓取代码。
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
## find the elements
elements = browser.find_elements_by_xpath(xpath)
## if any are missing, return all nan values
if len(elements) != 4:
return [nan] * 4
## otherwise, return just the text of the element
else:
text = []
for e in elements:
text.append(e.text)
return text
Todd then finishes the code to loop through each of the company symbols and get the quarterly financial data from each of the financial statements.
然后,Todd完成代码以遍历每个公司符号,并从每个财务报表中获取季度财务数据。
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
quarters = get_elements(quarters_xpath)
quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
quarter_endings = get_elements(quarter_endings_xpath)
total_revenue = get_elements(financials_xpath.format("Total Revenue"))
gross_profit = get_elements(financials_xpath.format("Gross Profit"))
net_income = get_elements(financials_xpath.format("Net Income"))
## navigate to balance sheet quarterly page
url = url_form.format(symbol, "balance-sheet")
browser.get(url)
total_assets = get_elements(financials_xpath.format("Total Assets"))
total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
total_equity = get_elements(financials_xpath.format("Total Equity"))
## navigate to cash flow quarterly page
url = url_form.format(symbol, "cash-flow")
browser.get(url)
net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))
So for each iteration of the loop, Todd is collecting these data points. But he needs somewhere to store them. That’s where the pandas dataframe comes in. The following for loop ensures that the data is placed appropriately in the dataframe.
因此,对于循环的每次迭代,Todd都会收集这些数据点。 但是他需要在某个地方存储它们。 这就是大熊猫数据框所在的位置。以下for循环可确保将数据正确放置在数据框中。
## fill the datarame with the scraped data, 4 rows per company
for j in range(4):
row = i + j
df.loc[row, 'company'] = company
df.loc[row, 'quarter'] = quarters[j]
df.loc[row, 'quarter_ending'] = quarter_endings[j]
df.loc[row, 'total_revenue'] = total_revenue[j]
df.loc[row, 'gross_profit'] = gross_profit[j]
df.loc[row, 'net_income'] = net_income[j]
df.loc[row, 'total_assets'] = total_assets[j]
df.loc[row, 'total_liabilities'] = total_liabilities[j]
df.loc[row, 'total_equity'] = total_equity[j]
df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
After remembering to close the browser and write his dataframe to a .csv file, Todd has his scraper. Kicking his feet back up on his desk, he breathes a sigh of relief and continues his deep meditations on the nature of being while selenium once again does his work for him.
记得关闭浏览器并将其数据帧写入.csv文件后,Todd拥有了自己的抓取工具。 他的脚重新抬起桌子,松了一口气,继续深思着Selenium的存在,而Selenium又一次为他工作。
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
## find the elements
elements = browser.find_elements_by_xpath(xpath)
## if any are missing, return all nan values
if len(elements) != 4:
return [nan] * 4
## otherwise, return just the text of the element
else:
text = []
for e in elements:
text.append(e.text)
return text
## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
columns=['company', 'quarter', 'quarter_ending',
'total_revenue', 'gross_profit', 'net_income',
'total_assets', 'total_liabilities', 'total_equity',
'net_cash_flow'])
## launch the Chrome browser
my_path = "C:UsersastantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly"
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]
for i, symbol in enumerate(symbols):
## navigate to income statement quarterly page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
quarters = get_elements(quarters_xpath)
quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
quarter_endings = get_elements(quarter_endings_xpath)
total_revenue = get_elements(financials_xpath.format("Total Revenue"))
gross_profit = get_elements(financials_xpath.format("Gross Profit"))
net_income = get_elements(financials_xpath.format("Net Income"))
## navigate to balance sheet quarterly page
url = url_form.format(symbol, "balance-sheet")
browser.get(url)
total_assets = get_elements(financials_xpath.format("Total Assets"))
total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
total_equity = get_elements(financials_xpath.format("Total Equity"))
## navigate to cash flow quarterly page
url = url_form.format(symbol, "cash-flow")
browser.get(url)
net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))
## fill the datarame with the scraped data, 4 rows per company
for j in range(4):
row = i + j
df.loc[row, 'company'] = company
df.loc[row, 'quarter'] = quarters[j]
df.loc[row, 'quarter_ending'] = quarter_endings[j]
df.loc[row, 'total_revenue'] = total_revenue[j]
df.loc[row, 'gross_profit'] = gross_profit[j]
df.loc[row, 'net_income'] = net_income[j]
df.loc[row, 'total_assets'] = total_assets[j]
df.loc[row, 'total_liabilities'] = total_liabilities[j]
df.loc[row, 'total_equity'] = total_equity[j]
df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
browser.quit()
## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)
翻译自: https://www.pybloggers.com/2016/11/scraping-financial-data-with-selenium/
selenium 获取数据