如何使用Python和BeautifulSoup抓取网站

by Justin Yek

贾斯汀·耶克(Justin Yek)

如何使用Python和BeautifulSoup抓取网站 (How to scrape websites with Python and BeautifulSoup)

There is more information on the Internet than any human can absorb in a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it.

互联网上的信息比任何人一生所吸收的信息都要多。 您所需的不是访问该信息,而是一种可伸缩的方式来收集,组织和分析该信息。

You need web scraping.

您需要抓取网页。

Web scraping automatically extracts data and presents it in a format you can easily make sense of. In this tutorial, we’ll focus on its applications in the financial market, but web scraping can be used in a wide variety of situations.

Web抓取会自动提取数据,并以您容易理解的格式显示数据。 在本教程中,我们将重点介绍其在金融市场中的应用,但是Web抓取可用于多种情况。

If you’re an avid investor, getting closing prices every day can be a pain, especially when the information you need is found across several webpages. We’ll make data extraction easier by building a web scraper to retrieve stock indices automatically from the Internet.

如果您是一个狂热的投资者,那么每天获取收盘价可能会很痛苦,尤其是在多个网页上找到所需信息时。 通过构建网络抓取工具来自动从Internet检索股票索引,我们将使数据提取更加容易。

入门 (Getting Started)

We are going to use Python as our scraping language, together with a simple and powerful library, BeautifulSoup.

我们将使用Python作为我们的抓取语言,以及一个简单而强大的库BeautifulSoup。

  • For Mac users, Python is pre-installed in OS X. Open up Terminal and type python --version. You should see your python version is 2.7.x.

    对于Mac用户,Python预先安装在OS X中。打开Terminal并输入python --version 。 您应该看到python版本是2.7.x。

  • For Windows users, please install Python through the official website.

    对于Windows用户,请通过官方网站安装Python。

Next we need to get the BeautifulSoup library using pip, a package management tool for Python.

接下来,我们需要使用pip (用于Python的软件包管理工具)获取BeautifulSoup库。

In the terminal, type:

在终端中,键入:

easy_install pip
pip install BeautifulSoup4

Note: If you fail to execute the above command line, try adding sudo in front of each line.

注意 :如果无法执行上述命令行,请尝试在每行前面添加sudo

基础 (The Basics)

Before we start jumping into the code, let’s understand the basics of HTML and some rules of scraping.

在开始学习代码之前,让我们了解HTML的基础知识和一些抓取规则。

HTML tagsIf you already understand HTML tags, feel free to skip this part.

HTML标记如果您已经了解HTML标记,请随时跳过此部分。

<!DOCTYPE html>  
<html>  
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>
</html>

This is the basic syntax of an HTML webpage. Every <tag> serves a block inside the webpage:1. <!DOCTYPE html>: HTML documents must start with a type declaration.2. The HTML document is contained between <html> and </html>.3. The meta and script declaration of the HTML document is between <head>and </head>.4. The visible part of the HTML document is between <body> and </body>tags.5. Title headings are defined with the <h1> through <h6> tags.6. Paragraphs are defined with the <p> tag.

这是HTML网页的基本语法。 每个<tag>在网页中都有一个块:1.。 <!DOCTYPE html> :HTML文档必须以类型声明开头2。 HTML文档包含在<html></html> .3。之间。 HTML文档的元和脚本声明在<head></head> .4之间。 HTML文档的可见部分在<body></body>标签之间。5。 标题标题通过<h6> <h1><h6>标签定义。6。 段落用<p>标记定义。

Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.

其他有用的标签包括<a>用于超链接, <table>用于表, <tr>用于表行以及<td>用于表列。

Also, HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

另外,HTML标记有时带有idclass属性。 id属性为HTML标签指定唯一ID,并且该值在HTML文档中必须唯一。 class属性用于为具有相同类HTML标签定义相同的样式。 我们可以利用这些ID和类来帮助我们找到所需的数据。

For more information on HTML tags, id and class, please refer to W3Schools Tutorials.

有关HTML 标签ID类的更多信息,请参考W3Schools 教程

Scraping Rules

刮刮规则

  1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.

    在抓取网站之前,您应该查看网站的条款。 请仔细阅读有关合法使用数据的声明。 通常,您抓取的数据不得用于商业目的。
  2. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.

    请勿使用您的程序过于积极地从网站请求数据(也称为垃圾邮件),因为这可能会破坏网站。 确保您的程序行为合理(即表现得像人)。 每秒请求一个网页是一种很好的做法。
  3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

    网站的布局可能会不时更改,因此请确保重新访问该网站并根据需要重写您的代码

检查页面 (Inspecting the Page)

Let’s take one page from the Bloomberg Quote website as an example.

让我们以彭博社报价网站的一页为例。

As someone following the stock market, we would like to get the index name (S&P 500) and its price from this page. First, right-click and open your browser’s inspector to inspect the webpage.

作为关注股市的人,我们希望从此页面获取指数名称(S&P 500)及其价格。 首先,右键单击并打开浏览器的检查器以检查网页。

Try hovering your cursor on the price and you should be able to see a blue box surrounding it. If you click it, the related HTML will be selected in the browser console.

尝试将光标悬停在价格上,您应该能够看到一个蓝色的框。 如果单击它,将在浏览器控制台中选择相关HTML。

From the result, we can see that the price is inside a few levels of HTML tags, which is <div class="basic-quote"><div class="price-container up"><div class="price">.

从结果中,我们可以看到价格在几个HTML标签中,即<div class="basic-quote"><div class="price-container up"><div class="price">

Similarly, if you hover and click the name “S&P 500 Index”, it is inside <div class="basic-quote"> and <h1 class="name">.

同样,如果悬停并单击名称“ S&P 500 Index”,则该名称位于<div class="basic-quote"><h1 class="name">

Now we know the unique location of our data with the help of class tags.

现在,借助class标签,我们知道了数据的唯一位置。

跳入代码 (Jump into the Code)

Now that we know where our data is, we can start coding our web scraper. Open your text editor now!

现在我们知道了数据在哪里,我们可以开始对Web刮板进行编码了。 立即打开文本编辑器!

First, we need to import all the libraries that we are going to use.

首先,我们需要导入所有将要使用的库。

# import libraries
import urllib2
from bs4 import BeautifulSoup

Next, declare a variable for the url of the page.

接下来,为页面的URL声明一个变量。

# specify the url
quote_page = ‘http://www.bloomberg.com/quote/SPX:IND'

Then, make use of the Python urllib2 to get the HTML page of the url declared.

然后,使用Python urllib2来获取声明的urlHTML页面。

# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)

Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.

最后,将页面解析为BeautifulSoup格式,以便我们可以使用BeautifulSoup对其进行处理。

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, ‘html.parser’)

Now we have a variable, soup, containing the HTML of the page. Here’s where we can start coding the part that extracts the data.

现在,我们有了一个变量, soup ,包含页面HTML。 在这里,我们可以开始对提取数据的部分进行编码。

Remember the unique layers of our data? BeautifulSoup can help us get into these layers and extract the content with find(). In this case, since the HTML class name is unique on this page, we can simply query <div class="name">.

还记得我们数据的独特层吗? BeautifulSoup可以帮助我们进入这些层并使用find()提取内容。 在这种情况下,由于HTML类name在此页面上是唯一的,因此我们只需查询<div class="name">

# Take out the <div> of name and get its value
name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})

After we have the tag, we can get the data by getting its text.

有了标签后,我们可以通过获取其text来获取数据。

name = name_box.text.strip() # strip() is used to remove starting and trailing
print name

Similarly, we can get the price too.

同样,我们也可以获取价格。

# get the index price
price_box = soup.find(‘div’, attrs={‘class’:’price’})
price = price_box.text
print price

When you run the program, you should be able to see that it prints out the current price of the S&P 500 Index.

运行该程序时,您应该能够看到它打印出了S&P 500指数的当前价格。

导出到Excel CSV (Export to Excel CSV)

Now that we have the data, it is time to save it. The Excel Comma Separated Format is a nice choice. It can be opened in Excel so you can see the data and process it easily.

现在我们有了数据,是时候保存它了。 Excel逗号分隔格式是一个不错的选择。 可以在Excel中打开它,以便您可以查看数据并轻松对其进行处理。

But first, we have to import the Python csv module and the datetime module to get the record date. Insert these lines to your code in the import section.

但是首先,我们必须导入Python csv模块和datetime模块以获取记录日期。 将这些行插入到导入部分的代码中。

import csv
from datetime import datetime

At the bottom of your code, add the code for writing data to a csv file.

在代码底部,添加用于将数据写入csv文件的代码。

# open a csv file with append, so old data will not be erased
with open(‘index.csv’, ‘a’) as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow([name, price, datetime.now()])

Now if you run your program, you should able to export an index.csv file, which you can then open with Excel, where you should see a line of data.

现在,如果您运行程序,则应该能够导出index.csv文件,然后可以使用Excel打开该文件,您应该在其中看到一行数据。

So if you run this program everyday, you will be able to easily get the S&P 500 Index price without rummaging through the website!

因此,如果您每天运行此程序,您将可以轻松获得S&P 500指数的价格,而无需在网站上翻箱倒柜!

更进一步(高级用途) (Going Further (Advanced uses))

Multiple IndicesSo scraping one index is not enough for you, right? We can try to extract multiple indices at the same time.

多个索引因此,仅抓取一个索引对您来说还不够,对吧? 我们可以尝试同时提取多个索引。

First, modify the quote_page into an array of URLs.

首先,将quote_page修改为URL数组。

quote_page = [‘http://www.bloomberg.com/quote/SPX:IND', ‘http://www.bloomberg.com/quote/CCMP:IND']

Then we change the data extraction code into a for loop, which will process the URLs one by one and store all the data into a variable data in tuples.

然后,我们将数据提取代码更改为for循环,该循环将逐个处理URL,并将所有数据存储到元组中的可变data中。

# for loop
data = []
for pg in quote_page:
 # query the website and return the html to the variable ‘page’
 page = urllib2.urlopen(pg)
 
# parse the html using beautiful soap and store in variable `soup`
 soup = BeautifulSoup(page, ‘html.parser’)
 
# Take out the <div> of name and get its value
 name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
 name = name_box.text.strip() # strip() is used to remove starting and trailing
 
# get the index price
 price_box = soup.find(‘div’, attrs={‘class’:’price’})
 price = price_box.text
 
# save the data in tuple
 data.append((name, price))

Also, modify the saving section to save data row by row.

另外,修改保存部分以逐行保存数据。

# open a csv file with append, so old data will not be erased
with open(‘index.csv’, ‘a’) as csv_file:
 writer = csv.writer(csv_file)
 # The for loop
 for name, price in data:
 writer.writerow([name, price, datetime.now()])

Rerun the program and you should be able to extract two indices at the same time!

重新运行该程序,您应该可以同时提取两个索引!

先进的刮技术 (Advanced Scraping Techniques)

BeautifulSoup is simple and great for small-scale web scraping. But if you are interested in scraping data at a larger scale, you should consider using these other alternatives:

BeautifulSoup非常简单,非常适合小规模的网页抓取。 但是,如果您有兴趣大规模抓取数据,则应考虑使用以下其他替代方法:

  1. Scrapy, a powerful python scraping framework

    Scrapy ,一个强大的python抓取框架

  2. Try to integrate your code with some public APIs. The efficiency of data retrieval is much higher than scraping webpages. For example, take a look at Facebook Graph API, which can help you get hidden data which is not shown on Facebook webpages.

    尝试将代码与一些公共API集成在一起。 数据检索的效率比抓取网页要高得多。 例如,看一下Facebook Graph API ,它可以帮助您获取Facebook网页上未显示的隐藏数据。

  3. Consider using a database backend like MySQL to store your data when it gets too large.

    考虑使用数据库后端(如MySQL)在数据太大时存储数据。

采用DRY方法 (Adopt the DRY Method)

DRY stands for “Don’t Repeat Yourself”, try to automate your everyday tasks like this person. Some other fun projects to consider might be keeping track of your Facebook friends’ active time (with their consent of course), or grabbing a list of topics in a forum and trying out natural language processing (which is a hot topic for Artificial Intelligence right now)!

DRY代表“不要重复自己”,请尝试像这样的人自动化您的日常任务。 需要考虑的其他一些有趣的项目可能是跟踪您的Facebook朋友的活动时间(当然要得到他们的同意),或者在论坛中获取主题列表并尝试自然语言处理(这是人工智能的热门话题)现在)!

If you have any questions, please feel free to leave a comment below.

如有任何疑问,请随时在下面发表评论。

Referenceshttp://www.gregreda.com/2013/03/03/web-scraping-101-with-python/http://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/

参考 http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/ http://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping -beautiful-soup-python /

This article was originally published on Altitude Labs’ blog and was written by our software engineer, Leonard Mok. Altitude Labs is a software agency that specializes in personalized, mobile-first React apps.

本文最初发布在Altitude Labs的博客上 ,由我们的软件工程师Leonard Mok撰写。 Altitude Labs是一家软件代理商,专门研究个性化,移动优先的React应用。

翻译自: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值