python3 抓取_如何使用Python 3抓取网站

python3 抓取

Web scraping is the process of extracting data from websites.

Web抓取是从网站提取数据的过程。

Before attempting to scrape a website, you should make sure that the provider allows it in their terms of service. You should also check to see whether you could use an API instead.

在尝试抓取网站之前,您应确保提供商在其服务条款中允许该网站。 您还应该检查以查看是否可以使用API​​。

Massive scraping can put a server under a lot of stress which can result in a denial of service. And you don't want that.

大量刮擦可能会使服务器承受很大压力,这可能会导致拒绝服务。 而且您不想要那样。

谁应该读这本书? (Who should read this?)

This article is for advanced readers. It will assume that you are already familiar with the Python programming language.

本文适用于高级读者。 假定您已经熟悉Python编程语言。

At the very minimum you should understand list comprehension, context manager, and functions. You should also know how to set up a virtual environment.

至少您应该了解列表理解,上下文管理器和函数。 您还应该知道如何设置虚拟环境。

We'll run the code on your local machine to explore some websites. With some tweaks you could make it run on a server as well.

我们将在您的本地计算机上运行代码以浏览一些网站。 通过一些调整,您也可以使其在服务器上运行。

您将在本文中学到什么 (What you will learn in this article)

At the end of this article, you will know how to download a webpage, parse it for interesting information, and format it in a usable format for further processing. This is also known as ETL.

在本文的结尾,您将知道如何下载网页,对其进行解析以获取有趣的信息,以及如何将其格式化为可用的格式以进行进一步处理。 这也称为ETL

This article will also explain what to do if that website is using JavaScript to render content (like React.js or Angular).

本文还将说明如果该网站使用JavaScript渲染内容(如React.js或Angular)时该怎么办。

先决条件 (Prerequisites)

Before I can start, I want to make sure we're ready to go. Please set up a virtual environment and install the following packages into it:

在开始之前,我想确保我们已准备就绪。 请设置虚拟环境并将以下软件包安装到其中:

  • beautifulsoup4 (version 4.9.0 at time of writing)

    beautifulsoup4(撰写本文时为4.9.0版)
  • requests (version 2.23.0 at time of writing)

    请求(撰写本文时为2.23.0版)
  • wordcloud (version 1.17.0 at time of writing, optional)

    wordcloud(撰写本文时为1.17.0版,可选)
  • selenium (version 3.141.0 at time of writing, optional)

    Selenium(撰写本文时为3.141.0版,可选)

You can find the code for this project in this git repository on GitHub.

您可以在GitHub上的git存储库中找到该项目的代码。

For this example, we are going to scrape the Basic Law for the Federal Republic of Germany. (Don't worry, I checked their Terms of Service. They offer an XML version for machine processing, but this page serves as an example of processing HTML. So it should be fine.)

在此示例中,我们将取消德意志联邦共和国的基本法》 。 (不用担心,我检查了他们的服务条款。他们提供了用于机器处理的XML版本,但是此页面仅作为处理HTML的示例。因此应该没事。)

步骤1:下载原始码 (Step 1: Download the source)

First things first: I create a file urls.txt holding all the URLs I want to download:

首先,我首先创建一个文件urls.txt包含我要下载的所有URL:

Next, I write a bit of Python code in a file called scraper.py to download the HTML of this files.

接下来,我在名为scraper.py的文件中编写了一些Python代码,以下载该文件HTML。

In a real scenario, this would be too expensive and you'd use a database instead. To keep things simple, I'll download files into the same directory next to the store and use their name as the filename.

在实际情况下,这太昂贵了,您将改为使用数据库。 为简单起见,我将文件下载到商店旁边的同一目录中,并将其名称用作文件名。

By downloading the files, I can process them locally as much as I want without being dependent on a server. Try to be a good web citizen, okay?

通过下载文件,我可以根据需要在本地进行尽可能多的处理,而无需依赖服务器。 努力成为一个好的网络公民,好吗?

步骤2:解析来源 (Step 2: Parse the source)

Now that I've downloaded the files, it's time to extract their interesting features. Therefore I go to one of the pages I downloaded, open it in a web browser, and hit Ctrl-U to view its source. Inspecting it will show me the HTML structure.

现在,我已经下载了文件,是时候提取它们有趣的功能了。 因此,我转到下载的页面之一,在Web浏览器中将其打开,然后按Ctrl-U查看其源代码。 检查它会显示HTML结构。

In my case, I figured I want the text of the law without any markup. The element wrapping it has an id of container. Using BeautifulSoup I can see that a combination of find and get_text will do what I want.

就我而言,我想我想要没有任何标记的法律文本。 包装它的元素具有container的ID。 使用BeautifulSoup,我可以看到findget_text的组合将完成我想要的。

Since I have a second step now, I'm going to refactor the code a bit by putting it into functions and add a minimal CLI.

既然现在有了第二步,我将通过将代码放入函数中并添加最小的CLI来稍微重构代码。

Now I can run the code in three ways:

现在,我可以通过三种方式运行代码:

  1. Without any arguments to run everything (that is, download all URLs and extract them, then save to disk) via: python scraper.py

    不带任何参数即可通过以下方式运行所有内容(即下载所有URL并解压缩它们,然后保存到磁盘): python scraper.py

  2. With an argument of download and a url to download python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html. This will not process the file.

    带有download参数和一个用于下载python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html的网址,请python scraper.py download https://www.gesetze-im-internet.de/gg/art_1.html 。 这将不会处理文件。

  3. With an argument of parse and a filepath to parse: python scraper.py art_1.html. This will skip the download step.

    具有parse参数和要解析的文件路径: python scraper.py art_1.html 。 这将跳过下载步骤。

With that, there's one last thing missing.

有了这一点,还有最后一件事。

步骤3:格式化来源以进行进一步处理 (Step 3: Format the source for further processing)

Let's say I want to generate a word cloud for each article. This can be a quick way to get an idea about what a text is about. For this, install the package wordcloud and update the file like this:

假设我想为每篇文章生成一个词云。 这可以是一种快速了解文本内容的方法。 为此,请安装软件包wordcloud并更新文件,如下所示:

What changed? For one, I downloaded a list of German stopwords from GitHub. This way, I can eliminate the most common words from the downloaded law text.

发生了什么变化? 首先,我从GitHub下载了一个德语停用词列表 。 这样,我可以从下载的法律文本中消除最常用的词。

Then I instantiate a WordCloud instance with the list of stopwords I downloaded and the text of the law. It will be turned into an image with the same basename.

然后,使用下载的停用词列表和法律文本实例化WordCloud实例。 它将变成具有相同基本名称的图像。

After the first run, I discover that the list of stopwords is incomplete. So I add additional words I want to exclude from the resulting image.

第一次运行后,我发现停用词列表不完整。 因此,我添加了要从结果图像中排除的其他单词。

With that, the main part of web scraping is complete.

这样,网络抓取的主要部分就完成了。

奖励:SPA怎么样? (Bonus: What about SPAs?)

SPAs - or Single Page Applications - are web applications where the whole experience is controlled by JavaScript, which is executed in the browser. As such, downloading the HTML file does not bring us far. What should we do instead?

SPA-或单页应用程序-是Web应用程序,其整个体验由JavaScript控制,并在浏览器中执行。 因此,下载HTML文件不会使我们走得太远。 我们应该怎么做呢?

We'll use the browser. With Selenium. Make sure to install a driver also. Download the .tar.gz archive and unpack it in the bin folder of your virtual environment so it will be found by Selenium. That is the directory where you can find the activate script (on GNU/Linux systems).

我们将使用浏览器。 与Selenium。 确保还安装驱动程序 。 下载.tar.gz档案并将其解压缩到虚拟环境的bin文件夹中,以便Selenium可以找到它。 那是您可以在其中找到activate脚本的目录(在GNU / Linux系统上)。

As an example, I am using the Angular website here. Angular is a popular SPA-Framework written in JavaScript and guaranteed to be controlled by it for the time being.

例如,我在这里使用Angular网站 。 Angular是一种流行的SPA框架,使用JavaScript编写,并保证暂时受其控制。

Since the code will be slower, I create a new file called crawler.py for it. The content looks like this:

由于代码会变慢,因此我为其创建了一个名为crawler.py的新文件。 内容如下所示:

Here, Python is opening a Firefox instance, browsing the website and looking for an <article> element. It is copying over its text into a dictionary, which gets read out in the transform step and turned into a WordCloud during load.

在这里,Python正在打开Firefox实例,浏览网站并寻找<article>元素。 它会将其文本复制到字典中,然后在transform步骤中将其读取出来,并在load期间变成WordCloud。

When dealing with JavaScript-heavy sites, it is often useful to use Waits and perhaps run even execute_scriptto defer to JavaScript if needed.

在处理大量JavaScript网站时,通常使用Waits很有用,甚至在需要时甚至可以运行execute_script来遵循JavaScript。

摘要 (Summary)

Thanks for reading this far! Let's summarise what we've learned now:

感谢您阅读本文! 让我们总结一下我们现在学到的东西:

  1. How to scrape a website with Python's requests package.

    如何使用Python的requests包抓取网站。

  2. How to translate it into a meaningful structure using beautifulsoup.

    如何使用beautifulsoup将其转换为有意义的结构。

  3. How to further process that structure into something you can work with.

    如何进一步将该结构处理为可以使用的结构。
  4. What to do if the target page is relying on JavaScript.

    如果目标页面依赖JavaScript,该怎么办。

进一步阅读 (Further reading)

If you want to find more about me, you can follow me on Twitter or visit my website.

如果您想了解有关我的更多信息,可以在Twitter上关注我或访问我的网站

I'm not the first one who wrote about Web Scraping here on freeCodeCamp. Yasoob Khalid and Dave Gray also did so in the past:

我不是第一个在freeCodeCamp上写过Web Scraping的人。 Yasoob Khalid和Dave Gray过去也这样做过:

翻译自: https://www.freecodecamp.org/news/webscraping-in-python/

python3 抓取

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值