防止web抓取_Web开发人员的Web抓取：简要摘要

最新推荐文章于 2020-09-21 02:22:47 发布

cumian9828

最新推荐文章于 2020-09-21 02:22:47 发布

阅读量305

点赞数

文章标签：编程语言 python js javascript java ViewUI

原文链接：https://www.freecodecamp.org/news/web-scraping-for-web-developers-a-concise-summary-3af3d0ca4069/

版权

防止web抓取

Knowing one approach to web scraping may solve your problem in the short term, but all methods have their own strengths and weaknesses. Being aware of this can save you time and help you to solve a task more efficiently.

知道一种网络抓取方法可能会在短期内解决您的问题，但是所有方法都有其优点和缺点。意识到这一点可以节省您的时间，并帮助您更有效地解决任务。

Numerous resources exist, which will show you a single technique for extracting data from a web page. The reality is that multiple solutions and tools can be used for that.

存在大量资源，这些资源将向您展示一种从网页提取数据的单一技术。现实情况是，可以使用多种解决方案和工具。

What are your options to programmatically extract data from a web page?

您可以通过哪些方式以编程方式从网页中提取数据？

What are the pros and cons of each approach?

每种方法的优缺点是什么？

How to use cloud services to increase the degree of automation?

如何使用云服务来提高自动化程度？

This guide meant to answer these questions.

本指南旨在回答这些问题。

I assume you have a basic understanding of browsers in general, HTTP requests, the DOM (Document Object Model), HTML, CSS selectors, and Async JavaScript.

我假设您对浏览器， HTTP请求， DOM (文档对象模型)， HTML ， CSS选择器和异步JavaScript都有基本的了解。

If these phrases sound unfamiliar, I suggest checking out those topics before continue reading. Examples are implemented in Node.js, but hopefully you can transfer the theory into other languages if needed.

如果这些短语听起来不熟悉，建议您在继续阅读之前先检查一下这些主题。示例是在Node.js中实现的，但是希望您可以根据需要将理论转化为其他语言。

静态内容 (Static content)

HTML来源 (HTML source)

Let’s start with the simplest approach.

让我们从最简单的方法开始。

If you are planning to scrape a web page, this is the first method to try. It requires a negligible amount of computing power and the least time to implement.

如果您打算抓取网页，这是第一种尝试的方法。它所需的计算能力可忽略不计，实现时间最少。

However, it only works if the HTML source code contains the data you are targeting. To check that in Chrome, right-click the page and choose View page source. Now you should see the HTML source code.

但是， 仅当HTML源代码包含您要定位的数据时 ，它才有效 。要在Chrome中进行检查，请右键单击页面，然后选择查看页面源 。现在您应该看到HTML源代码。

It’s important to note here, that you won’t see the same code by using Chrome’s inspect tool, because it shows the HTML structure related to the current state of the page, which is not necessarily the same as the source HTML document that you can get from the server.

请务必注意，使用Chrome的检查工具不会看到相同的代码，因为它显示的是与页面当前状态相关HTML结构，不一定与您可以使用的源HTML文档相同从服务器获取。

Once you find the data here, write a CSS selector belonging to the wrapping element, to have a reference later on.

在此处找到数据后，编写属于wrapping元素的CSS选择器，以备日后参考。

To implement, you can send an HTTP GET request to the URL of the page and will get back the HTML source code.

要实现，您可以将HTTP GET请求发送到页面的URL，并将获取HTML源代码。

In Node, you can use a tool called CheerioJS to parse this raw HTML and extract the data using a selector. The code looks something like this:

在Node中 ，可以使用一个名为CheerioJS的工具来解析此原始HTML，并使用选择器提取数据。代码看起来像这样：

const fetch = require('node-fetch');
const cheerio = require('cheerio');

const url = 'https://example.com/';
const selector = '.example';

fetch(url)
  .then(res => res.text())
  .then(html => {
    const $ = cheerio.load(html);
    const data = $(selector);
    console.log(data.text());
  });

动态内容 (Dynamic content)

In many cases, you can’t access the information from the raw HTML code, because the DOM was manipulated by some JavaScript, executed in the background. A typical example of that is a SPA (Single Page Application), where the HTML document contains a minimal amount of information, and the JavaScript populates it at runtime.

在许多情况下，您无法从原始HTML代码访问信息，因为DOM是由某些JavaScript操纵的，在后台执行。一个典型的示例是SPA(单页应用程序)，其中HTML文档包含最少的信息，而JavaScript在运行时填充它。

In this situation, a solution is to build the DOM and execute the scripts located in the HTML source code, just like a browser does. After that, the data can be extracted from this object with selectors.

在这种情况下，一种解决方案是构建DOM并执行位于HTML源代码中的脚本，就像浏览器一样。之后，可以使用选择器从该对象中提取数据。

无头浏览器 (Headless browsers)

This can be achieved by using a headless browser. A headless browser is almost the same thing as the normal one you are probably using every day but without a user interface. It’s running in the background and you can programmatically control it instead of clicking with your mouse and typing with a keyboard.

这可以通过使用无头浏览器来实现。无头浏览器与您每天可能使用的普通浏览器几乎相同，但是没有用户界面。它在后台运行，您可以通过编程方式控制它，而不用用鼠标单击并用键盘键入。

A popular choice for a headless browser is Puppeteer. It is an easy to use Node library which provides a high-level API to control Chrome in headless mode. It can be configured to run non-headless, which comes in handy during development. The following code does the same thing as before, but it will work with dynamic pages as well:

无头浏览器的一个流行选择是Puppeteer 。它是易于使用的Node库，它提供了高级API以无头模式控制Chrome。可以将其配置为非无头运行，这在开发期间会派上用场。以下代码执行与以前相同的操作，但它也适用于动态页面：

const puppeteer = require('puppeteer');

async function getData(url, selector){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const data = await page.evaluate(selector => {
    return document.querySelector(selector).innerText;
  }, selector);
  await browser.close();
  return data;
}

const url = 'https://example.com';
const selector = '.example';
getData(url,selector)
  .then(result => console.log(result));

Of course, you can do more interesting things with Puppeteer, so it is worth checking out the documentation. Here is a code snippet which navigates to a URL, takes a screenshot and saves it:

当然，您可以使用Puppeteer做更多有趣的事情，因此值得查阅文档。这是一个导航至URL，截屏并保存的代码段：

const puppeteer = require('puppeteer');

async function takeScreenshot(url,path){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  await page.screenshot({path: path});
  await browser.close();
}

const url = 'https://example.com';
const path = 'example.png';
takeScreenshot(url, path);

As you can imagine, running a browser requires much more computing power than sending a simple GET request and parsing the response. Therefore execution is relatively costly and slow. Not only that but including a browser as a dependency makes the deployment package massive.

可以想象，运行浏览器比发送简单的GET请求和解析响应需要更多的计算能力。因此，执行相对昂贵且缓慢。不仅如此，将浏览器作为依赖项使部署程序包变得庞大。

On the upside, this method is highly flexible. You can use it for navigating around pages, simulating clicks, mouse moves, and keyboard events, filling out forms, taking screenshots or generating PDFs of pages, executing commands in the console, selecting elements to extract its text content. Basically, everything can be done that is possible manually in a browser.

从好的方面来说，这种方法非常灵活。您可以使用它来浏览页面，模拟单击，鼠标移动和键盘事件，填写表单，截屏或生成页面PDF，在控制台中执行命令，选择元素以提取其文本内容。基本上，可以完成在浏览器中可以手动完成的所有操作。

仅构建DOM (Building just the DOM)

You may think it’s a little bit of overkill to simulate a whole browser just for building a DOM. Actually, it is, at least under certain circumstances.

您可能会认为，仅针对构建DOM来模拟整个浏览器会有些过头。实际上，至少在某些情况下是这样。

There is a Node library, called Jsdom, which will parse the HTML you pass it, just like a browser does. However, it isn’t a browser, but a tool for building a DOM from a given HTML source code, while also executing the JavaScript code within that HTML.

有一个名为Jsdom的Node库，它将像浏览器一样解析传递给它HTML。但是，它不是浏览器，而是从给定HTML源代码构建DOM的工具 ，同时还可以在该HTML中执行JavaScript代码。

Thanks to this abstraction, Jsdom is able to run faster than a headless browser. If it’s faster, why don’t use it instead of headless browsers all the time?

由于有了这种抽象，Jsdom的运行速度比无头浏览器要快。如果速度更快，为什么不一直使用它而不是无头浏览器呢？

Quote from the documentation:

从文档中引用：

People often have trouble with asynchronous script loading when using jsdom. Many pages load scripts asynchronously, but there is no way to tell when they’re done doing so, and thus when it’s a good time to run your code and inspect the resulting DOM structure. This is a fundamental limitation.

使用jsdom时，人们通常会在异步脚本加载方面遇到麻烦。许多页面都是异步加载脚本的，但是没有办法告诉他们何时完成加载，因此，什么时候该是运行代码并检查生成的DOM结构的好时机。这是一个基本限制。

… This can be worked around by polling for the presence of a specific element.

…可以通过轮询特定元素的存在来解决。

This solution is shown in the example. It checks every 100 ms if the element either appeared or timed out (after 2 seconds).

该解决方案如示例所示。它每100毫秒检查一次元素是否出现或超时(2秒后)。

It also often throws nasty error messages when some browser feature in the page is not implemented by Jsdom, such as: “Error: Not implemented: window.alert…” or “Error: Not implemented: window.scrollTo…”. This issue also can be solved with some workarounds (virtual consoles).

当页面中的某些浏览器功能未由Jsdom实现时，它通常还会抛出讨厌的错误消息，例如：“ 错误：未实现：window.alert…”或“错误：未实现：window.scrollTo…”。 还可以通过一些解决方法( 虚拟控制台 )解决此问题。

Generally, it’s a lower level API than Puppeteer, so you need to implement certain things yourself.

通常，它是比Puppeteer更低级别的API，因此您需要自己实现某些事情。

These things make it a little messier to use, as you will see in the example. Puppeteer solves all these things for you behind the scenes and makes it extremely easy to use. Jsdom for this extra work will offer a fast and lean solution.

正如您将在示例中看到的那样，这些东西使使用起来有些混乱。 Puppeteer在幕后为您解决了所有这些问题，并使其非常易于使用。 Jsdom的这项额外工作将提供快速而精益的解决方案。

Let’s see the same example as previously, but with Jsdom:

让我们看一下与前面相同的示例，但使用Jsdom：

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

async function getData(url,selector,timeout) {
  const virtualConsole = new jsdom.VirtualConsole();
  virtualConsole.sendTo(console, { omitJSDOMErrors: true });
  const dom = await JSDOM.fromURL(url, {
    runScripts: "dangerously",
    resources: "usable",
    virtualConsole
  });
  const data = await new Promise((res,rej)=>{
    const started = Date.now();
    const timer = setInterval(() => {
      const element = dom.window.document.querySelector(selector)
      if (element) {
        res(element.textContent);
        clearInterval(timer);
      }
      else if(Date.now()-started > timeout){
        rej("Timed out");
        clearInterval(timer);
      }
    }, 100);
  });
  dom.window.close();
  return data;
}

const url = "https://example.com/";
const selector = ".example";
getData(url,selector,2000).then(result => console.log(result));

逆向工程 (Reverse engineering)

Jsdom is a fast and lightweight solution, but it’s possible even further to simplify things.

Jsdom是一种快速，轻量级的解决方案，但是甚至可以进一步简化事情。

Do we even need to simulate the DOM?

我们甚至需要模拟DOM吗？

Generally speaking, the webpage that you want to scrape consists of the same HTML, same JavaScript, same technologies you’ve already know. So, if you find that piece of code from where the targeted data was derived, you can repeat the same operation in order to get the same result.

一般来说，您要抓取的网页由您已经知道的相同HTML，相同JavaScript和相同技术组成。因此， 如果您 从目标数据的来源中找到那段代码，则可以重复相同的操作以获得相同的结果。

If we oversimplify things, the data you’re looking for can be:

如果我们简化了事情，那么您正在寻找的数据可能是：

part of the HTML source code (as we saw in the first paragraph),
HTML源代码的一部分(如我们在第一段中所见)，
part of a static file, referenced in the HTML document (for example a string in a javascript file),
HTML文档中引用的静态文件的一部分(例如javascript文件中的字符串)，
a response for a network request (for example some JavaScript code sent an AJAX request to a server, which responded with a JSON string).
对网络请求的响应(例如，一些JavaScript代码向服务器发送了AJAX请求，并以JSON字符串作为响应)。

All of these data sources can be accessed with network requests. From our perspective, it doesn’t matter if the webpage uses HTTP, WebSockets or any other communication protocol, because all of them are reproducible in theory.

可以通过网络请求访问所有这些数据源。 从我们的角度来看，网页是否使用HTTP，WebSocket或任何其他通信协议都没有关系，因为它们在理论上都是可重现的。

Once you locate the resource housing the data, you can send a similar network request to the same server as the original page does. As a result, you get the response, containing the targeted data, which can be easily extracted with regular expressions, string methods, JSON.parse etc…

找到包含数据的资源后，您可以将与原始页面相同的网络请求发送到同一服务器。结果，您将获得包含目标数据的响应，可以使用正则表达式，字符串方法，JSON.parse等轻松提取该响应…

With simple words, you can just take the resource where the data is located, instead of processing and loading the whole stuff. This way the problem, showed in the previous examples, can be solved with a single HTTP request instead of controlling a browser or a complex JavaScript object.

用简单的话来说，您只需获取数据所在的资源，而不用处理和加载整个内容。这样，可以通过单个HTTP请求而不是控制浏览器或复杂JavaScript对象来解决前面示例中显示的问题。

This solution seems easy in theory, but most of the times it can be really time-consuming to carry out and requires some experience of working with web pages and servers.

从理论上讲，此解决方案似乎很容易，但是大多数情况下，执行该解决方案确实很耗时 ，并且需要一定的使用网页和服务器的经验。

A possible place to start researching is to observe network traffic. A great tool for that is the Network tab in Chrome DevTools. You will see all outgoing requests with the responses (including static files, AJAX requests, etc…), so you can iterate through them and look for the data.

一个可能开始研究的地方是观察网络流量。 Chrome DevTools中的“ 网络”标签就是一个很好的工具。您将看到所有带有响应的传出请求(包括静态文件，AJAX请求等)，因此您可以遍历它们并查找数据。

This can be even more sluggish if the response is modified by some code before being rendered on the screen. In that case, you have to find that piece of code and understand what’s going on.

如果响应在显示到屏幕之前被某些代码修改，则可能会变得更加缓慢。在这种情况下，您必须找到这段代码并了解发生了什么。

As you see, this solution may require way more work than the methods featured so far. On the other hand, once it’s implemented, it provides the best performance.

如您所见，此解决方案可能需要比到目前为止的方法更多的工作。另一方面，一旦实施，它将提供最佳性能。

This chart shows the required execution time, and the package size compared to Jsdom and Puppeteer:

该图显示了所需的执行时间，以及与Jsdom和Puppeteer相比的软件包大小：

These results aren’t based on precise measurements and can vary in every situation, but shows well the approximate difference between these techniques.

这些结果并非基于精确的测量结果，并且在每种情况下都可能有所不同，但是很好地显示了这些技术之间的大致差异。

云服务集成 (Cloud service integration)

Let’s say you implemented one of the solutions listed so far. One way to execute your script is to power on your computer, open a terminal and execute it manually.

假设您实施了到目前为止列出的解决方案之一。执行脚本的一种方法是打开计算机电源，打开终端并手动执行。

This can become annoying and inefficient very quickly, so it would be better if we could just upload the script to a server and it would execute the code on a regular basis depending on how it’s configured.

这会很快变得烦人且效率低下，因此最好将脚本上载到服务器，然后根据配置方式定期执行代码，这会更好。

This can be done by running an actual server and configuring some rules on when to execute the script. Servers shine when you keep observing an element in a page. In other cases, a cloud function is probably a simpler way to go.

这可以通过运行实际的服务器并配置何时执行脚本的一些规则来完成。当您继续观察页面中的元素时，服务器就会发光。在其他情况下，使用云功能可能是更简单的方法。

Cloud functions are basically containers intended to execute the uploaded code when a triggering event occurs. This means you don’t have to manage servers, it’s done automatically by the cloud provider of your choice.

云功能基本上是用于在发生触发事件时执行上载代码的容器。这意味着您不必管理服务器，而是由您选择的云提供商自动完成的。

A possible trigger can be a schedule, a network request, and numerous other events. You can save the collected data in a database, write it in a Google sheet or send it in an email. It all depends on your creativity.

可能的触发因素可能是时间表，网络请求和许多其他事件。您可以将收集的数据保存在数据库中，将其写在Google工作表中或通过电子邮件发送。这一切都取决于您的创造力。

Popular cloud providers are Amazon Web Services(AWS), Google Cloud Platform(GCP), and Microsoft Azure and all of them has a function service:

流行的云提供商是Amazon Web Services (AWS)， Google Cloud Platform (GCP)和Microsoft Azure ，它们都提供功能服务：

They offer some amount of free usage every month, which your single script probably won’t exceed, unless in extreme cases, but please check the pricing before use.

他们每个月都会提供一定的免费使用量，除非在极端情况下，否则您的单个脚本可能不会超过免费使用量，但是请在使用前检查价格 。

If you are using Puppeteer, Google’s Cloud Functions is the simplest solution. Headless Chrome’s zipped package size (~130MB) exceeds AWS Lambda’s limit of maximum zipped size (50MB). There are some techniques to make it work with Lambda, but GCP functions support headless Chrome by default, you just need to include Puppeteer as a dependency in package.json.

如果您使用的是Puppeteer，则Google的“ 响亮功能”是最简单的解决方案。无头Chrome的压缩软件包大小(〜130MB)超出了AWS Lambda的最大压缩大小(50MB)的限制。有一些技术可以使其与Lambda一起使用，但是GCP功能默认情况下支持无头Chrome ，您只需要在package.json中包括Puppeteer作为依赖项即可。

If you want to learn more about cloud functions in general, do some research on serverless architectures. Many great guides have already been written on this topic and most providers have an easy to follow documentation.

如果您想全面了解有关云功能的更多信息，请对无服务器架构进行一些研究。关于此主题已经编写了许多出色的指南，并且大多数提供程序都有易于遵循的文档。

摘要 (Summary)

I know that every topic was a bit compressed. You probably can’t implement every solution just with this knowledge, but with the documentation and some custom research, it shouldn’t be a problem.

我知道每个主题都有所压缩。您可能不能仅凭此知识就实现每个解决方案，但是有了文档和一些自定义研究，这应该不是问题。

Hopefully, now you have a high-level overview of techniques used for collecting data from the web, so you can dive deeper into each topic accordingly.

希望您现在对用于从Web收集数据的技术有一个高层次的概述，因此您可以相应地更深入地研究每个主题。