使用.htaccess防止网页爬虫

最新推荐文章于 2024-08-22 18:16:10 发布

culi3118

最新推荐文章于 2024-08-22 18:16:10 发布

阅读量695

点赞数 1

文章标签：网络 python linux java 大数据

原文链接：https://www.sitepoint.com/using-htaccess-prevent-web-scraping/

版权

Web scraping, known as content scraping, data scraping, web harvesting, or web data extraction, is a way of extracting data from websites, preferably using a program that sends a number of HTTP requests, emulating human behaviour, getting the responses and extracting the required data out of them. Modern GUI-based web scrapers like Kimono enable you to perform this task without any programming knowledge.

Web抓取(称为内容抓取，数据抓取，Web收集或Web数据提取)是一种从网站提取数据的方法，最好使用发送大量HTTP请求，模拟人类行为，获取响应并提取其中需要的数据。和服等基于GUI的现代Web抓取工具使您无需任何编程知识即可执行此任务。

If you face the problem of others scraping content from one of your websites, there are many ways of detecting web scrapers — Google Webmaster Tools and Feedburner to name a few tools.

如果您遇到其他人从您的网站之一抓取内容的问题，则有许多方法可以检测到网络抓取工具 -Google Webmaster Tools和Feedburner等。

In this article, we will discuss a few ways to make the lives of these scrapers difficult, using .htaccess files in Apache.

在本文中，我们将讨论使用Apache中的.htaccess文件使这些刮板的工作变得困难的几种方法。

An .htaccess (hypertext access) file is a plain text configuration file for web servers that overrides the global server settings for the directory where the file is placed. They can be innovatively used to prevent web scraping.

.htaccess(超文本访问)文件是Web服务器的纯文本配置文件，它会覆盖文件所在目录的全局服务器设置。它们可以创新地用于防止刮纸。

Before we discuss the specific methods, let me clear up one small fact: If something is publicly available, it can be scraped. The steps that we discuss here can only make things more difficult, not impossible. However, what would you do if someone is smart enough to bypass all your filters? We have a solution for that too.

在讨论具体方法之前，让我弄清一个小事实：如果某些东西公开可用，则可以将其废弃。我们在这里讨论的步骤只会使事情变得更加困难，而不是不可能。但是，如果有人足够聪明地绕过所有过滤器，您将怎么办？我们也有解决方案。

.htaccess入门 (Getting Started with .htaccess)

Since the use of .htaccess files involves Apache checking and reading all .htaccess files on every request, it is generally turned off by default. There are different processes to enable it in Ubuntu, OS X and Windows. Your .htaccess files will be interpreted by Apache only after you enable them, or they will be simply ignored.

由于.htaccess文件的使用涉及Apache对每个请求的检查和读取所有.htaccess文件，因此默认情况下通常将其关闭。在Ubuntu ， OS X和Windows中有不同的过程来启用它。只有您启用了.htaccess文件，Apache才会对其进行解释，否则它们将被忽略。

Next, in most of our use cases, we will be using the RewriteEngine of Apache, which is a part of the mod_rewrite module. If necessary, you could check out a detailed guide on how to set up mod_rewrite for Apache or a general guide on .htaccess.

接下来，在大多数用例中，我们将使用Apache的RewriteEngine ，它是mod_rewrite模块的一部分。如有必要，您可以查看有关如何为Apache设置mod_rewrite的详细指南或有关.htaccess的常规指南。

Once you have completed these, you are ready to proceed with the solutions discussed here on dealing with content scrapers. If you haven’t completed either of these steps successfully, Apache will ignore your .htaccess files or raise an error when you restart it after making changes.

完成这些步骤后，您就可以继续进行此处讨论的有关内容抓取工具的解决方案。如果您没有成功完成任何一个步骤，则Apache将忽略您的.htaccess文件，或者在进行更改后重新启动它时引发错误。

防止热链接 (Prevent Hotlinking)

If someone scrapes your content, all your inline HTML remains the same. This means that the links to the images that were part of your content (and most probably hosted on your domain) remain the same. If the scraper wishes to put the content on a different website, the image would still link back to the original source. This is called hotlinking. Hotlinking costs you bandwidth because every time someone opens the scraper’s site, your image is downloaded.

如果有人抓取您的内容，则所有内联HTML都将保持不变。这意味着指向您内容(最可能托管在您的域中)的图像的链接保持不变。如果抓取工具希望将内容放置在其他网站上，则图像仍将链接回原始资源。这称为热链接。热链接会浪费您的带宽，因为每次有人打开刮板的站点时，都会下载您的图像。

You can prevent hotlinking by adding the following lines to your .htaccess file.

您可以通过在.htaccess文件中添加以下几行来防止热链接 。

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$

# domains that can link to your content (images here)
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mysite.com [NC]

# show no image when hotlinked
RewriteRule \.(jpg|png|gif)$ – [NC,F,L]

# Or show an alternate image
# RewriteRule \.(jpg|png|gif)$ http://mysite.com/forbidden_image.jpg [NC,R,L]

Some notes about the code:

有关代码的一些注意事项：

Switching on RewriteEngine gives us the ability to redirect the user’s request.
启用RewriteEngine可使我们重定向用户的请求。
RewriteCond specifies which requests should be redirected. %{HTTP_REFERER} is the variable that contains the domain from which the request was made.
RewriteCond指定应重定向的请求。 %{HTTP_REFERER}是包含发出请求的域的变量。
Then we match it with our own domain mysite.com. We add (www\.) to ensure requests from both mysite.com and www.mysite.com are allowed. Similarly, our code covers http and https.
然后，将其与我们自己的域mysite.com匹配。我们添加(www\.)以确保同时允许来自mysite.com和www.mysite.com请求。同样，我们的代码涵盖了http和https 。
Next, we check if a jpg, png, or gif file was requested, and either show an error or redirect the request to an alternate image.
接下来，我们检查是否请求了jpg ， png或gif文件，并显示错误或将请求重定向到备用图像。
NC ignores the case, F shows a 403 Forbidden error, R redirects the request, and L stops rewriting.
NC忽略这种情况， F显示403 Forbidden错误， R重定向请求， L停止重写。
Note that you should apply only one of the rules above (either the 403 error or the alternate image). This is because as soon as L is encountered, Apache would not apply any other rules. In the code example above, the alternate image method is commented out.
请注意，您应仅应用上述规则之一( 403错误或替代图像)。这是因为一旦遇到L ，Apache将不会应用任何其他规则。在上面的代码示例中，替代图像方法被注释掉。

Web爬网程序如何绕过此步骤？ (How Can Web Scrapers Bypass This?)

One way for a web scraper to bypass such a hurdle is to download images as it encounters them in the HTML code. In such a case, a regular expression check can be applied, the images downloaded, and the links of the images changed while storing the data in the system.

网页抓取工具绕过这种障碍的一种方法是下载图像，因为图像在HTML代码中遇到它们。在这种情况下，可以在将数据存储到系统中的同时应用正则表达式检查，下载图像并更改图像链接。

允许或阻止来自特定IP地址的请求 (Allow or Block Requests From Specific IP Addresses)

If you happen to determine the origin of the requests of the web scraper (usually, it’s an unnaturally high number of requests from the same IP address), you can block requests from that IP address.

如果您碰巧确定了Web抓取器的请求的来源(通常，来自同一IP地址的请求数量异常高)，则可以阻止来自该IP地址的请求。

Order Deny
Deny from xxx.xxx.xxx.xxx

In the code above (and in other examples in this article) you would replace xxx.xxx.xxx.xxx with the IP address you want to block. If you are really paranoid about security, you could deny requests from all IP addresses and selectively allow from a whitelist of IP addresses:

在上面的代码(以及本文的其他示例)中，您将xxx.xxx.xxx.xxx替换为您要阻止的IP地址。如果您真的对安全性抱有偏执，则可以拒绝来自所有IP地址的请求，并有选择地允许来自IP地址的白名单：

order deny,allow
Deny from all
# IP Address whitelist 
allow from xx.xxx.xx.xx
allow from xx.xxx.xx.xx

One use case for this technique (not related to web scraping) is blocking access to the WordPress’s wp-admin directory. In such a case, you would allow requests from only your IP address, eliminating the possibility of someone hacking your site via wp-admin.

这项技术的一个用例(与Web抓取无关)是阻止对WordPress的wp-admin目录的访问。在这种情况下，您将只允许来自您的IP地址的请求，从而消除了有人通过wp-admin入侵您的网站的可能性。

Web爬网程序如何绕过此步骤？ (How Can Web Scrapers Bypass This?)

If a web scraper has access to proxies, it could distribute its requests over the list of IP addresses to avoid abnormal activity from one IP address.

如果Web搜寻器可以访问代理，则可以将其请求分配到IP地址列表中，以避免来自一个IP地址的异常活动。

To explain: Let’s say someone is scraping your site from IP address 1.1.1.1. So you block 1.1.1.1 using .htaccess. Now, if the scraper has access to a proxy server 2.2.2.2, it routes its request through 2.2.2.2, so it appears to your server that the request is coming from 2.2.2.2. So, in spite of blocking 1.1.1.1, the scraper is still able to access the resource.

解释：假设有人从IP地址1.1.1.1抓取您的网站。因此，您使用.htaccess阻止了1.1.1.1。现在，如果抓取程序可以访问代理服务器2.2.2.2，它将通过2.2.2.2路由其请求，因此对于您的服务器，该请求似乎来自2.2.2.2。因此，尽管阻止了1.1.1.1，但刮板仍能够访问资源。

Thus, if the scraper has access to thousands of these proxies, it can become undetectable if it sends requests in low numbers from each proxy.

因此，如果抓取工具可以访问成千上万个这样的代理，那么如果它从每个代理发送少量请求，就可能变得不可检测。

从IP地址重定向请求 (Redirect Requests From an IP Address)

You can not only block any IP address, you can redirect them to a different page too:

您不仅可以阻止任何IP地址，还可以将它们重定向到另一个页面：

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com [R,L]

If you redirect them to a static site, chances are the scraper will figure this out. However, you can go one step further and do something a bit more innovative. For that, you need to understand how your content is scraped.

如果将它们重定向到静态站点，则刮板很可能会解决此问题。但是，您可以更进一步，并做一些更具创新性的事情。为此，您需要了解如何抓取您的内容。

Web scraping is a systematic procedure. It involves studying URL patterns and sending requests to all possible pages on a website. If you are a WordPress user, for instance, the URL pattern is http://mysite.com/?p=[page_no], where you increment page_no from 1 to a large number.

Web抓取是系统的过程。它涉及研究URL模式并将请求发送到网站上所有可能的页面。例如，如果您是WordPress用户，则URL模式为http://mysite.com/?p=[page_no] ，您可以将page_no从1递增到较大的数字。

What you could do is create a page especially for redirection that redirects the request to one out of a number of predefined pages:

您可以做的是创建一个专门用于重定向的页面，该页面将请求重定向到许多预定义页面中的一个：

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com/redirection_page [R,L]

In the above code, “redirection_page” would be the page used to do one of the subsequent predefined redirects. Therefore, when a web scraping program is running, it would be redirected to a number of pages and it would be difficult to detect that you have identified the scraper.

在上面的代码中，“ redirection_page”将是用于执行后续预定义重定向之一的页面。因此，在运行Web抓取程序时，它将被重定向到多个页面，并且很难检测到您已经识别了该抓取程序。

Alternately, “redirection_page” can redirect to a third page “redirection_page_1”, which would then redirect back to “redirection_page”. This would lead to a redirect loop, and a request would get bounced back between the two pages indefinitely.

或者，“ redirection_page”可以重定向到第三页“ redirection_page_1”，然后将其重定向回“ redirection_page”。这将导致重定向循环，并且请求将在两个页面之间无限期地反弹。

Web爬网程序如何绕过此步骤？ (How Can Web Scrapers Bypass This?)

A web scraper could check for redirection of the request. If there is a redirect, it would get a 301 or 302 HTTP status code. If there was no redirection, it would get the normal 200 status code.

网页抓取工具可以检查请求的重定向。如果存在重定向，它将获得301或302 HTTP状态代码。如果没有重定向，它将获得正常的200状态代码。

马特·卡茨(Matt Cutts)进行救援 (Matt Cutts to the Rescue)

Matt Cutts is the head of the web spam team at Google. Part of his job is to be on constant lookout for scraping sites. If he doesn’t like your website, he can make it vanish from Google’s search results. The recent Panda and Penguin updates to Google’s search algorithm have affected a huge number of sites, including a number of scraper sites.

Matt Cutts是Google网络垃圾邮件小组的负责人。他的工作之一是不断寻找刮刮地。如果他不喜欢您的网站，则可以从Google的搜索结果中消失。熊猫和企鹅最近对Google的搜索算法进行了更新，已经影响了许多站点，其中包括许多刮板站点。

A webmaster can report scraper sites to Google using this form, providing the source of the content. If you produce original content, you would definitely be on the radar of web scrapers. Yet, if they re-publish your content, Google will make sure that they are omitted from its search results.

网站管理员可以使用此表单向Google报告抓取网站，并提供内容来源。如果您制作原始内容，则肯定会成为刮板机的雷达。但是，如果他们重新发布了您的内容，Google将确保其搜索结果中将其省略。