网络适配器无法建立连接_建立更快的网路搜寻器的5个秘诀

网络适配器无法建立连接

Scraping a large amount of data requires you to have a very fast web scraper. If you want to scrape 10 million items and your scraper gets 50 items per minute, you’ll be waiting for 130 days for that scraper to finish. That’s way too long!

抓取大量数据需要您使用非常快速的Web抓取工具。 如果您要刮除1000万个项目,而刮板每分钟可获取50个项目,那么您将等待130天才能完成该刮板。 太久了!

This guide provides a structured approach to building a super-fast web scraper.

本指南提供了构建超快速卷筒纸刮板的结构化方法。

Now let me take you from this:

现在让我从这开始:

Image for post
Scraper with average rate lower than 50 items/min
刮板平均速度低于50件/分钟

To this:

对此:

Image for post
Scraper with average rate greater than 2,000 items/min
刮板平均速度大于2,000项/分钟

1.设定 (1. Setup)

If you’re scraping in Python and want to go fast, there is only one library to use: Scrapy. This is a fantastic web scraping framework if you’re going to do any substantial scraping. BeautifulSoup, Requests, and Selenium are just too slow for large projects. If you aren’t familiar with Scrapy, I would recommend learning it and then revisiting this article later on.

如果您要使用Python进行抓取并希望快速运行,则只能使用一个库: Scrapy 。 如果您要进行大量的抓取,那么这是一个很棒的Web抓取框架。 对于大型项目,BeautifulSoup,Requests和Selenium太慢了。 如果您不熟悉Scrapy,我建议您先学习它,然后再重新阅读本文。

There are now two things we need to do before we start scraping:

现在,在开始抓取之前我们需要做两件事:

  1. Change your user agent. Your user agent tells servers who is accessing their website. By default, Scrapy tells servers that a bot is crawling their site. If you don’t change this setting, you are going to get banned in minutes. To change it, google “User agents” and set one of them equal to the USER_AGENT variable in settings.py. It is also possible to rotate your user agent. However, I’ve found this unnecessary. If you want to do this, just google how — I believe the setup is relatively easy.

    更改您的用户代理。 您的用户代理告诉服务器谁正在访问其网站。 默认情况下,Scrapy告诉服务器机器人正在爬网其站点。 如果您不更改此设置,您将在几分钟之内被禁止。 要进行更改,请使用Google“用户代理”,并将其中之一settings.pyUSER_AGENT变量。 也可以轮换用户代理。 但是,我发现这没有必要。 如果您想这样做,只需在Google上搜索-我相信设置相对容易。

  2. Choose and set up proxies for your crawler. When choosing proxies, you should consider the pricing structure of the proxies. Do you pay per GB of bandwidth? Do you pay per proxy? Do you pay per thread? For large projects, I always pay per thread and use StormProxies. For smaller projects, I’d recommend SmartProxy. They charge by GB of bandwidth and provide unlimited threads. Next, you want to set up your proxies, which can be done by creating a new middleware in the middleware.py file as shown below. This will set the proxies for all spiders in your project. You then need to add the middleware to your settings.py file. Alternatively, you can set proxies for each spider individually. There is a short video on how to do this below:

    选择并设置您的搜寻器代理。 选择代理时,应考虑代理的定价结构。 您是否为每GB带宽付费? 您是否按代理人付款? 您是否按线程付费? 对于大型项目,我总是按线程付费并使用StormProxies 。 对于较小的项目,我建议使用SmartProxy 。 它们按GB带宽收费,并提供无限线程。 接下来,您要设置代理,可以通过在middleware.py文件中创建一个新的中间件来完成,如下所示。 这将设置项目中所有蜘蛛的代理。 然后,您需要将中间件添加到settings.py文件中。 另外,您可以分别为每个蜘蛛设置代理。 以下是有关如何执行此操作的简短视频:

Image for post
Middleware for proxies (all spiders)
代理中间件(所有蜘蛛)
Setting proxies for each spider individually
分别为每个蜘蛛设置代理

2.优化您的报废策略 (2. Optimize Your Scraping Strategy)

Work smarter, not harder.

更聪明地工作,而不是更努力。

This section is about the three scraping techniques that are going to make a huge difference to your speed.

本节介绍将对您的速度产生巨大影响的三种刮削技术。

分而治之 (Divide and conquer)

If you are using a single large spider, split it into many smaller ones. You do this so that you can make use of Scrapyd (more details in Step 4). Scrapyd allows you to run many spiders simultaneously (with Scrapy, you can only run one spider at a time). Each smaller spider will crawl part of what the large spider crawled. These mini-spiders should not overlap in the content they crawl, as this will waste time. If you split one spider into ten smaller ones, your scraping process is going to be around ten times faster (provided there are no other bottlenecks — see Step 5).

如果您使用单个大型蜘蛛,请将其分成许多较小的蜘蛛。 这样做是为了可以利用Scrapyd (更多详细信息,请参阅步骤4)。 Scrapyd允许您同时运行多个蜘蛛(使用Scrapy,您一次只能运行一个蜘蛛)。 每个较小的蜘蛛将爬行大蜘蛛爬行的部分。 这些小型蜘蛛不应在它们爬网的内容中重叠,因为这会浪费时间。 如果将一只蜘蛛拆分为十个较小的蜘蛛,则抓取过程将快十倍左右(前提是没有其他瓶颈,请参阅步骤5)。

减少发送的请求数 (Minimize the number of requests sent)

Sending requests and waiting for responses is the slowest part of using a scraper. If you can reduce the number of requests sent, your scraper will be much faster. For example, if you are scraping prices and titles from an e-commerce site, then you don’t need to visit each item’s page. You can get all the data you need from the results page. If you have 30 items per page, then using this technique will make your scraper 30 times faster (it only has to send one request now instead of 30). Always be on the lookout for ways to reduce your number of requests. Below is a list of things you can try. If you can think of any others, please leave a comment.

发送请求和等待响应是使用刮板最慢的部分。 如果您可以减少发送的请求数量,那么您的抓取工具将会更快。 例如,如果您要从电子商务网站抓取价格和标题,则无需访问每个项目的页面。 您可以从结果页面获取所需的所有数据。 如果每页有30个项目,则使用此技术可使您的抓取工具快30倍(现在只需发送一个请求,而不是30个)。 始终在寻找减少请求数量的方法。 以下是您可以尝试的操作列表。 如果您有其他想法,请发表评论。

Common ways to reduce requests:

减少请求的常用方法:

  • Increase the number of results on the results page (e.g. from ten to 100).

    将结果页面上的结果数量增加(例如,从十个增加到100个)。
  • Apply filters before scraping (e.g. price filters).

    抓取之前应用过滤器(例如价格过滤器)。
  • Use a general spider — not a CrawlSpider.

    使用普通蜘蛛-而不是CrawlSpider。

批量上传项目到数据库 (Upload items to the database in batches)

Another cause of a slow scraper is that people tend to scrape their data and then immediately add that data to their database. This is slow for two reasons. Firstly, processing in batches is always going to be faster than adding item by item. Secondly, with batching, you can make use of the many tools Python has to offer for batch uploading to databases. For example, the pandas library can be used to put your data into a dataframe and then upload that data to a SQL database. That is much faster! If you are interested in learning more, I highly recommend you read this article on batch uploading to SQL databases.

抓取速度缓慢的另一个原因是,人们倾向于抓取数据,然后立即将数据添加到他们的数据库中。 这很慢,原因有二。 首先,分批处理总是比逐项添加要快。 其次,使用批处理,您可以利用Python必须提供的许多工具来批量上传到数据库。 例如,可以使用pandas库将数据放入数据框,然后将其上传到SQL数据库。 这是更快 ! 如果您有兴趣了解更多信息,强烈建议您阅读有关批量上传到SQL数据库的这篇文章

3.设定 (3. Settings)

  1. CONCURRENT_REQUESTS:

    CONCURRENT_REQUESTS

“The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader.” — Scrapy’s documentation

“ Scrapy下载器将执行的并发(即并发)请求的最大数量。” — Scrapy的文档

This is the number of simultaneous requests that your spider will send. You will want to experiment a little with different values and see which gives you the best scrape rate. A good place to start is 50. If you get a lot of timeout errors, then you have set this too high. Reduce by 10% and try again.

这是您的蜘蛛将发送的同时请求数。 您将要尝试使用不同的值进行一些试验,然后看看哪一个可以为您提供最佳的刮擦率。 一个好的起点是50。如果您遇到很多超时错误,则将其设置得太高。 减少10%,然后重试。

2. DOWNLOAD_TIMEOUT:

2. DOWNLOAD_TIMEOUT

“The amount of time (in secs) that the downloader will wait before timing out.” — Scrapy’s docs

“下载程序在超时之前将等待的时间(以秒为单位)。” — Scrapy的文档

This is how long the spider will wait for the response after sending a request before retrying. Set this too low and you will get endless timeout errors. Set this too high and your spider will be waiting around instead of retrying a request, wasting time and slowing you down. Start at 100 seconds and experiment to find the optimal value.

这是蜘蛛在发送请求后重试之前将等待响应的时间。 将此值设置得太低,您将收到无尽的超时错误。 将此值设置得太高,您的蜘蛛将在附近等待而不是重试请求,这样会浪费时间并使您减速。 从100秒开始,尝试寻找最佳值。

3. DOWNLOAD_DELAY:

3. DOWNLOAD_DELAY

“The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website.” — Scrapy’s docs

“从相同网站下载连续页面之前,下载程序应等待的时间(以秒为单位)。” — Scrapy的文档

This is how long your spider will wait between downloading responses. For maximum speed, set this to zero. If you get response codes 400, 403, or 502 consistently, then you are scraping too fast. Increase the download delay slightly and try again (a good starting point is 0.5).

这是蜘蛛在下载响应之间等待的时间。 为了获得最大速度,请将其设置为零。 如果您始终获得响应代码400、403或502,则说明抓取速度太快。 稍微增加下载延迟,然后重试(好的起点是0.5)。

4. Scrapyd (4. Scrapyd)

According to its documentation, Scrapyd is an application for deploying and running Scrapy spiders.

根据其文档 ,Scrapyd是用于部署和运行Scrapy蜘蛛的应用程序。

Scrapyd allows you to run multiple spiders simultaneously. This will enable us to improve the overall speed of the scraping process significantly. If you want spider deployment that’s free and easy to set up, use Scrapyd. The Scrapy Cluster docs include a number of alternatives, but I would still recommend Scrapyd.

Scrapyd允许您同时运行多个蜘蛛。 这将使我们能够显着提高刮削过程的整体速度。 如果您想要免费且易于设置的蜘蛛部署,请使用Scrapyd。 Scrapy Cluster文档包括许多替代方法,但我仍然会推荐Scrapyd。

Benefits of using Scrapyd:

使用Scrapyd的好处:

  1. If your project contains one or more large spiders, split them up (as mentioned in Step 2 of this guide).

    如果您的项目包含一个或多个大蜘蛛,则将它们拆分(如本指南的步骤2中所述)。
  2. Small spiders that don’t need to be split up can be run as they are.

    不需要拆分的小蜘蛛可以照原样运行。

The setup of Scrapyd can appear intimidating if you only read the docs, but the video below makes it very easy to understand and implement:

如果您仅阅读文档,Scrapyd的设置可能会令人生畏,但是下面的视频非常易于理解和实施:

Note: It is possible to run all the spiders in a project with a single command. It takes a small amount of work to set up, but for projects with ten or more spiders, I’d recommend doing it. Learn how on Stack Overflow.

注意:可以使用一个命令来运行项目中的所有蜘蛛。 设置工作很少,但是对于具有十个或更多蜘蛛的项目,我建议您这样做。 了解如何进行 堆栈溢出

5.查找并消除瓶颈 (5. Find and Remove Bottlenecks)

You have followed this guide and your scraper is running at a respectable speed. There is now one final step to take your crawler from respectable to lightning-fast: Dealing with bottlenecks!

您已经遵循了本指南,并且刮板正在以可观的速度运行。 现在,您要迈出的最后一步是使您的爬虫从受人尊敬的状态变为闪电般的状态:应对瓶颈!

Image for post
IndigoVision. IndigoVision

What is a bottleneck?

什么是瓶颈?

It is the limiting factor for the speed of your process. If addressed, it will give your process a significant speed boost up to the next bottleneck.

这是处理速度的限制因素。 如果得到解决,它将极大地提高您的过程速度,直至达到下一个瓶颈。

Dealing with bottlenecks is an iterative process that goes like this:

处理瓶颈是一个反复的过程,如下所示:

  1. You have a bottleneck slowing down your process.

    您有瓶颈来拖延您的流程。
  2. You find out what the bottleneck is.

    您会发现瓶颈所在。
  3. You address the bottleneck, and your process becomes faster.

    您解决了瓶颈,流程变得更快了。
  4. You have a new bottleneck.

    您有一个新的瓶颈。

This is the process you are going to be repeating and repeating (and repeating) until you’ve squeezed every last bit of speed from your scraper.

这是您将要重复和重复(和重复)的过程,直到您从刮板挤压出每一个最后的速度为止。

Below is a table containing some common scraping bottlenecks.

下表列出了一些常见的抓取瓶颈。

Image for post
Table of common scraping bottlenecks and solutions (by the author)
常见刮刮瓶颈表和解决方案(作者)

结论 (Conclusion)

Well done. You’ve learned the ins and outs of building a rapid web scraper in Python. I hope you found this article useful and would love to hear any ideas you have. What projects are you working on? What do you like about coding/scraping? What’s your highest ever items/min rate?

做得好。 您已经了解了用Python构建快速Web抓取工具的来龙去脉。 希望本文对您有所帮助,并希望听到您的任何想法。 您正在从事哪些项目? 您喜欢编码/抓取什么? 您最高的商品/分钟费率是多少?

Thanks for reading. As always, if you have any questions, just leave a comment.

谢谢阅读。 与往常一样,如果您有任何问题,请发表评论。

翻译自: https://medium.com/better-programming/5-tips-to-build-a-faster-web-crawler-f2bbc90cf233

网络适配器无法建立连接

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值