网络爬虫数据挖掘_我如何构建无服务器网络爬虫以大规模挖掘温哥华房地产数据...

网络爬虫数据挖掘

by Marcello Lins

通过Marcello Lins

我如何构建无服务器网络爬虫以大规模挖掘温哥华房地产数据 (How I built a serverless web crawler to mine Vancouver real estate data at scale)

I recently moved from Rio de Janeiro, Brazil to Vancouver, Canada. The first thing that hits you right in the face, aside from the beautiful scenery, are the rental prices. Vancouver is currently ranked among the top 5 most expensive cities to live in the world. The rental price of a property is indicative of how expensive it is to actually own and mortgage that same property.

我最近从巴西里约热内卢搬到了加拿大温哥华。 除了美丽的风景之外,面对您的第一件事就是租金价格。 温哥华目前是世界上最昂贵的 5 大城市之一。 物业的租金价格表明实际拥有和抵押该物业的成本是多少。

I decided to start a side-project that could mine a decent number of housing listings and crunch the data. I wanted to come up with my own conclusions about the current real state market in Vancouver. There’s a bunch of well formatted data living on these listing websites on the web, so why not go ahead and grab it? This is how this project was born.

我决定启动一个辅助项目,该项目可以挖掘相当数量的房屋清单并处理数据。 我想就温哥华目前的房地产市场得出自己的结论。 网络上的这些列表网站上都有大量格式正确的数据,那么为什么不继续进行获取呢? 这就是这个项目的诞生。

This article will walk you through the architecture, costs, pros and cons and more about the first crawler I’ve built using no servers at all. It is living 100% on the cloud, using only AWS (Amazon Web Services).

本文将向您介绍体系结构,成本,优缺点以及有关我完全不使用服务器构建的第一个搜寻器的更多信息。 仅使用AWS(Amazon Web Services),它就可以100%地存储在云中。

等一下,您说“没有服务器”吗? (Wait, did you say “No Server”?)

Sure enough, everything you run on the cloud, is backed up by servers at the end of the day. What I meant by Server-less is that you won’t have to actually maintain any server or virtual machine yourself.

毫无疑问,您在云上运行的所有内容最终都会由服务器备份。 我所说的“无服务器”是指您实际上不必自己维护任何服务器或虚拟机。

The trick is to build your architecture around cloud-native services such as AWS Lambda, DynamoDB, RDS MySQL and Cloudwatch. Then make them work together in a clever way.

诀窍是围绕AWS LambdaDynamoDBRDS MySQLCloudwatch等云原生服务构建您的架构。 然后使它们以一种聪明的方式一起工作。

Shall we start?

我们可以开始了吗?

项目架构 (Project Architecture)

In case you’re not familiar with these services, I will summarize them for you:

如果您不熟悉这些服务,我将为您总结一下:

  • AWS Lambda:

    AWS Lambda

    Short lived functions that run on the cloud

    在云上运行的短暂功能

    Whenever these are invoked or triggered they will spin up, run the code you wrote into it, and shutdown as soon as its done running. You’ll only pay for the seconds each function is actually doing something.

    只要调用或触发这些命令,它们就会启动,运行您编写的代码,并在运行完成后立即关闭。 您只需支付每个功能实际上在做某事的秒数。

  • DynamoDB:

    DynamoDB

    Fully managed NoSQL database on the cloud

    云端完全托管的NoSQL数据库

    You can feed it with JSON records and they will be stored on a server you won’t have to maintain. You can scale your read and write throughput in seconds. As of early 2017, they started supporting a

    您可以向其提供JSON记录,它们将存储在无需维护的服务器上。 您可以在几秒钟内扩展读取和写入吞吐量。 从2017年初开始,他们开始支持

    TTL (Time To Live) mechanism. This allows your objects to be deleted automatically after reaching its TTL.

    TTL (生存时间)机制。 这样,您的对象达到TTL后即可自动删除。

  • RDS MySQL:

    RDS MySQL

    Fully managed RDS (Relational Database Service) MySQL database on the cloud

    完全托管的RDS(关系数据库服务)MySQL数据库在云上

    Scale up or down, take backups as you wish. We recently announced a new

    放大或缩小,按需备份。 我们最近宣布了一个新的

    Start and Stop feature. It allows you to keep your instance stopped for up to 7 days in a row. You’re only paying for the instance volume, instead of paying for its instance hours as well.

    启动和停止功能。 它使您可以连续7天保持实例停止。 您只需要为实例数量付费,而不是为实例时间付费。

  • CloudWatch:

    CloudWatch

    Monitors and logs your resources on the cloud

    在云上监视和记录您的资源

    You get this for free since every “.log” message ran from Python on Lambda logs it straight to a CloudWatch stream.

    您可以免费获得它,因为在Lambda上从Python运行的每个“ .log”消息都将其直接记录到CloudWatch流中。

项目目标 (Project Goals)

When starting this project I had a few goals in mind. Then I started improvising as I went along. The ideal project to me would have to:

开始这个项目时,我有几个目标。 然后我开始进行即兴演奏。 对我来说,理想的项目必须:

  • Be fully managed by AWS on the Cloud and require no server

    由AWS on Cloud完全管理,不需要服务器
  • Be elastic to scale up and down according to load

    具有弹性,可以根据负载进行缩放
  • Capable of processing tens of thousands of listings to start with

    首先可以处理成千上万的列表
  • Be inexpensive

    价格便宜

成本明细 (Costs Breakdown)

You can rely on Lambda and CloudWatch for this project. They are free unless you’re running this constantly and non-stop. Then the bill will come.

您可以依靠Lambda和CloudWatch进行此项目。 它们是免费的,除非您不断不断地运行它。 然后账单就来了。

For the storage layers of DynamoDB and RDS MySQL you will be paying under 3 bucks a month. You can stop your RDS database for up to 7 days in a row. And you can scale your DynamoDB tables down to 1 read + 1 write units when you’re not using them.

对于DynamoDB和RDS MySQL的存储层,您每月需支付不到3美元。 您可以连续7天停止RDS数据库。 当您不使用DynamoDB表时,可以将它们缩减为1个读取单位+ 1个写入单位。

This brings your total costs to an estimate of $2.40 a month. Check my documentation for a more detailed breakdown.

这样一来,您每月的总费用约为$ 2.40。 查看我的文档以获取更详细的细分。

旅程 (The Journey)

From start to finish the whole project took me about 19 hours of work. Your mileage may vary according to your previous knowledge of AWS and Python. I am familiar with both, but not Dynamo and Lambda services.

从头到尾,整个项目花了我大约19个小时的工作。 您的里程可能会根据您先前对AWS和Python的了解而有所不同。 我对两者都熟悉,但对Dynamo和Lambda服务却不熟悉。

The setup of Lambda functions takes time to get used to. It’s definitely sub-par with other AWS services when it comes to usability and metrics.

Lambda函数的设置需要花费一些时间来习惯。 在可用性和指标方面,它绝对不如其他AWS服务。

Once you get used to the whole Lambda developing dance: edit Python files locally -> create a .zip package -> upload that to replace your Lambda Function -> Save and Test, it gets better.

一旦您习惯了整个Lambda开发舞蹈: edit Python files locally -&g t; create a .zip pack t; create a .zip pack年龄-> uploa d,要更换你的拉姆达Functi on -> Save和测试,它变得更好。

The integration with CloudWatch is definitely a plus. It is free, and comes in handy when you’re trying to understand why your lambda failed after that HTTP Request, or during that other loop you forgot to indent.

与CloudWatch的集成绝对是加号。 它是免费的,并且在您尝试了解为什么lambda为什么在该HTTP请求之后或在其他循环中您忘记缩进失败时会派上用场。

Making use of Environment Variables, adjusting function resources and timeouts, and enabling and disabling triggers for testing works smoothly and blends in really well. It doesn’t require you to redeploy your functions. Also, I’ve noticed that the spin-up of the Lambda functions works fast, having almost unnoticeable delay. I assume they’re using some sort of smart-cached ECS under the hood, but I wouldn’t know.

利用环境变量 ,调整功能资源和超时,以及启用和禁用测试触发器可以顺利进行,并且融合得很好。 它不需要您重新部署功能。 此外,我注意到Lambda函数的旋转速度很快,几乎没有明显的延迟。 我以为他们在后台使用某种智能缓存的ECS ,但我不知道。

Setting up DynamoDB tables couldn’t be easier. We’re talking about a one prompt setup, where you only have to fill 2 boxes, your table name and the partition key for your table. Configuring TTL for each table works fine. But you can’t do and undo it often. It will prevent you from toggling it on and off, since it deletes your records without charging you for these operations. Inserting dynamoDB records manually on each table for testing purposes works perfectly. Each insert or batch triggers the lambda functions with little to no delay. Tweaking each table’s capacity up and down with read and write units is a breeze. It allows you to adjust them with only a few seconds of delay to apply the new configuration.

设置DynamoDB表再简单不过了。 我们谈论的是一个提示设置,您只需要填写2个框,即表名和表的分区键。 为每个表配置TTL都可以正常工作。 但是您不能做并且经常撤消它。 因为它将删除您的记录而不收取这些操作的费用,所以它将阻止您打开和关闭它。 出于测试目的,在每个表上手动插入dynamoDB记录非常有效。 每个插入或批处理都会几乎无延迟地触发lambda函数。 用读写单元来微调每个表的容量是一件容易的事。 它使您仅需几秒钟的延迟就可以调整它们以应用新配置。

Getting to configure RDS MySQL is definitely easier than Lambda, but has more steps then DynamoDB. You also get more options. You can pick instance type, volume sizes and types, redundancy, maintenance windows, and backup retention periods. Once you set it up, you’ll have your shiny MySQL instance in about 10 minutes, ready to rock.

配置RDS MySQL绝对比Lambda容易,但是比DynamoDB有更多步骤。 您还将获得更多选择。 您可以选择实例类型,卷大小和类型,冗余,维护时段以及备份保留期。 设置好之后,您将在大约10分钟内拥有一个闪闪发光MySQL实例,随时可以摇摆。

After the setup and test phase ends, I’ve had a contemplation moment as the listings were making their way to MySQL. I could sit back, relax and have a beer while the capture was happening. Or 3 beers. Take a nap? This thing is slow!

在设置和测试阶段结束之后,随着清单逐渐进入MySQL,我有了一个沉思的时刻。 捕获发生时,我可以坐下来,放松并喝啤酒。 或3杯啤酒。 打盹儿? 这东西很慢!

粗糙的边缘 (Rough Edges)

Performance was never my goal. Tinkering with the technologies available and building something cool was. But I didn’t expect it to be this slow. In the end, it was able to capture around 11.000 listings every 6 hours which translates to about one listing every ~2 seconds. I’ve written distributed crawlers with rates easily thirty-fold faster then this. They may not have been as exciting, though.

性能从来不是我的目标。 修补可用的技术并构建一些很棒的东西。 但是我没想到它会这么慢。 最终,它能够每6个小时捕获大约11.000个列表,这相当于每隔约2秒捕获约1个列表。 我编写的分布式爬虫的费率比这快三十倍。 但是,它们可能不那么令人兴奋。

Each HTTP Request for a page takes between 0.7 and 1.1 seconds to return on average. Factor in the time it takes to spin up each lambda container, plus connecting to MySQL across the wire and inserting each record, you have 2 seconds. Each lambda receives a batch or stream of 5 DynamoDB records. The average life-span of each lambda function was of about 7 seconds for the parsing lambdas.

每个页面的HTTP请求平均需要0.7到1.1秒的时间才能返回。 考虑启动每个lambda容器所需的时间,以及通过电线连接到MySQL并插入每个记录所需的时间,您有2秒的时间。 每个lambda接收一批或5条DynamoDB记录流。 对于解析的lambda,每个lambda函数的平均寿命约为7秒。

A few optimizations that could be done would be to perform the HTTP requests for each batch in parallel and perform batch inserts in MySQL.

可以完成的一些优化操作是并行执行每个批次的HTTP请求,并在MySQL中执行批次插入。

Speaking of parallelism, the coldest bucket of water for me was the fact that Lambda will not scale horizontally very well. In my head, every stream inserted into Dynamo would immediately trigger one lambda function to process it. This meant that Lambda would always be catching up with the pace of inserts on Dynamo. So I would have tens of Lambda functions running at any given time, all in parallel, and beautiful. I was wrong.

说到并行性,对我而言,最冷的水是Lambda无法很好地水平缩放的事实 在我的脑海中,每一个插入Dynamo的流都将立即触发一个lambda函数来处理它。 这意味着Lambda将始终追随Dynamo上的插入速度。 因此,我将在任何给定时间运行数十个Lambda函数,这些函数并行且美观。 我错了

What actually happens is that Lambda has a limit of concurrent executions that’s tied to how many shards the DynamoDB table has. Since my table had only one shard, there was only one Lambda Function Running at all times. What happened is that even though the inserts on one of the dynamoDB table took a couple minutes, the second layer of Lambda was being triggered slowly, one after the other. There was an internal queue storing my dynamo streams and feeding them to Lambda by serializing my execution instead of parallelizing it.

实际发生的情况是Lambda的并发执行限制与DynamoDB表具有的分片数量有关。 由于我的表只有一个分片,因此始终只运行一个Lambda函数。 发生的事情是,尽管dynamoDB表之一上的插入花费了几分钟,但第二层Lambda却被缓慢触发,一个接一个地触发。 有一个内部队列存储我的发电机流,并通过序列化我的执行而不是并行化它们将其馈送到Lambda。

Every change on a DynamoDB table’s content will trigger your Lambda functions set to trigger. The catch is that these changes may not be only Insertsbut also updates, and some deletes triggered when the TTL collector kicks in and starts wiping your set-to-expire records. Luckily enough, each DynamoDB stream contains for each record in the stream, an attribute that you can use to tell whether that object was inserted, updated or deleted. I was receiving everything because there’s no way to set Lambda otherwise, but only processing the inserts.

DynamoDB表内容的每次更改都会触发您设置为触发的Lambda函数。 值得注意的是,这些更改可能不仅是插入内容,而且还可能是更新内容,并且在TTL收集器启动并开始清除即将到期的记录时触发了一些删除操作。 幸运的是,每个DynamoDB流对于该流中的每个记录都包含一个属性,您可以使用该属性来判断该对象是已插入,更新还是删除。 我收到的所有内容都是因为无法设置Lambda否则只能处理插入内容。

利弊 (Pros and Cons)

Pros:

优点

  • Cheap

  • Fully Managed / Serverless

    完全托管/无服务器
  • Bleeding-Edge Technology

    前沿技术
  • Flexible Infrastructure

    灵活的基础架构
  • If you find a bug, you can change your lambdas immediately to fix every following batch

    如果发现错误,则可以立即更改lambda以修复以下每个批次

Cons:

缺点:

  • Slow

  • Once it starts, you can’t pause it and restart from where it left

    一旦启动,您将无法暂停它并从它离开的地方重新开始
  • Only possible to tweak so far (code-wise changes)

    到目前为止只能进行调整(按代码进行更改)
  • Testing specific parts requires you to constantly disable and enable Lambda triggers

    测试特定零件需要您不断禁用和启用Lambda触发器

最终裁决 (Final Verdict)

Despite the initial appeal, I wouldn’t recommend this architecture for something that requires performance and flexibility to change architecture easily and tweak more than the code that’s running. But, this setup is cheap and for something small, it works fine. It may not be the easiest to setup, but once you’re past that part, the maintenance is roughly zero.

尽管具有最初的吸引力,但我不建议这种体系结构用于需要性能和灵活性才能轻松更改体系结构并进行比运行的代码更多的调整的事物。 但是,此设置很便宜,并且对于较小的设备,它可以正常工作。 它可能不是最简单的设置,但是一旦您超过了那部分,维护费用几乎为零。

I’ve had fun writing this and gluing all these pieces together to build this small Frankenstein. I would do it again. I still checked the boxes of all my initial goals for this project, but yes, performance could be better.

我很乐于编写这些代码并将所有这些内容粘贴在一起以构建这个小的科学怪人。 我会再做一次。 我仍然选中了该项目所有最初目标的框,但是可以,性能会更好。

In the end, I’ve managed to download data of over 40k listings by running this process a few times. With that in hand, I plan on writing the code to crunch this data, but, for now, this is still a WIP.

最后,通过运行几次此过程,我设法下载了超过4万个清单的数据。 有了这些,我计划编写代码来处理这些数据,但是就目前而言,这仍然是一个WIP。

I can only thank you if you made it this far. I’ve put together a guide on how to set up your own AWS Account. Since the code is open-source anyway, go hack it !

如果您到目前为止做到了,我只能感谢您。 我整理了有关如何设置自己的AWS账户的指南 。 由于代码仍然是开源的,因此请破解它!

The code is open on GitHub if you want to check it out. The original article was posted on my blog. Stop by if you want to see what else I’m working on.

如果您想签出该代码,可以在GitHub上打开该代码。 原始文章已发布在我的博客上 。 如果您想了解我还在做什么,请稍等。

Feel free to reach out to me through any contact at my personal page, in case you have any questions or simply want to chat.

如果您有任何疑问或只是想聊天,请随时通过我个人页面上的任何联系人与我联系。

See you on the next one :)

下一个再见:)

翻译自: https://www.freecodecamp.org/news/how-to-build-a-scalable-crawler-on-the-cloud-that-can-mine-thousands-of-data-points-costing-less-a9825331eef5/

网络爬虫数据挖掘

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值