Design web crawler

最新推荐文章于 2024-08-30 21:17:14 发布

xxxmmc

最新推荐文章于 2024-08-30 21:17:14 发布

阅读量539

点赞数 25

文章标签：系统架构

本文链接：https://blog.csdn.net/xxxmmc/article/details/141441240

版权

Design web crawler

functional request
Non-functional request
Core entities
API Interface
Data Flow
Raw high level design
deep dive high level design

This is not a user facing application, so the design will focus on
functional request -> non-functional request-> core entities -> api interface& data flow-> high level design -> deep dive

functional request

Given the seed url, I’m able to crawl the websites
I’m able to extract the text from the web url for future analysis to LLM

Non-functional request

availability for crawling and fault tolerance (e.g. webpage down, crawling service down)
Politeness, there’s a robot.txt in the browser to limit
We want to scale that 5 days a round for 10B page

Core entities

URL
URL metadata
text data

API Interface

Input: a seed url
Output: crawed results stored in DB for LLM analysis

Data Flow

A seed url comes in to frontier and we request IP address from DNS
We fetch the URL to get the webpage and store it to local DB
we extract the text from the webpage
we extract the URL from the webpage and store it back to the frontier

Raw high level design

deep dive high level design

请添加图片描述

non-func requirement如何满足？

fault tolerance

考虑webpage挂掉了，那我这个queue需要一个retry的机制，并且有一个好的性能
为什么要用Amazon SQS？
因为SQS是一个queue,并且有configurable的 retry logic，用SQS还可以默认 exponential backoff 30s -1min —> 15min。如果所有的retry都失败了，就会掉入DLQ(dead-letter queue) for future analysis.
然后你可以用 kafka可能要自己写相对比较复杂retry logic
SQS的一些deep dive:
首先SQS消息分发的时候，每条消息分发出去给一个consumer之后，对于其他consumer来说是invisible的，当这条消息处理完毕就会删除，如果没处理完那就会visible然后给下一个consumer

用SQS
为什么要用S3而不是直接把html存在db里面？
因为html还是比较大的，用blob storage更为合适

Politeness

因为网页都会有robot.txt.
robot.txt里面会有一些craw-delay我需要尊重这个delay
obey the rule,我需要一rate limitting.我们可以利用redis来保存每一个clawer server发送的请求次数, 可以用sliding window算法来确认每个domain发送的请求数。

scalability

做一些计算，省略。。。
400Gbps是一台机器的带宽
400 Gbps / 8 bits/byte / 2MB/page = 25,000 pages/second
0.3用来下载
25000*0.3 =7,500 pages/second
10B / 7500/ 5 = 15.4 days
那我大概需要4台机子来满足5天的要求

dns constraints

Multiple DNS providers: We can use multiple DNS providers and round-robin between them. This can help distribute the load across multiple providers and reduce the risk of hitting rate limits.

efficiency of crawling

比如
http://example.com and http://www.example.com我不希望他们被重复访问，办法就是说我们可以在db中增加一条hash index（对内容的），这样的话每次我都去db里去看一下这个东西有没有已经被访问，访问过了那就不用访问了

corner case

有可能这个网址自己的网页里包含自己，那我需要db 有个depth量以免不断重复进queue

xxxmmc

关注

25
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
Design web crawler

http://example.com and http://www.example.com我不希望他们被重复访问，办法就是说我们可以在db中增加一条hash index（对内容的），这样的话每次我都去db里去看一下这个东西有没有已经被访问，访问过了那就不用访问了。obey the rule,我需要一rate limitting.我们可以利用redis来保存每一个clawer server发送的请求次数, 可以用sliding window算法来确认每个domain发送的请求数。
复制链接

扫一扫