Design web crawler
This is not a user facing application, so the design will focus on
functional request -> non-functional request-> core entities -> api interface& data flow-> high level design -> deep dive
functional request
- Given the seed url, I’m able to crawl the websites
- I’m able to extract the text from the web url for future analysis to LLM
Non-functional request
- availability for crawling and fault tolerance (e.g. webpage down, crawling service down)
- Politeness, there’s a robot.txt in the browser to limit
- We want to scale that 5 days a round for 10B page
Core entities
- URL
- URL metadata
- text data
API Interface
Input: a seed url
Output: crawed results stored in DB for LLM analysis
Data Flow
- A seed url comes in to frontier and we request IP address from DNS
- We fetch the URL to get the webpage and store it to local DB
- we extract the text from the webpage
- we extract the URL from the webpage and store it back to the frontier
Raw high level design
deep dive high level design
non-func requirement如何满足?
fault tolerance
- 考虑webpage挂掉了,那我这个queue需要一个retry的机制,并且有一个好的性能
为什么要用Amazon SQS?
因为SQS是一个queue,并且 有configurable的 retry logic,用SQS还可以默认 exponential backoff 30s -1min —> 15min。如果所有的retry都失败了,就会掉入DLQ(dead-letter queue) for future analysis.
然后你可以用 kafka可能要自己写相对比较复杂retry logic
SQS的一些deep dive:
首先SQS消息分发的时候,每条消息分发出去给一个consumer之后,对于其他consumer来说是invisible的,当这条消息处理完毕就会删除,如果没处理完那就会visible然后给下一个consumer
用SQS
为什么要用S3而不是直接把html存在db里面?
因为html还是比较大的,用blob storage更为合适
Politeness
因为网页都会有robot.txt.
robot.txt里面会有一些craw-delay我需要尊重这个delay
obey the rule,我需要一rate limitting.我们可以利用redis来保存每一个clawer server发送的请求次数, 可以用sliding window算法来确认每个domain发送的请求数。
scalability
做一些计算,省略。。。
400Gbps是一台机器的带宽
400 Gbps / 8 bits/byte / 2MB/page = 25,000 pages/second
0.3用来下载
25000*0.3 =7,500 pages/second
10B / 7500/ 5 = 15.4 days
那我大概需要4台机子来满足5天的要求
dns constraints
Multiple DNS providers: We can use multiple DNS providers and round-robin between them. This can help distribute the load across multiple providers and reduce the risk of hitting rate limits.
efficiency of crawling
比如
http://example.com and http://www.example.com我不希望他们被重复访问,办法就是说我们可以在db中增加一条hash index(对内容的),这样的话每次我都去db里去看一下这个东西有没有已经被访问,访问过了那就不用访问了
corner case
有可能这个网址自己的网页里包含自己,那我需要db 有个depth量以免不断重复进queue