Firecrawl源码getCachedDocuments类源码解读与实战应用

最新推荐文章于 2024-09-21 22:20:35 发布

黑金IT

最新推荐文章于 2024-09-21 22:20:35 发布

阅读量596

点赞数 13

文章标签： node.js 大数据

本文链接：https://blog.csdn.net/ylong52/article/details/141225624

版权

“深入解析Firecrawl源码：打造高效网页抓取工具” —— 本文将带你深入探讨getCachedDocuments类的源码，解析其如何通过不同的抓取模式（单URL、站点地图、爬取模式）实现高效网页数据抓取。我们将逐一分析关键功能，助你掌握网页抓取的核心技术，轻松应对各类数据采集需求。

  async getCachedDocuments(urls: string[]): Promise<Document[]> {
    let documents: Document[] = [];
    for (const url of urls) {
      const normalizedUrl = this.normalizeUrl(url);
      Logger.debug(
        "Getting cached document for web-scraper-cache:" + normalizedUrl
      );
      const cachedDocumentString = await getValue(
        "web-scraper-cache:" + normalizedUrl
      );
      if (cachedDocumentString) {
        const cachedDocument = JSON.parse(cachedDocumentString);
        documents.push(cachedDocument);

        // get children documents
        for (const childUrl of cachedDocument.childrenLinks || []) {
          const normalizedChildUrl = this.normalizeUrl(childUrl);
          const childCachedDocumentString = await getValue(
            "web-scraper-cache:" + normalizedChildUrl
          );
          if (childCachedDocumentString) {
            const childCachedDocument = JSON.parse(childCachedDocumentString);
            if (
              !documents.find(
                (doc) =>
                  doc.metadata.sourceURL ===
                  childCachedDocument.metadata.sourceURL
              )
            ) {
              documents.push(childCachedDocument);
            }
          }
        }
      }
    }
    return documents;
  }

实现原理

初始化文档数组：首先，函数初始化一个空数组documents，用于存储从缓存中获取的文档。
遍历URL数组：函数通过for...of循环遍历传入的urls数组，对每个URL进行处理。
URL标准化：对于每个URL，调用this.normalizeUrl(url)方法进行标准化处理，确保URL格式的一致性。
获取缓存文档：使用await getValue("web-scraper-cache:" + normalizedUrl)从缓存中获取标准化后的URL对应的文档字符串。如果缓存中存在该文档，则将其解析为Document对象，并添加到documents数组中。
获取子文档：对于每个缓存文档，如果该文档包含子文档链接（cachedDocument.childrenLinks），则对这些子文档链接进行相同的处理，即标准化URL、从缓存中获取文档、解析为Document对象，并检查是否已经存在于documents数组中，如果不存在则添加。
返回结果：循环结束后，函数返回包含所有从缓存中获取的文档的数组documents。

用途

该函数的主要用途是从缓存中批量获取文档，包括主文档及其子文档。这在需要处理大量网页数据时非常有用，可以显著提高数据获取的效率，避免重复请求相同的资源。

注意事项

缓存机制：代码假设存在一个getValue函数，用于从缓存中获取数据。这个函数的实现细节未在代码中给出，需要确保它能够正确地从缓存中读取数据。
URL标准化：this.normalizeUrl(url)方法用于标准化URL，确保URL格式的一致性。这个方法的实现细节也未在代码中给出，需要确保它能够正确处理URL。
文档去重：在处理子文档时，代码通过检查documents数组中是否已经存在相同的文档来避免重复添加。这确保了最终返回的文档数组中没有重复的文档。
错误处理：代码中没有包含错误处理逻辑。在实际应用中，应该添加适当的错误处理机制，例如处理缓存读取失败的情况。
性能考虑：如果urls数组非常大，循环处理每个URL可能会消耗较多的时间。可以考虑使用并发请求来提高效率，但需要注意并发请求的数量，以避免对缓存系统造成过大的压力。