Crawly：Elixir爬虫框架指南-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00311/article/details/141085286

Crawly：Elixir爬虫框架指南

crawlyCrawly, a high-level web crawling & scraping framework for Elixir. 项目地址:https://gitcode.com/gh_mirrors/cr/crawly

1. 项目介绍

Crawly 是一个基于 Elixir 语言构建的网页抓取框架。它设计简洁且易于使用，提供了高效和灵活的方式来爬取网页数据。该项目遵循现代 web 抓取的最佳实践，包括处理重试、反爬机制以及基于中间件的扩展性。Crawly 可以轻松集成到你的 Elixir 应用中，用于构建复杂的网络数据采集系统。

2. 项目快速启动

安装依赖

首先确保你安装了 Elixir 和 OTP 环境，然后通过 Mix 添加 crawly 到你的 mix.exs 文件的 dependencies 部分：

def deps do
  [
    {:crawly, "~> 0.0.0"} # 替换为实际版本号
  ]
end

接着更新或安装依赖：

mix deps.get

创建爬虫配置

创建一个新的文件 config/config.exs，并添加以下配置：

use Config

config :crawly,
  spiders: [MySpider],
  concurrency_level: 5,
  pipelines: [:my_pipeline]

在此例子中，替换 MySpider 为你的 Spider 模块名，并自定义并发级别和管道。

编写 Spider

创建名为 MySpider 的文件，例如在 lib/spiders 目录下：

defmodule MySpider do
  use Crawly.Spider,
    domain: "http://example.com",
    start_urls: ["http://example.com"],
    pipeline: [MyPipeline]

  def parse_item(response) do
    html = response.body
    # 解析并提取数据的逻辑
    data = scrape_data(html)
    {data, nil} # 返回抓取的数据和下一步 URL（可选）
  end

  defp scrape_data(html) do
    # 使用 Floki 或其他解析库提取所需数据
    data = ...
    data
  end
end