SeimiCrawler 开源项目教程-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00589/article/details/141043236

SeimiCrawler 开源项目教程

SeimiCrawler一个简单、敏捷、分布式的支持SpringBoot的Java爬虫框架;An agile, distributed crawler framework.项目地址:https://gitcode.com/gh_mirrors/se/SeimiCrawler

项目介绍

SeimiCrawler 是一个敏捷的、独立部署的、支持分布式的 Java 爬虫框架。它的目标是降低开发一个可用性高且性能不差的爬虫系统的门槛，提升开发效率。SeimiCrawler 受 Python 的爬虫框架 Scrapy 启发，融合了 Java 语言特点与 Spring 的特性，并默认使用 JsoupXpath 作为 HTML 解析器，通过 XPath 完成 HTML 数据的解析和提取。

项目快速启动

环境准备

Java 8 或更高版本
Maven 3.x

快速启动步骤

克隆项目

git clone https://github.com/zhegexiaohuozi/SeimiCrawler.git

构建项目
```
cd SeimiCrawler
mvn clean install
```
运行示例

进入 seimicrawler-spring-boot-example 目录，运行 Spring Boot 示例：
```
cd seimicrawler-spring-boot-example
mvn spring-boot:run
```

示例代码

以下是一个简单的爬虫示例代码：

import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;

@Crawler(name = "basic")
public class Basic extends BaseSeimiCrawler {
    @Override
    public String[] startUrls() {
        return new String[]{"http://example.com"};
    }

    @Override
    public void start(Response response) {
        logger.info("Got response from: {}", response.getUrl());
        String title = response.xpath("//title/text()").get();
        logger.info("Page title: {}", title);
    }
}