最近做一个小项目,需要用到爬虫,今天介绍一下怎么配置使用Spider-Flow
Spider-Flow 是新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。这和国内的八爪鱼类似 也是界面化操作配置,我们看看使用效果怎么样。以下是开源地址
项目地址 https://gitee.com/ssssssss-team/spider-flow
1. 项目基础背景介绍
Spider-Flow 是一个新一代的爬虫平台,它以图形化的方式定义爬虫流程,用户无需编写代码即可完成爬虫任务。该平台高度灵活且可配置,支持多种数据提取方式和数据源,适用于各种爬虫需求。
主要编程语言
Spider-Flow 主要使用 Java 语言开发。这点我还是觉得python比较强一点。
2. 项目使用的关键技术和框架
JSON/XML/二进制格式:支持多种数据格式的处理。
Xpath/JsonPath/CSS选择器/正则表达式:用于数据提取。
多数据源支持:包括 SQL select/selectInt/selectOne/insert/update/delete 等操作。
JS动态渲染页面爬取:支持爬取通过 JavaScript 动态渲染的页面。
代理支持:支持使用代理进行爬取。
插件扩展:支持自定义插件,如 Selenium 插件、Redis 插件、OSS 插件等。
3. 项目安装和配置
准备工作,作为程序员应该很简单,这里就不赘述了
- Java 环境:确保已安装 Java 8 或更高版本。
- 数据库:Spider-Flow 需要一个数据库来存储爬虫配置和数据,支持 MySQL、PostgreSQL 等。
- Git:用于从 GitHub 克隆项目源码
3.1 克隆代码
git clone https://gitee.com/ssssssss-team/spider-flow.git
3.2 使用 Maven 构建项目
mvn install

3.3 数据库准备
这里我用mysql,在源码目录找到db目录,运行里面的spiderflow.sql文件
SET FOREIGN_KEY_CHECKS=0;
CREATE DATABASE spiderflow;
USE spiderflow;
DROP TABLE IF EXISTS `sp_flow`;
CREATE TABLE `sp_flow` (
`id` varchar(32) NOT NULL,
`name` varchar(64) DEFAULT NULL COMMENT '任务名字',
`xml` longtext DEFAULT NULL COMMENT 'xml表达式',
`cron` varchar(255) DEFAULT NULL COMMENT 'corn表达式',
`enabled` char(1) DEFAULT '0' COMMENT '任务是否启动,默认未启动',
`create_date` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
`last_execute_time` datetime DEFAULT NULL COMMENT '上一次执行时间',
`next_execute_time` datetime DEFAULT NULL COMMENT '下一次执行时间',
`execute_count` int(8) DEFAULT NULL COMMENT '定时执行的已执行次数',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT '爬虫任务表';
INSERT INTO `sp_flow` VALUES ('b45fb98d2a564c23ba623a377d5e12e9', '爬取码云GVP', '<mxGraphModel>\n <root>\n <mxCell id=\"0\">\n <JsonProperty as=\"data\">\n {"spiderName":"爬取码云GVP","threadCount":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"1\" parent=\"0\"/>\n <mxCell id=\"2\" value=\"开始\" style=\"start\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"80\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"shape":"start"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"5\" value=\"抓取首页\" style=\"request\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"180\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"抓取首页","loopVariableName":"","sleep":"","timeout":"","response-charset":"","method":"GET","body-type":"none","body-content-type":"text/plain","loopCount":"","url":"https://gitee.com/gvp/all","proxy":"","request-body":[""],"follow-redirect":"1","shape":"request"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"6\" value=\"\" parent=\"1\" source=\"2\" target=\"5\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"7\" value=\"提取项目名、地址\" style=\"variable\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"330\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"提取项目名、地址","loopVariableName":"","variable-name":["projectUrls","projectNames"],"loopCount":"","variable-value":["${extract.selectors(resp.html,'.categorical-project-card a','attr','href')}","${extract.selectors(resp.html,'.project-name')}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"8\" value=\"\" parent=\"1\" source=\"5\" target=\"7\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"9\" value=\"抓取详情页\" style=\"request\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"450.16668701171875\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"抓取详情页","loopVariableName":"projectIndex","sleep":"","timeout":"","response-charset":"","method":"GET","body-type":"none","body-content-type":"text/plain","loopCount":"10","url":"https://gitee.com/${projectUrls[projectIndex]}","proxy":"","request-body":[""],"follow-redirect":"1","shape":"request"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"10\" value=\"\" parent=\"1\" source=\"7\" target=\"9\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"12\" value=\"提取项目描述\" style=\"variable\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"550\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"提取项目描述","loopVariableName":"","variable-name":["projectDesc"],"loopCount":"","variable-value":["${extract.selector(resp.html,'.git-project-desc-text')}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"13\" value=\"\" parent=\"1\" source=\"9\" target=\"12\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"14\" value=\"输出\" style=\"output\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"660.1666870117188\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"输出","output-name":["项目名","项目地址","项目描述"],"output-value":["${projectNames[projectIndex]}","https://gitee.com${projectUrls[projectIndex]}","${projectDesc}"],"shape":"output"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"15\" value=\"\" parent=\"1\" source=\"12\" target=\"14\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n </root>\n</mxGraphModel>\n', null, '0', '2019-08-22 13:46:54', null, null, null);
INSERT INTO `sp_flow` VALUES ('f0a67f17ee1a498a9b2f4ca30556f3c3', '抓取每日菜价', '<mxGraphModel>\n <root>\n <mxCell id=\"0\">\n <JsonProperty as=\"data\">\n {"spiderName":"抓取每日菜价","threadCount":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"1\" parent=\"0\"/>\n <mxCell id=\"2\" value=\"开始\" style=\"start\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"80\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"shape":"start"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"3\" value=\"开始抓取\" style=\"request\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"219.83334350585938\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"开始抓取","loopVariableName":"","sleep":"","timeout":"","response-charset":"","method":"GET","body-type":"none","body-content-type":"text/plain","loopCount":"","url":"http://www.beijingprice.cn:8086/price/priceToday/PageLoad/LoadPrice?jsoncallback=1","proxy":"","request-body":[""],"follow-redirect":"1","shape":"request"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"4\" value=\"\" parent=\"1\" source=\"2\" target=\"3\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"5\" value=\"解析JSON\" style=\"variable\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"350\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"解析JSON","loopVariableName":"","variable-name":["jsonstr","jsondata","data"],"loopCount":"","variable-value":["${string.substring(resp.html,2,resp.html.length()-1)}","${json.parse(jsonstr)}","${extract.jsonpath(jsondata[0],'data')}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"6\" value=\"\" parent=\"1\" source=\"3\" target=\"5\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"7\" value=\"输出\" style=\"output\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"480.16668701171875\" y=\"80\" width=\"24\" height=\"24\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"输出","loopVariableName":"i","output-name":["菜名","菜价","单位"],"loopCount":"${list.length(data)}","output-value":["${data[i].ItemName}","${data[i].Price04}","${data[i].ItemUnit}"],"shape":"output"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"8\" value=\"\" parent=\"1\" source=\"5\" target=\"7\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n </root>\n</mxGraphModel>\n', null, '0', '2019-08-22 13:48:22', null, null, null);
INSERT INTO `sp_flow` VALUES ('b4430885ba8349588d1220d37eac831d', '爬取开源中国动弹', '<mxGraphModel>\n <root>\n <mxCell id=\"0\">\n <JsonProperty as=\"data\">\n {"spiderName":"爬取开源中国动弹","threadCount":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"1\" parent=\"0\"/>\n <mxCell id=\"2\" value=\"开始\" style=\"start\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"80\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"shape":"start"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"3\" value=\"爬取动弹\" style=\"request\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"220\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"爬取动弹","loopVariableName":"","sleep":"","timeout":"","response-charset":"","method":"GET","parameter-name":["type","lastLogId"],"body-type":"none","body-content-type":"text/plain","loopCount":"","url":"https://www.oschina.net/tweets/widgets/_tweet_index_list ","proxy":"","parameter-value":["ajax","${lastLogId}"],"request-body":"","follow-redirect":"1","tls-validate":"1","shape":"request"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"4\" value=\"\" edge=\"1\" parent=\"1\" source=\"2\" target=\"3\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"5\" value=\"提取lastLogId以及tweets\" style=\"variable\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"340\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"提取lastLogId以及tweets","loopVariableName":"","variable-name":["lastLogId","tweets","fetchCount"],"loopCount":"","variable-value":["${resp.selector('.tweet-item:last-child').attr('data-tweet-id')}","${resp.selectors('.tweet-item[data-tweet-id]')}","${fetchCount == null ? 0 : fetchCount + 1}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"6\" value=\"\" edge=\"1\" parent=\"1\" source=\"3\" target=\"5\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"7\" value=\"循环\" style=\"loop\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"340\" y=\"250\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"循环","loopVariableName":"index","loopCount":"${list.length(tweets)}","shape":"loop"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"8\" value=\"\" edge=\"1\" parent=\"1\" source=\"5\" target=\"7\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"9\" value=\"提取详细信息\" style=\"variable\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"340\" y=\"340\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"提取详细信息","loopVariableName":"","variable-name":["content","author","like","reply","publishTime"],"loopCount":"","variable-value":["${tweets[index].selector('.text').text()}","${tweets[index].selector('.user').text()}","${tweets[index].selector('.like span').text()}","${tweets[index].selector('.reply span').text()}","${tweets[index].selector('.date').regx('(.*?)&nbsp')}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"10\" value=\"\" edge=\"1\" parent=\"1\" source=\"7\" target=\"9\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"11\" value=\"输出\" style=\"output\" vertex=\"1\" parent=\"1\">\n <mxGeometry x=\"340\" y=\"430\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"输出","loopVariableName":"","output-name":["作者","内容","点赞数","评论数","发布时间"],"loopCount":"","output-value":["${author}","${content}","${like}","${reply}","${publishTime}"],"shape":"output"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"12\" value=\"\" edge=\"1\" parent=\"1\" source=\"9\" target=\"11\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"13\" value=\"爬取3页\" edge=\"1\" parent=\"1\" source=\"5\" target=\"3\">\n <mxGeometry x=\"-0.0312\" y=\"-20\" relative=\"1\" as=\"geometry\">\n <Array as=\"points\">\n <mxPoint x=\"356\" y=\"180\"/>\n <mxPoint x=\"236\" y=\"180\"/>\n </Array>\n <mxPoint as=\"offset\"/>\n </mxGeometry>\n <JsonProperty as=\"data\">\n {"value":"爬取5页","condition":"${fetchCount < 3}"}\n </JsonProperty>\n </mxCell>\n </root>\n</mxGraphModel>\n', '', '0', '2019-11-03 17:02:49', '2019-11-04 10:11:31', '2019-11-03 17:30:56', '3');
INSERT INTO `sp_flow` VALUES ('663aaa5e36a84c9594ef3cfd6738e9a7', '百度热点', '<mxGraphModel>\n <root>\n <mxCell id=\"0\">\n <JsonProperty as=\"data\">\n {"spiderName":"百度热点","threadCount":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"1\" parent=\"0\"/>\n <mxCell id=\"2\" value=\"开始\" style=\"start\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"80\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"shape":"start"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"3\" value=\"开始抓取\" style=\"request\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"220\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"开始抓取","loopVariableName":"","sleep":"","timeout":"","response-charset":"gbk","method":"GET","body-type":"none","body-content-type":"text/plain","loopCount":"","url":"https://top.baidu.com/buzz?b=1&fr=topindex","proxy":"","request-body":"","follow-redirect":"1","tls-validate":"1","shape":"request"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"4\" value=\"定义变量\" style=\"variable\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"360\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"定义变量","loopVariableName":"","variable-name":["elementbd"],"loopCount":"","variable-value":["${resp.xpaths('//*[@id=\\"main\\"]/div[2]/div/table/tbody/tr')}"],"shape":"variable"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"5\" value=\"输出\" style=\"output\" parent=\"1\" vertex=\"1\">\n <mxGeometry x=\"480\" y=\"80\" width=\"32\" height=\"32\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"输出","loopVariableName":"i","output-name":["名称","地址","百度指数","2"],"loopCount":"${elementbd.size()-1}","output-value":["${elementbd[i+1].xpath('//td[2]/a[1]/text()')}","${elementbd[i+1].xpath('//td[2]/a[1]/@href')}","${elementbd[i+1].xpath('//td[4]/span/text()')}","${elementbd[i+1].xpath('//td[3]/a[2]/text()')}"],"shape":"output"}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"6\" value=\"\" parent=\"1\" source=\"2\" target=\"3\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"7\" value=\"\" parent=\"1\" source=\"3\" target=\"4\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n <mxCell id=\"8\" value=\"\" parent=\"1\" source=\"4\" target=\"5\" edge=\"1\">\n <mxGeometry relative=\"1\" as=\"geometry\"/>\n <JsonProperty as=\"data\">\n {"value":"","condition":""}\n </JsonProperty>\n </mxCell>\n </root>\n</mxGraphModel>\n', '0 0/30 * * * ? *', '1', '2019-10-20 17:24:21', '2019-11-04 08:52:05', '2019-10-30 14:52:39', '45');
DROP TABLE IF EXISTS `sp_datasource`;
CREATE TABLE `sp_datasource` (
`id` varchar(32) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`driver_class_name` varchar(255) DEFAULT NULL,
`jdbc_url` varchar(255) DEFAULT NULL,
`username` varchar(64) DEFAULT NULL,
`password` varchar(32) DEFAULT NULL,
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
DROP TABLE IF EXISTS `sp_variable`;
CREATE TABLE `sp_variable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(32) DEFAULT NULL COMMENT '变量名',
`value` varchar(512) DEFAULT NULL COMMENT '变量值',
`description` varchar(255) DEFAULT NULL COMMENT '变量描述',
`create_date` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8mb4;
/* v0.3.0 新增 */
DROP TABLE IF EXISTS `sp_task`;
CREATE TABLE `sp_task` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`flow_id` varchar(32) NOT NULL,
`begin_time` datetime DEFAULT NULL,
`end_time` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4;
/* v0.4.0 新增 */
DROP TABLE IF EXISTS `sp_function`;
CREATE TABLE `sp_function` (
`id` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
`name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '函数名',
`parameter` varchar(512) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '参数',
`script` text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL COMMENT 'js脚本',
`create_date` datetime(0) NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
/* v0.5.0 新增 */
DROP TABLE IF EXISTS `sp_flow_notice`;
CREATE TABLE `sp_flow_notice` (
`id` varchar(32) NOT NULL,
`recipients` varchar(200) DEFAULT NULL COMMENT '收件人',
`notice_way` char(10) DEFAULT NULL COMMENT '通知方式',
`start_notice` char(1) DEFAULT '0' COMMENT '流程开始通知:1:开启通知,0:关闭通知',
`exception_notice` char(1) DEFAULT '0' COMMENT '流程异常通知:1:开启通知,0:关闭通知',
`end_notice` char(1) DEFAULT '0' COMMENT '流程结束通知:1:开启通知,0:关闭通知',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT '爬虫任务通知表';
注意:上面语句中的 几条insert测试数据报错可以不执行
3.4 修改项目配置
选择这个模块 spider-flow-web
- 进入项目目录,找到
src/main/resources文件夹,编辑application.properties文件。 - 配置数据库连接信息:用户 密码 地址
spring.datasource.url=jdbc:mysql://localhost:3306/spiderflow
spring.datasource.username=xxx
spring.datasource.password=xxx
3.5 运行项目
根目录运行
mvn spring-boot:run
报错了,不要慌,分析下原因

spider-flow ........................................ FAILURE [ 3.317 s]
这个报错,我们去打印的日志,找到具体的错误
Unable to find a suitable main class, please add a 'mainClass' property
这个应该是单元测试报错了,那我们不管这个错误,直接打包运行吧
mvn package

执行没问题后,我们去 spider-flow\spider-flow-web\target 目录 执行运行
java -jar spider-flow.jar

看到这个熟悉的消息,说明tomcat服务器启动成功了,端口8088
我们浏览器访问http://localhost:8088/,将打开如下界面

到此,项目部署并启动成功了
4. 配置测试爬虫任务
4.1 在图形化界面中,爬虫列表菜单,点击“添加爬虫”,选取左侧工具拖拽,即可配置爬虫

4.2 根据需要配置爬虫的 URL、选择器、数据源等


4.3 保存并启动爬虫任务
这里我测试的是爬取boss直聘的职位信息,在列表还可以设置定时


我们点击列表, 操作 ,里面的运行按钮 ,就可以执行任务

本次的介绍就到这里了,下次再介绍如何详细使用Spider-flow,当然在这里提醒各位小伙伴,要合法合规使用工具
总结
感谢能看到这里的朋友😉
本次的分享就到这里,猫头鹰数据致力于为大家分享技术干货😎
如果以上过程中出现了任何的纰漏错误,烦请大佬们指正😅
受益的朋友或对技术感兴趣的伙伴记得点赞关注支持一波🙏
也可以搜索关注我的微信公众号【终极量化数据】,留言交流🙏

2730

被折叠的 条评论
为什么被折叠?



