java爬虫是_java爬虫简单示例

最新推荐文章于 2024-07-12 09:13:43 发布

weixin_39553458

最新推荐文章于 2024-07-12 09:13:43 发布

阅读量144

点赞数

文章标签： java爬虫是

本文链接：https://blog.csdn.net/weixin_39553458/article/details/114043239

版权

Java爬虫 JSOUP 网页解析数据抓取数据库存储

关键词由CSDN通过智能技术生成

此文是为方便有一定基础的小白看的java爬虫流程，欢迎指正！

一、流程图

二、根据流程写程序(以抓取天津市卫生厅数据(通知、公告、新闻)为例)

1.目录结构

2.主函数

截图1

3.如何定位到自己需要的信息

附上jsoup中文文档http://www.open-open.com/jsoup/

不过这只能定位到一个网址我们的目的是得到页面所有需要的网址如

可以在刚刚的路径上进行修改

刚刚copy selector的路径

body > table:nth-child(12) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table:nth-child(2) > tbody > tr > td:nth-child(2) > table:nth-child(1) > tbody > tr > td:nth-child(1) > a

经过对每一个url所在位置的分析

比如：

body > table:nth-child(12) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table:nth-child(2) > tbody > tr > td:nth-child(2) > table:nth-child(2) > tbody > tr > td:nth-child(1) > a

再比如：

body > table:nth-child(12) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table:nth-child(2) > tbody > tr > td:nth-child(2) > table:nth-child(3) > tbody > tr > td:nth-child(1) > a

发现规律

body > table:nth-child(12) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table:nth-child(2) > tbody > tr > td:nth-child(2) > table> tbody > tr > td:nth-child(1) > a

该路径可以获取到页面的所有所需的url

可以粘贴按enter键试试看