爬虫入门(实时新闻采集器)②_实时资讯爬虫工具-CSDN博客

本文链接：https://blog.csdn.net/weixin_42061487/article/details/88909773

难得的周末，继续爬虫的学习。

爬虫的管理器类实现

在之前新建的parser包中，新建一个用于解析爬取下来的网页，管理器类
（用于解析下载下来的web页面html code的管理类）
然后再建一个用于存放解析出来的Pojo类(ParserResultEntity）
1)外层代码，解析带有ul里面再嵌套的ul内容，使用正则表达式进行拿取.

	List<ParserResultEntity> resultList = new ArrayList<ParserResultEntity>();
		// 先把url大块取到
	String regexUl = "<ul[\\s\\S]*?</ul>";
	String ulBlockHtmlSource = RegexUtil.getText(htmlSource, regexUl, 0);
	// 从ul中匹配到li块
	String regexLi = "<li[\\s\\S]*?</li>";
	Matcher matcher = RegexUtil.getMatcher(ulBlockHtmlSource, regexLi);
	while (matcher.find()) {
		String liHtmlSource = matcher.group();
		ParserResultEntity resultEntity = new ParserResultEntity();
			...................
			}

(List是用于存放下面解析出来的网页标题，链接，时间)
在while里面读取网页的对应需要的内容，并放进List列表里面，再存入数据库
2)标题，时间，url。的解析
解析postDate(新闻发布时间)

	// 解析postDate
	String regexPostdate = "<font>([\\s\\S]*?)</font>";
	String dateString = RegexUtil.getText(liHtmlSource, regexPostdate, 1);
	Date postdate = DateUtil.parseStringToDate(dateString);
	// System.out.println(matcherPostdate.group(1));
	resultEntity.setPostDate(postdate);

解析url链接(新闻对应链接)

// 解析url链接
	String regexURL = "href=\"([\\s\\S]*?)\"";
	String href = RegexUtil.getText(liHtmlSource, regexURL, 1);
	href = StaticValue.rootURL + href.substring(2);
	resultEntity.setSourceUrl(href);

解析标题(新闻标题)

// 解析标题
	String regexTitle = "<a[\\s\\S]*?>([\\s\\S]*?)</a>";
	Pattern patternTitle = Pattern.compile(regexTitle);
	Matcher matcherTitle = patternTitle.matcher(liHtmlSource);
	String title = RegexUtil.getText(liHtmlSource, regexTitle, 1);
	resultEntity.setTitle(title);

最后设置数据库插入时间，并存放进List里面

// 最后设置数据库插入时间
	resultEntity.setInsertDate(DateUtil.getCurrentDate());
	resultList.add(resultEntity);

上面要注意的是日期格式的转换Date和拿取下来的日期格式的转化.
接下来进行数据库链接的搭建(包括以下几点)
1.数据库环境搭建
2.基本操作与测试
3.数据库设计
4.java连接和操作数据库中的表
5.做数据校验
数据库操作都是基本要求这里就不多说啦。但需要注意的

记得异常要捕获不能直接抛出
数据插入的时候如果有多条可以用Batch进行批量处理
数据库链接的用户名，密码可以存放到文件中便于管理
注意字符编码，防止乱码

接下来是对文件的串联

编写一个SysetmController类对整体的文件进行串联操作
分为以下5步:
1.拿到url种子
2.交给任务调度，用于后续的下载
3.下载拿到任务的url，没有则返回空
4.解析下载下来的htmlSource
5.对解析下来的对象进行持久化(存入数据库)
代码如下:


import java.util.List;

import com.tl.spider.download.DownLoadManager;
import com.tl.spider.parser.HtmlParserManager;
import com.tl.spider.persistence.DataPersistenceManager;
import com.tl.spider.pojos.ParserResultEntity;
import com.tl.spider.schedule.ScheduleManager;
import com.tl.spider.ui.UIManager;

public class SystemControler {
	//做各个层间的串联
	public static void main(String[] args) throws Exception {
		/*
		 * 1.拿到种子url
		 */
		String seedUrl = UIManager.getSeedUrl();
		/*
		 * 2.交给任务调度,用于后续的下载
		 */
		ScheduleManager.addSeedUrlToTaskList(seedUrl);
		/*
		 * 3.下载拿到的任务url，没有则返回空
		 */
		String htmlSource = DownLoadManager.download();
		/*
		 * 4.解析下载下来的htmlsource
		 */
		if(htmlSource!=null) {
			List<ParserResultEntity> resultEntityList = HtmlParserManager.parseHtml(htmlSource);
			/*
			 * 5.对解析出来的对象进行持久化
			 */
			DataPersistenceManager.persist(resultEntityList);
		}else {
			System.out.println("没有找到要解析的htmlSource,本轮任务结束!");
		}
		
		System.out.println("done");
	}
}