java网络蜘蛛_基于java网络蜘蛛程序

不期风吟

于 2021-02-17 00:59:06 发布

阅读量336

点赞数

文章标签： java网络蜘蛛

本文链接：https://blog.csdn.net/weixin_33924688/article/details/114281092

版权

内容介绍

原文档由会员刘丽发布

基于java网络蜘蛛程序

1.2万字 42页

包括开题报告，任务书，答辩PPT和论文正文

摘要

在互联网发展初期，网站相对较少，信息查找比较容易。然而伴随互联网爆炸性的发展，普通网络用户想找到所需的资料简直如同大海捞针，这时为满足大众信息检索需求的专业搜索网站便应运而生了。网络蜘蛛程序是Web搜索引擎技术中关键的一部分。

本论文基于现有的知识理论实现了蜘蛛程序，从给定网址开始进行爬行搜索，利用数据库队列技术管理网页链接，将访问过的网页资源下载到本地硬盘保存。通过使用Lucene工具包对下载资源。利用java.url中的类实现Spider程序与外界通讯，以及处理网页中的URL连接，对蜘蛛程序的核心类(通讯核心、蜘蛛程序工作核心)，资源索引的建立与搜索新型了详细的研究。

通过设计分析，完成了自己的蜘蛛爬行程序。程序按照初始设计功能完成，实现了对网络资源的收集和整理。功能通过了测试，程序可以正常稳定运行

最后论文对全文进行了总结，并对为了发展的方向进行了展望。

关键字：HTTP，线程，Spider，Lucene

Abstract

At the initial stage of internet development， there were few websites， so information searching is comparatively easy. However， with the explosion of internet， searching for information became very hard to common website users which calls for the appearance of professional searching websites. A crucial part of web searching engine technology is web spider program.

This paper realized the following procedures from give the website address to operate searching， make use of data base lining technology to manage webpage linkage to download visited sources to the local hard drives. Lucene tool bag is used to give content to the download sources. This paper is focused on the following technology: the core of spider program (communication core， spider program working core)， the establishment of sources and search.

Though the design analysis， I have finished my own spider creeping program. The program is finished based on initial design， implement the collection and arranging of net sources. These functions passed the test， and is able to run normally.

Key words: HTTP， routine， spider， Lucene

1 绪论1

1.1课题研究背景1

1.2国内外研究现状1

1.3 本论文的结构4

2 程序设计目标及策略5

2.1程序分析5

2.1.1 多线程搜索5

2.1.2 数据库队列管理5

2.1.3 检索引擎——Lucene6

2.2功能点技术分析6

2.2.1 Spider如何获取URL链接的获取6