网络搜索引擎---网络爬虫之原理分析和探讨

最新推荐文章于 2024-11-07 15:39:03 发布

ztbzg

最新推荐文章于 2024-11-07 15:39:03 发布

阅读量371

点赞数

文章标签：搜索引擎网络爬虫网络 string exception 文档

转自： http://www.360doc.com/content/10/0519/09/1007797_28335641.shtml

成搜索引擎从
1.网页下载，
2.文本分析，
3.索引生成，
4.索引存储，
5.信息检索等各个层面的应用。

讨论网站：http://www.chengshibianyuan.cn

一个搜索引擎的好坏的评价标准：
1.相关性
2.数据量
3.查全率
4.相应速度
5.更新速度

互联网搜索引擎的五个主要环节，系统主要模块包括
1.网页信息的抓取；
   1.深度优先搜集策略
   2.IP段扫描搜索策略
   3.广度优先搜集策略
2.网页内容的分析；
   1.分词
   2.过滤
   3.转换
3.网页索引的建立；
   1.索引器是用来建立索引的软件和程序；
   2.索引模型：倒排文档，矢量空间模型，概率模型等，一般索引表采用倒排索引。
4.网页检索结果排序；
5.网页检索工具与接口。

用到的java api:
java.net.*;
java.net.www.html; 提供了处理HTML语言的功能方法，
而Java.net.www.http;提供了处理HTTP协议的功能方法。

实例解析：

第一步：数据的抓取

网络采集程序的大致设计思路：
   1.先确定需要下载的网页的URL，指定通信端口，创建一个用于网络通信的
   Socket对象，网页下载默认端口是80，
   2.结果通过流式输出接口输出，创建相应的输出对象。通过输入接口，向Socket中
   传入HTTP下载请求。
   3.远端的目标Web服务器得到请求后，发送应答消息。本地Socket对象收到消息后缓冲并输出，就完成整个网页的下载功能。
实践：
   完成一个简单的网页下载，并保存到本地的功能程序。WebCrawlerMe.java如下：

package chapter2;

import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.net.Socket;
import java.net.UnknownHostException;

/**
* @author gf
*
*/
public class WebCrawlerMe {

    /**
     * @param args
     */
    public static void main(String[] args) {

        try {

            PrintWriter pw = new PrintWriter(new FileOutputStream("D:\\aa.jsp"));
            OutputStream out = new FileOutputStream("D:\\aa.jsp");

//生成下载对象

Socket webclient = new Socket("www.bnu.edu.cn", 80);

            //输出流
            PrintWriter result = new PrintWriter(webclient.getOutputStream(),
                    true);
            //输入流
            BufferedReader receiver = new BufferedReader(new InputStreamReader(
                    webclient.getInputStream()));

//发送Http request请求

            result.println("GET / HTTP/1.1");
            result.println("Host:bnu.edu.cn");
            result.println("Connection:close");
            result.println();
            //result.println("==============================");

//接收HTTP Response 返回结果信息

            boolean bRet = true;
            StringBuffer sb = new StringBuffer(8096);
            while (bRet) {
                if (receiver.ready()) { //判断此流是否已准备好被读取。
                    int idx = 0;
                    while (idx != -1) {
                        idx = receiver.read(); //读取单个字符
                        sb.append((char) idx);
                    }
                    bRet = false;
                }
            }

//显示获得的网页正文，打印到控制台

System.out.println(sb.toString());

            pw.print(sb.toString());
            webclient.close();
            pw.close();

        } catch (UnknownHostException e) {
            System.err.println("无法访问指定主机");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("下载失败，请检查输入地址是否正确");
            e.printStackTrace();
        }
    }
}

第二步：网页分析程序实现

实现原理：
   分析方法包括格式化字符去除，文本正文分词，信息过滤等基本任务
   1.简单语言标记去除方法主要根据HTML语言的特点实现；
   2.正则表达式信息抽取主要利用模板方式提取有效信息；
   3.DOM树内容抽取是利用每个网页都有一定的层次结构的特点。
系统结构：
原始HTML文档---->网页结构化分析 ----> 网页结构信息
           ----> 视觉模块分析   ----> 视觉特征信息 ----> 信息过滤
          ----> 去除HTML标记   ----> 网页文本信息
实践：
   1.完成一个简单的去除网页中的<>标签功能程序。WebParser.java，如下：
   2.完成一个简单的简单字分割，过滤各种标点符号，结果文字以空格来分割不同汉字和英文词组。
package chapter2;

import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

/**
* @author gf
*
*/
public class WebParser {

    private static String src_File_Path= "D:\\workshop\\htmlsrc.html";
    private static String dst_File_Path= "D:\\workshop\\htmlsrc.txt";
    /**
     * @param args
     */
    public static void main(String[] args) {

        try {
            parser();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

    public static void parser() throws IOException{

        //标志位，用来标志是否进入<>内部，当为FALSE时，进入标签内部
        //当位true时，进入<>外部
        boolean bContent = true;

        /**Constructs a string buffer with no characters in it，
        *The initial capacity of the string buffer is 8096*2
        * */
        StringBuffer sBuffer = new StringBuffer(8096*2);

        char [] cBuffer = new char[8096*2];
        int nCount = 0;

        //准备读写文件
        File srcFile = new File(src_File_Path);
        FileReader fileReader = new FileReader(srcFile);
        File dstFile = new File(dst_File_Path);
        FileWriter fileWriter = new FileWriter(dstFile);

        //返回读取的字符的个数
        nCount = fileReader.read(cBuffer);

        //开始处理文本正文
        for(int i = 0;i < nCount;i++){

            if(bContent == false){
                if(cBuffer[i] == '>'){
                    bContent = true;
                }else{
                    continue;
                }
            }else{
                if(cBuffer[i] == '<'){
                    bContent = false;
                    continue;
                }else if(cBuffer[i] == '\n' || cBuffer[i] == ' '){ //过滤
                    continue;
                }else if(cBuffer[i] == '&' && cBuffer[i+1] =='n' &&    //过滤&nbsp
                    cBuffer[i+2] == 'b' && cBuffer[i+3] =='s' && cBuffer[i+4] =='p'){
                    i= i+5;
                    continue;
                }
                sBuffer.append(cBuffer[i]);
                fileWriter.write(cBuffer[i]);
            }
        }
        System.out.println(sBuffer.toString());
       // fileWriter.write(sBuffer.toString());
        fileWriter.close();
        fileReader.close();

    }
}

第三步：网页索引程序实现

实现原理：
   采用关键词匹配，核心算法采用倒排索引结构进行。
   倒排索引是一种以关键词作为索引关键字和链表访问入口的索引结构。通常保存在内存中，提高访问速度，
   利用索引关键字直接确定文档列表，最后确定希望找到的文档列表。由于搜索引擎中的文件通常并不单独存放，
   而是存储在一个巨型的文件库里。为了节约内存，文档在索引中通常以文件库编号以及文件偏移量来代表。
   当通过关键字检索时，检索的结果可以直接通过位置信息计算得到。在需要的时候可以读取内存空间或磁盘文件得到。
网页索引程序设计
   程序的基本思想是采用文档关键字作为索引，生产按照关键字组合的链表，每个链表都是包含了特定关键字的文档集合。
   在检索过程中按照关键字的哈希值或其它的映射算法，快速定位关键字链表。在得到的文档集合基础上进行排序和过滤，就可以快速
   得到需要的文档集合。

   整个程序以两个循环为主线，对每个文档中的每个语素进行处理。每个语素作为关键字生成一个Hash值，并把附加的相关文档信息作为索引项保存。
   每个关键字的索引项需要添加到Hash表中，但必须先检查是否已经存在相同的关键字的索引项。
   如果已经存在相同的关键字，需要取出相关关键字，并链接到当前的索引项之后，形成每个关键字的关联文档集合链表。

实践：
1.索引程序包括2部分：一个是索引项类代码，包括了对索引项维护以及属性的操作；InfoItem.java
一个是索引创建类，完成索引的创建工作。WordIndex.java

package chapter2.index;

/**
* @author gf
* @date 2008-10-16
*/
//function: 索引项类，包括存储位置和后续指针成员变量
// 一系列参数设置和获取的接口
public class InfoItem {

    public int fileID; //文件编号
    public int offset; //文件偏移
    InfoItem next; //后续节点指针

    /**
     * 默认构造函数
     */
    public InfoItem() {
        super();
        // TODO Auto-generated constructor stub
    }

    /**
     * @param fileID
     * @param offset
     * 参数初始化
     */
    public InfoItem(int fileID, int offset) {
        super();
        this.fileID = fileID;
        this.offset = offset;
        this.next = null;
    }

    /**
     * @return the fileID
     */
    public final int getFileID() {
        return fileID;
    }

    /**
     * @param fileID the fileID to set
     */
    public final void setFileID(int fileID) {
        this.fileID = fileID;
    }

    /**
     * @return the next
     */
    public final InfoItem getNext() {
        return next;
    }

    /**
     * @param next the next to set
     */
    public final void setNext(InfoItem next) {
        this.next = next;
    }

    /**
     * @return the offset
     */
    public final int getOffset() {
        return offset;
    }

    /**
     * @param offset the offset to set
     */
    public final void setOffset(int offset) {
        this.offset = offset;
    }
}

package chapter2.index;

import java.util.Hashtable;

/**
* @author gf
*
*/
public class WordIndex {

        try {
            index();
        } catch (Exception e) {

            e.printStackTrace();
        }
    }
}

第四步：检索程序实现

package search;

import java.util.Hashtable;

import chapter2.index.InfoItem;

/**
* @author gf
*
*/
public class SearchWordIndex {

    private static Hashtable keyWordIdx; //关键词哈希表
    //存放原始文件内容，每一数组代表一个文件
    private static String [] FileList = {"北京师范大学",
        "北师大附属试验小学","北师大第二附属中学"};


    //根据给定的内容，按照汉字建立Hash索引
    public static void index() throws Exception{
        InfoItem item1,item2;
        System.out.println("index:===========begin==========");
        keyWordIdx = new Hashtable(); //创建关键词Hash表

        //关键词初始长度
        System.out.println("index:Hashtable initial Size:"+keyWordIdx.size());

        for(int i=0;i<3;i++){
            int len = FileList[i].length();
            for(int j =0;j<len;j++){
                item1 = new InfoItem(i,i); //生产文件存放位置的附属信息
                String key = FileList[i].substring(j,j+1);
                System.out.print(key);
                if(!keyWordIdx.containsKey(key)){
                    keyWordIdx.put(key, item1);
                }else{
                    //提取原始的存储项
                    item2 = (InfoItem) keyWordIdx.get(key);
                    item1.setNext(item2); //新文件节点添加到链表头
                    keyWordIdx.put(key, item1);//添加关键字索引到Hash表
                }
            }
        }
        System.out.println("\nindex:Hashtable finish Size:"+keyWordIdx.size());
        System.out.println("index:===========end==========");
    }

    //根据输入的内容进行Hash检索
    public static void search(String keyword) throws Exception{
        InfoItem item;
        System.out.println("Search:=======begin=======");
        if(null == keyWordIdx){
            return;
        }
        //根据关键词获取Hash表中的索引项列表
        item = (InfoItem)keyWordIdx.get(keyword);
        //循环显示索引项内容
        while(null != item){
            //显示编号
            System.out.println("Search:File number :"+item.getFileID());
            //显示偏移
            System.out.println("Search:File offset :"+item.getOffset());
            //显示内容
            System.out.println("Search:File Content :"+FileList[item.getOffset()]);

            //获取下一个索引项
            item = item.getNext();
        }
        System.out.println("Search:=======end=======");
    }

    /**
     * @param args
     */
    public static void main(String[] args) {