Java+MySQL实现网络爬虫程序

最新推荐文章于 2021-02-20 19:42:35 发布

JKxiaoxing

最新推荐文章于 2021-02-20 19:42:35 发布

阅读量724

点赞数

网络爬虫，也叫网络蜘蛛，有的项目也把它称作“walker”。维基百科所给的定义是“一种系统地扫描互联网，以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目，其中比较有名的是Heritrix和Apache Nutch。

有时需要在网上搜集信息，如果需要搜集的是获取方法单一而人工搜集费时费力的信息，比如统计一个网站每个月发了多少篇文章、用了哪些标签，为自然语言处理项目搜集语料，或者为模式识别项目搜集图片等等，就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。

很多网络爬虫都是用Python，Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间，我把程序限制在只扫描本博客地址下的网页（也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容），并从网址中统计出所用的所有标签。只要稍作修改，去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改，可以作为生成博客sitemap的工具。

代码也可以在这里下载：johnhany/WPCrawler。

环境需求

我的开发环境是Windows7 + Eclipse。

需要XAMPP提供通过url访问MySQL数据库的端口。

还要用到三个开源的Java类库：

Apache HttpComponents 4.3 提供HTTP接口，用来向目标网址提交HTTP请求，以获取网页的内容；

HTML Parser 2.0 用来解析网页，从DOM节点中提取网址链接；

MySQL Connector/J 5.1.27 连接Java程序和MySQL，然后就可以用Java代码操作数据库。

代码

代码位于三个文件中，分别是：crawler.java，httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。

crawler.java

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
         51 
       
         52 
       
         53 
       
         54 
       
         55 
       
         56 
       
         57 
       
         58 
       
         59 
       
         60 
       
         61 
       
         62 
       
         63 
       
         64 
       
         65 
       
         66 
       
         67 
       
         68 
       
         69 
       
         70 
       
         71 
       
         72 
       
         73 
       
         74 
       
         75 
       
         76 
       
         77 
       
         78 
       
         79 
       
         80 
       
         81 
       
         82 
       
         83 
       
         84 
       
         85 
       
         86 
       
         87 
       
         88 
       
         89 
       
         90 
       
        package 
          
        net 
        . 
        johnhany 
        . 
        wpcrawler 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        Connection 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        DriverManager 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        ResultSet 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        SQLException 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        Statement 
        ; 
       
        public 
          
        class 
          
        crawler 
          
        { 
       
        public 
          
        static 
          
        void 
          
        main 
        ( 
        String 
          
        args 
        [ 
        ] 
        ) 
          
        throws 
          
        Exception 
          
        { 
       
        String 
          
        frontpage 
          
        = 
          
        "http://johnhany.net/" 
        ; 
       
        Connection  
        conn 
          
        = 
          
        null 
        ; 
       
        //connect the MySQL database 
       
        try 
          
        { 
       
        Class 
        . 
        forName 
        ( 
        "com.mysql.jdbc.Driver" 
        ) 
        ; 
       
        String 
          
        dburl 
          
        = 
          
        "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8" 
        ; 
       
        conn 
          
        = 
          
        DriverManager 
        . 
        getConnection 
        ( 
        dburl 
        , 
          
        "root" 
        , 
          
        "" 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "connection built" 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        SQLException 
          
        e 
        ) 
          
        { 
       
        e 
        . 
        printStackTrace 
        ( 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        ClassNotFoundException 
          
        e 
        ) 
          
        { 
       
        e 
        . 
        printStackTrace 
        ( 
        ) 
        ; 
       
        } 
       
        String 
          
        sql 
          
        = 
          
        null 
        ; 
       
        String 
          
        url 
          
        = 
          
        frontpage 
        ; 
       
        Statement  
        stmt 
          
        = 
          
        null 
        ; 
       
        ResultSet  
        rs 
          
        = 
          
        null 
        ; 
       
        int 
          
        count 
          
        = 
          
        0 
        ; 
       
        if 
        ( 
        conn 
          
        != 
          
        null 
        ) 
          
        { 
       
        //create database and table that will be needed 
       
        try 
          
        { 
       
        sql 
          
        = 
          
        "CREATE DATABASE IF NOT EXISTS crawler" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "USE crawler" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        SQLException 
          
        e 
        ) 
          
        { 
       
        e 
        . 
        printStackTrace 
        ( 
        ) 
        ; 
       
        } 
       
        //crawl every link in the database 
       
        while 
        ( 
        true 
        ) 
          
        { 
       
        //get page content of link "url" 
       
        httpGet 
        . 
        getByString 
        ( 
        url 
        , 
        conn 
        ) 
        ; 
       
        count 
        ++ 
        ; 
       
        //set boolean value "crawled" to true after crawling this page 
       
        sql 
          
        = 
          
        "UPDATE record SET crawled = 1 WHERE URL = '" 
          
        + 
          
        url 
          
        + 
          
        "'" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        if 
        ( 
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
          
        > 
          
        0 
        ) 
          
        { 
       
        //get the next page that has not been crawled yet 
       
        sql 
          
        = 
          
        "SELECT * FROM record WHERE crawled = 0" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        rs 
          
        = 
          
        stmt 
        . 
        executeQuery 
        ( 
        sql 
        ) 
        ; 
       
        if 
        ( 
        rs 
        . 
        next 
        ( 
        ) 
        ) 
          
        { 
       
        url 
          
        = 
          
        rs 
        . 
        getString 
        ( 
        2 
        ) 
        ; 
       
        } 
        else 
          
        { 
       
        //stop crawling if reach the bottom of the list 
       
        break 
        ; 
       
        } 
       
        //set a limit of crawling count 
       
        if 
        ( 
        count 
          
        > 
          
        1000 
          
        || 
          
        url 
          
        == 
          
        null 
        ) 
          
        { 
       
        break 
        ; 
       
        } 
       
        } 
       
        } 
       
        conn 
        . 
        close 
        ( 
        ) 
        ; 
       
        conn 
          
        = 
          
        null 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "Done." 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        count 
        ) 
        ; 
       
        } 
       
        } 
       
        }

httpGet.java

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
        package 
         
        net 
        . 
        johnhany 
        . 
        wpcrawler 
        ; 
       
        import 
        java 
        . 
        io 
        . 
        IOException 
        ; 
       
        import 
        java 
        . 
        sql 
        . 
        Connection 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        HttpEntity 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        HttpResponse 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        client 
        . 
        ClientProtocolException 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        client 
        . 
        ResponseHandler 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        client 
        . 
        methods 
        . 
        HttpGet 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        impl 
        . 
        client 
        . 
        CloseableHttpClient 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        impl 
        . 
        client 
        . 
        HttpClients 
        ; 
       
        import 
        org 
        . 
        apache 
        . 
        http 
        . 
        util 
        . 
        EntityUtils 
        ; 
       
        public 
         
        class 
         
        httpGet 
         
        { 
       
        public 
         
        final 
         
        static 
         
        void 
         
        getByString 
        ( 
        String 
         
        url 
        , 
         
        Connection 
        conn 
        ) 
         
        throws 
         
        Exception 
         
        { 
       
        CloseableHttpClient 
        httpclient 
         
        = 
         
        HttpClients 
        . 
        createDefault 
        ( 
        ) 
        ; 
       
        try 
         
        { 
       
        HttpGet 
        httpget 
         
        = 
         
        new 
         
        HttpGet 
        ( 
        url 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "executing request " 
         
        + 
         
        httpget 
        . 
        getURI 
        ( 
        ) 
        ) 
        ; 
       
        ResponseHandler 
        < 
        String 
        > 
         
        responseHandler 
         
        = 
         
        new 
         
        ResponseHandler 
        < 
        String 
        > 
        ( 
        ) 
         
        { 
       
        public 
         
        String 
         
        handleResponse 
        ( 
       
        final 
         
        HttpResponse 
        response 
        ) 
         
        throws 
        ClientProtocolException 
        , 
         
        IOException 
         
        { 
       
        int 
         
        status 
         
        = 
         
        response 
        . 
        getStatusLine 
        ( 
        ) 
        . 
        getStatusCode 
        ( 
        ) 
        ; 
       
        if 
         
        ( 
        status 
         
        >= 
         
        200 
         
        && 
         
        status 
         
        < 
         
        300 
        ) 
         
        { 
       
        HttpEntity 
        entity 
         
        = 
         
        response 
        . 
        getEntity 
        ( 
        ) 
        ; 
       
        return 
         
        entity 
         
        != 
         
        null 
         
        ? 
         
        EntityUtils 
        . 
        toString 
        ( 
        entity 
        ) 
         
        : 
         
        null 
        ; 
       
        } 
         
        else 
         
        { 
       
        throw 
         
        new 
         
        ClientProtocolException 
        ( 
        "Unexpected response status: " 
         
        + 
         
        status 
        ) 
        ; 
       
        } 
       
        } 
       
        } 
        ; 
       
        String 
         
        responseBody 
         
        = 
         
        httpclient 
        . 
        execute 
        ( 
        httpget 
        , 
         
        responseHandler 
        ) 
        ; 
       
        /* 
       
                    //print the content of the page 
       
                    System.out.println("----------------------------------------"); 
       
                    System.out.println(responseBody); 
       
                    System.out.println("----------------------------------------"); 
       
                    */ 
       
        parsePage 
        . 
        parseFromString 
        ( 
        responseBody 
        , 
        conn 
        ) 
        ; 
       
        } 
         
        finally 
         
        { 
       
        httpclient 
        . 
        close 
        ( 
        ) 
        ; 
       
        } 
       
        } 
       
        }

parsePage.java

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
         34 
       
         35 
       
         36 
       
         37 
       
         38 
       
         39 
       
         40 
       
         41 
       
         42 
       
         43 
       
         44 
       
         45 
       
         46 
       
         47 
       
         48 
       
         49 
       
         50 
       
         51 
       
         52 
       
         53 
       
         54 
       
         55 
       
         56 
       
         57 
       
         58 
       
         59 
       
         60 
       
         61 
       
         62 
       
         63 
       
         64 
       
         65 
       
         66 
       
         67 
       
         68 
       
         69 
       
         70 
       
         71 
       
         72 
       
         73 
       
         74 
       
         75 
       
         76 
       
         77 
       
         78 
       
         79 
       
         80 
       
         81 
       
         82 
       
         83 
       
         84 
       
         85 
       
         86 
       
         87 
       
         88 
       
         89 
       
         90 
       
         91 
       
         92 
       
         93 
       
         94 
       
         95 
       
         96 
       
         97 
       
         98 
       
         99 
       
         100 
       
         101 
       
         102 
       
         103 
       
         104 
       
         105 
       
         106 
       
         107 
       
         108 
       
         109 
       
         110 
       
         111 
       
         112 
       
         113 
       
         114 
       
         115 
       
         116 
       
         117 
       
        package 
          
        net 
        . 
        johnhany 
        . 
        wpcrawler 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        Connection 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        PreparedStatement 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        ResultSet 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        SQLException 
        ; 
       
        import  
        java 
        . 
        sql 
        . 
        Statement 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        Node 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        Parser 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        filters 
        . 
        HasAttributeFilter 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        tags 
        . 
        LinkTag 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        util 
        . 
        NodeList 
        ; 
       
        import  
        org 
        . 
        htmlparser 
        . 
        util 
        . 
        ParserException 
        ; 
       
        import  
        java 
        . 
        net 
        . 
        URLDecoder 
        ; 
       
        public 
          
        class 
          
        parsePage 
          
        { 
       
        public 
          
        static 
          
        void 
          
        parseFromString 
        ( 
        String 
          
        content 
        , 
          
        Connection  
        conn 
        ) 
          
        throws 
          
        Exception 
          
        { 
       
        Parser  
        parser 
          
        = 
          
        new 
          
        Parser 
        ( 
        content 
        ) 
        ; 
       
        HasAttributeFilter  
        filter 
          
        = 
          
        new 
          
        HasAttributeFilter 
        ( 
        "href" 
        ) 
        ; 
       
        try 
          
        { 
       
        NodeList  
        list 
          
        = 
          
        parser 
        . 
        parse 
        ( 
        filter 
        ) 
        ; 
       
        int 
          
        count 
          
        = 
          
        list 
        . 
        size 
        ( 
        ) 
        ; 
       
        //process every link on this page 
       
        for 
        ( 
        int 
          
        i 
        = 
        0 
        ; 
          
        i 
        < 
        count 
        ; 
          
        i 
        ++ 
        ) 
          
        { 
       
        Node  
        node 
          
        = 
          
        list 
        . 
        elementAt 
        ( 
        i 
        ) 
        ; 
       
        if 
        ( 
        node  
        instanceof 
          
        LinkTag 
        ) 
          
        { 
       
        LinkTag  
        link 
          
        = 
          
        ( 
        LinkTag 
        ) 
          
        node 
        ; 
       
        String 
          
        nextlink 
          
        = 
          
        link 
        . 
        extractLink 
        ( 
        ) 
        ; 
       
        String 
          
        mainurl 
          
        = 
          
        "http://johnhany.net/" 
        ; 
       
        String 
          
        wpurl 
          
        = 
          
        mainurl 
          
        + 
          
        "wp-content/" 
        ; 
       
        //only save page from "http://johnhany.net" 
       
        if 
        ( 
        nextlink 
        . 
        startsWith 
        ( 
        mainurl 
        ) 
        ) 
          
        { 
       
        String 
          
        sql 
          
        = 
          
        null 
        ; 
       
        ResultSet  
        rs 
          
        = 
          
        null 
        ; 
       
        PreparedStatement  
        pstmt 
          
        = 
          
        null 
        ; 
       
        Statement  
        stmt 
          
        = 
          
        null 
        ; 
       
        String 
          
        tag 
          
        = 
          
        null 
        ; 
       
        //do not save any page from "wp-content" 
       
        if 
        ( 
        nextlink 
        . 
        startsWith 
        ( 
        wpurl 
        ) 
        ) 
          
        { 
       
        continue 
        ; 
       
        } 
       
        try 
          
        { 
       
        //check if the link already exists in the database 
       
        sql 
          
        = 
          
        "SELECT * FROM record WHERE URL = '" 
          
        + 
          
        nextlink 
          
        + 
          
        "'" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ResultSet 
        . 
        TYPE_FORWARD_ONLY 
        , 
        ResultSet 
        . 
        CONCUR_UPDATABLE 
        ) 
        ; 
       
        rs 
          
        = 
          
        stmt 
        . 
        executeQuery 
        ( 
        sql 
        ) 
        ; 
       
        if 
        ( 
        rs 
        . 
        next 
        ( 
        ) 
        ) 
          
        { 
       
        } 
        else 
          
        { 
       
        //if the link does not exist in the database, insert it 
       
        sql 
          
        = 
          
        "INSERT INTO record (URL, crawled) VALUES ('" 
          
        + 
          
        nextlink 
          
        + 
          
        "',0)" 
        ; 
       
        pstmt 
          
        = 
          
        conn 
        . 
        prepareStatement 
        ( 
        sql 
        , 
          
        Statement 
        . 
        RETURN_GENERATED_KEYS 
        ) 
        ; 
       
        pstmt 
        . 
        execute 
        ( 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        nextlink 
        ) 
        ; 
       
        //use substring for better comparison performance 
       
        nextlink 
          
        = 
          
        nextlink 
        . 
        substring 
        ( 
        mainurl 
        . 
        length 
        ( 
        ) 
        ) 
        ; 
       
        //System.out.println(nextlink); 
       
        if 
        ( 
        nextlink 
        . 
        startsWith 
        ( 
        "tag/" 
        ) 
        ) 
          
        { 
       
        tag 
          
        = 
          
        nextlink 
        . 
        substring 
        ( 
        4 
        , 
          
        nextlink 
        . 
        length 
        ( 
        ) 
        - 
        1 
        ) 
        ; 
       
        //decode in UTF-8 for Chinese characters 
       
        tag 
          
        = 
          
        URLDecoder 
        . 
        decode 
        ( 
        tag 
        , 
        "UTF-8" 
        ) 
        ; 
       
        sql 
          
        = 
          
        "INSERT INTO tags (tagname) VALUES ('" 
          
        + 
          
        tag 
          
        + 
          
        "')" 
        ; 
       
        pstmt 
          
        = 
          
        conn 
        . 
        prepareStatement 
        ( 
        sql 
        , 
          
        Statement 
        . 
        RETURN_GENERATED_KEYS 
        ) 
        ; 
       
        //if the links are different from each other, the tags must be different 
       
        //so there is no need to check if the tag already exists 
       
        pstmt 
        . 
        execute 
        ( 
        ) 
        ; 
       
        } 
       
        } 
       
        } 
          
        catch 
          
        ( 
        SQLException 
          
        e 
        ) 
          
        { 
       
        //handle the exceptions 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "SQLException: " 
          
        + 
          
        e 
        . 
        getMessage 
        ( 
        ) 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "SQLState: " 
          
        + 
          
        e 
        . 
        getSQLState 
        ( 
        ) 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "VendorError: " 
          
        + 
          
        e 
        . 
        getErrorCode 
        ( 
        ) 
        ) 
        ; 
       
        } 
          
        finally 
          
        { 
       
        //close and release the resources of PreparedStatement, ResultSet and Statement 
       
        if 
        ( 
        pstmt 
          
        != 
          
        null 
        ) 
          
        { 
       
        try 
          
        { 
       
        pstmt 
        . 
        close 
        ( 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        SQLException  
        e2 
        ) 
          
        { 
        } 
       
        } 
       
        pstmt 
          
        = 
          
        null 
        ; 
       
        if 
        ( 
        rs 
          
        != 
          
        null 
        ) 
          
        { 
       
        try 
          
        { 
       
        rs 
        . 
        close 
        ( 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        SQLException  
        e1 
        ) 
          
        { 
        } 
       
        } 
       
        rs 
          
        = 
          
        null 
        ; 
       
        if 
        ( 
        stmt 
          
        != 
          
        null 
        ) 
          
        { 
       
        try 
          
        { 
       
        stmt 
        . 
        close 
        ( 
        ) 
        ; 
       
        } 
          
        catch 
          
        ( 
        SQLException  
        e3 
        ) 
          
        { 
        } 
       
        } 
       
        stmt 
          
        = 
          
        null 
        ; 
       
        } 
       
        } 
       
        } 
       
        } 
       
        } 
          
        catch 
          
        ( 
        ParserException 
          
        e 
        ) 
          
        { 
       
        e 
        . 
        printStackTrace 
        ( 
        ) 
        ; 
       
        } 
       
        } 
       
        }

程序原理

所谓“互联网”，是网状结构，任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描，在图论角度来讲，就是对有向图的遍历（链接是从一个网页指向另一个网页，所以是有向的）。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历：https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_1/Fundamentals_of_algorithms/Tree_traversal和这里。我的程序采用的是广度优先方式。

程序从crawler.java的main()开始运行。

 
         1 
       
         2 
       
         3 
       
         4 
       
        Class 
        . 
        forName 
        ( 
        "com.mysql.jdbc.Driver" 
        ) 
        ; 
       
        String 
         
        dburl 
         
        = 
         
        "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8" 
        ; 
       
        conn 
         
        = 
         
        DriverManager 
        . 
        getConnection 
        ( 
        dburl 
        , 
         
        "root" 
        , 
         
        "" 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "connection built" 
        ) 
        ;

首先，调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306，端口值可以在XAMPP主界面看到：

Apache和MySQL都启动之后，在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
        sql 
          
        = 
          
        "CREATE DATABASE IF NOT EXISTS crawler" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "USE crawler" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ; 
       
        sql 
          
        = 
          
        "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
        ;

连接好数据库后，建立一个名为“crawler”的数据库，在库里建两个表，一个叫“record”，包含字段“recordID”，“URL”和“crawled”，分别记录地址编号、链接地址和地址是否被扫描过；另一个叫“tags”，包含字段“tagnum”和“tagname”，分别记录标签编号和标签名。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
        while 
        ( 
        true 
        ) 
         
        { 
       
        httpGet 
        . 
        getByString 
        ( 
        url 
        , 
        conn 
        ) 
        ; 
       
        count 
        ++ 
        ; 
       
        sql 
         
        = 
         
        "UPDATE record SET crawled = 1 WHERE URL = '" 
         
        + 
         
        url 
         
        + 
         
        "'" 
        ; 
       
        stmt 
         
        = 
         
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        if 
        ( 
        stmt 
        . 
        executeUpdate 
        ( 
        sql 
        ) 
         
        > 
         
        0 
        ) 
         
        { 
       
        sql 
         
        = 
         
        "SELECT * FROM record WHERE crawled = 0" 
        ; 
       
        stmt 
         
        = 
         
        conn 
        . 
        createStatement 
        ( 
        ) 
        ; 
       
        rs 
         
        = 
         
        stmt 
        . 
        executeQuery 
        ( 
        sql 
        ) 
        ; 
       
        if 
        ( 
        rs 
        . 
        next 
        ( 
        ) 
        ) 
         
        { 
       
        url 
         
        = 
         
        rs 
        . 
        getString 
        ( 
        2 
        ) 
        ; 
       
        } 
        else 
         
        { 
       
        break 
        ; 
       
        } 
       
        } 
       
        }

接着在一个while循环内依次处理表record内的每个地址。每次处理时，把地址url传递给httpGet.getByString()，然后在表record中把crawled改为true，表明这个地址已经处理过。然后寻找下一个crawled为false的地址，继续处理，直到处理到表尾。

这里需要注意的细节是，执行executeQuery()后，得到了一个ResultSet结构rs，rs包含SQL查询返回的所有行和一个指针，指针指向结果中第一行之前的位置，需要执行一次rs.next()才能让rs的指针指向第一个结果，同时返回true，之后每次执行rs.next()都会把指针移到下一个结果上并返回true，直至再也没有结果时，rs.next()的返回值变成了false。

还有一个细节，在执行建库建表、INSERT、UPDATE时，需要用executeUpdate()；在执行SELECT时，需要使用executeQuery()。executeQuery()总是返回一个ResultSet，executeUpdate()返回符合查询的行数。

httpGet.java的getByString()类负责向所给的网址发送请求，然后下载网页内容。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
        HttpGet  
        httpget 
          
        = 
          
        new 
          
        HttpGet 
        ( 
        url 
        ) 
        ; 
       
        System 
        . 
        out 
        . 
        println 
        ( 
        "executing request " 
          
        + 
          
        httpget 
        . 
        getURI 
        ( 
        ) 
        ) 
        ; 
       
        ResponseHandler 
        < 
        String 
        > 
          
        responseHandler 
          
        = 
          
        new 
          
        ResponseHandler 
        < 
        String 
        > 
        ( 
        ) 
          
        { 
       
        public 
          
        String 
          
        handleResponse 
        ( 
       
        final 
          
        HttpResponse  
        response 
        ) 
          
        throws  
        ClientProtocolException 
        , 
          
        IOException 
          
        { 
       
        int 
          
        status 
          
        = 
          
        response 
        . 
        getStatusLine 
        ( 
        ) 
        . 
        getStatusCode 
        ( 
        ) 
        ; 
       
        if 
          
        ( 
        status 
          
        >= 
          
        200 
          
        && 
          
        status 
          
        < 
          
        300 
        ) 
          
        { 
       
        HttpEntity  
        entity 
          
        = 
          
        response 
        . 
        getEntity 
        ( 
        ) 
        ; 
       
        return 
          
        entity 
          
        != 
          
        null 
          
        ? 
          
        EntityUtils 
        . 
        toString 
        ( 
        entity 
        ) 
          
        : 
          
        null 
        ; 
       
        } 
          
        else 
          
        { 
       
        throw 
          
        new 
          
        ClientProtocolException 
        ( 
        "Unexpected response status: " 
          
        + 
          
        status 
        ) 
        ; 
       
        } 
       
        } 
       
        } 
        ; 
       
        String 
          
        responseBody 
          
        = 
          
        httpclient 
        . 
        execute 
        ( 
        httpget 
        , 
          
        responseHandler 
        ) 
        ;

这段代码是HTTPComponents的HTTP Client组件中给出的样例，在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody，里面保存着网页中的全部字符。

接着，就需要把responseBody传递给parsePage.java的parseFromString类提取链接。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
        Parser 
        parser 
         
        = 
         
        new 
         
        Parser 
        ( 
        content 
        ) 
        ; 
       
        HasAttributeFilter 
        filter 
         
        = 
         
        new 
         
        HasAttributeFilter 
        ( 
        "href" 
        ) 
        ; 
       
        try 
         
        { 
       
        NodeList 
        list 
         
        = 
         
        parser 
        . 
        parse 
        ( 
        filter 
        ) 
        ; 
       
        int 
         
        count 
         
        = 
         
        list 
        . 
        size 
        ( 
        ) 
        ; 
       
        //process every link on this page 
       
        for 
        ( 
        int 
         
        i 
        = 
        0 
        ; 
         
        i 
        < 
        count 
        ; 
         
        i 
        ++ 
        ) 
         
        { 
       
        Node 
        node 
         
        = 
         
        list 
        . 
        elementAt 
        ( 
        i 
        ) 
        ; 
       
        if 
        ( 
        node 
        instanceof 
         
        LinkTag 
        ) 
         
        {

在HTML文件中，链接一般都在a标签的href属性中，所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点，通过在for循环中依次处理每个节点寻找符合要求的标签，可以把网页中的所有链接提取出来。

然后通过nextlink.startsWith()进一步筛选，只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
        sql 
          
        = 
          
        "SELECT * FROM record WHERE URL = '" 
          
        + 
          
        nextlink 
          
        + 
          
        "'" 
        ; 
       
        stmt 
          
        = 
          
        conn 
        . 
        createStatement 
        ( 
        ResultSet 
        . 
        TYPE_FORWARD_ONLY 
        , 
        ResultSet 
        . 
        CONCUR_UPDATABLE 
        ) 
        ; 
       
        rs 
          
        = 
          
        stmt 
        . 
        executeQuery 
        ( 
        sql 
        ) 
        ; 
       
        if 
        ( 
        rs 
        . 
        next 
        ( 
        ) 
        ) 
          
        { 
       
        } 
        else 
          
        { 
       
        //if the link does not exist in the database, insert it 
       
        sql 
          
        = 
          
        "INSERT INTO record (URL, crawled) VALUES ('" 
          
        + 
          
        nextlink 
          
        + 
          
        "',0)" 
        ; 
       
        pstmt 
          
        = 
          
        conn 
        . 
        prepareStatement 
        ( 
        sql 
        , 
          
        Statement 
        . 
        RETURN_GENERATED_KEYS 
        ) 
        ; 
       
        pstmt 
        . 
        execute 
        ( 
        ) 
        ;

在表record中查找是否已经存在这个链接，如果存在（rs.next()==true），不做任何处理；如果不存在（rs.next()==false），在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT，所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
        nextlink 
         
        = 
         
        nextlink 
        . 
        substring 
        ( 
        mainurl 
        . 
        length 
        ( 
        ) 
        ) 
        ; 
       
        if 
        ( 
        nextlink 
        . 
        startsWith 
        ( 
        "tag/" 
        ) 
        ) 
         
        { 
       
        tag 
         
        = 
         
        nextlink 
        . 
        substring 
        ( 
        4 
        , 
         
        nextlink 
        . 
        length 
        ( 
        ) 
        - 
        1 
        ) 
        ; 
       
        tag 
         
        = 
         
        URLDecoder 
        . 
        decode 
        ( 
        tag 
        , 
        "UTF-8" 
        ) 
        ; 
       
        sql 
         
        = 
         
        "INSERT INTO tags (tagname) VALUES ('" 
         
        + 
         
        tag 
         
        + 
         
        "')" 
        ; 
       
        pstmt 
         
        = 
         
        conn 
        . 
        prepareStatement 
        ( 
        sql 
        , 
         
        Statement 
        . 
        RETURN_GENERATED_KEYS 
        ) 
        ; 
       
        pstmt 
        . 
        execute 
        ( 
        ) 
        ;

去掉链接开头的“http://johnhany.net/”几个字符，提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字，把这给名字提取出来，用UTF-8编码，保证汉字的正常显示，然后存入表tags。类似地还可以加入判断“article/”，“author/”，或“2013/11/”等对其他链接进行归类。

结果

这是两张数据库的截图，显示了程序的部分结果：