爬虫遍历中图分类法

很多图书馆网站都有较为完整的网页版本《中图分类法》,不同的分类目录都位于不同的链接网页上

为此专门写了一个遍历该种网站的Java程序,我们以“中国图书馆分类法网站”为例


数据库创建表脚本:

CREATE TABLE [zhtclass] (
       [type] [nvarchar] (100) COLLATE Chinese_PRC_CI_AS NULL ,
       [detail] [nvarchar] (400) COLLATE Chinese_PRC_CI_AS NULL
) ON [PRIMARY]
GO

 


Java代码:

import java.net.URL;
import java.net.URLConnection;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.Scanner;

public class Exec {
        public static void main(String args[]) {
                PreparedStatement pstm = null;

                // 下面五个参数可以自行根据网站网页格式调整
                String nullFlag = "没有下级分类";
                String firstFlagAlph = "<ul id=\"list\" class=\"cent\" style=\"list-style:none;\"><li>";
                String startFlagAlph = "<span class=\"code\">";
                String endFlagAlph = "</span>";
                String startFlagName = "\">";
                String endFlagName = "</a>";

                try {
                        Class.forName("com.microsoft.jdbc.sqlserver.SQLServerDriver");
                        Connection con = DriverManager
                                        .getConnection(
                                                        "jdbc:microsoft:sqlserver://localhost:1433;DatabaseName=LIB",
                                                        "sa", "");
                        pstm = con.prepareStatement("insert into zhtclass values(?,?)");
                } catch (Exception e) {
                        System.out.println(e.getLocalizedMessage());
                }
                for (int i = 1; i < 45837; i++) {
                        System.out.println(i);
                        try {
                                StringBuffer sb = new StringBuffer();
                                URL url = new URL("http://www.ztflh.com/?c=" + i);
                                URLConnection urlConn = url.openConnection();
                                urlConn.setReadTimeout(10000);
                                urlConn.setConnectTimeout(10000);
                                urlConn.setDoOutput(true);
                                urlConn.connect();
                                Scanner in = new Scanner(urlConn.getInputStream(),"UTF-8");
                                for (int n = 1; in.hasNextLine(); n++)
                                        sb.append(in.nextLine());

                                if (sb.indexOf(nullFlag) < 0) {
                                        int start = sb.indexOf(firstFlagAlph, 0)
                                                        + firstFlagAlph.length();
                                        int end = start;
                                        while (true) {
                                                if (sb.indexOf(startFlagAlph, end) < 0)
                                                        break;
                                                start = sb.indexOf(startFlagAlph, end)
                                                                + startFlagAlph.length();
                                                end = sb.indexOf(endFlagAlph, start);
                                                String alph = sb.substring(start, end).trim();
                                                start = sb.indexOf(startFlagName, end)
                                                                + startFlagName.length();
                                                end = sb.indexOf(endFlagName, start);
                                                String name = sb.substring(start, end).trim();
                                                pstm.setString(1, alph);
                                                pstm.setString(2, name);
                                                pstm.execute();
                                        }
                                }
                                url = null;
                        } catch (Exception e) {
                                i--;
                                System.out.println(e.getLocalizedMessage());
                        }
                        System.gc();
                }
        }
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

leeshuqing

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值