简单的java爬虫：HttpClient+jsoup 爬取数据

最新推荐文章于 2024-09-06 11:16:49 发布

lizhipengg

最新推荐文章于 2024-09-06 11:16:49 发布

阅读量644

点赞数

本文链接：https://blog.csdn.net/lizhiopeng/article/details/103969493

版权

简单的java爬虫：HttpClient+jsoup 爬取数据

说到爬虫，首先想到的是python爬虫，代码少，功能强大，关于python就不过多说明了。
这里介绍一个我试用过的java开发过程使用过的在网页爬取数据的方法。

功能介绍：在国家最新新政区代码网页上获取最新地区信息
http://www.mca.gov.cn/article/sj/xzqh/2019/2019/201912251506.html

将在网页上获取的所有最新省市区名称以及编号，进行分析，再写入数据库市判断上下级关系。
首先要创建数据库表，主要字段如下：
在这里插入图片描述由于只是需要获取最新的地区信息，所以我制作了测试类来跟新数据库地区表信息。更新成功就没有在整理，代码有冗余，仅作可行演示。
根据数据库生成对应的接口文件，可使用mybatis-generator代码生成工具
（https://blog.csdn.net/lizhiopeng/article/details/103872942）
pom.xml中添加:HttpClient和jsoup的对应版本依赖。

<!--HttpClient支持类-->
    <dependency>
      <groupId>commons-httpclient</groupId>
      <artifactId>commons-httpclient</artifactId>
      <version>3.1</version>
    </dependency>
<!--jsoup-->
    <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.11.3</version>
    </dependency>

测试类如下：

主要步骤：

1.使用HttpClient（）获取url的html内容
2.使用jsoup解析html内容
3.判断获取的信息，进行逻辑处理
4.存入数据库

 /**
 * 从网页爬取地址信息写入数据库
 *
 * @throws IOException
 */
    @Test
    public void insertAreaByUrl() throws IOException {
        HttpClient client = new HttpClient();
        GetMethod getMethod = new GetMethod("http://www.mca.gov.cn/article/sj/xzqh/2019/2019/201912251506.html");//获取需要解析的url
        getMethod.getParams().setParameter(HttpClientParams.HTTP_CONTENT_CHARSET, "UTF-8");
        client.executeMethod(getMethod);
        String params = getMethod.getResponseBodyAsString();
        Document doc = Jsoup.parse(params);//jsoup解析html里的内容
        // 获取html中所有table
        Elements table = doc.select("table");
        // 使用选择器选择该table内所有的<tr> <tr/>
        Elements trs = table.select("tr");
        // 遍历该表格内的所有的<tr> <tr/>
        for (int i = 0; i < trs.size(); ++i) {
            // 获取一个tr
            Element tr = trs.get(i);
            // 获取该tr的所有td
            Elements tds = tr.select("td");
            // 第二个td是编码 = id = org_id
            Element tdId = tds.get(1);
            String id = tdId.text();
            //第三个td是地名 = name
            Element tdName = tds.get(2);
            String name = tdName.text();
            //写area_levle(根据编码进行判断 0000结尾的1级;00结尾的2级;其他的3级)
            //写parent_code
            //写parent_name
            try {
                if (id.endsWith("0000")) {//省
                    AreaBean area = new AreaBean();
                    area.setId(id.substring(0, 2));
                    area.setOrgId(id.substring(0, 2));
                    area.setName(name);
                    area.setLevelType("1");
                    area.setParentCode("0");
                    area.setParentName("中国");
                    area.setCreateTime(new Date());
                    AreaBeanDao.insert(area);
                } else if (id.endsWith("00")) {//市
                    AreaBean area = new AreaBean();
                    area.setId(id.substring(0, 4));
                    area.setOrgId(id.substring(0, 4));
                    area.setName(name);
                    area.setLevelType("2");
                    area.setParentCode(id.substring(0, 2));
                    AreaBean temp = AreaBeanDao.getAreaById(id.substring(0, 2));
                    area.setParentName(temp.getName());
                    area.setCreateTime(new Date());
                    AreaBeanDao.insert(area);
                } else {//区
                    AreaBean area = new AreaBean();
                    area.setId(id);
                    area.setOrgId(id);
                    area.setName(name);
                    area.setLevelType("3");
                    AreaBean temp = AreaBeanDao.getAreaById(id.substring(0, 4));
                    if (temp != null) {
                        area.setParentCode(id.substring(0, 4));
                    } else {
                        temp = AreaBeanDao.getAreaById(id.substring(0, 2));
                        area.setParentCode(id.substring(0, 2));
                    }
                    area.setParentName(temp.getName());
                    area.setParentCode(temp.getId());
                    area.setCreateTime(new Date());
                    AreaBeanDao.insert(area);
                }
            } catch (Exception e) {
            }
        }
    }