JAVA学习之路~爬虫相关技术

最新推荐文章于 2022-05-27 18:55:12 发布

haki

最新推荐文章于 2022-05-27 18:55:12 发布

阅读量339

点赞数 1

分类专栏： JAVA学习文章标签： JAVA爬虫 Jsoup 模拟form表单

本文链接：https://blog.csdn.net/qq741058114/article/details/88930129

版权

JAVA学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Java爬虫相关技术

网络爬虫，是一种按照一定的规则，自动的爬取网页的程序或者脚本，可以根据自己的需爬取指定的网页，然后根据一定的规则，获得目标的数据

采用的相关jar包

必选：jsoup.jar
可选：httpClient

1.使用Get请求去爬取网页
 2.模拟form表单爬取网页

使用Get请求爬取网页

分析步骤:首先利用Jsoup，结合需要爬取的URL，建立一个Connection连接，然后设置对应的请求属性，譬如请求头之类的。然后执行Connection，将会返回响应，而响应里面的响应体则是网页的代码，这时需要对响应体进行字符集编码，否则有可能出现乱码，因此根据爬取的URL的字符集去设置响应体的字符集，然后设定指定的规则去爬取需要的数据，主要代码实现如下所示：

     public class SpiderDemo {

    private static final Logger log = LoggerFactory.getLogger(SpiderDemo.class);

    //要爬取的地址
    private static final String URL = "https://item.m.jd.com/product/41633722698.html";

    @Test
    public void spiderForGet() {
        //首先创建一个Connection
        Connection connect = Jsoup.connect(URL);
        //设置用户类型
        connect.userAgent("Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1");
        //设置请求类型
        connect.method(Connection.Method.GET);
        Response response;
        //然后执行并返回响应体
        try {
            //转换字符集编码
            response = connect.execute().charset("UTF-8");
            //将String类型转换成Dom树形式
            Document doc = Jsoup.parse(response.body());
            //采用获取id方式获取节点，并获取文本
            String itemName = doc.getElementById("itemName").text();
            String priceSale = doc.getElementById("priceSale").text();
            log.info("产品名称:{}\n价格为:{}",itemName,priceSale);
        } catch (IOException e) {
            log.info("爬取失败,{}", e.getMessage());
        }

    }
}

模拟form表单爬取网页

分析步骤:首先利用Jsoup，结合需要爬取的URL，建立一个Connection连接，爬取到需要action地址，并获取需要请求的数据，填写需要的请求数据，并输入请求头，由于是form表单提交，往往需要填写请求头模拟浏览器操作，并设置请求类型为POST，执行返回响应。

	public class Example {

    private static final String URL = "https://www.kanshuzhong.com/modules/article/search.php";

    private static final Logger log = LoggerFactory.getLogger(Example.class);

    private static final Map<Integer, String> SEL_MAP = new HashMap<>();
    //设置select的属性值
    static {
        SEL_MAP.put(1, "articlename");//文章名
        SEL_MAP.put(2, "author"); //作者名
        SEL_MAP.put(3, "keywords");//关键字
    }

    public void spider(String keyword, Integer id) throws IOException {
        Map<String, String> datas = new HashMap<>();
        if (id == null) {
            id = 3;
        }
        //设置请求的数据
        datas.put("searchkey", keyword);

        datas.put("searchtype", SEL_MAP.get(id));

        //建立连接，传入的URL是Form表单的ac
        Connection con = Jsoup.connect(URL);

        log.info("请求的数据为:{}", datas);

        //设置post数据请求的字符集，主要根据页面的字符集进行设定
        con.postDataCharset("GBK");
        //设置请求头，去模拟浏览器操作
        con.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
        Response execute = con.ignoreContentType(true).followRedirects(true).method(Connection.Method.POST).data
                (datas)
                .execute();
        //值得注意的是要设置字符集
        String result = execute.charset("GBK").body();
        Document docs = Jsoup.parse(result);
        //建立爬取规则
        Elements elements = docs.select(".info");
        List<Novel> list = new ArrayList<>();
        //根据dom节点规则去爬取指定位置的数据，并存储到pojo当中
        elements.forEach(e -> {
            Novel novel = new Novel();
            //获取所有选择的a标签中的第一个a标签
            Element a = e.getElementsByTag("a").first();
            //获取href属性的值
            String href = a.attr("href");
            novel.setHref(href);
            //获取img标签中的第一个img标签，然后获取其src值
            String image = a.getElementsByTag("img").first().attr("src");
            novel.setImage(image);
            //获取所有a标签中的第二个a标签中的文本值
            String name = e.getElementsByTag("a").get(1).text();
            novel.setName(name);
            //获取所有a标签中的第三个a标签中的文本值
            String updateChapter = e.getElementsByTag("a").get(2).text();
            novel.setUpdateChapter(updateChapter);
            //获取e下面的所有的li标签
            Elements li = e.getElementsByTag("li");
            String introduction = li.first().text();
            novel.setIntroduction(introduction);
            //获取li元素中的第二个li然后获取font元素
            Elements font = li.get(1).getElementsByTag("font");
            //获取font元素的第一个font标签，获取其值
            String author = font.get(0).text();
            novel.setAuthor(author);
            //获取font元素的第二个font标签，获取其值
            String type = font.get(1).text();
            novel.setType(type);
            //获取font元素的第三个font标签，获取其值
            String status = font.get(2).text();
            novel.setStatus(status);
            //获取font元素的第四个font标签，获取其值
            String updateTime = font.get(3).text();
            novel.setUpdateTime(updateTime);
            list.add(novel);
            log.info("元素为:{}",novel);
        });

    }

    @Test
    public void test() throws IOException {
        spider(new String("天蚕土豆"), 2);
    }


}