crawler

学无止路

已于 2022-03-17 09:01:58 修改

阅读量1.5k

点赞数 2

分类专栏：爬虫文章标签：爬虫

于 2022-03-16 15:46:30 首次发布

本文链接：https://blog.csdn.net/weixin_40055163/article/details/123493123

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1 网络爬虫

网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

1.1 爬虫入门程序

1.1.1 环境准备

1.JDK1.8
2.IntelliJ IDEA
2.IDEA自带的Maven
1.打开IDEA如图所示的界面，选择Projects,点击Create New Project。在这里插入图片描述
2.选择Maven，点击Next。如图所示：
3.填写项目名称，点击Finish。如图所示：
4.pom.xml的代码如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.txw</groupId>
    <artifactId>crawler-first</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <!--HttpClient-->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!--日志-->
        <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
    </dependency>
    </dependencies>
</project>

如图所示：在这里插入图片描述
5.加入log4j.properties的代码如下：

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

如图所示：在这里插入图片描述
6.编写CrawlerFirst的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
/**
 * 网络爬虫入门案例
 * @author Adair
 * @date 2022/3/15 上午 8:49
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerFirst {
    public static void main(String[] args) throws Exception{
        // 1.打开浏览器，创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 2.输入网址,发起get请求创建Httpget对象
        HttpGet httpGet = new HttpGet("https://blog.csdn.net/weixin_40055163?spm=1010.2135.3001.5343");
        // 3.按回车，发起请求，返回响应，使用httpClient对象发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 4.解析响应，获取数据
        // 判断状态码是否是200
        if (response.getStatusLine().getStatusCode() == 200){
            String content = EntityUtils.toString(response.getEntity(), "UTF-8");
            System.out.println(content);
        }
    }
}

测试结果，可以获取到页面数据。如图所示：在这里插入图片描述

2 网络爬虫

2.1 网络爬虫介绍

在大数据时代，信息的采集是一项重要的工作，而互联网中的数据是海量的，如果单纯靠人力进行信息采集，不仅低效繁琐，搜集的成本也会提高。如何自动高效地获取互联网中我们感兴趣的信息并为我们所用是一个重要的问题，而爬虫技术就是为了解决这些问题而生的。
网络爬虫（Web crawler）也叫做网络机器人，可以代替人们自动地在互联网中进行数据信息的采集与整理。它是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本，可以自动采集所有其能够访问到的页面内容，以获取相关数据。
从功能上来讲，爬虫一般分为数据采集，处理，储存三个部分。爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。

2.2 为什么学网络爬虫

我们初步认识了网络爬虫，但是为什么要学习网络爬虫呢？只有清晰地知道我们的学习目的，才能够更好地学习这一项知识。在此，总结了4种常见的学习爬虫的原因：
1.可以实现搜索引擎
我们学会了爬虫编写之后，就可以利用爬虫自动地采集互联网中的信息，采集回来后进行相应的存储或处理，在需要检索某些信息的时候，只需在采集回来的信息中进行检索，即实现了私人的搜索引擎。
2.大数据时代，可以让我们获取更多的数据源。
在进行大数据分析或者进行数据挖掘的时候，需要有数据源进行分析。我们可以从某些提供数据统计的网站获得，也可以从某些文献或内部资料中获得，但是这些获得数据的方式，有时很难满足我们对数据的需求，而手动从互联网中去寻找这些数据，则耗费的精力过大。此时就可以利用爬虫技术，自动地从互联网中获取我们感兴趣的数据内容，并将这些数据内容爬取回来，作为我们的数据源，再进行更深层次的数据分析，并获得更多有价值的信息。
3.可以更好地进行搜索引擎优化（SEO）。
对于很多SEO从业者来说，为了更好的完成工作，那么就必须要对搜索引擎的工作原理非常清楚，同时也需要掌握搜索引擎爬虫的工作原理。
而学习爬虫，可以更深层次地理解搜索引擎爬虫的工作原理，这样在进行搜索引擎优化时，才能知己知彼，百战不殆。
4.有利于就业。
从就业来说，爬虫工程师方向是不错的选择之一，因为目前爬虫工程师的需求越来越大，而能够胜任这方面岗位的人员较少，所以属于一个比较紧缺的职业方向，并且随着大数据时代和人工智能的来临，爬虫技术的应用将越来越广泛，在未来会拥有很好的发展空间。

3 HttpClient

网络爬虫就是用程序帮助我们访问网络上的资源，我们一直以来都是使用HTTP协议访问互联网的网页，网络爬虫需要编写程序，在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术，来实现抓取网页数据。

3.1 Get请求

访问csdn，请求url地址：https://blog.csdn.net
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
/**
 * Get请求
 * @author Adair
 * @date 2022/3/15 上午 9:13
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerSecond {
    public static void main(String[] args) {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建HttpGet请求
        HttpGet httpGet = new HttpGet("https://blog.csdn.net");
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpGet);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

3.2 带参数的GET请求

在csdn中搜索学习视频，地址为：https://blog.csdn.net/search?keys=Java
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
/**
 * 带参数的GET请求
 * @author Adair
 * @date 2022/3/15 上午 9:29
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerThird {
    public static void main(String[] args) {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建HttpGet请求，带参数的地址:https://blog.csdn.net/search?keys=Java
        String uri = "https://blog.csdn.net/search?keys=Java";
        HttpGet httpGet = new HttpGet(uri);
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpGet);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

3.3 POST请求

使用POST访问CSDN，请求url地址：https://blog.csdn.net
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

/**
 * POST请求
 * @author Adair
 * @date 2022/3/15 上午 9:37
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerFourth {
    public static void main(String[] args) {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建对象
        HttpPost httpPost = new HttpPost("https://blog.csdn.net");
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpPost);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

3.4 带参数的POST请求

在传智中搜索学习视频，使用POST请求，url地址为：https://blog.csdn.net/search
url地址没有参数，参数keys=java放到表单中进行提交。
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import java.util.ArrayList;
import java.util.List;
/**
 * 带参数的POST请求
 * @author Adair
 * @date 2022/3/15 上午 9:44
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerFifth {
    public static void main(String[] args) throws Exception {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建对象
        HttpPost httpPost = new HttpPost("https://blog.csdn.net");
        // 声明存放参数的List集合
        List<NameValuePair>  params = new ArrayList<NameValuePair>();
        params.add(new BasicNameValuePair("keys", "java"));
        // 创建表单数据Entity
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"UTF-8");
        // 设置表单Entity到httpPost请求对象中
        httpPost.setEntity(formEntity);
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpPost);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

3.5 连接池

如果每次请求都要创建HttpClient，会有频繁创建和销毁的问题，可以使用连接池来解决这个问题。
测试以下代码，并断点查看每次获取的HttpClient都是不一样的。
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
/**
 * 连接池
 * @author Adair
 * @date 2022/3/15 上午 9:54
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerSixth {
    public static void main(String[] args) {
        //  创建连接池管理器
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
        // 设置最大连接数
        cm.setMaxTotal(200);
        // 设置每个主机的并发数
        cm.setDefaultMaxPerRoute(20);
        doGet(cm);
        doGet(cm);
    }
    private static void doGet(PoolingHttpClientConnectionManager cm) {
        // 不是每次创建的httpClient，而是从连接池中获取httpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
        HttpGet httpGet = new HttpGet("https://blog.csdn.net");
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpGet);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                System.out.println(content.length());
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
                // 不能关闭HttpClient
                // httpClient.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

3.6 请求参数

有时候因为网络，或者目标服务器的原因，请求需要更长的时间才能完成，我们需要自定义相关时间。
演示的代码如下：

package com.txw.crawler.test;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
/**
 * 请求参数
 * @author Adair
 * @date 2022/3/15 上午 10:10
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public class CrawlerSeventh {
    public static void main(String[] args) {
        // 创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 创建对象
        HttpGet httpGet = new HttpGet("https://blog.csdn.net");
        // 置请求参数
        RequestConfig requestConfig = RequestConfig.custom()
                // 设置创建连接的最长时间
                .setConnectTimeout(1000)
                // 设置获取连接的最长时间
                .setConnectionRequestTimeout(500)
                // 设置数据传输的最长时间
                .setSocketTimeout(10 * 1000)
                .build();
        httpGet.setConfig(requestConfig);
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求
            response = httpClient.execute(httpGet);
            // 判断状态码是否是20
            if (response.getStatusLine().getStatusCode() == 200){
                // 如果为200表示请求成功，获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                // 打印数据长度
                System.out.println(content);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // 释放资源
            try {
                if (response == null){
                    response.close();
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

如图所示：在这里插入图片描述

4 Jsoup

我们抓取到页面之后，还需要对页面进行解析。可以使用字符串处理工具解析页面，也可以使用正则表达式，但是这些方法都会带来很大的开发成本，所以我们需要使用一款专门解析html页面的技术。

4.1 jsoup介绍

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup的主要功能如下：
1.从一个URL，文件或字符串中解析HTML；
2.使用DOM或CSS选择器来查找、取出数据；
3.可操作HTML元素、属性、文本；
1.pom.xml的代码如下：

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <!--测试-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!--工具-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>

如图所示：在这里插入图片描述

4.2 jsoup解析

4.2.1 解析url

Jsoup可以直接输入url，它会发起请求并获取数据，封装为Document对象
演示的代码如下：

package com.txw.jsoup.test;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;
import java.net.URL;
/**
 * Jsoup测试类 {@link JsuopFirstTest}
 * @author Adair
 * @date 2022/3/15 上午 11:19
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")     // 注解警告信息
public class JsuopFirstTest {

    /**
     * 解析url
     * @throws Exception
     */
    @Test
    public void testJsoupUrl() throws Exception{
        // 解析url地址,第一个参数是访问url的地址。第二个参数是访问时候的超时时间
        Document document = Jsoup.parse(new URL("https://blog.csdn.net"), 10000);
        // 获取title的内容
        Element title = document.getElementsByTag("title").first();
        System.out.println(title);
    }
}

如图所示：在这里插入图片描述
PS：虽然使用Jsoup可以替代HttpClient直接发起请求解析数据，但是往往不会这样用，因为实际的开发过程中，需要使用到多线程，连接池，代理等等方式，而jsoup对这些的支持并不是很好，所以我们一般把jsoup仅仅作为Html解析工具使用。

4.2.2 解析字符串

先准备以下html文件的代码如图所示：在这里插入图片描述
演示的代码如下：

    /**
     * 解析字符串
     * @throws Exception
     */
    @Test
    public void testJsoupString() throws Exception {
        // 读取文件获取
        String html = FileUtils.readFileToString(new File("D:/jsoup.html"), "UTF-8");
        // 解析字符串
        Document document = Jsoup.parse(html);
        // 获取body的内容
        Element body = document.getElementsByTag("body").first();
        System.out.println(body.text());
    }

如图所示：在这里插入图片描述

4.2.3.解析文件

Jsoup可以直接解析文件，并封装为Document对象。
演示的代码如下：

    /**
     * 解析文件
     * @throws Exception
     */
    @Test
    public void testJsoupHtml() throws Exception {
        // 解析文件
        Document document = Jsoup.parse(new File("D:/jsoup.html"),"UTF-8");
        // 获取body的内容
        Element body = document.getElementsByTag("body").first();
        System.out.println(body.text());
    }

如图所示：
在这里插入图片描述

4.2.4 使用dom方式遍历文档

元素获取
1.根据id查询元素getElementById
2.根据标签获取元素getElementsByTag
3.根据class获取元素getElementsByClass
4.根据属性获取元素getElementsByAttribute
演示的代码如下：

    /**
     * 元素获取
     * @throws Exception
     */
    @Test
    public void testDom() throws Exception{
        // 解析文件,获取document对象
        Document document = Jsoup.parse(new File("D:/jsoup.html"),"UTF-8");
        // 1.根据id查询元素getElementById
        Element element = document.getElementById("city_bj");
        // 2.根据标签获取元素getElementsByTag
        element = document.getElementsByTag("title").first();
        // 3.根据class获取元素getElementsByClass
        element = document.getElementsByClass("s_name").last();
        // 4.根据属性获取元素getElementsByAttribute
        element = document.getElementsByAttribute("abc").first();
        element = document.getElementsByAttributeValue("class", "city_con").first();
    }

如图所示：在这里插入图片描述
元素中获取数据
1.从元素中获取id
2.从元素中获取className
3.从元素中获取属性的值attr
4.从元素中获取所有属性attributes
5.从元素中获取文本内容text
演示的代码如下：

    /**
     * 元素中获取数据
     * @throws Exception
     */
    @Test
    public void testDate() throws Exception {
        // 解析文件,获取document对象
        Document document = Jsoup.parse(new File("D:/jsoup.html"), "UTF-8");
        // 获取元素
        Element element = document.getElementById("jsoup");
        // 1.从元素中获取id
        String str = element.id();
        // 2.从元素中获取className
        str = element.className();
        // 3.从元素中获取属性的值attr
        str = element.attr("id");
        // 4.从元素中获取所有属性attributes
        str = element.attributes().toString();
        //5.从元素中获取文本内容text
        str = element.text();
    }

4.2.5.使用选择器语法查找元素

jsoup elements对象支持类似于CSS (或jquery)的选择器语法，来实现非常强大和灵活的查找功能。这个select 方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的，因此可实现指定元素的过滤，或者链式选择访问。
Select方法将返回一个Elements集合，并提供一组方法来抽取和处理结果。

4.2.6.Selector选择器概述

tagname: 通过标签查找元素，比如：span
#id: 通过ID查找元素，比如：# city_bj
.class: 通过class名称查找元素，比如：.class_a
[attribute]: 利用属性查找元素，比如：[abc]
[attr=value]: 利用属性值来查找元素，比如：[class=s_name]
演示的代码如下：

    /**
     * Selector选择器概述
     * @throws Exception
     */
    @Test
    public void  testSelector() throws Exception {
        // 解析文件,获取document对象
        Document document = Jsoup.parse(new File("D:/jsoup.html"), "UTF-8");
        // tagname: 通过标签查找元素，比如：span
        Elements span = document.select("span");
        for (Element element : span) {
            System.out.println(element.text());
        }
        // #id: 通过ID查找元素，比如：#city_bjj
        String str = document.select("#city_bj").text();
        // .class: 通过class名称查找元素，比如：.class_a
        str = document.select(".class_a").text();
        // [attribute]: 利用属性查找元素，比如：[abc]
        str = document.select("[abc]").text();
        // [attr=value]: 利用属性值来查找元素，比如：[class=s_name]
        str = document.select("[class=s_name]").text();
    }

如图所示：在这里插入图片描述

4.2.7 Selector选择器组合使用

el#id: 元素+ID，比如： h3#city_bj
el.class: 元素+class，比如： li.class_a
el[attr]: 元素+属性名，比如： span[abc]
任意组合: 比如：span[abc].s_name
ancestor child: 查找某个元素下子元素，比如：.city_con li 查找"city_con"下的所有li
parent > child: 查找某个父元素下的直接子元素，比如：
.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
parent > *: 查找某个父元素下所有直接子元素
演示的代码如下：

    /**
     * Selector选择器组合使用
     * @throws Exception
     */
    @Test
    public void testSelectorGroup()throws Exception{
        // 解析文件,获取document对象
        Document document = Jsoup.parse(new File("D:/jsoup.html"), "UTF-8");
        // el#id: 元素+ID，比如： h3#city_bj
        String str = document.select("h3#city_bj").text();
        // el.class: 元素+class，比如： li.class_a
        str = document.select("li.class_a").text();
        // el[attr]: 元素+属性名，比如： span[abc]
        str = document.select("span[abc]").text();
        // 任意组合，比如：span[abc].s_name
        str = document.select("span[abc].s_name").text();
        // ancestor child: 查找某个元素下子元素，比如：.city_con li 查找"city_con"下的所有li
        str = document.select(".city_con li").text();
        // parent > child: 查找某个父元素下的直接子元素，
        // 比如：.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
        str = document.select(".city_con > ul > li").text();
        // parent > * 查找某个父元素下所有直接子元素.city_con > *
        str = document.select(".city_con > *").text();
    }

如图所示：在这里插入图片描述

5 爬虫案例

学习了HttpClient和Jsoup，就掌握了如何抓取数据和如何解析数据，接下来，我们做一个小练习，把京东的手机数据抓取下来。
主要目的是HttpClient和Jsoup的学习。

5.1 需求分析

首先访问京东，搜索手机，分析页面，我们抓取以下商品数据：
商品图片、价格、标题、商品详情页在这里插入图片描述

5.1.1 SPU和SKU

除了以上四个属性以外，我们发现上图中的苹果手机有四种产品，我们应该每一种都要抓取。那么这里就必须要了解spu和sku的概念
SPU = Standard Product Unit （标准产品单位）
SPU是商品信息聚合的最小单位，是一组可复用、易检索的标准化信息的集合，该集合描述了一个产品的特性。通俗点讲，属性值、特性相同的商品就可以称为一个SPU。
例如上图中的苹果手机就是SPU，包括红色、深灰色、金色、银色
SKU=stock keeping unit(库存量单位)
SKU即库存进出计量的单位，可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态，不同管理模式来处理。在服装、鞋类商品中使用最多最普遍。
例如上图中的苹果手机有几个款式，红色苹果手机，就是一个sku。
查看页面的源码也可以看出区别：
在这里插入图片描述

5.2 开发准备

5.2.1 数据库表分析

根据需求分析，我们创建的表如下：

DROP TABLE IF EXISTS `jd_item`;
CREATE TABLE `jd_item`  (
  `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `spu` bigint(15) NULL DEFAULT NULL COMMENT '商品集合id',
  `sku` bigint(15) NULL DEFAULT NULL COMMENT '商品最小品类单元id',
  `title` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '商品标题',
  `price` bigint(10) NULL DEFAULT NULL COMMENT '商品价格',
  `pic` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '商品图片',
  `url` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '商品详情地址',
  `created` datetime(0) NULL DEFAULT NULL COMMENT '创建时间',
  `updated` datetime(0) NULL DEFAULT NULL COMMENT '更新时间',
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `sku`(`sku`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '京东商品表' ROW_FORMAT = Dynamic;

如图所示：在这里插入图片描述

4.2.2 添加依赖

使用Spring Boot+Spring Data JPA和定时任务进行开发，
需要创建Maven工程并添加以下依赖的代码如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>
    <groupId>com.txw</groupId>
    <artifactId>crawler-first</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <!--HttpClient-->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!--日志-->
        <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
        </dependency>
        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <!--测试-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!--工具-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
    </dependencies>
</project>

如图所示：在这里插入图片描述

5.2.3添加配置文件

加入application.properties配置文件的代码如下：

##数据库配置
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://192.168.56.10:3306/crawler?characterEncoding=utf-8&useSSL=false
spring.datasource.username=root
spring.datasource.password=123456
#JPA配置
spring.jpa.database=MySQL
spring.jpa.show-sql=true
spring.jpa.open-in-view=false

如图所示：在这里插入图片描述

5.3 代码实现

6.3.1.编写pojo

根据数据库表，编写pojo
演示的代码如下：

package com.txw.crawler.pojo;

import javax.persistence.*;
import java.util.Date;
/**
 * 商品信息
 * @author Adair
 * @date 2022/3/15 下午 3:28
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
@Entity
@Table(name = "jd_item")
public class Item {
    // 主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    // 标准产品单位（商品集合）
    private Long spu;
    // 库存量单位（最小品类单元）
    private Long sku;
    // 商品标题
    private String title;
    // 商品价格
    private Double price;
    // 商品图片
    private String pic;
    // 商品详情地址
    private String url;
    // 创建时间
    private Date created;
    // 更新时间
    private Date updated;

    public Long getId() {
        return id;
    }

    public void setId(Long id) {
        this.id = id;
    }

    public Long getSpu() {
        return spu;
    }

    public void setSpu(Long spu) {
        this.spu = spu;
    }

    public Long getSku() {
        return sku;
    }

    public void setSku(Long sku) {
        this.sku = sku;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }

    public String getPic() {
        return pic;
    }

    public void setPic(String pic) {
        this.pic = pic;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Date getCreated() {
        return created;
    }

    public void setCreated(Date created) {
        this.created = created;
    }

    public Date getUpdated() {
        return updated;
    }

    public void setUpdated(Date updated) {
        this.updated = updated;
    }
}

如图所示：在这里插入图片描述

5.3.2 编写dao

演示的代码如下：

package com.txw.crawler.dao;

import com.txw.crawler.pojo.Item;
import org.springframework.data.jpa.repository.JpaRepository;
/**
 * 商品持久层
 * @author Adair
 * @date 2022/3/15 0015下午 3:32
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public interface ItemDao extends JpaRepository<Item,Long> {
}

如图所示：在这里插入图片描述

5.3.3 编写Service

1.演示的代码如下：

package com.txw.crawler.service;

import com.txw.crawler.pojo.Item;
import java.util.List;
/**
 * 商品业务层接口
 * @author Adair
 * @date 2022/3/15 下午 3:34
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
public interface ItemService {
    /**
     * 根据条件查询数据
     * @param item
     * @return
     */
    public List<Item> findAll(Item item);

    /**
     * 保存数据
     * @param item
     */
    public void save(Item item);
}

如图所示：在这里插入图片描述
2.演示的代码如下：

package com.txw.crawler.service.impl;

import com.txw.crawler.dao.ItemDao;
import com.txw.crawler.pojo.Item;
import com.txw.crawler.service.ItemService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;
import javax.transaction.Transactional;
import java.util.List;
/**
 * 商品业务层实现类
 * @author Adair
 * @date 2022/3/15 下午 3:37
 * @email 1578533828@qq.com
 */
@Service
@SuppressWarnings("all")  // 注解警告信息
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;
    
    /**
     * 根据条件查询数据
     * @param item
     * @return
     */
    @Override
    public List<Item> findAll(Item item) {
        // 声明查询条件
        Example example = Example.of(item);
        // 根据查询条件进行查询
        List list = this.itemDao.findAll(example);
        return list;
    }

    /**
     * 保存数据
     * @param item
     */
    @Override
    @Transactional
    public void save(Item item) {
        this.itemDao.save(item);
    }
}

如图所示：在这里插入图片描述

5.3.4 编写引导类

演示的代码如下：

package com.txw.crawler;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;
/**
 * 引导类
 * @author Adair
 * @date 2022/3/15 下午 3:55
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
@SpringBootApplication
// 设置开启定时任务
@EnableScheduling
public class CrawlerApplication {
    public static void main(String[] args) {
        SpringApplication.run(CrawlerApplication.class,args);
    }
}

如图所示：在这里插入图片描述

5.3.5 封装HttpClient

我们需要经常使用HttpClient，所以需要进行封装，方便使用
演示的代码如下：

package com.txw.crawler.utils;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.UUID;
/**
 * Http工具类
 * @author Adair
 * @date 2022/3/15 下午 4:00
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
@Component
public class HttpUtils {
    private PoolingHttpClientConnectionManager cm = null;
    private CloseableHttpClient httpClient = null;
    public HttpUtils() {
        cm = new PoolingHttpClientConnectionManager();
        // 设置最大连接数
        cm.setMaxTotal(200);
        // 设置每个主机的并发数
        cm.setDefaultMaxPerRoute(20);
    }

    /**
     * 获取内容
     * @param url
     * @return
     */
    public String getHtml(String url) {
        // 获取HttpClient对象
        httpClient = HttpClients.custom().setConnectionManager(cm).build();
        // 声明httpGet请求对象
        HttpGet httpGet = new HttpGet( url);
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.222 Safari/537.36");
        // 设置请求参数RequestConfig
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求，返回response
            response = httpClient.execute(httpGet);
            // 解析response返回数据
            if (response.getStatusLine().getStatusCode() == 200) {
                String html = "";
                // 如果response。getEntity获取的结果是空，在执行EntityUtils.toString会报错
                // 需要对Entity进行非空的判断
                if (response.getEntity() != null) {
                    html = EntityUtils.toString(response.getEntity(), "UTF-8");
                }
                return html;
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (response != null) {
                    // 关闭连接
                    response.close();
                }
                // 不能关闭，现在使用的是连接管理器
                // httpClient.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return null;
    }

    /**
     * 获取图片
     * @param url
     * @return
     */
    public String getImage(String url) {
        // 获取HttpClient对象
        if (httpClient != null) {
            httpClient = HttpClients.custom().setConnectionManager(cm).build();
        }
        // 声明httpGet请求对象
        HttpGet httpGet = new HttpGet( "https//" + url);
        // 设置请求参数RequestConfig
        httpGet.setConfig(this.getConfig());
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.222 Safari/537.36");
        CloseableHttpResponse response = null;
        try {
            // 使用HttpClient发起请求，返回response
            if (response != null) {
                response = httpClient.execute(httpGet);
                // 解析response下载图片
                if (response.getStatusLine().getStatusCode() == 200) {
                    // 获取文件类型
                    String extName = url.substring(url.lastIndexOf("."));
                    // 使用uuid生成图片名
                    String imageName = UUID.randomUUID().toString() + extName;
                    // 声明输出的文件
                    OutputStream outstream = new FileOutputStream(new File("D:/images/" + imageName));
                    // 使用响应体输出文件
                    response.getEntity().writeTo(outstream);

                    // 返回生成的图片名
                    return imageName;
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (response != null) {
                    // 关闭连接
                    response.close();
                }
                // 不能关闭，现在使用的是连接管理器
                // httpClient.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return null;
    }

    /**
     * 获取请求参数对象
     * @return
     */
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                // 设置创建连接的超时时间
                .setConnectTimeout(1000)
                // 设置获取连接的超时时间
                .setConnectionRequestTimeout(500)
                // 设置连接的超时时间
                .setSocketTimeout(10000)
                .build();

        return config;
    }
}

如图所示：在这里插入图片描述

5.3.6 实现数据抓取

使用定时任务，可以定时抓取最新的数据
演示的代码如下：

package com.txw.crawler.task;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.txw.crawler.pojo.Item;
import com.txw.crawler.service.ItemService;
import com.txw.crawler.utils.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import java.util.Date;
import java.util.List;
/**
 * 实现数据抓取
 * @author Adair
 * @date 2022/3/15 下午 4:18
 * @email 1578533828@qq.com
 */
@SuppressWarnings("all")  // 注解警告信息
@Component
public class ItemTask {

    @Autowired
    private HttpUtils httpUtils;

    @Autowired
    private ItemService itemService;

    public static final ObjectMapper MAPPER = new ObjectMapper();

    // 设置定时任务执行完成后，再间隔100秒执行一次
    @Scheduled(fixedDelay = 100 * 1000)
    public void process() throws Exception{
        // 分析页面发现访问的地址,页码page从1开始，下一页page加2
        String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&pvid=e9610a5e637248a7a96a6d2c82756007&s=176&click=0&page=";
        // 遍历执行，获取所有的数据
        for (int i = 1; i < 10; i = i + 2 ) {
            // 发起请求进行访问，获取页面数据,先访问第一页
            String html = this.httpUtils.getHtml(url+i);
            // 发起请求进行访问，获取页面数据,先访问第一页
            this.parseHtml(html);
        }
        System.out.println("执行完成");
    }

    /**
     * 解析页面，并把数据保存到数据库中
     * @param html
     */
    private void parseHtml(String html) throws Exception {
        // 使用jsoup解析页面
        Document document = Jsoup.parse(html);
        // 获取商品数据
        Elements spus = document.select("div#J_goodsList > ul > li");
            // 遍历商品spu数据
            for (Element spuEle : spus) {
                // 获取商品spu
                Long spuId = Long.parseLong(spuEle.attr("data-sku"));
                // 获取商品sku数据
                Elements skus = spuEle.select("li[class='gl-item']");
                for (Element skuEle  : skus) {
                    // 获取商品sku
                    Long skuId = Long.parseLong(skuEle.attr("data-sku"));
                    // 判断商品是否被抓取过，可以根据sku判断
                    Item param = new Item();
                    param.setSku(skuId);
                    List<Item> list = this.itemService.findAll(param);
                    // 判断是否查询到结果
                    if (list.size() > 0) {
                        // 如果有结果，表示商品已下载，进行下一次遍历
                        continue;
                    }
                    // 保存商品数据，声明商品对象
                    Item item = new Item();
                    // 商品spu
                    item.setSpu(spuId);
                    // 商品sku
                    item.setSku(skuId);
                    // 商品url地址
                    item.setUrl("https://item.jd.com/" + skuId + ".html");
                    // 创建时间
                    item.setCreated(new Date());
                    // 修改时间
                    item.setUpdated(item.getCreated());
                    // 获取商品标题
                    String itemHtml = this.httpUtils.getHtml(item.getUrl());
                    String title = Jsoup.parse(itemHtml).select("div.sku-name").text();
                    item.setTitle(title);
                    // 获取商品价格
                    String priceUrl = "https://p.3.cn/prices/mgets?skuIds=J_"+skuId;
                    String priceJson = null;
                    if (priceUrl.equals("")) {
                        priceJson = this.httpUtils.getHtml(priceUrl);
                        // 解析json数据获取商品价格
                        double price = MAPPER.readTree(priceJson).get(0).get("p").asDouble();
                        item.setPrice(price);
                    }
                    // 获取图片地址
                    String pic = skuEle.attr("data-lazy-img").replace("/n9/","/n1/");
                    System.out.println(pic);
                    // 下载图片
                    String picName = this.httpUtils.getImage(pic);
                    item.setPic(picName);
                    // 保存商品数据
                    this.itemService.save(item);
                }
            }
        }
}

如图所示：在这里插入图片描述
运行之后看数据库的结果如图所示：

本案例的注意事项：
1.抓京东的数据，需要登陆自己的账号，否则抓取不到。
2.登陆京东之后，使用F12打开如图所示的查看在请求头添加这个内容。
这样就解决了爬取显示要登录的问题。代码如图所示：在这里插入图片描述

学无止路

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
crawler

1 网络爬虫网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。1.1 爬虫入门程序1.1.1 环境准备1.JDK1.82.IntelliJ IDEA2.IDEA自带的Maven1.打开IDEA如图所示的界面，选择Projects,点击Create New Project。2.选择Maven，点击Next。如图所示：3.填写项目名称，点击Finish。如图所示：4.pom.xml的代码如下：<?xml version="1.0" enco
复制链接

扫一扫