java-爬虫1

最新推荐文章于 2024-05-14 19:21:39 发布

晒太阳的黑宝

最新推荐文章于 2024-05-14 19:21:39 发布

阅读量494

点赞数

分类专栏： Java 文章标签： java 爬虫

本文链接：https://blog.csdn.net/qq_40301475/article/details/114879953

版权

Java 专栏收录该内容

22 篇文章 4 订阅

订阅专栏

网络爬虫

（注：本文为学习黑马课程所做笔记，文后有过程中所遇的问题与相关解决方案。)
网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

1. 爬虫入门程序

1.1 环境准备

创建Maven工程crawler_demo1并给pom.xml加入依赖。httpclient用来爬数据，slf4j日志信息。

<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
            <scope>test</scope>
        </dependency>

resources资源下创建log4j.properties文件

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

1.2 案例代码

编写最简单的爬虫，抓取西安石油大学首页：http://www.xsyu.edu.cn/

package com.xmx.crawler.test;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:04
 * @Version 1.0
 */
public class Crawler1 {

    public static void main(String[] args) throws Exception {
        //1.打开浏览器，创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.输入网址,发起get请求创建HttpGet对象
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        //3.按回车，发起请求，返回响应，使用HttpClient对象发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //4.解析响应，获取数据
        //判断状态码是否是200
        if (response.getStatusLine().getStatusCode() == 200){
            HttpEntity httpEntity = response.getEntity();
            String content = EntityUtils.toString(httpEntity, "UTF-8");
            System.out.println(content);
        }
    }
}

2. HttpClient

网络爬虫就是用程序帮助我们访问网络上的资源，我们一直以来都是使用HTTP协议访问互联网的网页，网络爬虫需要编写程序，在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术，来实现抓取网页数据。

2.1 GET请求

访问西安石油大学官网，请求url地址：http://www.xsyu.edu.cn/
核心代码:

//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//创建HttpGet对象，设置url访问地址
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
//使用HttpClient发起请求，获取response
try {
	CloseableHttpResponse response = httpClient.execute(httpGet);
	//解析响应
	if (response.getStatusLine().getStatusCode()==200){    // 状态码=200 表示响应成功 
    	String content = EntityUtils.toString(response.getEntity(), "utf8");
    	System.out.println(content.length());
    }
}catch(IOException e) {
    e.printStackTrace();
}

案例代码：

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpGetTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求，设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求，获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.2 带参数的GET请求

在西安石油大学官网搜索研究生，地址为：http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
核心代码：

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//设置请求地址是：http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
//设置URIBuilder
URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
//设置参数
uriBuilder.setParameter("wbtreeid","1032");
//多个参数 直接在后面加setParameter
//uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");

//2.创建HttpGet请求，设置url访问地址
HttpGet httpGet = new HttpGet(uriBuilder.build());

//3.使用HttpClient发起请求，获取response
try {
	CloseableHttpResponse response = httpClient.execute(httpGet);
	//解析响应
	if (response.getStatusLine().getStatusCode()==200){    // 状态码=200 表示响应成功 
    	String content = EntityUtils.toString(response.getEntity(), "utf8");
    	System.out.println(content.length());
    }
}catch(IOException e) {
    e.printStackTrace();
}

案例代码：

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.net.URISyntaxException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpGetParamTest {
    public static void main(String[] args) throws Exception {
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //设置请求地址是：http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
        //设置URIBuilder
        URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
        //设置参数
        uriBuilder.setParameter("wbtreeid","1032");
        //多个参数 直接在后面加setParameter
        //uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");

        //2.创建HttpGet请求，设置url访问地址
        HttpGet httpGet = new HttpGet(uriBuilder.build());

        System.out.println("发起请求的信息："+httpGet);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求，获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.3 POST请求

使用POST访问西安石油大学官网，请求url地址：http://www.xsyu.edu.cn/
核心代码：

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//2.创建HttpGet请求，设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");

CloseableHttpResponse response = null;
//3.使用HttpClient发起请求，获取response
try {
    CloseableHttpResponse response = httpClient.execute(httpPost);

    //4.解析响应
    if (response.getStatusLine().getStatusCode()==200){
          String content = EntityUtils.toString(response.getEntity(), "utf8");
          System.out.println(content.length());
    }
}

案例代码：

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpPostTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求，设置url访问地址
        HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求，获取response
        try {
            response = httpClient.execute(httpPost);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.4 带参数的POST请求

在西安石油大学中搜索研究生，使用POST请求，url地址为：http://www.xsyu.edu.cn/search.jsp
url地址没有参数，参数wbtreeid=1032放到表单中进行提交
核心代码：

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//2.创建HttpGet请求，设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");

//声明List集合，封装表单中的请求参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址是：http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
params.add(new BasicNameValuePair("wbtreeid","1032"));

//创建表单的Entity对象,第一个参数就是封装好的表单数据，第二个参数就是编码
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");

//设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求，获取response
try {
	response = httpClient.execute(httpPost);

	//4.解析响应
	if (response.getStatusLine().getStatusCode()==200){
		String content = EntityUtils.toString(response.getEntity(), "utf8");
		System.out.println(content.length());
	}
}

案例代码：

package com.xmx.crawler.test;

import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpPostParamTest {
    public static void main(String[] args) throws Exception {
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求，设置url访问地址
        HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");

        //声明List集合，封装表单中的请求参数
        List<NameValuePair> params = new ArrayList<NameValuePair>();
        //设置请求地址是：http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
        params.add(new BasicNameValuePair("wbtreeid","1032"));

        //创建表单的Entity对象,第一个参数就是封装好的表单数据，第二个参数就是编码
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");

        //设置表单的Entity对象到Post请求中
        httpPost.setEntity(formEntity);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求，获取response
        try {
            response = httpClient.execute(httpPost);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.5 使用连接池

如果每次请求都要创建HttpClient，会有频繁创建和销毁的问题，会造成资源的消耗与浪费，可以使用连接池来解决这个问题。

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 15:22
 * @Version 1.0
 */
public class HttpClientPoolTest {
    public static void main(String[] args) {
        //创建连接池管理器
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

        //设置连接数
        cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        cm.setDefaultMaxPerRoute(10);

        //使用连接池管理器发起请求
        doGet(cm);
        doGet(cm);

    }

    private static void doGet(PoolingHttpClientConnectionManager cm) {
        //不是每次创建新的HttpClient，而是从连接池中获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpGet);

            if (response.getStatusLine().getStatusCode() == 200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                //不能关闭HTTPClient，由连接池管理HttpClient
                //httpClient.close();
            }
        }
    }
}

3.6 HttpClient请求参数设置

我们可以对请求进行自定义连接时间，来解决由于网络或者目标服务器的原因而造成超时爬取失败。
核心代码：

//配置请求信息
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)  //创建连接的最长时间，单位是毫秒
	.setConnectionRequestTimeout(500)   //设置获取连接的最长时间，单位是毫秒
	.setSocketTimeout(10*1000)   //设置数据传输的最长时间，单位是毫秒
	.build();

//给请求设置请求信息
httpGet.setConfig(config);

案例代码：

package com.xmx.crawler.test;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpConfigTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求，设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        //配置请求信息
        RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)  //创建连接的最长时间，单位是毫秒
                .setConnectionRequestTimeout(500)   //设置获取连接的最长时间，单位是毫秒
                .setSocketTimeout(10*1000)   //设置数据传输的最长时间，单位是毫秒
                .build();

        //给请求设置请求信息
        httpGet.setConfig(config);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求，获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

3. Jsoup

3.1 Jsoup简介

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup的主要功能如下：

从一个URL，文件或字符串中解析HTML；
使用DOM或CSS选择器来查找、取出数据；
可操作HTML元素、属性、文本；

3.2 使用Jsoup所需环境准备

使用jsoup需要导入的依赖如下：

      	<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

3.3 jsoup解析

3.3.1 使用Jsoup解析url

Jsoup可以直接输入url，它会发起请求并获取数据，封装为Document对象

@Test
    public void testUrl() throws Exception{
        //解析url地址,第一个参数是访问的url，第二个参数是访问时候的超时时间
        Document doc = Jsoup.parse(new URL("http://www.xsyu.edu.cn/"), 1000);

        //使用标签选择器，获取title标签中的内容
        String title = doc.getElementsByTag("title").first().text();

        //打印
        System.out.println(title);
    }

3.3.2 使用Jsoup解析字符串

Jsoup可以直接输入字符串，并封装为Document对象

    @Test
    public void testString() throws Exception{
        //使用工具类读取文件，获取字符串
        String content = FileUtils.readFileToString(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //解析字符串
        Document document = Jsoup.parse(content);

        String title = document.getElementsByTag("title").first().text();
        System.out.println(title);
    }

3.3.3 使用jsoup解析文件

Jsoup可以直接解析文件，并封装为Document对象

    @Test
    public void testFile() throws Exception{
        //解析文件
        Document document = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"),"utf8");

        String title = document.getElementsByTag("title").first().text();

        System.out.println(title);

    }

3.3.4 使用dom方式遍历文档

3.3.4.1 元素的获取

元素获取

根据id查询元素getElementById
根据标签获取元素getElementsByTag
根据class获取元素getElementsByClass
根据属性获取元素getElementsByAttribute

    @Test
    public void testDOM() throws Exception{

        //解析文件，获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //获取元素
        //1.根据id查询元素getElementById
        //Element element = doc.getElementById("city_bj");

        //2.根据标签获取元素getElementsByTag
        //Element element = doc.getElementsByTag("span").first();

        //3.根据class获取元素getElementsByClass
        //Element element = doc.getElementsByClass("class_a class_b").first();
        //Element element = doc.getElementsByClass("class_a").first();
        //Element element = doc.getElementsByClass("class_b").first();

        //4.根据属性获取元素getElementsByAttribute
        //Element element = doc.getElementsByAttribute("abc").first();
        Element element = doc.getElementsByAttributeValue("href", "http://sh.itcast.cn").first();

        //打印元素内容
        System.out.println("获取到的元素内容是："+element.text());

    }

3.3.4.2 从元素中获取数据

元素中获取数据

从元素中获取id
从元素中获取className
从元素中获取属性的值attr
从元素中获取所有属性attributes
从元素中获取文本内容text

    @Test
    public void testData() throws Exception{
        //解析文件，获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //根据id获取元素
        Element element = doc.getElementById("test");

        String str = "";

        //元素中获取数据
        //1.从元素中获取id
        //str = element.id();

        //2.从元素中获取className
//        str = element.className();
//        Set<String> classSet = element.classNames();
//        for (String str1: classSet) {
//            System.out.println(str1);
//        }

        //3.从元素中获取属性的值attr
        //str = element.attr("id");
        //str = element.attr("id");

        //4.从元素中获取所有属性attributes
        Attributes attributes = element.attributes();
        System.out.println(attributes.toString());

        //5.从元素中获取文本内容text
        str = element.text();

        //打印获取到的内容
        System.out.println("获取到的数据是："+str);

    }

3.3.5 Selector选择器

tagname: 通过标签查找元素，比如：span
#id: 通过ID查找元素，比如：# city_bj
.class: 通过class名称查找元素，比如：.class_a

@Test
    public void testSelector() throws Exception{

        //解析html文件,获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //tagname: 通过标签查找元素，比如：span
        Elements elements = doc.select("span");
        for (Element element : elements) {
            System.out.println(element.text());
        }
        System.out.println("===================");
        //#id: 通过ID查找元素，比如：# city_bj
        Element element = doc.select("#city_bj").first();
        System.out.println(element.text());
        System.out.println("===================");

        //.class: 通过class名称查找元素，比如：.class_a
        Element element1 = doc.select(".class_a").first();
        System.out.println(element1.text());
        System.out.println("===================");


        //[attribute]: 利用属性查找元素，比如：[abc]
        Element element2 = doc.select("[abc]").first();
        System.out.println(element2.text());
        System.out.println("===================");

        //[attr=value]: 利用属性值来查找元素，比如：[class=s_name]
        Elements elements1 = doc.select("[class=s_name]");
        for (Element element3 : elements1) {
            System.out.println(element3.text());
        }
    }

3.3.6 Selector选择器组合使用

el#id: 元素+ID，比如： h3#city_bj
el.class: 元素+class，比如： li.class_a
el[attr]: 元素+属性名，比如： span[abc]
任意组合: 比如：span[abc].s_name
ancestor child: 查找某个元素下子元素，比如：.city_con li 查找"city_con"下的所有li
parent > child: 查找某个父元素下的直接子元素，比如：
.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
parent > *: 查找某个父元素下所有直接子元素

    @Test
    public void testSelector2() throws Exception{
        //获取Document
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //el#id: 元素+ID，比如： h3#city_bj
        Element element = doc.select("h3#city_bj").first();
        System.out.println(element.text());
        System.out.println("===================");

        //el.class: 元素+class，比如： li.class_a
        Element element1 = doc.select("li.class_a").first();
        System.out.println(element1.text());
        System.out.println("===================");

        //el[attr]: 元素+属性名，比如： span[abc]
        Element element2 = doc.select("span[abc]").first();
        System.out.println(element2.text());
        System.out.println("===================");

        //任意组合: 比如：span[abc].s_name
        Element element3 = doc.select("span[abc].s_name").first();
        System.out.println(element3.text());
        System.out.println("===================");

        //ancestor child: 查找某个元素下子元素，比如：.city_con li 查找"city_con"下的所有li
        Elements elements = doc.select(".city_con li");
        for (Element element4 : elements) {
            System.out.println(element4.text());
        }
        System.out.println("===================");

        //parent > child: 查找某个父元素下的直接子元素，比如：
        //.city_con > ul > li 查找city_con第一级（直接子元素）的ul，再找所有ul下的第一级li
        Elements elements1 = doc.select(".city_con > ul > li");
        for (Element element4 : elements1) {
            System.out.println(element4.text());
        }
        System.out.println("===================");

        //parent > *: 查找某个父元素下所有直接子元素
        Elements elements2 = doc.select(".city_con > ul > *");
        for (Element element4 : elements2) {
            System.out.println(element4.text());
        }
    }

上述测试代码已传至码云：crawler_demo1

4. 爬虫案例

4.1 需求分析

使用Spring Boot+Spring Data JPA和定时任务爬取京东商城手机信息。

4.2 开发准备

4.2.1 创建数据库表

在mysql数据库中创建数据库crawler，并在其下创建jd_itemd表其表结构如下：

CREATE TABLE `jd_item` (
  `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',
  `sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',
  `title` varchar(100) DEFAULT NULL COMMENT '商品标题',
  `price` bigint(10) DEFAULT NULL COMMENT '商品价格',
  `pic` varchar(200) DEFAULT NULL COMMENT '商品图片',
  `url` varchar(200) DEFAULT NULL COMMENT '商品详情地址',
  `created` datetime DEFAULT NULL COMMENT '创建时间',
  `updated` datetime DEFAULT NULL COMMENT '更新时间',
  PRIMARY KEY (`id`),
  KEY `sku` (`sku`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='京东商品表';

4.2.2 添加依赖

创建Maven工程并添加所需依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <groupId>com.xmx</groupId>
    <artifactId>crawler_jd</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>9</maven.compiler.source>
        <maven.compiler.target>9</maven.compiler.target>
    </properties>

    <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
    </dependencies>
</project>

4.2.3 添加配置文件

加入application.properties配置文件

#DB Configuration:
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/crawler
spring.datasource.username=root
spring.datasource.password=root

#JPA Configuration:
spring.jpa.database=MySQL
spring.jpa.show-sql=true

4.3 代码实现

4.3.1 编写pojo

根据数据库表，编写pojo

package com.xmx.jd.pojo;

import javax.persistence.*;
import java.util.Date;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:21
 * @Version 1.0
 */
@Entity
@Table(name = "jd_item")
public class Item {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //标准产品单位（商品集合）
    private Long spu;
    //库存量单位（最小品类单元）
    private Long sku;
    //商品标题
    private String title;
    //商品价格
    private Double price;
    //商品图片
    private String pic;
    //商品详情地址
    private String url;
    //创建时间
    private Date created;
    //更新时间
    private Date updated;

    public Long getId() {
        return id;
    }

    public void setId(Long id) {
        this.id = id;
    }

    public Long getSpu() {
        return spu;
    }

    public void setSpu(Long spu) {
        this.spu = spu;
    }

    public Long getSku() {
        return sku;
    }

    public void setSku(Long sku) {
        this.sku = sku;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }

    public String getPic() {
        return pic;
    }

    public void setPic(String pic) {
        this.pic = pic;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Date getCreated() {
        return created;
    }

    public void setCreated(Date created) {
        this.created = created;
    }

    public Date getUpdated() {
        return updated;
    }

    public void setUpdated(Date updated) {
        this.updated = updated;
    }
}

4.3.2 编写dao

package com.xmx.jd.dao;

import com.xmx.jd.pojo.Item;

import org.springframework.data.jpa.repository.JpaRepository;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:25
 * @Version 1.0
 */
public interface ItemDao extends JpaRepository<Item, Long> {
}

4.3.3 编写Service

ItemService接口

package com.xmx.jd.service;

import com.xmx.jd.pojo.Item;

import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:27
 * @Version 1.0
 */
public interface ItemService {

    /*
    *   保存商品
    * */
    public void save(Item item);

    /*
    *
    * */
    public List<Item> findAll(Item item);

}

ItemServiceImpl实现类

package com.xmx.jd.service.impl;

import com.xmx.jd.dao.ItemDao;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;

import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:30
 * @Version 1.0
 */
@Service
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;

    @Override
    public void save(Item item) {
        this.itemDao.save(item);
    }

    @Override
    public List<Item> findAll(Item item) {
        //声明查询条件
        Example<Item> example = Example.of(item);

        //根据查询条件进行查询数据
        List<Item> list = this.itemDao.findAll(example);

        return list;
    }
}

4.3.4 编写引导类

package com.xmx.jd;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:35
 * @Version 1.0
 */
@SpringBootApplication
//使用定时任务，需要先开启定时任务，需要添加注解
@EnableScheduling
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class,args);
    }
}

4.3.5 封装HttpClient

对HttpClient进行封装，方便使用。（这里的代码为已解决下文问题2中问题的代码已经设置模拟电脑环境）

package com.xmx.jd.util;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:44
 * @Version 1.0
 */
@Component
public class HttpUtils {

    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();

		
        //设置最大连接数
        this.cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        this.cm.setDefaultMaxPerRoute(10);
    }

    /*
    *   根据请求地址下载页面数据
    * */
    public String doGetHtml(String url){
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);
        	
        //设置一下头信息：模拟环境
		httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
        //设置请求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response = null;
        //使用HttpClient发起请求，获取响应
        try {
            response = httpClient.execute(httpGet);

            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200){
                //判断响应体Entity是否不为空，如果不为空就可以使用EntityUtils
                if (response.getEntity() != null){
                    String content = EntityUtils.toString(response.getEntity(), "utf8");
                    return content;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //关闭response
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //返回空串
        return "";
    }



    /*
    *   下载图片
    * */
    public String doGetImage(String url){
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);

        //设置请求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response = null;
        //使用HttpClient发起请求，获取响应
        try {
            response = httpClient.execute(httpGet);

            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200){
                //判断响应体Entity是否不为空
                if (response.getEntity() != null){
                    //下载图片
                    //获取图片的后缀
                    String extName = url.substring(url.lastIndexOf("."));
                    //创建图片名，重名名图片
                    String picName = UUID.randomUUID().toString()+extName;

                    //下载图片
                    //声明OutPutStream
                    OutputStream outputStream = new FileOutputStream(new File("images"+picName));
                    response.getEntity().writeTo(outputStream);

                    //返回图片名称
                    return picName;

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //关闭response
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //如果下载失败，返回空串
        return "";
    }

    //设置请求的数据信息
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                .setConnectTimeout(1000)   //创建连接的最长时间
                .setConnectionRequestTimeout(500)   //获取连接的最长时间
                .setSocketTimeout(10000)       //数据传输的最长时间
                .build();
        return config;
    }

}

4.3.6 实现数据抓取

设置定时任务定时抓取最新的数据

package com.xmx.jd.task;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;
import com.xmx.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.util.Date;
import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 15:28
 * @Version 1.0
 */
@Component
public class ItemTask {

    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;

    private static final ObjectMapper MAPPER = new ObjectMapper();
    int i = 1;
    @Scheduled(fixedDelay = 100*1000)   //当下载任务完成后，间隔多长时间进行下一次任务。
    public void itemTask() throws Exception{
        //声明需要解析的初始地址
        String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq%22%20+%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&s=113&click=0&page=";

        //按照页面对手机的搜索结果进行遍历解析
        for (int i = 0; i < 50; i=i+2) {
            String html = httpUtils.doGetHtml(url + i);
            //System.out.println(html);
            //解析页面，获取商品数据并存储
            this.parse(html);

        }

        System.out.println("手机数据抓取完成！"+i++);
    }


    //解析页面，获取商品数据并存储
    private void parse(String html) throws Exception {
        //解析html获取Document
        Document doc = Jsoup.parse(html);

        //获取spu信息
        Elements spuEles = doc.select("div#J_goodsList > ul > li");

        for (Element spuEle : spuEles) {
            //获取spu


            Elements skuEles = spuEle.select("div.gl-i-wrap");
            //System.out.println(skuEles);
            for (Element skuEle : skuEles) {

                Item item = new Item();

                Elements select = skuEle.select("div.p-img");
                //System.out.println(select);
                for (Element element : select) {
                    //获取商品的图片
                    String picUrl ="https:"+ element.select("img[data-lazy-img]").first().attr("data-lazy-img");
                    picUrl = picUrl.replace("/n9/","/n1/");
                    String picName = this.httpUtils.doGetImage(picUrl);
                    item.setPic(picName);

                    String attr = element.select("a").attr("title");
                    item.setTitle(attr);
                    long spu =Long.parseLong(element.select("div").attr("data-venid"));
                    //设置商品的spu
                    item.setSpu(spu);
                }
                Elements select1 = skuEle.select("div.p-operate > a");
                for (int i = 0; i < 1; i++) {
                    long sku = Long.parseLong(select1.get(1).attr("data-sku"));
                    item.setSku(sku);

                    //获取商品的详情的url
                    String itemUrl = "https://item.jd.com/" + sku + ".html";
                    item.setUrl(itemUrl);
                }
                /*for (Element element : select1) {

                }*/
                //根据sku查询商品数据


                List<Item> list = this.itemService.findAll(item);

                if(list.size()>0) {
                    //如果商品存在，就进行下一个循环，该商品不保存，因为已存在
                    continue;
                }



                //获取商品的价格
                Elements priEle = skuEle.select("div.p-price");
                //System.out.println(priEle);
                for (Element element : priEle) {
                    String pri = element.select("i").first().text();
                    //System.out.println(pri.length());
                    double price = Double.parseDouble(pri);
                    item.setPrice(price);
                }
                item.setCreated(new Date());
                item.setUpdated(item.getCreated());

                //保存商品数据到数据库中
                this.itemService.save(item);

            }
        }
    }
}

上述案例代码已传至码云：crawler_jd

5. 所遇问题与解决方案

5.1 问题1

5.1.1 问题描述

编写pojo实体类时出现 Cannot resolve table 'jd_item’错误
在这里插入图片描述

5.1.2 解决方案

这里是由于IDEA没有关联mysql数据库，需要将mysql数据库与IDEA进行关联，以下为IDEA关联mysql数据库详细步骤：
在这里插入图片描述

在这里插入图片描述
注意点击Test Connection时会出现上图问题，问题为mysql与IDEA时差不同连接失败问题，
最简单的方法就是在URL地址后添加：?serverTimezone=GMT ，如下图所示，问题解决：

问题解决

5.2 问题2

5.2.1 问题描述

程序运行未能抓取到数据，打印抓取数据发现如下图所示，大概意思为需要跳转到京东登录页面进行登录
在这里插入图片描述

在这里插入图片描述

5.2.2 解决方案

在HttpClient中对连接设置一下头信息：模拟电脑环境，则可解决

httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");

晒太阳的黑宝

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
java-爬虫1

网络爬虫网络爬虫(Web crawler)，是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。1. 爬虫概述1.1 爬虫介绍在大数据时代，信息的采集是一项重要的工作，而互联网中的数据是海量的，如果单纯靠人力进行信息采集，不仅低效繁琐，搜集的成本也会提高。如何自动高效地获取互联网中我们感兴趣的信息并为我们所用是一个重要的问题，而爬虫技术就是为了解决这些问题而生的。网络爬虫（Web crawler）也叫做网络机器人，可以代替人们自动地在互联网中进行数据信息的采集与整理。它是一种按照一定的规则，
复制链接

扫一扫