java-爬虫1

网络爬虫

(注:本文为学习黑马课程所做笔记,文后有过程中所遇的问题与相关解决方案。)
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

1. 爬虫入门程序

1.1 环境准备
  1. 创建Maven工程crawler_demo1并给pom.xml加入依赖。httpclient用来爬数据,slf4j日志信息。
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
            <scope>test</scope>
        </dependency>
  1. resources资源下创建log4j.properties文件
log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

1.2 案例代码

编写最简单的爬虫,抓取西安石油大学首页:http://www.xsyu.edu.cn/

package com.xmx.crawler.test;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:04
 * @Version 1.0
 */
public class Crawler1 {

    public static void main(String[] args) throws Exception {
        //1.打开浏览器,创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.输入网址,发起get请求创建HttpGet对象
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        //3.按回车,发起请求,返回响应,使用HttpClient对象发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //4.解析响应,获取数据
        //判断状态码是否是200
        if (response.getStatusLine().getStatusCode() == 200){
            HttpEntity httpEntity = response.getEntity();
            String content = EntityUtils.toString(httpEntity, "UTF-8");
            System.out.println(content);
        }
    }
}

2. HttpClient

网络爬虫就是用程序帮助我们访问网络上的资源,我们一直以来都是使用HTTP协议访问互联网的网页,网络爬虫需要编写程序,在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术,来实现抓取网页数据。

2.1 GET请求

访问西安石油大学官网,请求url地址:http://www.xsyu.edu.cn/
核心代码:

//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//创建HttpGet对象,设置url访问地址
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
//使用HttpClient发起请求,获取response
try {
	CloseableHttpResponse response = httpClient.execute(httpGet);
	//解析响应
	if (response.getStatusLine().getStatusCode()==200){    // 状态码=200 表示响应成功 
    	String content = EntityUtils.toString(response.getEntity(), "utf8");
    	System.out.println(content.length());
    }
}catch(IOException e) {
    e.printStackTrace();
}


案例代码:

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpGetTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求,设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求,获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.2 带参数的GET请求

在西安石油大学官网搜索研究生,地址为:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
核心代码:

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
//设置URIBuilder
URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
//设置参数
uriBuilder.setParameter("wbtreeid","1032");
//多个参数 直接在后面加setParameter
//uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");

//2.创建HttpGet请求,设置url访问地址
HttpGet httpGet = new HttpGet(uriBuilder.build());

//3.使用HttpClient发起请求,获取response
try {
	CloseableHttpResponse response = httpClient.execute(httpGet);
	//解析响应
	if (response.getStatusLine().getStatusCode()==200){    // 状态码=200 表示响应成功 
    	String content = EntityUtils.toString(response.getEntity(), "utf8");
    	System.out.println(content.length());
    }
}catch(IOException e) {
    e.printStackTrace();
}

案例代码:

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.net.URISyntaxException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpGetParamTest {
    public static void main(String[] args) throws Exception {
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
        //设置URIBuilder
        URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
        //设置参数
        uriBuilder.setParameter("wbtreeid","1032");
        //多个参数 直接在后面加setParameter
        //uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");

        //2.创建HttpGet请求,设置url访问地址
        HttpGet httpGet = new HttpGet(uriBuilder.build());

        System.out.println("发起请求的信息:"+httpGet);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求,获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.3 POST请求

使用POST访问西安石油大学官网,请求url地址:http://www.xsyu.edu.cn/
核心代码:

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");

CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
    CloseableHttpResponse response = httpClient.execute(httpPost);

    //4.解析响应
    if (response.getStatusLine().getStatusCode()==200){
          String content = EntityUtils.toString(response.getEntity(), "utf8");
          System.out.println(content.length());
    }
}

案例代码:

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpPostTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求,设置url访问地址
        HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求,获取response
        try {
            response = httpClient.execute(httpPost);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.4 带参数的POST请求

在西安石油大学中搜索研究生 ,使用POST请求,url地址为:http://www.xsyu.edu.cn/search.jsp
url地址没有参数,参数wbtreeid=1032放到表单中进行提交
核心代码:

//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();

//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");

//声明List集合,封装表单中的请求参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
params.add(new BasicNameValuePair("wbtreeid","1032"));

//创建表单的Entity对象,第一个参数就是封装好的表单数据,第二个参数就是编码
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");

//设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
	response = httpClient.execute(httpPost);

	//4.解析响应
	if (response.getStatusLine().getStatusCode()==200){
		String content = EntityUtils.toString(response.getEntity(), "utf8");
		System.out.println(content.length());
	}
}

案例代码:

package com.xmx.crawler.test;

import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpPostParamTest {
    public static void main(String[] args) throws Exception {
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求,设置url访问地址
        HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");

        //声明List集合,封装表单中的请求参数
        List<NameValuePair> params = new ArrayList<NameValuePair>();
        //设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
        params.add(new BasicNameValuePair("wbtreeid","1032"));

        //创建表单的Entity对象,第一个参数就是封装好的表单数据,第二个参数就是编码
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");

        //设置表单的Entity对象到Post请求中
        httpPost.setEntity(formEntity);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求,获取response
        try {
            response = httpClient.execute(httpPost);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

2.5 使用连接池

如果每次请求都要创建HttpClient,会有频繁创建和销毁的问题,会造成资源的消耗与浪费,可以使用连接池来解决这个问题。

package com.xmx.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 15:22
 * @Version 1.0
 */
public class HttpClientPoolTest {
    public static void main(String[] args) {
        //创建连接池管理器
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

        //设置连接数
        cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        cm.setDefaultMaxPerRoute(10);

        //使用连接池管理器发起请求
        doGet(cm);
        doGet(cm);

    }

    private static void doGet(PoolingHttpClientConnectionManager cm) {
        //不是每次创建新的HttpClient,而是从连接池中获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        CloseableHttpResponse response = null;
        try {
            response = httpClient.execute(httpGet);

            if (response.getStatusLine().getStatusCode() == 200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                //不能关闭HTTPClient,由连接池管理HttpClient
                //httpClient.close();
            }
        }
    }
}

3.6 HttpClient请求参数设置

我们可以对请求进行自定义连接时间,来解决由于网络或者目标服务器的原因而造成超时爬取失败。
核心代码:

//配置请求信息
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)  //创建连接的最长时间,单位是毫秒
	.setConnectionRequestTimeout(500)   //设置获取连接的最长时间,单位是毫秒
	.setSocketTimeout(10*1000)   //设置数据传输的最长时间,单位是毫秒
	.build();

//给请求设置请求信息
httpGet.setConfig(config);

案例代码:

package com.xmx.crawler.test;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

/**
 * @Author Xumx
 * @Date 2021/3/16 14:39
 * @Version 1.0
 */
public class HttpConfigTest {
    public static void main(String[] args){
        //1.创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.创建HttpGet请求,设置url访问地址
        HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");

        //配置请求信息
        RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)  //创建连接的最长时间,单位是毫秒
                .setConnectionRequestTimeout(500)   //设置获取连接的最长时间,单位是毫秒
                .setSocketTimeout(10*1000)   //设置数据传输的最长时间,单位是毫秒
                .build();

        //给请求设置请求信息
        httpGet.setConfig(config);

        CloseableHttpResponse response = null;
        //3.使用HttpClient发起请求,获取response
        try {
            response = httpClient.execute(httpGet);

            //4.解析响应
            if (response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }

            try {
                httpClient.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

3. Jsoup

3.1 Jsoup简介

jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup的主要功能如下:

  1. 从一个URL,文件或字符串中解析HTML;
  2. 使用DOM或CSS选择器来查找、取出数据;
  3. 可操作HTML元素、属性、文本;
3.2 使用Jsoup所需环境准备

使用jsoup需要导入的依赖如下:

      	<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

3.3 jsoup解析
3.3.1 使用Jsoup解析url

Jsoup可以直接输入url,它会发起请求并获取数据,封装为Document对象

@Test
    public void testUrl() throws Exception{
        //解析url地址,第一个参数是访问的url,第二个参数是访问时候的超时时间
        Document doc = Jsoup.parse(new URL("http://www.xsyu.edu.cn/"), 1000);

        //使用标签选择器,获取title标签中的内容
        String title = doc.getElementsByTag("title").first().text();

        //打印
        System.out.println(title);
    }
3.3.2 使用Jsoup解析字符串

Jsoup可以直接输入字符串,并封装为Document对象

    @Test
    public void testString() throws Exception{
        //使用工具类读取文件,获取字符串
        String content = FileUtils.readFileToString(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //解析字符串
        Document document = Jsoup.parse(content);

        String title = document.getElementsByTag("title").first().text();
        System.out.println(title);
    }
3.3.3 使用jsoup解析文件

Jsoup可以直接解析文件,并封装为Document对象

    @Test
    public void testFile() throws Exception{
        //解析文件
        Document document = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"),"utf8");

        String title = document.getElementsByTag("title").first().text();

        System.out.println(title);

    }
3.3.4 使用dom方式遍历文档
3.3.4.1 元素的获取

元素获取

  1. 根据id查询元素getElementById
  2. 根据标签获取元素getElementsByTag
  3. 根据class获取元素getElementsByClass
  4. 根据属性获取元素getElementsByAttribute
    @Test
    public void testDOM() throws Exception{

        //解析文件,获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //获取元素
        //1.根据id查询元素getElementById
        //Element element = doc.getElementById("city_bj");

        //2.根据标签获取元素getElementsByTag
        //Element element = doc.getElementsByTag("span").first();

        //3.根据class获取元素getElementsByClass
        //Element element = doc.getElementsByClass("class_a class_b").first();
        //Element element = doc.getElementsByClass("class_a").first();
        //Element element = doc.getElementsByClass("class_b").first();

        //4.根据属性获取元素getElementsByAttribute
        //Element element = doc.getElementsByAttribute("abc").first();
        Element element = doc.getElementsByAttributeValue("href", "http://sh.itcast.cn").first();

        //打印元素内容
        System.out.println("获取到的元素内容是:"+element.text());

    }
3.3.4.2 从元素中获取数据

元素中获取数据

  1. 从元素中获取id
  2. 从元素中获取className
  3. 从元素中获取属性的值attr
  4. 从元素中获取所有属性attributes
  5. 从元素中获取文本内容text
    @Test
    public void testData() throws Exception{
        //解析文件,获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //根据id获取元素
        Element element = doc.getElementById("test");

        String str = "";

        //元素中获取数据
        //1.从元素中获取id
        //str = element.id();

        //2.从元素中获取className
//        str = element.className();
//        Set<String> classSet = element.classNames();
//        for (String str1: classSet) {
//            System.out.println(str1);
//        }

        //3.从元素中获取属性的值attr
        //str = element.attr("id");
        //str = element.attr("id");

        //4.从元素中获取所有属性attributes
        Attributes attributes = element.attributes();
        System.out.println(attributes.toString());

        //5.从元素中获取文本内容text
        str = element.text();

        //打印获取到的内容
        System.out.println("获取到的数据是:"+str);

    }
3.3.5 Selector选择器
  1. tagname: 通过标签查找元素,比如:span
  2. #id: 通过ID查找元素,比如:# city_bj
  3. .class: 通过class名称查找元素,比如:.class_a
@Test
    public void testSelector() throws Exception{

        //解析html文件,获取Document对象
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //tagname: 通过标签查找元素,比如:span
        Elements elements = doc.select("span");
        for (Element element : elements) {
            System.out.println(element.text());
        }
        System.out.println("===================");
        //#id: 通过ID查找元素,比如:# city_bj
        Element element = doc.select("#city_bj").first();
        System.out.println(element.text());
        System.out.println("===================");

        //.class: 通过class名称查找元素,比如:.class_a
        Element element1 = doc.select(".class_a").first();
        System.out.println(element1.text());
        System.out.println("===================");


        //[attribute]: 利用属性查找元素,比如:[abc]
        Element element2 = doc.select("[abc]").first();
        System.out.println(element2.text());
        System.out.println("===================");

        //[attr=value]: 利用属性值来查找元素,比如:[class=s_name]
        Elements elements1 = doc.select("[class=s_name]");
        for (Element element3 : elements1) {
            System.out.println(element3.text());
        }
    }

3.3.6 Selector选择器组合使用
  1. el#id: 元素+ID,比如: h3#city_bj
  2. el.class: 元素+class,比如: li.class_a
  3. el[attr]: 元素+属性名,比如: span[abc]
  4. 任意组合: 比如:span[abc].s_name
  5. ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
  6. parent > child: 查找某个父元素下的直接子元素,比如:
  7. .city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
  8. parent > *: 查找某个父元素下所有直接子元素
    @Test
    public void testSelector2() throws Exception{
        //获取Document
        Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");

        //el#id: 元素+ID,比如: h3#city_bj
        Element element = doc.select("h3#city_bj").first();
        System.out.println(element.text());
        System.out.println("===================");

        //el.class: 元素+class,比如: li.class_a
        Element element1 = doc.select("li.class_a").first();
        System.out.println(element1.text());
        System.out.println("===================");

        //el[attr]: 元素+属性名,比如: span[abc]
        Element element2 = doc.select("span[abc]").first();
        System.out.println(element2.text());
        System.out.println("===================");

        //任意组合: 比如:span[abc].s_name
        Element element3 = doc.select("span[abc].s_name").first();
        System.out.println(element3.text());
        System.out.println("===================");

        //ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
        Elements elements = doc.select(".city_con li");
        for (Element element4 : elements) {
            System.out.println(element4.text());
        }
        System.out.println("===================");

        //parent > child: 查找某个父元素下的直接子元素,比如:
        //.city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
        Elements elements1 = doc.select(".city_con > ul > li");
        for (Element element4 : elements1) {
            System.out.println(element4.text());
        }
        System.out.println("===================");

        //parent > *: 查找某个父元素下所有直接子元素
        Elements elements2 = doc.select(".city_con > ul > *");
        for (Element element4 : elements2) {
            System.out.println(element4.text());
        }
    }

上述测试代码已传至码云:crawler_demo1

4. 爬虫案例

4.1 需求分析

使用Spring Boot+Spring Data JPA和定时任务爬取京东商城手机信息。

4.2 开发准备
4.2.1 创建数据库表

在mysql数据库中创建数据库crawler,并在其下创建jd_itemd表其表结构如下:

CREATE TABLE `jd_item` (
  `id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',
  `sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',
  `title` varchar(100) DEFAULT NULL COMMENT '商品标题',
  `price` bigint(10) DEFAULT NULL COMMENT '商品价格',
  `pic` varchar(200) DEFAULT NULL COMMENT '商品图片',
  `url` varchar(200) DEFAULT NULL COMMENT '商品详情地址',
  `created` datetime DEFAULT NULL COMMENT '创建时间',
  `updated` datetime DEFAULT NULL COMMENT '更新时间',
  PRIMARY KEY (`id`),
  KEY `sku` (`sku`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='京东商品表';

4.2.2 添加依赖

创建Maven工程并添加所需依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <groupId>com.xmx</groupId>
    <artifactId>crawler_jd</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>9</maven.compiler.source>
        <maven.compiler.target>9</maven.compiler.target>
    </properties>

    <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>

        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>

        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
    </dependencies>
</project>
4.2.3 添加配置文件

加入application.properties配置文件

#DB Configuration:
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/crawler
spring.datasource.username=root
spring.datasource.password=root

#JPA Configuration:
spring.jpa.database=MySQL
spring.jpa.show-sql=true

4.3 代码实现
4.3.1 编写pojo

根据数据库表,编写pojo

package com.xmx.jd.pojo;

import javax.persistence.*;
import java.util.Date;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:21
 * @Version 1.0
 */
@Entity
@Table(name = "jd_item")
public class Item {
    //主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //标准产品单位(商品集合)
    private Long spu;
    //库存量单位(最小品类单元)
    private Long sku;
    //商品标题
    private String title;
    //商品价格
    private Double price;
    //商品图片
    private String pic;
    //商品详情地址
    private String url;
    //创建时间
    private Date created;
    //更新时间
    private Date updated;

    public Long getId() {
        return id;
    }

    public void setId(Long id) {
        this.id = id;
    }

    public Long getSpu() {
        return spu;
    }

    public void setSpu(Long spu) {
        this.spu = spu;
    }

    public Long getSku() {
        return sku;
    }

    public void setSku(Long sku) {
        this.sku = sku;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }

    public String getPic() {
        return pic;
    }

    public void setPic(String pic) {
        this.pic = pic;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Date getCreated() {
        return created;
    }

    public void setCreated(Date created) {
        this.created = created;
    }

    public Date getUpdated() {
        return updated;
    }

    public void setUpdated(Date updated) {
        this.updated = updated;
    }
}
4.3.2 编写dao
package com.xmx.jd.dao;

import com.xmx.jd.pojo.Item;

import org.springframework.data.jpa.repository.JpaRepository;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:25
 * @Version 1.0
 */
public interface ItemDao extends JpaRepository<Item, Long> {
}

4.3.3 编写Service

ItemService接口

package com.xmx.jd.service;

import com.xmx.jd.pojo.Item;

import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:27
 * @Version 1.0
 */
public interface ItemService {

    /*
    *   保存商品
    * */
    public void save(Item item);

    /*
    *
    * */
    public List<Item> findAll(Item item);

}

ItemServiceImpl实现类

package com.xmx.jd.service.impl;

import com.xmx.jd.dao.ItemDao;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;

import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:30
 * @Version 1.0
 */
@Service
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;

    @Override
    public void save(Item item) {
        this.itemDao.save(item);
    }

    @Override
    public List<Item> findAll(Item item) {
        //声明查询条件
        Example<Item> example = Example.of(item);

        //根据查询条件进行查询数据
        List<Item> list = this.itemDao.findAll(example);

        return list;
    }
}

4.3.4 编写引导类
package com.xmx.jd;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:35
 * @Version 1.0
 */
@SpringBootApplication
//使用定时任务,需要先开启定时任务,需要添加注解
@EnableScheduling
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class,args);
    }
}

4.3.5 封装HttpClient

对HttpClient进行封装,方便使用。(这里的代码为已解决下文问题2中问题的代码已经设置模拟电脑环境)

package com.xmx.jd.util;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

/**
 * @Author Xumx
 * @Date 2021/3/17 14:44
 * @Version 1.0
 */
@Component
public class HttpUtils {

    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();

		
        //设置最大连接数
        this.cm.setMaxTotal(100);

        //设置每个主机的最大连接数
        this.cm.setDefaultMaxPerRoute(10);
    }

    /*
    *   根据请求地址下载页面数据
    * */
    public String doGetHtml(String url){
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象,设置url地址
        HttpGet httpGet = new HttpGet(url);
        	
        //设置一下头信息:模拟环境
		httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
        //设置请求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response = null;
        //使用HttpClient发起请求,获取响应
        try {
            response = httpClient.execute(httpGet);

            //解析响应,返回结果
            if (response.getStatusLine().getStatusCode() == 200){
                //判断响应体Entity是否不为空,如果不为空就可以使用EntityUtils
                if (response.getEntity() != null){
                    String content = EntityUtils.toString(response.getEntity(), "utf8");
                    return content;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //关闭response
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //返回空串
        return "";
    }



    /*
    *   下载图片
    * */
    public String doGetImage(String url){
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();

        //创建httpGet请求对象,设置url地址
        HttpGet httpGet = new HttpGet(url);

        //设置请求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response = null;
        //使用HttpClient发起请求,获取响应
        try {
            response = httpClient.execute(httpGet);

            //解析响应,返回结果
            if (response.getStatusLine().getStatusCode() == 200){
                //判断响应体Entity是否不为空
                if (response.getEntity() != null){
                    //下载图片
                    //获取图片的后缀
                    String extName = url.substring(url.lastIndexOf("."));
                    //创建图片名,重名名图片
                    String picName = UUID.randomUUID().toString()+extName;

                    //下载图片
                    //声明OutPutStream
                    OutputStream outputStream = new FileOutputStream(new File("images"+picName));
                    response.getEntity().writeTo(outputStream);

                    //返回图片名称
                    return picName;

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //关闭response
            if (response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //如果下载失败,返回空串
        return "";
    }

    //设置请求的数据信息
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                .setConnectTimeout(1000)   //创建连接的最长时间
                .setConnectionRequestTimeout(500)   //获取连接的最长时间
                .setSocketTimeout(10000)       //数据传输的最长时间
                .build();
        return config;
    }

}

4.3.6 实现数据抓取

设置定时任务定时抓取最新的数据

package com.xmx.jd.task;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;
import com.xmx.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.util.Date;
import java.util.List;

/**
 * @Author Xumx
 * @Date 2021/3/17 15:28
 * @Version 1.0
 */
@Component
public class ItemTask {

    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;

    private static final ObjectMapper MAPPER = new ObjectMapper();
    int i = 1;
    @Scheduled(fixedDelay = 100*1000)   //当下载任务完成后,间隔多长时间进行下一次任务。
    public void itemTask() throws Exception{
        //声明需要解析的初始地址
        String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq%22%20+%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&s=113&click=0&page=";

        //按照页面对手机的搜索结果进行遍历解析
        for (int i = 0; i < 50; i=i+2) {
            String html = httpUtils.doGetHtml(url + i);
            //System.out.println(html);
            //解析页面,获取商品数据并存储
            this.parse(html);

        }

        System.out.println("手机数据抓取完成!"+i++);
    }


    //解析页面,获取商品数据并存储
    private void parse(String html) throws Exception {
        //解析html获取Document
        Document doc = Jsoup.parse(html);

        //获取spu信息
        Elements spuEles = doc.select("div#J_goodsList > ul > li");

        for (Element spuEle : spuEles) {
            //获取spu


            Elements skuEles = spuEle.select("div.gl-i-wrap");
            //System.out.println(skuEles);
            for (Element skuEle : skuEles) {

                Item item = new Item();

                Elements select = skuEle.select("div.p-img");
                //System.out.println(select);
                for (Element element : select) {
                    //获取商品的图片
                    String picUrl ="https:"+ element.select("img[data-lazy-img]").first().attr("data-lazy-img");
                    picUrl = picUrl.replace("/n9/","/n1/");
                    String picName = this.httpUtils.doGetImage(picUrl);
                    item.setPic(picName);

                    String attr = element.select("a").attr("title");
                    item.setTitle(attr);
                    long spu =Long.parseLong(element.select("div").attr("data-venid"));
                    //设置商品的spu
                    item.setSpu(spu);
                }
                Elements select1 = skuEle.select("div.p-operate > a");
                for (int i = 0; i < 1; i++) {
                    long sku = Long.parseLong(select1.get(1).attr("data-sku"));
                    item.setSku(sku);

                    //获取商品的详情的url
                    String itemUrl = "https://item.jd.com/" + sku + ".html";
                    item.setUrl(itemUrl);
                }
                /*for (Element element : select1) {

                }*/
                //根据sku查询商品数据


                List<Item> list = this.itemService.findAll(item);

                if(list.size()>0) {
                    //如果商品存在,就进行下一个循环,该商品不保存,因为已存在
                    continue;
                }



                //获取商品的价格
                Elements priEle = skuEle.select("div.p-price");
                //System.out.println(priEle);
                for (Element element : priEle) {
                    String pri = element.select("i").first().text();
                    //System.out.println(pri.length());
                    double price = Double.parseDouble(pri);
                    item.setPrice(price);
                }
                item.setCreated(new Date());
                item.setUpdated(item.getCreated());

                //保存商品数据到数据库中
                this.itemService.save(item);

            }
        }
    }
}

上述案例代码已传至码云:crawler_jd

5. 所遇问题与解决方案

5.1 问题1
5.1.1 问题描述

编写pojo实体类时出现 Cannot resolve table 'jd_item’错误
在这里插入图片描述

5.1.2 解决方案

这里是由于IDEA没有关联mysql数据库,需要将mysql数据库与IDEA进行关联,以下为IDEA关联mysql数据库详细步骤:
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
注意点击Test Connection时会出现上图问题,问题为mysql与IDEA时差不同连接失败问题,
最简单的方法就是在URL地址后添加:?serverTimezone=GMT ,如下图所示,问题解决:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
问题解决

5.2 问题2
5.2.1 问题描述

程序运行未能抓取到数据,打印抓取数据发现如下图所示,大概意思为需要跳转到京东登录页面进行登录
在这里插入图片描述

在这里插入图片描述

5.2.2 解决方案

在HttpClient中对连接设置一下头信息:模拟电脑环境,则可解决

httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

晒太阳的黑宝

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值