网络爬虫
(注:本文为学习黑马课程所做笔记,文后有过程中所遇的问题与相关解决方案。)
网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
1. 爬虫入门程序
1.1 环境准备
- 创建Maven工程crawler_demo1并给pom.xml加入依赖。httpclient用来爬数据,slf4j日志信息。
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
<scope>test</scope>
</dependency>
- resources资源下创建log4j.properties文件
log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n
1.2 案例代码
编写最简单的爬虫,抓取西安石油大学首页:http://www.xsyu.edu.cn/
package com.xmx.crawler.test;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
/**
* @Author Xumx
* @Date 2021/3/16 14:04
* @Version 1.0
*/
public class Crawler1 {
public static void main(String[] args) throws Exception {
//1.打开浏览器,创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.输入网址,发起get请求创建HttpGet对象
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
//3.按回车,发起请求,返回响应,使用HttpClient对象发起请求
CloseableHttpResponse response = httpClient.execute(httpGet);
//4.解析响应,获取数据
//判断状态码是否是200
if (response.getStatusLine().getStatusCode() == 200){
HttpEntity httpEntity = response.getEntity();
String content = EntityUtils.toString(httpEntity, "UTF-8");
System.out.println(content);
}
}
}
2. HttpClient
网络爬虫就是用程序帮助我们访问网络上的资源,我们一直以来都是使用HTTP协议访问互联网的网页,网络爬虫需要编写程序,在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术,来实现抓取网页数据。
2.1 GET请求
访问西安石油大学官网,请求url地址:http://www.xsyu.edu.cn/
核心代码:
//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//创建HttpGet对象,设置url访问地址
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
//使用HttpClient发起请求,获取response
try {
CloseableHttpResponse response = httpClient.execute(httpGet);
//解析响应
if (response.getStatusLine().getStatusCode()==200){ // 状态码=200 表示响应成功
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
}catch(IOException e) {
e.printStackTrace();
}
案例代码:
package com.xmx.crawler.test;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
/**
* @Author Xumx
* @Date 2021/3/16 14:39
* @Version 1.0
*/
public class HttpGetTest {
public static void main(String[] args){
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpGet);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
2.2 带参数的GET请求
在西安石油大学官网搜索研究生,地址为:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
核心代码:
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
//设置URIBuilder
URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
//设置参数
uriBuilder.setParameter("wbtreeid","1032");
//多个参数 直接在后面加setParameter
//uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");
//2.创建HttpGet请求,设置url访问地址
HttpGet httpGet = new HttpGet(uriBuilder.build());
//3.使用HttpClient发起请求,获取response
try {
CloseableHttpResponse response = httpClient.execute(httpGet);
//解析响应
if (response.getStatusLine().getStatusCode()==200){ // 状态码=200 表示响应成功
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
}catch(IOException e) {
e.printStackTrace();
}
案例代码:
package com.xmx.crawler.test;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.net.URISyntaxException;
/**
* @Author Xumx
* @Date 2021/3/16 14:39
* @Version 1.0
*/
public class HttpGetParamTest {
public static void main(String[] args) throws Exception {
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
//设置URIBuilder
URIBuilder uriBuilder = new URIBuilder("http://www.xsyu.edu.cn/search.jsp");
//设置参数
uriBuilder.setParameter("wbtreeid","1032");
//多个参数 直接在后面加setParameter
//uriBuilder.setParameter("wbtreeid","1032").setParameter("wbtreeid","1032");
//2.创建HttpGet请求,设置url访问地址
HttpGet httpGet = new HttpGet(uriBuilder.build());
System.out.println("发起请求的信息:"+httpGet);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpGet);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
2.3 POST请求
使用POST访问西安石油大学官网,请求url地址:http://www.xsyu.edu.cn/
核心代码:
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
CloseableHttpResponse response = httpClient.execute(httpPost);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
}
案例代码:
package com.xmx.crawler.test;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
/**
* @Author Xumx
* @Date 2021/3/16 14:39
* @Version 1.0
*/
public class HttpPostTest {
public static void main(String[] args){
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/");
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpPost);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
2.4 带参数的POST请求
在西安石油大学中搜索研究生 ,使用POST请求,url地址为:http://www.xsyu.edu.cn/search.jsp
url地址没有参数,参数wbtreeid=1032放到表单中进行提交
核心代码:
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");
//声明List集合,封装表单中的请求参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
params.add(new BasicNameValuePair("wbtreeid","1032"));
//创建表单的Entity对象,第一个参数就是封装好的表单数据,第二个参数就是编码
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");
//设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpPost);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
}
案例代码:
package com.xmx.crawler.test;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;
/**
* @Author Xumx
* @Date 2021/3/16 14:39
* @Version 1.0
*/
public class HttpPostParamTest {
public static void main(String[] args) throws Exception {
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.xsyu.edu.cn/search.jsp");
//声明List集合,封装表单中的请求参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址是:http://www.xsyu.edu.cn/search.jsp?wbtreeid=1032
params.add(new BasicNameValuePair("wbtreeid","1032"));
//创建表单的Entity对象,第一个参数就是封装好的表单数据,第二个参数就是编码
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8");
//设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpPost);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
2.5 使用连接池
如果每次请求都要创建HttpClient,会有频繁创建和销毁的问题,会造成资源的消耗与浪费,可以使用连接池来解决这个问题。
package com.xmx.crawler.test;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
/**
* @Author Xumx
* @Date 2021/3/16 15:22
* @Version 1.0
*/
public class HttpClientPoolTest {
public static void main(String[] args) {
//创建连接池管理器
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
//设置连接数
cm.setMaxTotal(100);
//设置每个主机的最大连接数
cm.setDefaultMaxPerRoute(10);
//使用连接池管理器发起请求
doGet(cm);
doGet(cm);
}
private static void doGet(PoolingHttpClientConnectionManager cm) {
//不是每次创建新的HttpClient,而是从连接池中获取HttpClient对象
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet);
if (response.getStatusLine().getStatusCode() == 200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
}finally {
if (response!=null){
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
//不能关闭HTTPClient,由连接池管理HttpClient
//httpClient.close();
}
}
}
}
3.6 HttpClient请求参数设置
我们可以对请求进行自定义连接时间,来解决由于网络或者目标服务器的原因而造成超时爬取失败。
核心代码:
//配置请求信息
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) //创建连接的最长时间,单位是毫秒
.setConnectionRequestTimeout(500) //设置获取连接的最长时间,单位是毫秒
.setSocketTimeout(10*1000) //设置数据传输的最长时间,单位是毫秒
.build();
//给请求设置请求信息
httpGet.setConfig(config);
案例代码:
package com.xmx.crawler.test;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
/**
* @Author Xumx
* @Date 2021/3/16 14:39
* @Version 1.0
*/
public class HttpConfigTest {
public static void main(String[] args){
//1.创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault();
//2.创建HttpGet请求,设置url访问地址
HttpGet httpGet = new HttpGet("http://www.xsyu.edu.cn/");
//配置请求信息
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) //创建连接的最长时间,单位是毫秒
.setConnectionRequestTimeout(500) //设置获取连接的最长时间,单位是毫秒
.setSocketTimeout(10*1000) //设置数据传输的最长时间,单位是毫秒
.build();
//给请求设置请求信息
httpGet.setConfig(config);
CloseableHttpResponse response = null;
//3.使用HttpClient发起请求,获取response
try {
response = httpClient.execute(httpGet);
//4.解析响应
if (response.getStatusLine().getStatusCode()==200){
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
3. Jsoup
3.1 Jsoup简介
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup的主要功能如下:
- 从一个URL,文件或字符串中解析HTML;
- 使用DOM或CSS选择器来查找、取出数据;
- 可操作HTML元素、属性、文本;
3.2 使用Jsoup所需环境准备
使用jsoup需要导入的依赖如下:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.7</version>
</dependency>
3.3 jsoup解析
3.3.1 使用Jsoup解析url
Jsoup可以直接输入url,它会发起请求并获取数据,封装为Document对象
@Test
public void testUrl() throws Exception{
//解析url地址,第一个参数是访问的url,第二个参数是访问时候的超时时间
Document doc = Jsoup.parse(new URL("http://www.xsyu.edu.cn/"), 1000);
//使用标签选择器,获取title标签中的内容
String title = doc.getElementsByTag("title").first().text();
//打印
System.out.println(title);
}
3.3.2 使用Jsoup解析字符串
Jsoup可以直接输入字符串,并封装为Document对象
@Test
public void testString() throws Exception{
//使用工具类读取文件,获取字符串
String content = FileUtils.readFileToString(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");
//解析字符串
Document document = Jsoup.parse(content);
String title = document.getElementsByTag("title").first().text();
System.out.println(title);
}
3.3.3 使用jsoup解析文件
Jsoup可以直接解析文件,并封装为Document对象
@Test
public void testFile() throws Exception{
//解析文件
Document document = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"),"utf8");
String title = document.getElementsByTag("title").first().text();
System.out.println(title);
}
3.3.4 使用dom方式遍历文档
3.3.4.1 元素的获取
元素获取
- 根据id查询元素getElementById
- 根据标签获取元素getElementsByTag
- 根据class获取元素getElementsByClass
- 根据属性获取元素getElementsByAttribute
@Test
public void testDOM() throws Exception{
//解析文件,获取Document对象
Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");
//获取元素
//1.根据id查询元素getElementById
//Element element = doc.getElementById("city_bj");
//2.根据标签获取元素getElementsByTag
//Element element = doc.getElementsByTag("span").first();
//3.根据class获取元素getElementsByClass
//Element element = doc.getElementsByClass("class_a class_b").first();
//Element element = doc.getElementsByClass("class_a").first();
//Element element = doc.getElementsByClass("class_b").first();
//4.根据属性获取元素getElementsByAttribute
//Element element = doc.getElementsByAttribute("abc").first();
Element element = doc.getElementsByAttributeValue("href", "http://sh.itcast.cn").first();
//打印元素内容
System.out.println("获取到的元素内容是:"+element.text());
}
3.3.4.2 从元素中获取数据
元素中获取数据
- 从元素中获取id
- 从元素中获取className
- 从元素中获取属性的值attr
- 从元素中获取所有属性attributes
- 从元素中获取文本内容text
@Test
public void testData() throws Exception{
//解析文件,获取Document对象
Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");
//根据id获取元素
Element element = doc.getElementById("test");
String str = "";
//元素中获取数据
//1.从元素中获取id
//str = element.id();
//2.从元素中获取className
// str = element.className();
// Set<String> classSet = element.classNames();
// for (String str1: classSet) {
// System.out.println(str1);
// }
//3.从元素中获取属性的值attr
//str = element.attr("id");
//str = element.attr("id");
//4.从元素中获取所有属性attributes
Attributes attributes = element.attributes();
System.out.println(attributes.toString());
//5.从元素中获取文本内容text
str = element.text();
//打印获取到的内容
System.out.println("获取到的数据是:"+str);
}
3.3.5 Selector选择器
- tagname: 通过标签查找元素,比如:span
- #id: 通过ID查找元素,比如:# city_bj
- .class: 通过class名称查找元素,比如:.class_a
@Test
public void testSelector() throws Exception{
//解析html文件,获取Document对象
Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");
//tagname: 通过标签查找元素,比如:span
Elements elements = doc.select("span");
for (Element element : elements) {
System.out.println(element.text());
}
System.out.println("===================");
//#id: 通过ID查找元素,比如:# city_bj
Element element = doc.select("#city_bj").first();
System.out.println(element.text());
System.out.println("===================");
//.class: 通过class名称查找元素,比如:.class_a
Element element1 = doc.select(".class_a").first();
System.out.println(element1.text());
System.out.println("===================");
//[attribute]: 利用属性查找元素,比如:[abc]
Element element2 = doc.select("[abc]").first();
System.out.println(element2.text());
System.out.println("===================");
//[attr=value]: 利用属性值来查找元素,比如:[class=s_name]
Elements elements1 = doc.select("[class=s_name]");
for (Element element3 : elements1) {
System.out.println(element3.text());
}
}
3.3.6 Selector选择器组合使用
- el#id: 元素+ID,比如: h3#city_bj
- el.class: 元素+class,比如: li.class_a
- el[attr]: 元素+属性名,比如: span[abc]
- 任意组合: 比如:span[abc].s_name
- ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
- parent > child: 查找某个父元素下的直接子元素,比如:
- .city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
- parent > *: 查找某个父元素下所有直接子元素
@Test
public void testSelector2() throws Exception{
//获取Document
Document doc = Jsoup.parse(new File("D:\\code\\java\\code\\heima\\crawler\\crawler_demo1\\src\\test\\java\\jsoup\\test.html"), "utf8");
//el#id: 元素+ID,比如: h3#city_bj
Element element = doc.select("h3#city_bj").first();
System.out.println(element.text());
System.out.println("===================");
//el.class: 元素+class,比如: li.class_a
Element element1 = doc.select("li.class_a").first();
System.out.println(element1.text());
System.out.println("===================");
//el[attr]: 元素+属性名,比如: span[abc]
Element element2 = doc.select("span[abc]").first();
System.out.println(element2.text());
System.out.println("===================");
//任意组合: 比如:span[abc].s_name
Element element3 = doc.select("span[abc].s_name").first();
System.out.println(element3.text());
System.out.println("===================");
//ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
Elements elements = doc.select(".city_con li");
for (Element element4 : elements) {
System.out.println(element4.text());
}
System.out.println("===================");
//parent > child: 查找某个父元素下的直接子元素,比如:
//.city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
Elements elements1 = doc.select(".city_con > ul > li");
for (Element element4 : elements1) {
System.out.println(element4.text());
}
System.out.println("===================");
//parent > *: 查找某个父元素下所有直接子元素
Elements elements2 = doc.select(".city_con > ul > *");
for (Element element4 : elements2) {
System.out.println(element4.text());
}
}
上述测试代码已传至码云:crawler_demo1
4. 爬虫案例
4.1 需求分析
使用Spring Boot+Spring Data JPA和定时任务爬取京东商城手机信息。
4.2 开发准备
4.2.1 创建数据库表
在mysql数据库中创建数据库crawler,并在其下创建jd_itemd表其表结构如下:
CREATE TABLE `jd_item` (
`id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
`spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',
`sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',
`title` varchar(100) DEFAULT NULL COMMENT '商品标题',
`price` bigint(10) DEFAULT NULL COMMENT '商品价格',
`pic` varchar(200) DEFAULT NULL COMMENT '商品图片',
`url` varchar(200) DEFAULT NULL COMMENT '商品详情地址',
`created` datetime DEFAULT NULL COMMENT '创建时间',
`updated` datetime DEFAULT NULL COMMENT '更新时间',
PRIMARY KEY (`id`),
KEY `sku` (`sku`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='京东商品表';
4.2.2 添加依赖
创建Maven工程并添加所需依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.0.2.RELEASE</version>
</parent>
<groupId>com.xmx</groupId>
<artifactId>crawler_jd</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>9</maven.compiler.source>
<maven.compiler.target>9</maven.compiler.target>
</properties>
<dependencies>
<!--SpringMVC-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!--SpringData Jpa-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<!--MySQL连接包-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
<!--Jsoup-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
<!--工具包-->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
</dependencies>
</project>
4.2.3 添加配置文件
加入application.properties配置文件
#DB Configuration:
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/crawler
spring.datasource.username=root
spring.datasource.password=root
#JPA Configuration:
spring.jpa.database=MySQL
spring.jpa.show-sql=true
4.3 代码实现
4.3.1 编写pojo
根据数据库表,编写pojo
package com.xmx.jd.pojo;
import javax.persistence.*;
import java.util.Date;
/**
* @Author Xumx
* @Date 2021/3/17 14:21
* @Version 1.0
*/
@Entity
@Table(name = "jd_item")
public class Item {
//主键
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
//标准产品单位(商品集合)
private Long spu;
//库存量单位(最小品类单元)
private Long sku;
//商品标题
private String title;
//商品价格
private Double price;
//商品图片
private String pic;
//商品详情地址
private String url;
//创建时间
private Date created;
//更新时间
private Date updated;
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public Long getSpu() {
return spu;
}
public void setSpu(Long spu) {
this.spu = spu;
}
public Long getSku() {
return sku;
}
public void setSku(Long sku) {
this.sku = sku;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public Double getPrice() {
return price;
}
public void setPrice(Double price) {
this.price = price;
}
public String getPic() {
return pic;
}
public void setPic(String pic) {
this.pic = pic;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public Date getCreated() {
return created;
}
public void setCreated(Date created) {
this.created = created;
}
public Date getUpdated() {
return updated;
}
public void setUpdated(Date updated) {
this.updated = updated;
}
}
4.3.2 编写dao
package com.xmx.jd.dao;
import com.xmx.jd.pojo.Item;
import org.springframework.data.jpa.repository.JpaRepository;
/**
* @Author Xumx
* @Date 2021/3/17 14:25
* @Version 1.0
*/
public interface ItemDao extends JpaRepository<Item, Long> {
}
4.3.3 编写Service
ItemService接口
package com.xmx.jd.service;
import com.xmx.jd.pojo.Item;
import java.util.List;
/**
* @Author Xumx
* @Date 2021/3/17 14:27
* @Version 1.0
*/
public interface ItemService {
/*
* 保存商品
* */
public void save(Item item);
/*
*
* */
public List<Item> findAll(Item item);
}
ItemServiceImpl实现类
package com.xmx.jd.service.impl;
import com.xmx.jd.dao.ItemDao;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;
import java.util.List;
/**
* @Author Xumx
* @Date 2021/3/17 14:30
* @Version 1.0
*/
@Service
public class ItemServiceImpl implements ItemService {
@Autowired
private ItemDao itemDao;
@Override
public void save(Item item) {
this.itemDao.save(item);
}
@Override
public List<Item> findAll(Item item) {
//声明查询条件
Example<Item> example = Example.of(item);
//根据查询条件进行查询数据
List<Item> list = this.itemDao.findAll(example);
return list;
}
}
4.3.4 编写引导类
package com.xmx.jd;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;
/**
* @Author Xumx
* @Date 2021/3/17 14:35
* @Version 1.0
*/
@SpringBootApplication
//使用定时任务,需要先开启定时任务,需要添加注解
@EnableScheduling
public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class,args);
}
}
4.3.5 封装HttpClient
对HttpClient进行封装,方便使用。(这里的代码为已解决下文问题2中问题的代码已经设置模拟电脑环境)
package com.xmx.jd.util;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;
/**
* @Author Xumx
* @Date 2021/3/17 14:44
* @Version 1.0
*/
@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils() {
this.cm = new PoolingHttpClientConnectionManager();
//设置最大连接数
this.cm.setMaxTotal(100);
//设置每个主机的最大连接数
this.cm.setDefaultMaxPerRoute(10);
}
/*
* 根据请求地址下载页面数据
* */
public String doGetHtml(String url){
//获取HttpClient对象
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
//创建httpGet请求对象,设置url地址
HttpGet httpGet = new HttpGet(url);
//设置一下头信息:模拟环境
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
//设置请求信息
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
//使用HttpClient发起请求,获取响应
try {
response = httpClient.execute(httpGet);
//解析响应,返回结果
if (response.getStatusLine().getStatusCode() == 200){
//判断响应体Entity是否不为空,如果不为空就可以使用EntityUtils
if (response.getEntity() != null){
String content = EntityUtils.toString(response.getEntity(), "utf8");
return content;
}
}
} catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
if (response!=null){
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
//返回空串
return "";
}
/*
* 下载图片
* */
public String doGetImage(String url){
//获取HttpClient对象
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
//创建httpGet请求对象,设置url地址
HttpGet httpGet = new HttpGet(url);
//设置请求信息
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
//使用HttpClient发起请求,获取响应
try {
response = httpClient.execute(httpGet);
//解析响应,返回结果
if (response.getStatusLine().getStatusCode() == 200){
//判断响应体Entity是否不为空
if (response.getEntity() != null){
//下载图片
//获取图片的后缀
String extName = url.substring(url.lastIndexOf("."));
//创建图片名,重名名图片
String picName = UUID.randomUUID().toString()+extName;
//下载图片
//声明OutPutStream
OutputStream outputStream = new FileOutputStream(new File("images"+picName));
response.getEntity().writeTo(outputStream);
//返回图片名称
return picName;
}
}
} catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
if (response!=null){
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
//如果下载失败,返回空串
return "";
}
//设置请求的数据信息
private RequestConfig getConfig() {
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(1000) //创建连接的最长时间
.setConnectionRequestTimeout(500) //获取连接的最长时间
.setSocketTimeout(10000) //数据传输的最长时间
.build();
return config;
}
}
4.3.6 实现数据抓取
设置定时任务定时抓取最新的数据
package com.xmx.jd.task;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.xmx.jd.pojo.Item;
import com.xmx.jd.service.ItemService;
import com.xmx.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import java.util.Date;
import java.util.List;
/**
* @Author Xumx
* @Date 2021/3/17 15:28
* @Version 1.0
*/
@Component
public class ItemTask {
@Autowired
private HttpUtils httpUtils;
@Autowired
private ItemService itemService;
private static final ObjectMapper MAPPER = new ObjectMapper();
int i = 1;
@Scheduled(fixedDelay = 100*1000) //当下载任务完成后,间隔多长时间进行下一次任务。
public void itemTask() throws Exception{
//声明需要解析的初始地址
String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq%22%20+%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&s=113&click=0&page=";
//按照页面对手机的搜索结果进行遍历解析
for (int i = 0; i < 50; i=i+2) {
String html = httpUtils.doGetHtml(url + i);
//System.out.println(html);
//解析页面,获取商品数据并存储
this.parse(html);
}
System.out.println("手机数据抓取完成!"+i++);
}
//解析页面,获取商品数据并存储
private void parse(String html) throws Exception {
//解析html获取Document
Document doc = Jsoup.parse(html);
//获取spu信息
Elements spuEles = doc.select("div#J_goodsList > ul > li");
for (Element spuEle : spuEles) {
//获取spu
Elements skuEles = spuEle.select("div.gl-i-wrap");
//System.out.println(skuEles);
for (Element skuEle : skuEles) {
Item item = new Item();
Elements select = skuEle.select("div.p-img");
//System.out.println(select);
for (Element element : select) {
//获取商品的图片
String picUrl ="https:"+ element.select("img[data-lazy-img]").first().attr("data-lazy-img");
picUrl = picUrl.replace("/n9/","/n1/");
String picName = this.httpUtils.doGetImage(picUrl);
item.setPic(picName);
String attr = element.select("a").attr("title");
item.setTitle(attr);
long spu =Long.parseLong(element.select("div").attr("data-venid"));
//设置商品的spu
item.setSpu(spu);
}
Elements select1 = skuEle.select("div.p-operate > a");
for (int i = 0; i < 1; i++) {
long sku = Long.parseLong(select1.get(1).attr("data-sku"));
item.setSku(sku);
//获取商品的详情的url
String itemUrl = "https://item.jd.com/" + sku + ".html";
item.setUrl(itemUrl);
}
/*for (Element element : select1) {
}*/
//根据sku查询商品数据
List<Item> list = this.itemService.findAll(item);
if(list.size()>0) {
//如果商品存在,就进行下一个循环,该商品不保存,因为已存在
continue;
}
//获取商品的价格
Elements priEle = skuEle.select("div.p-price");
//System.out.println(priEle);
for (Element element : priEle) {
String pri = element.select("i").first().text();
//System.out.println(pri.length());
double price = Double.parseDouble(pri);
item.setPrice(price);
}
item.setCreated(new Date());
item.setUpdated(item.getCreated());
//保存商品数据到数据库中
this.itemService.save(item);
}
}
}
}
上述案例代码已传至码云:crawler_jd
5. 所遇问题与解决方案
5.1 问题1
5.1.1 问题描述
编写pojo实体类时出现 Cannot resolve table 'jd_item’错误
5.1.2 解决方案
这里是由于IDEA没有关联mysql数据库,需要将mysql数据库与IDEA进行关联,以下为IDEA关联mysql数据库详细步骤:
注意点击Test Connection时会出现上图问题,问题为mysql与IDEA时差不同连接失败问题,
最简单的方法就是在URL地址后添加:?serverTimezone=GMT
,如下图所示,问题解决:
问题解决
5.2 问题2
5.2.1 问题描述
程序运行未能抓取到数据,打印抓取数据发现如下图所示,大概意思为需要跳转到京东登录页面进行登录
5.2.2 解决方案
在HttpClient中对连接设置一下头信息:模拟电脑环境,则可解决
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");