Java爬虫
前言:之前就有好多小伙伴说道爬虫的事情,但是我们比较常见的都是用python爬虫,那么Java爬虫如何操作呢?接下来我将会介绍一个详细的流程供大家参考,我们就以豆瓣视频为例
寻找爬虫入口
此处建议使用谷大爷浏览器
首先我们找到豆瓣视频的官网网址:https://movie.douban.com
打开开发者工具如下,看到数据如下:
之后我们再看下请求头信息如下:
最后找到的爬取的入口为:
https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0
开始爬取数据
1.创建Maven工程
2.添加所需依赖至pom.xml文件
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20140107</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>3.5.1</version>
</dependency>
</dependencies>
3.创建项目完成之后目录如下
4.在model包中根据豆瓣影视的字段创建实体类对象
public class Movie {
private String id;
private String directors;
private String title;
private String cover;
private String rate;
private String casts;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getDirectors() {
return directors;
}
public void setDirectors(String directors) {
this.directors = directors;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getCover() {
return cover;
}
public void setCover(String cover) {
this.cover = cover;
}
public String getRate() {
return rate;
}
public void setRate(String rate) {
this.rate = rate;
}
public String getCasts() {
return casts;
}
public void setCasts(String casts) {
this.casts = casts;
}
}
5.创建mapper接口
public interface MovieMapper {
public void insert(Movie movie);
List<Movie> findAll();
}
6.在resources下创建数据连接配置文件jdbc.properties
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/movie
username=root
password=XXXXXX
7.创建mybatis配置文件 mybatis-config.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
<properties resource="jdbc.properties"></properties>
<environments default="development">
<environment id="development">
<transactionManager type="JDBC"/>
<dataSource type="POOLED">
<property name="driver" value="${driver}"/>
<property name="url" value="${url}"/>
<property name="username" value="${username}"/>
<property name="password" value="${password}"/>
</dataSource>
</environment>
</environments>
<mappers>
<mapper resource="MovieMapper.xml"/>
</mappers>
</configuration>
8.创建mapper.xml映射文件
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="com.wxy.mapper.MovieMapper">
<select id="findAll" resultType="com.wxy.model.Movie">
select * from movie
</select>
<insert id="insert" parameterType="com.wxy.model.Movie">
insert into movie (id,directors,title,cover,rate,casts) values (#{id},#{directors},#{title},#{cover},#{rate},#{casts})
</insert>
</mapper>
9.通过原生Java的Http协议进行爬取,工具类如下
public class GetJson {
public JSONObject getHttpJson(String url, int comefrom) throws Exception {
try {
URL realUrl = new URL(url);
HttpURLConnection connection = (HttpURLConnection) realUrl.openConnection();
connection.setRequestProperty("accept", "*/*");
connection.setRequestProperty("connection", "Keep-Alive");
connection.setRequestProperty("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
// 建立连接
connection.connect();
//请求成功
if (connection.getResponseCode() == 200) {
InputStream is = connection.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
//10MB的缓存
byte[] buffer = new byte[10485760];
int len = 0;
while ((len = is.read(buffer)) != -1) {
baos.write(buffer, 0, len);
}
String jsonString = baos.toString();
baos.close();
is.close();
//转换成json数据处理
// getHttpJson函数的后面的参数1,表示返回的是json数据,2表示http接口的数据在一个()中的数据
JSONObject jsonArray = getJsonString(jsonString, comefrom);
return jsonArray;
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
}
return null;
}
public JSONObject getJsonString(String str, int comefrom) throws Exception{
JSONObject jo = null;
if(comefrom==1){
return new JSONObject(JSON.parseObject(str));
}else if(comefrom==2){
int indexStart = 0;
//字符处理
for(int i=0;i<str.length();i++){
if(str.charAt(i)=='('){
indexStart = i;
break;
}
}
String strNew = "";
//分割字符串
for(int i=indexStart+1;i<str.length()-1;i++){
strNew += str.charAt(i);
}
return new JSONObject(JSON.parseObject(strNew));
}
return jo;
}
}
10.爬取豆瓣电影数据启动类
public class App {
public static void main(String [] args) {
String resource = "mybatis-config.xml"; //定义配置文件路径
InputStream inputStream = null;
try {
inputStream = Resources.getResourceAsStream(resource);//读取配置文件
} catch (IOException e) {
e.printStackTrace();
}
SqlSessionFactory sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);//注册mybatis 工厂
SqlSession sqlSession = sqlSessionFactory.openSession();//得到连接对象
MovieMapper movieMapper = sqlSession.getMapper(MovieMapper.class);//从mybatis中得到dao对象
int start;//每页多少条
int total = 0;//记录数
int end = 10000;//总共10000条数据
for (start = 0; start <= end; start += 20) {
try {
String address = "https://Movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=" + start;
JSONObject dayLine = new GetJson().getHttpJson(address, 1);
System.out.println(dayLine+"-----------------");
System.out.println("start:" + start);
JSONArray json = dayLine.getJSONArray("data");
System.out.println(json+"===================");
List<Movie> list = JSON.parseArray(json.toString(), Movie.class);
if (start >= end){
System.out.println("爬取到底");
sqlSession.close();
}
for (Movie movie : list) {
movieMapper.insert(movie);
sqlSession.commit();
}
total += list.size();
System.out.println("正在爬取中 -- 共抓取:" + total + "条数据");
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
11.创建数据库
此处我们使用mysql数据库,根据上述字段创建数据库以及对应表和字段,如下为运行之后的爬取结果:
爬完总结
我们爬完数据之后,看到满满的数据库数据,很是爽啊,但是你可以试试将数据库里面的数据全部清掉,再爬一次(也就是在运行一下我们的项目),运行完之后有的人可能就懵掉了,控制台报错了,如下:
这是什么情况?难道是项目出问题了,别急,我们再来通过浏览器访问下该url,结果发现,,,
没数据了,其实现在好多网站为了防止爬虫,对访问IP进行了限制,关于限制多长时间,是随机性还是固定性就没准了,比如博主就是爬完第一次之后的半个小时之后才能在爬第二次,但感觉这样也不是最有效的解决办法啊。其实解决办法也不是没有,比如通过设置Http请求头直接绕过、使用代理IP等,大家可以自行去百度啦