今天接到个需求,一个同学需要我帮忙爬取一下携程的民宿酒店数据。都知道携程难爬,我一不小心就掉坑里了。
其实携程难爬的数据是酒店数据,而这个民宿应该是个新上线的业务,所以并没有做什么反爬手段,可惜老夫不知道啊,所以从中午接到电话就开始了折腾之路。
第一阶段:人生苦短,我用python
刚一听到这个需求,就想用python来做,所以先装python环境,又装了pycharm,找了几个脚本,基本都是跑不起来,要么是库安装不了,要么是语法不对。鉴于本人的渣渣python水平,在捣鼓了一两个小时后放弃了。中间的坑主要是库不对,我cmd窗口安装的库和pycharm的库不是通用的,cmd各种库都能装,pycharm有个别库搜索不到,所以,你懂得。。。
第二阶段:搜索java解决方案
java的方案比较多,这一阶段主要是网上搜索各种demo,找了那么五六个,甚至还在csdn用积分下载了两个,可惜由于代码基本都是去年的,请求的路径还是aspx,最新的携程已经不是这个了,找到的所有的教程都是基于这种方式的,所以根本也都用不了。
ps:中间还用了下Chromedriver方案,但教程是爬取艺龙的,能用,放弃。
还有很多教程,没有完整代码,拷贝过来并不能运行,还有几个是广告贴,让买一个什么携程海内外酒店爬虫系统,未付费只能爬十条,可惜注册的页面手机号码都输入不了,放弃。
搞到这个时候,一下午基本已经过去了,这时候同学打电话过来,他已经人肉完了,所以,彻底在他面前丢人了。
其实,一直陷入到了误区中,到了这个阶段,我一直以为携程的很难爬,所以跟小伙伴聊了一下,直接找到请求的地址,拿apidebug进行了测试,看到post请求中十几个请求头,请求参数也是一大堆,弄的真是心力交瘁。最后也测试通过了,证明根据这个路径可以爬,而且他返回的是json数据。此时下班了。。。。
第三阶段:jsoup,webmagic方式
回家后继续折腾,之前用过jsoup爬过医药行业的信息,所有还是按这个思路,各种搭环境,找demo代码,中间也试了webmagic,都差不多。搞到一半突然反应过来了,这两个工具都是解析静态页面的,可老夫不需要解析页面啊,人家携程已经很友好地通过接口返回json数据了,我这还弄个毛的html解析啊,于是,里面又转换思路。
第四阶段:httpclient终极大招
想明白了这个问题,其实就是发送个http请求,然后解析得到的json数据转换成对象,存到数据库就ok了。所以就是最后的直接发送http的post请求阶段,代码如下:
pom依赖:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.34</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.31</version>
</dependency>
java实体类
package cn.wanghaomiao;
import java.util.Date;
/**
* @Author: szz
* @Date: 2019/5/29 下午9:50
* @Version 1.0
*/
public class Hotel {
String id;
String pid;
String pname;
String zone;
String lng;
String lat;
String zoneId;
String rname;
Date createTime;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
public String getZone() {
return zone;
}
public void setZone(String zone) {
this.zone = zone;
}
public String getLng() {
return lng;
}
public void setLng(String lng) {
this.lng = lng;
}
public String getLat() {
return lat;
}
public void setLat(String lat) {
this.lat = lat;
}
public String getZoneId() {
return zoneId;
}
public void setZoneId(String zoneId) {
this.zoneId = zoneId;
}
public String getRname() {
return rname;
}
public void setRname(String rname) {
this.rname = rname;
}
public Date getCreateTime() {
return createTime;
}
public void setCreateTime(Date createTime) {
this.createTime = createTime;
}
@Override
public String toString() {
return "Hotel{" +
"pid='" + pid + '\'' +
", pname='" + pname + '\'' +
", zone='" + zone + '\'' +
", lng='" + lng + '\'' +
", lat='" + lat + '\'' +
", rname='" + rname + '\'' +
", zoneId='" + zoneId + '\'' +
'}';
}
}
操作类
package cn.wanghaomiao;
import ch.qos.logback.core.net.SyslogOutputStream;
import com.alibaba.fastjson.JSONObject;
import com.mysql.jdbc.Connection;
import com.mysql.jdbc.PreparedStatement;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.ParseException;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;
import java.io.IOException;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.CertificateException;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.util.*;
/**
* 简单httpclient实例
*
* @author szz
* @date 2019年05月29日 下午6:36:49
* @version 1.0
*/
public class SimpleHttpClientDemo {
/**
* 绕过验证
*
* @return
* @throws NoSuchAlgorithmException
* @throws KeyManagementException
*/
public static SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
SSLContext sc = SSLContext.getInstance("SSLv3");
// 实现一个X509TrustManager接口,用于绕过验证,不用修改里面的方法
X509TrustManager trustManager = new X509TrustManager() {
@Override
public void checkClientTrusted(
java.security.cert.X509Certificate[] paramArrayOfX509Certificate,
String paramString) throws CertificateException {
}
@Override
public void checkServerTrusted(
java.security.cert.X509Certificate[] paramArrayOfX509Certificate,
String paramString) throws CertificateException {
}
@Override
public java.security.cert.X509Certificate[] getAcceptedIssuers() {
return null;
}
};
sc.init(null, new TrustManager[] { trustManager }, null);
return sc;
}
public static void main(String[] args) throws Exception{
for (int i = 2; i < 70; i++) {
Map map = new HashMap();
map.put("cityid","105");//城市id
map.put("districtId","822");//区域id
map.put("pindex",i+"");//动态循环的分页
map.put("pSize","15");//每页记录数
String result = sendHttps("https://m.ctrip.com/restapi/soa2/12455/prod/json/searchProduct?_fxpcqlniredt=09031133110201642129", map, "utf-8");//请求地址,参数,编码集
Map resultMap = JSONObject.parseObject(result, Map.class);
System.out.println(resultMap);//debug分析返回json的格式
List<Map> list= (List<Map>) resultMap.get("product");//获取需要的节点
Date time=new Date();//统一设置这一批的插入时间
for (Map map1 : list) {
Hotel hotel=new Hotel();
//加了非空判断,因为爬过来的数据字段可能为空
hotel.setPid(map1.get("pid").toString());
if (map1.get("pname")!=null) {
hotel.setPname(map1.get("pname").toString());
}
if (map1.get("zone")!=null) {
hotel.setZone(map1.get("zone").toString());
}
if (map1.get("rname")!=null) {
hotel.setRname(map1.get("rname").toString());
}
if (map1.get("zoneId")!=null) {
hotel.setZoneId(map1.get("zoneId").toString());
}
if (map1.get("pos")!=null) {//这是坐标,本身是对象,所以做了二次分解
Map mapPos= (Map) map1.get("pos");
if (mapPos.get("lng")!=null) {
hotel.setLng(mapPos.get("lng").toString());
}
if (mapPos.get("lat")!=null) {
hotel.setLat(mapPos.get("lat").toString());
}
}
hotel.setCreateTime(time);
System.out.println(hotel);
insert(hotel);//保存到数据库
}
Thread.sleep(3000);//文明爬取,要睡几秒
}
}
//jdbc导入
private static int insert(Hotel hotel) {
Connection conn = getConn();
int i = 0;
String sql = "insert into hotel (pid,pname,zone,lng,lat,rname,zoneId) values(?,?,?,?,?,?,?)";//好好写sql
PreparedStatement pstmt;
try {
pstmt = (PreparedStatement) conn.prepareStatement(sql);
pstmt.setString(1, hotel.getPid());
pstmt.setString(2, hotel.getPname());
pstmt.setString(3, hotel.getZone());
pstmt.setString(4, hotel.getLng());
pstmt.setString(5, hotel.getLat());
pstmt.setString(6, hotel.getRname());
pstmt.setString(7, hotel.getZoneId());
i = pstmt.executeUpdate();
pstmt.close();
conn.close();
} catch (SQLException e) {
e.printStackTrace();
}
return i;
}
//jdbc的连接配置
private static Connection getConn() {
String driver = "com.mysql.jdbc.Driver";
//characterEncoding防止中文乱码
//useSSL防止警告,高版本mysql会出现
String url = "jdbc:mysql://localhost:3306/pachong?useUnicode=true&characterEncoding=UTF-8&useSSL=true";
String username = "root";
String password = "root";
Connection conn = null;
try {
Class.forName(driver); //classLoader,加载对应驱动
conn = (Connection) DriverManager.getConnection(url, username, password);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return conn;
}
/**
* 模拟请求https
*
* @param url 资源地址
* @param map 参数列表
* @param encoding 编码
* @return
* @throws NoSuchAlgorithmException
* @throws KeyManagementException
* @throws IOException
* @throws ClientProtocolException
*/
public static String sendHttps(String url, Map<String,String> map,String encoding) throws KeyManagementException, NoSuchAlgorithmException, ClientProtocolException, IOException {
String body = "";
//采用绕过验证的方式处理https请求
SSLContext sslcontext = createIgnoreVerifySSL();
// 设置协议http和https对应的处理socket链接工厂的对象
Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.INSTANCE)
.register("https", new SSLConnectionSocketFactory(sslcontext))
.build();
PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(socketFactoryRegistry);
HttpClients.custom().setConnectionManager(connManager);
//创建自定义的httpclient对象
CloseableHttpClient client = HttpClients.custom().setConnectionManager(connManager).build();
// CloseableHttpClient client = HttpClients.createDefault();
//创建post方式请求对象
HttpPost httpPost = new HttpPost(url);
//装填参数
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
if(map!=null){
for (Map.Entry<String, String> entry : map.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
}
//设置参数到请求对象中
httpPost.setEntity(new UrlEncodedFormEntity(nvps, encoding));
System.out.println("请求地址:"+url);
System.out.println("请求参数:"+nvps.toString());
//设置header信息
//指定报文头【Content-type】、【User-Agent】
httpPost.setHeader("Content-type", "application/x-www-form-urlencoded");
httpPost.setHeader("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
//执行请求操作,并拿到结果(同步阻塞)
CloseableHttpResponse response = client.execute(httpPost);
//获取结果实体
HttpEntity entity = response.getEntity();
if (entity != null) {
//按指定编码转换结果实体为String类型
body = EntityUtils.toString(entity, encoding);
}
EntityUtils.consume(entity);
//释放链接
response.close();
return body;
}
/**
* 模拟请求http
*
* @param url 资源地址
* @param map 参数列表
* @param encoding 编码
* @return
* @throws ParseException
* @throws IOException
*/
public static String sendHttp(String url, Map<String,String> map, String encoding) throws ParseException, IOException {
String body = "";
//创建httpclient对象
CloseableHttpClient client = HttpClients.createDefault();
//创建post方式请求对象
HttpPost httpPost = new HttpPost(url);
//装填参数
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
if(map!=null){
for (Map.Entry<String, String> entry : map.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
}
//设置参数到请求对象中
httpPost.setEntity(new UrlEncodedFormEntity(nvps, encoding));
System.out.println("请求地址:"+url);
System.out.println("请求参数:"+nvps.toString());
//设置header信息
//指定报文头【Content-type】、【User-Agent】
httpPost.setHeader("Content-type", "application/x-www-form-urlencoded");
httpPost.setHeader("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
//执行请求操作,并拿到结果(同步阻塞)
CloseableHttpResponse response = client.execute(httpPost);
//获取结果实体
HttpEntity entity = response.getEntity();
if (entity != null) {
//按指定编码转换结果实体为String类型
body = EntityUtils.toString(entity, encoding);
}
EntityUtils.consume(entity);
//释放链接
response.close();
return body;
}
}
注释写的很清楚,这里主要采用了https的请求方式。
注意,每次循环要睡几秒,不然很容易被封。文明爬取,要睡几秒
sql脚本:
CREATE TABLE `hotel` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`pid` varchar(20) COLLATE utf8mb4_bin DEFAULT NULL,
`pname` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL,
`zone` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL,
`lng` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL,
`lat` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL,
`rname` varchar(255) COLLATE utf8mb4_bin DEFAULT NULL,
`zoneId` varchar(20) COLLATE utf8mb4_bin DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1724 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
SET FOREIGN_KEY_CHECKS = 1;
最坑的地方是一直认为请求的参数都要传,所以弄各种参数搞了很久,最终证明,根本不需要传那么多参数,是我自己太蠢了。
可以了吧,非常完整的爬取攻略,在2019年05月29日亲测还是可用的。各位亲们且行且珍惜!!!