最近要做一个爬虫,需要网站数据,先拿京东开刀。
因为我是java开发的,所以最开始的时候,想到了httpClient和htmlunit两个东东,于是开始做实验。
(1)打开京东的登陆页面,看他的源码,发现是执行了一个ajax,具体链接是:https://passport.jd.com/uc/loginService?uuid=f5c0dd5a-762c-4230-b8c0-f70589b7dbdb&ReturnUrl=http://order.jd.com/center/list.action&r=0.66408410689742&loginname=username&nloginpwd=xxxxxx&loginpwd=xxxxxx&machineNet=&machineCpu=&machineDisk=&authcode=&saHrhnkIIX=GXgVo
每次刷新页面,uuid和最后一个参数都是不一样的。然后在火狐打开登陆页,把参数拼在一起后,直接访问火狐,没问题,登陆成功;但是在火狐打开登陆页,把参数拼起来后,在IE却不能打开。OK,看来是在cookie里存了一些东西后面做验证了。
基于以上分析,做了第一套代码:
核心代码如下:
package com.lkb.test;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.http.HttpResponse;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.message.BufferedHeader;
import org.apache.http.protocol.HTTP;
public class JD {
// The configuration items
private static String userName = "xxx";
private static String password = "yyy";
private static String redirectURL = "http://order.jd.com/center/list.action";
private static String loginUrl = "http://passport.jd.com/uc/login";
// Don't change the following URL
private static String renRenLoginURL = "https://passport.jd.com/uc/loginService";
// The HttpClient is used in one session
private HttpResponse response;
private DefaultHttpClient httpclient = new DefaultHttpClient();
public Map<String,String> getParams(){
Map<String,String> map = new HashMap<String,String>();
String str = getText(loginUrl);
String strs1[] = str.split("name=\"uuid\" value=\"");
String strs2[] = strs1[1].split("\"/>");
String uuid = strs2[0];
map.put("uuid", uuid);
System.out.println(strs2[0]);
String str3s[] = strs1[1].split("<span class=
因为我是java开发的,所以最开始的时候,想到了httpClient和htmlunit两个东东,于是开始做实验。
(1)打开京东的登陆页面,看他的源码,发现是执行了一个ajax,具体链接是:https://passport.jd.com/uc/loginService?uuid=f5c0dd5a-762c-4230-b8c0-f70589b7dbdb&ReturnUrl=http://order.jd.com/center/list.action&r=0.66408410689742&loginname=username&nloginpwd=xxxxxx&loginpwd=xxxxxx&machineNet=&machineCpu=&machineDisk=&authcode=&saHrhnkIIX=GXgVo
每次刷新页面,uuid和最后一个参数都是不一样的。然后在火狐打开登陆页,把参数拼在一起后,直接访问火狐,没问题,登陆成功;但是在火狐打开登陆页,把参数拼起来后,在IE却不能打开。OK,看来是在cookie里存了一些东西后面做验证了。
基于以上分析,做了第一套代码:
核心代码如下:
package com.lkb.test;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.http.HttpResponse;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.message.BufferedHeader;
import org.apache.http.protocol.HTTP;
public class JD {
// The configuration items
private static String userName = "xxx";
private static String password = "yyy";
private static String redirectURL = "http://order.jd.com/center/list.action";
private static String loginUrl = "http://passport.jd.com/uc/login";
// Don't change the following URL
private static String renRenLoginURL = "https://passport.jd.com/uc/loginService";
// The HttpClient is used in one session
private HttpResponse response;
private DefaultHttpClient httpclient = new DefaultHttpClient();
public Map<String,String> getParams(){
Map<String,String> map = new HashMap<String,String>();
String str = getText(loginUrl);
String strs1[] = str.split("name=\"uuid\" value=\"");
String strs2[] = strs1[1].split("\"/>");
String uuid = strs2[0];
map.put("uuid", uuid);
System.out.println(strs2[0]);
String str3s[] = strs1[1].split("<span class=