其实一开始我是要获取OSChina的logo的,不过我这几天是不是用Httpclient请求的oschina的首页太多了,现在请求就是403,原因可能在于请求时没有加浏览器的参数,导致网站检测后把我的请求拒绝了。
所以换个目标,获取百度的LOGO。
通过前三篇的热身,这一篇开始正式使用正则和httpclient获取目标了。
咱们复习一下步骤
httpclient请求页面资源
分析资源
正则表达式匹配合适字符串
Java API捕获输出目标数据
第一步请求资源,HttpGetUtils.java ,上一篇写了请求资源的工具类,我先贴下来,如果不清楚请求步骤,看Java简单爬虫系列(2)---HttpClient的使用
package com.hldh.river;
import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
/**
* Created by liuhj on 2016/1/4.
*/
public class HttpGetUtils {
/**
* get 方法
* @param url
* @return
*/
public static String get(String url){
String result = "";
try {
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
CloseableHttpResponse response = httpclient.execute(httpGet);
try {
if (response != null
&& response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
System.out.println(response.getStatusLine());
HttpEntity entity = response.getEntity();
result = readResponse(entity, "utf-8");
}
} finally {
httpclient.close();
response.close();
}
}catch (Exception e){
e.printStackTrace();
}
return result;
}
/**
* stream读取内容,可以传入字符格式
* @param resEntity
* @param charset
* @return
*/
private static String readResponse(HttpEntity resEntity, String charset) {
StringBuffer res = new StringBuffer();
BufferedReader reader = null;
try {
if (resEntity == null) {
return null;
}
reader = new BufferedReader(new InputStreamReader(
resEntity.getContent(), charset));
String line = null;
while ((line = reader.readLine()) != null) {
res.append(line);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (reader != null) {
reader.close();
}
} catch (IOException e) {
}
}
return res.toString();
}
}
通过上面的方法我们就可以把页面资源给请求下来了,当然了上面只是个工具类,还要配合下面几个代码使用。
第二步分析资源
分析资源肯定不是分析输出的结果,那样太乱了,最好是去目标页面去分析
打开www.baidu.com,chrome浏览器右键审查元素
上面的那一行就是百度LOGO所在的位置,查看之后只有hidefocus后面有的图片就是LOGO,那么我就可以写正则表达式了
第三步正则表达式
String regex = "hidefocus.+?src=\"//(.+?)\"";
以hidefocus开始,一直到src之间不管多少字符串都行,
src后面双引号里的内容就是我们要取的,用组的形式表示出来,就是用()包含住,
这样方便在使用API时,用Java的group函数时对应。
第四步 RegexStringUtils.java Java API捕获输出,具体请求步骤请查看第三篇Java简单爬虫系列(3)---正则表达式和Java正则API的使用
package com.hldh.river;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Created by liuhj on 2016/1/4.
*/
public class RegexStringUtils {
public static String regexString(String targetStr, String patternStr){
Pattern pattern = Pattern.compile(patternStr);
// 定义一个matcher用来做匹配
Matcher matcher = pattern.matcher(targetStr);
// 如果找到了
if (matcher.find()) {
// 打印出结果
// System.out.println(matcher.group(1));
return matcher.group(1);
}
return "";
}
}
下面把主函数贴出来 Appjava
package com.hldh.river;
/**
* 正则表达式获取百度LOGO
*/
public class App {
public static void main( String[] args ){
String url = "http://www.baidu.com/";
String regex = "hidefocus.+?src=\"//(.+?)\"";
System.out.println(regex);
String result = HttpGetUtils.get(url);
System.out.println(result);
String src = RegexStringUtils.regexString(result, regex);
System.out.println(src);
}
}
输出结果
hidefocus.+?src="//(.+?)"
HTTP/1.1 200 OK
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8">.....中间的就省略了</script></body></html>
www.baidu.com/img/bd_logo1.png
最后一行就是百度图片的logo地址。
大家可能觉得有个地址管毛用,如果我下载的是美女那不是还是看不了,不要着急,
下面就用httpclient下载图片的工具类 DownloadUtils.java
package com.hldh.river;
import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import java.io.*;
/**
* Created by liuhj on 2016/1/8.
* 把获取的图片下载
*/
public class DownloadUtils {
public static String get(String url){
String filename = "";
String tergetUrl = "http://" + url;
try {
CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(tergetUrl);
CloseableHttpResponse response = httpclient.execute(httpGet);
try {
if (response != null
&& response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
System.out.println(response.getStatusLine());
HttpEntity entity = response.getEntity();
filename = download(entity);
}
} finally {
httpclient.close();
response.close();
}
}catch (Exception e){
e.printStackTrace();
}
return filename;
}
private static String download(HttpEntity resEntity) {
//图片要保存的路径
String dirPath = "d:\\img\\";
//图片名称,可以自定义生成
String fileName = "b_logo.png";
//如果没有目录先创建目录,如果没有文件名先创建文件名
File file = new File(dirPath);
if(file == null || !file.exists()){
file.mkdir();
}
String realPath = dirPath.concat(fileName);
File filePath = new File(realPath);
if (filePath == null || !filePath.exists()) {
try {
filePath.createNewFile();
} catch (IOException e) {
e.printStackTrace();
}
}
//得到输入流,然后把输入流放入缓冲区中,缓冲区--->输出流flush,关闭资源
BufferedOutputStream out = null;
InputStream in = null;
try {
if (resEntity == null) {
return null;
}
in = resEntity.getContent();
out = new BufferedOutputStream(new FileOutputStream(filePath));
byte[] bytes = new byte[1024];
int len = -1;
while((len = in.read(bytes)) != -1){
out.write(bytes,0,len);
}
out.flush();
out.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (in != null) {
in.close();
}
} catch (IOException e) {
}
}
return filePath.toString();
}
}
下面就是保存图片的结果
至此使用正则表达式爬取百度图片就写完了,下面会写一写扩展,使用Jsoup来获取图片保存下来。