记一次使用Java获取Http资源的经历
起源
看资料看到老司机在github上开车了, 于是果断跟了。
项目主要讲的是如何通过机器学习做色情图片的鉴别。
这不是关键,关键是有将图片的URL给拿了下来。
于是。。。嘿嘿嘿。。。
思路
思路也是特别清晰,读取文件中的URL,通过Http直接获得图片,保存。
基础
文件的读取
在观看了文件各式后,了解到一个URL单独一行,于是读取文件就变得很简单
String urlsFilePath = "文件地址"
try(FileReader reader = new FileReader(urlsFilePath); BufferedReader br = new BufferedReader(reader)){
int count = 0;
while(true){
String line=br.readLine();
if(line == null){
break;
}
}
}catch(IOException e){
e.printStrackTrace();
}
Http请求
对于Http请求,开源世界里有很多类库,这里选择的是HttpClient, 具体的版本于文末提供
简单的分装
这里借鉴了github上的代码只有一点点的修改,具体不记得了,望原作者原谅。
public final class HttpClientFactory {
/**
* 默认连接超时时间
*/
private static final int DEFAULT_CONNECT_TIMEOUT = 5000;
/**
* 默认读取超时时间
*/
private static final int DEFAULT_SOCKET_TIMEOUT = 60000;
/**
* 默认读取超时时间
*/
private static final int DEFAULT_CONN_REQUEST_TIMEOUT = 5000;
/**
* 最大连接数
*/
private static final int DEFAULT_MAX_CONN_TOTAL = 200;
/**
* 每个host最大连接数
*/
private static final int DEFAULT_MAX_CONN_PER_ROUTE = 20;
private static final String HTTP = "http";
private static final String HTTPS = "https";
/**
* 使用缺省配置生成httpClient
*
* @return HttpClient
*/
public static HttpClient defaultClient() {
return HttpClients.custom().setConnectionManager(getPoolingClientConnectionManager())
.setDefaultRequestConfig(getRequestConfig(DEFAULT_CONNECT_TIMEOUT, DEFAULT_SOCKET_TIMEOUT, DEFAULT_CONN_REQUEST_TIMEOUT))
.setMaxConnTotal(DEFAULT_MAX_CONN_TOTAL).setMaxConnPerRoute(DEFAULT_MAX_CONN_PER_ROUTE).build();
}
/**
* 使用缺省配置生成httpClient
*
* @return
*/
public static HttpClient defaultClient(HttpHost host) {
HttpClientBuilder build = HttpClients.custom()
.setConnectionManager(getPoolingClientConnectionManager())
.setDefaultRequestConfig(getRequestConfig(DEFAULT_CONNECT_TIMEOUT, DEFAULT_SOCKET_TIMEOUT, DEFAULT_CONN_REQUEST_TIMEOUT))
.setMaxConnTotal(DEFAULT_MAX_CONN_TOTAL)
.setMaxConnPerRoute(DEFAULT_MAX_CONN_PER_ROUTE);
if(host != null){
build.setProxy(host);
}
return build.build();
}
/**
* 缺省connectionManager
*
* @return
*/
public static PoolingHttpClientConnectionManager getPoolingClientConnectionManager() {
try {
SSLContext sslContext = SSLContexts.custom().useTLS().build();
sslContext.init(null, new TrustManager[] { new X509TrustManager() {
public X509Certificate[] getAcceptedIssuers() {
return null;
}
public void checkClientTrusted(X509Certificate[] certs, String authType) {
}
public void checkServerTrusted(X509Certificate[] certs, String authType) {
}
} }, null);
Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
.register(HTTP, PlainConnectionSocketFactory.INSTANCE)
.register(HTTPS, new SSLConnectionSocketFactory(sslContext))
.build();
PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(socketFactoryRegistry);
SocketConfig socketConfig = SocketConfig.custom()
.setTcpNoDelay(true)
.build();
connManager.setDefaultSocketConfig(socketConfig);
ConnectionConfig connectionConfig = ConnectionConfig.custom()
.setMalformedInputAction(CodingErrorAction.IGNORE)
.setUnmappableInputAction(CodingErrorAction.IGNORE)
.setCharset(Consts.UTF_8)
.build();
connManager.setDefaultConnectionConfig(connectionConfig);
return connManager;
} catch (Exception e) {
e.printStackTrace();
throw new RuntimeException(e);
}
}
/**
* 生成requestConfig
*
* @param connectTimeout 请求超时
* @param socketTimeout 接口响应超时
* @param connectionRequestTimeout 获取连接超时
* @return
*/
public static RequestConfig getRequestConfig(int connectTimeout, int socketTimeout, int connectionRequestTimeout) {
return RequestConfig.custom().setConnectionRequestTimeout(connectionRequestTimeout).setConnectTimeout(connectTimeout).setSocketTimeout(socketTimeout)
.build();
}
}
之后便是基于HttpClientFactory
对HttpClient
的封装了
下面的代码是我自己写的,如有雷同纯属巧合, 由于只会使用get方法,所以也只对get做了一点封装,很是简陋勿嫌弃。
public class HttpClientHelper {
private HttpClientHelper() {}
private static final String DEFAULT_CHARSET = Consts.UTF_8.name();
public static HttpResponse get(HttpHost host, String requestURI, Map<String, String> header, String charset) throws IOException {
HttpClient client = HttpClientFactory.defaultClient(host);
HttpGet request = new HttpGet(requestURI);
request.setConfig(RequestConfig.DEFAULT);
dealHeader(request, header);
return client.execute(request);
}
public static HttpEntity getEntity(HttpHost host, String requestURI, Map<String, String> header, String charset) throws IOException {
HttpResponse response = get(host, requestURI,header, charset);
return response.getEntity();
}
public static HttpEntity getEntity(String requestURI, Map<String, String> header, String charset) throws IOException {
HttpResponse response = get(null, requestURI,header, charset);
return response.getEntity();
}
public static HttpEntity getEntity(String requestURI, Map<String, String> header) throws IOException {
return HttpClientHelper.getEntity(null, requestURI, header, DEFAULT_CHARSET);
}
public static HttpEntity getEntity(HttpHost host, String requestURI) throws IOException {
return HttpClientHelper.getEntity(host,requestURI, null, DEFAULT_CHARSET);
}
public static HttpEntity getEntity(String requestURI) throws IOException {
return HttpClientHelper.getEntity(null, requestURI, null, DEFAULT_CHARSET);
}
}
基于封装后的HttpClient对资源的请求就变得很简单了。
// proxy 代理服务器, url 则为请求的资源
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
图片的下载
有了之前的基础,现在要做的就是将请求的资源保存在本地
基于HTTP Client的请求的资源的保存
byte[] buffer = new byte[1024];
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
File file = new File("文件地址");
try(InputStream inputStream = entity.getContent();
FileOutputStream out = new FileOutputStream(img)){
int j;
while((j = inputStream.read(buffer)) != -1){
out.write(buffer,0,j);
}
}
上述代码还只是最原始的保存方式,当然现在我们更多的是基于Apache common-io的处理
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
FileUtils.writeByteArrayToFile(img, EntityUtils.toByteArray(entity));
简单的封装
public class DownloadUtils {
public static String URL_SPLIT = "/";
public static void downloadImage(String url, String path) throws Exception {
downloadImageByProxy(null, url, path);
}
public static void downloadImageByProxy(HttpHost proxy, String url, String path) throws Exception {
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
if(entity == null){
throw new Exception("没有获得资源");
}
File img = generatorFile(path, getFileNameByURI(url));
FileUtils.writeByteArrayToFile(img, EntityUtils.toByteArray(entity));
// byte[] buffer = new byte[1024];
// try(InputStream inputStream = entity.getContent();
// FileOutputStream out = new FileOutputStream(img)){
// int j;
// while((j = inputStream.read(buffer)) != -1){
// out.write(buffer,0,j);
// }
// }
}
private static File generatorFile(String path, String name) throws IOException {
String fileName = String.format("img-%d-%s", System.currentTimeMillis(), name);
String filePath = path + File.separator + fileName;
File file = new File(filePath);
FileUtils.forceMkdir(file.getParentFile());
FileUtils.touch(file);
return file;
}
private static String getFileNameByURI(String url) throws Exception {
// http://i.imgur.com/3152Dkx.jpg
String[] es = StringUtils.split(url, URL_SPLIT);
if(es.length < 1){
throw new Exception("文件名获取失败");
}
return es[es.length - 1];
}
}
演变
单线程跑
源码
public static void backup(String[] args) {
int countPath = 0;
String basePath = System.getProperty("user.dir") + File.separator + "download" + File.separator + "images";
String urlsFilePath = Main.class.getResource("/").getPath() + File.separator + "imgur.txt";
try(FileReader reader = new FileReader(urlsFilePath);
BufferedReader br = new BufferedReader(reader)){
int count = 0;
while(true){
String line=br.readLine();
if(line == null){
break;
}
count ++;
if(count < 3044){
countPath=3;
continue;
}
if(count % 1000 == 0){
countPath ++;
}
String path = basePath + File.separator + countPath;
try {
DownloadUtils.downloadImage(line, path);
System.out.println(String.format("%d: %s\t 下载成功", count, line));
} catch (Exception e) {
System.out.println(String.format("%d: %s\t 下载失败", count, line));
}
try {
TimeUnit.MILLISECONDS.sleep(300);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
目标很明确, 先读取URL,再去下载资源,其中读取文件是一个耗时的过程,下载同样是一个耗时的时间,这样就导致下载的速度不是很理想,4K条记录估计也得要个一个多小时跑一下,这还得是网络情况不错的前提下。于是改为多线程的就是很必要的,将读取URL和资源呢下载分离。
逻辑上也很简单: 读取文件获得URL,将URL放到一个list中,下载的线程会去拿 url,然后下载。
然后考虑到频繁请求可能会被封IP,于是打算用代理IP,一个代理IP开一个线程。这样就要考虑不要重复下载资源。
其次要考虑的是什么时候结束线程中的循环。
于是代码如下:
public static void read(String filePath, String basePath){
try(FileReader fileReader = new FileReader(filePath); BufferedReader br = new BufferedReader(fileReader)){
int count = 0;
int countPath = 0;
while(true){
String line;
try {
line = br.readLine();
} catch (IOException e) {
System.out.println("文件读取失败");
e.printStackTrace();
continue;
}
if(line == null){
end = true;
break;
}
count ++;
if(count % 1000 == 0){
countPath ++;
}
String path = basePath + File.separator + countPath;
Main.urls.add(String.format("%s%s%s", path, SPLIT, line ));
}
} catch (IOException e) {
e.printStackTrace();
}
}
public static void down(HttpHost proxy){
String fileUrl;
boolean can = true;
Integer count = -1;
while(!end || count < Main.urls.size()){
try {
TimeUnit.MILLISECONDS.sleep(100);
} catch (InterruptedException e1) {
System.out.println(e1.getMessage());
}
if(can){
count = getCount();
}
try {
fileUrl = urls.get(count);
can = true;
} catch (IndexOutOfBoundsException e) {
can = false;
continue;
}
String[] es = StringUtils.split(fileUrl, SPLIT);
if(es.length != 2){
System.out.println(String.format("fileUrl格式有问题: \t%s", fileUrl));
return;
}
try {
DownloadUtils.downloadImageByProxy(proxy, es[1], es[0]);
System.out.println(String.format("%s\t 下载成功", es[1]));
} catch (Exception e) {
System.out.println(String.format("%s\t 下载失败", es[1]));
System.out.println(e.getMessage());
}
}
}
private static Integer getCount(){
synchronized (lock){
count ++;
}
return count;
}
public static void main(String[] args){
// String fileUrl = "D:\\manoo\\workspace\\Demo\\download\\DownloadImg\\download\\images\\0@http://i.imgur.com/05xaVZR.jpg?1";
// String[] es = StringUtils.split(fileUrl, SPLIT);
CountDownLatch countDownLatch = new CountDownLatch(hosts.size() + 1);
String basePath = System.getProperty("user.dir") + File.separator + "download" + File.separator + "images";
String urlsFilePath = Main.class.getResource("/").getPath() + File.separator + "imgur.txt";
// 开启读取
new Thread(()->{
read(urlsFilePath, basePath);
countDownLatch.countDown();
}).start();
for(HttpHost proxy : hosts){
new Thread(()->{
down(proxy);
countDownLatch.countDown();
}).start();
}
try {
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("运行结束");
}
结语
小车到这边基本算好了,但是任然还有该进的地方, 比如将获取资源和将资源保存到本地再做一次分离。
其次项目过小,时间也没花多久,但是可以考虑加上消息中间件Kafka,线程池, 代理IP也是一个一个集合而不是一个代理IP一个线程。
要完成这些改动带代码还是要做很大的改动,有机会在分享吧。
附录
- 使用的依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.10</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.8.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
</dependencies>