记一次使用Java获取Http资源的经历

最新推荐文章于 2024-06-28 21:55:38 发布

SnrtIevg

最新推荐文章于 2024-06-28 21:55:38 发布

阅读量1.1k

点赞数 1

分类专栏： java 文章标签： java 开车 HttpClient 多线程

本文链接：https://blog.csdn.net/Snrt_Julier/article/details/86551451

版权

java 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

记一次使用Java获取Http资源的经历

起源

看资料看到老司机在github上开车了, 于是果断跟了。

项目主要讲的是如何通过机器学习做色情图片的鉴别。
这不是关键，关键是有将图片的URL给拿了下来。
于是。。。嘿嘿嘿。。。

思路

思路也是特别清晰，读取文件中的URL，通过Http直接获得图片，保存。

基础

文件的读取

在观看了文件各式后，了解到一个URL单独一行，于是读取文件就变得很简单

String urlsFilePath = "文件地址"
try(FileReader reader = new FileReader(urlsFilePath); BufferedReader br = new BufferedReader(reader)){
    int count = 0;
    while(true){
        String line=br.readLine();
        if(line == null){
            break;
        }
    }
}catch(IOException e){
    e.printStrackTrace();
}

Http请求

对于Http请求，开源世界里有很多类库，这里选择的是HttpClient, 具体的版本于文末提供

简单的分装

这里借鉴了github上的代码只有一点点的修改，具体不记得了，望原作者原谅。

public final class HttpClientFactory {

    /**
     * 默认连接超时时间
     */
    private static final int DEFAULT_CONNECT_TIMEOUT = 5000;

    /**
     * 默认读取超时时间
     */
    private static final int DEFAULT_SOCKET_TIMEOUT = 60000;

    /**
     * 默认读取超时时间
     */
    private static final int DEFAULT_CONN_REQUEST_TIMEOUT = 5000;

    /**
     * 最大连接数
     */
    private static final int DEFAULT_MAX_CONN_TOTAL = 200;

    /**
     * 每个host最大连接数
     */
    private static final int DEFAULT_MAX_CONN_PER_ROUTE = 20;

    private static final String HTTP = "http";
    private static final String HTTPS = "https";

    /**
     * 使用缺省配置生成httpClient
     *
     * @return HttpClient
     */
    public static HttpClient defaultClient() {
        return HttpClients.custom().setConnectionManager(getPoolingClientConnectionManager())
                .setDefaultRequestConfig(getRequestConfig(DEFAULT_CONNECT_TIMEOUT, DEFAULT_SOCKET_TIMEOUT, DEFAULT_CONN_REQUEST_TIMEOUT))
                .setMaxConnTotal(DEFAULT_MAX_CONN_TOTAL).setMaxConnPerRoute(DEFAULT_MAX_CONN_PER_ROUTE).build();
    }

    /**
     * 使用缺省配置生成httpClient
     *
     * @return
     */
    public static HttpClient defaultClient(HttpHost host) {
        HttpClientBuilder build = HttpClients.custom()
                .setConnectionManager(getPoolingClientConnectionManager())
                .setDefaultRequestConfig(getRequestConfig(DEFAULT_CONNECT_TIMEOUT, DEFAULT_SOCKET_TIMEOUT, DEFAULT_CONN_REQUEST_TIMEOUT))
                .setMaxConnTotal(DEFAULT_MAX_CONN_TOTAL)
                .setMaxConnPerRoute(DEFAULT_MAX_CONN_PER_ROUTE);
        if(host != null){
            build.setProxy(host);
        }
        return build.build();
    }

    /**
     * 缺省connectionManager
     *
     * @return
     */
    public static PoolingHttpClientConnectionManager getPoolingClientConnectionManager() {
        try {
            SSLContext sslContext = SSLContexts.custom().useTLS().build();
            sslContext.init(null, new TrustManager[] { new X509TrustManager() {

                public X509Certificate[] getAcceptedIssuers() {
                    return null;
                }

                public void checkClientTrusted(X509Certificate[] certs, String authType) {
                }

                public void checkServerTrusted(X509Certificate[] certs, String authType) {
                }
            } }, null);
            Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register(HTTP, PlainConnectionSocketFactory.INSTANCE)
                    .register(HTTPS, new SSLConnectionSocketFactory(sslContext))
                    .build();

            PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(socketFactoryRegistry);
            SocketConfig socketConfig = SocketConfig.custom()
                    .setTcpNoDelay(true)
                    .build();
            connManager.setDefaultSocketConfig(socketConfig);
            ConnectionConfig connectionConfig = ConnectionConfig.custom()
                    .setMalformedInputAction(CodingErrorAction.IGNORE)
                    .setUnmappableInputAction(CodingErrorAction.IGNORE)
                    .setCharset(Consts.UTF_8)
                    .build();
            connManager.setDefaultConnectionConfig(connectionConfig);
            return connManager;
        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    /**
     * 生成requestConfig
     *
     * @param connectTimeout           请求超时
     * @param socketTimeout            接口响应超时
     * @param connectionRequestTimeout 获取连接超时
     * @return
     */
    public static RequestConfig getRequestConfig(int connectTimeout, int socketTimeout, int connectionRequestTimeout) {
        return RequestConfig.custom().setConnectionRequestTimeout(connectionRequestTimeout).setConnectTimeout(connectTimeout).setSocketTimeout(socketTimeout)
                .build();
    }

}

之后便是基于HttpClientFactory对HttpClient的封装了

下面的代码是我自己写的，如有雷同纯属巧合，由于只会使用get方法，所以也只对get做了一点封装，很是简陋勿嫌弃。

public class HttpClientHelper {
    private HttpClientHelper() {}

    private static final String DEFAULT_CHARSET = Consts.UTF_8.name();

    public static HttpResponse get(HttpHost host, String requestURI, Map<String, String> header, String charset) throws IOException {
        HttpClient client = HttpClientFactory.defaultClient(host);
        HttpGet request = new HttpGet(requestURI);
        request.setConfig(RequestConfig.DEFAULT);
        dealHeader(request, header);
        return client.execute(request);
    }

    public static HttpEntity getEntity(HttpHost host, String requestURI, Map<String, String> header, String charset) throws IOException {
        HttpResponse response = get(host, requestURI,header, charset);
        return response.getEntity();
    }

    public static HttpEntity getEntity(String requestURI, Map<String, String> header, String charset) throws IOException {
        HttpResponse response = get(null, requestURI,header, charset);
        return response.getEntity();
    }

    public static HttpEntity getEntity(String requestURI, Map<String, String> header) throws IOException {
        return HttpClientHelper.getEntity(null, requestURI, header, DEFAULT_CHARSET);
    }

    public static HttpEntity getEntity(HttpHost host, String requestURI) throws IOException {
        return HttpClientHelper.getEntity(host,requestURI, null, DEFAULT_CHARSET);
    }

    public static HttpEntity getEntity(String requestURI) throws IOException {
        return HttpClientHelper.getEntity(null, requestURI, null, DEFAULT_CHARSET);
    }


}

基于封装后的HttpClient对资源的请求就变得很简单了。

// proxy 代理服务器， url 则为请求的资源
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);

图片的下载

有了之前的基础，现在要做的就是将请求的资源保存在本地

基于HTTP Client的请求的资源的保存

byte[] buffer = new byte[1024];
HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
File file = new File("文件地址");
try(InputStream inputStream = entity.getContent();
    FileOutputStream out = new FileOutputStream(img)){
    int j;
    while((j = inputStream.read(buffer)) != -1){
        out.write(buffer,0,j);
    }
}

上述代码还只是最原始的保存方式，当然现在我们更多的是基于Apache common-io的处理

HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
FileUtils.writeByteArrayToFile(img, EntityUtils.toByteArray(entity));

简单的封装

public class DownloadUtils {

    public static String URL_SPLIT = "/";

    public static void downloadImage(String url, String path) throws Exception {
        downloadImageByProxy(null, url, path);
    }

    public static void downloadImageByProxy(HttpHost proxy, String url, String path) throws Exception {
        HttpEntity entity = HttpClientHelper.getEntity(proxy, url);
        if(entity == null){
            throw new Exception("没有获得资源");
        }
        File img = generatorFile(path, getFileNameByURI(url));
        FileUtils.writeByteArrayToFile(img, EntityUtils.toByteArray(entity));
//        byte[] buffer = new byte[1024];
//        try(InputStream inputStream = entity.getContent();
//            FileOutputStream out = new FileOutputStream(img)){
//            int j;
//            while((j = inputStream.read(buffer)) != -1){
//                out.write(buffer,0,j);
//            }
//        }
    }

    private static File generatorFile(String path, String name) throws IOException {
        String fileName = String.format("img-%d-%s", System.currentTimeMillis(), name);
        String filePath = path + File.separator + fileName;
        File file = new File(filePath);
        FileUtils.forceMkdir(file.getParentFile());
        FileUtils.touch(file);
        return file;
    }

    private static String getFileNameByURI(String url) throws Exception {
        // http://i.imgur.com/3152Dkx.jpg

        String[] es = StringUtils.split(url, URL_SPLIT);
        if(es.length < 1){
            throw new Exception("文件名获取失败");
        }
        return es[es.length - 1];
    }
}

演变

单线程跑

源码

public static void backup(String[] args) {
    int countPath = 0;
    String basePath = System.getProperty("user.dir") + File.separator + "download" + File.separator + "images";

    String urlsFilePath = Main.class.getResource("/").getPath() + File.separator + "imgur.txt";
    try(FileReader reader = new FileReader(urlsFilePath);
        BufferedReader br = new BufferedReader(reader)){
        int count = 0;
        while(true){
            String line=br.readLine();
            if(line == null){
                break;
            }
            count ++;
            if(count < 3044){
                countPath=3;
                continue;
            }
            if(count % 1000 == 0){
                countPath ++;
            }
            String path = basePath + File.separator + countPath;
            try {
                DownloadUtils.downloadImage(line, path);
                System.out.println(String.format("%d: %s\t 下载成功", count, line));
            } catch (Exception e) {
                System.out.println(String.format("%d: %s\t 下载失败", count, line));
            }
            try {
                TimeUnit.MILLISECONDS.sleep(300);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

}

目标很明确, 先读取URL，再去下载资源，其中读取文件是一个耗时的过程，下载同样是一个耗时的时间，这样就导致下载的速度不是很理想，4K条记录估计也得要个一个多小时跑一下，这还得是网络情况不错的前提下。于是改为多线程的就是很必要的，将读取URL和资源呢下载分离。

逻辑上也很简单: 读取文件获得URL，将URL放到一个list中，下载的线程会去拿 url,然后下载。
然后考虑到频繁请求可能会被封IP，于是打算用代理IP，一个代理IP开一个线程。这样就要考虑不要重复下载资源。
其次要考虑的是什么时候结束线程中的循环。

于是代码如下:

public static void read(String filePath, String basePath){
    try(FileReader fileReader = new FileReader(filePath); BufferedReader br = new BufferedReader(fileReader)){
        int count = 0;
        int countPath = 0;
        while(true){
            String line;
            try {
                line = br.readLine();
            } catch (IOException e) {
                System.out.println("文件读取失败");
                e.printStackTrace();
                continue;
            }
            if(line == null){
                end = true;
                break;
            }
            count ++;
            if(count % 1000 == 0){
                countPath ++;
            }
            String path = basePath + File.separator + countPath;
            Main.urls.add(String.format("%s%s%s", path, SPLIT, line ));
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

public static void down(HttpHost proxy){
    String fileUrl;
    boolean can = true;
    Integer count = -1;
    while(!end || count < Main.urls.size()){
        try {
            TimeUnit.MILLISECONDS.sleep(100);
        } catch (InterruptedException e1) {
            System.out.println(e1.getMessage());
        }
        if(can){
            count = getCount();
        }
        try {
            fileUrl = urls.get(count);
            can = true;
        } catch (IndexOutOfBoundsException  e) {
            can = false;
            continue;
        }
        String[] es = StringUtils.split(fileUrl, SPLIT);
        if(es.length != 2){
            System.out.println(String.format("fileUrl格式有问题: \t%s", fileUrl));
            return;
        }
        try {
            DownloadUtils.downloadImageByProxy(proxy, es[1], es[0]);
            System.out.println(String.format("%s\t 下载成功", es[1]));
        } catch (Exception e) {
            System.out.println(String.format("%s\t 下载失败", es[1]));
            System.out.println(e.getMessage());
        }
    }

}

private static Integer getCount(){
    synchronized (lock){
        count ++;
    }
    return count;
}


public static void main(String[] args){
//        String fileUrl = "D:\\manoo\\workspace\\Demo\\download\\DownloadImg\\download\\images\\0@http://i.imgur.com/05xaVZR.jpg?1";
//        String[] es = StringUtils.split(fileUrl, SPLIT);
    CountDownLatch countDownLatch = new CountDownLatch(hosts.size() + 1);
    String basePath = System.getProperty("user.dir") + File.separator + "download" + File.separator + "images";
    String urlsFilePath = Main.class.getResource("/").getPath() + File.separator + "imgur.txt";

    // 开启读取
    new Thread(()->{
        read(urlsFilePath, basePath);
        countDownLatch.countDown();
    }).start();

    for(HttpHost proxy : hosts){
        new Thread(()->{
            down(proxy);
            countDownLatch.countDown();
        }).start();
    }




    try {
        countDownLatch.await();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    System.out.println("运行结束");


}

结语

小车到这边基本算好了，但是任然还有该进的地方，比如将获取资源和将资源保存到本地再做一次分离。
其次项目过小，时间也没花多久，但是可以考虑加上消息中间件Kafka，线程池，代理IP也是一个一个集合而不是一个代理IP一个线程。
要完成这些改动带代码还是要做很大的改动，有机会在分享吧。

附录

使用的依赖

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpcore</artifactId>
      <version>4.4.10</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.6</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-lang3</artifactId>
      <version>3.8.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.4</version>
    </dependency>

  </dependencies>