学以致用：批量下载豆瓣线上活动图片

本文链接：https://blog.csdn.net/mockingbirds/article/details/54565224

背景：今天在浏览豆瓣网站的时候，发现一个在线活动”来一句王家卫式的话”,之前看过不少王家卫导的电影，从来都是比较喜欢其中的台词，但是比较急性子，不能耐心看完，也或许是碎片时间比较多，就有了下面的想法

使用爬虫抓取到每一个图片的url地址
使用java访问该地址，并且将该url对应的图片保存到本地
说干就干，这里我们使用jsoup来爬去网页上的数据。

第一步：获取”查看全部”地址

我们先打开豆瓣主页，看到下面的在线活动，点击“来一句王家卫式的话”,这一活动
这里写图片描述

这里写图片描述

这里可以看到有很多图片，一个网页是显示完的，一般情况下，都会点击”全部186张”继续浏览，所以我们首先要做的就是获取”全部186张”
对应的链接
这里写图片描述

从图中可以看到，其包含一个id=”pho-num的属性，全局查找也是唯一的，那么就可以根据属性获取当前的标签，继而获取当前标签对应的href值

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);
System.out.println(element.attr("href"));

此时打印结果如下：
这里写图片描述
可以看到，此时已经获取到浏览全部的地址了

第二步：获取每一个图片地址

我们可以根据关键字进行查找
这里写图片描述

比如”来自 TZ”和”来自白良宴”这样的关键字，快速定位到需要获取的标签位置
这里写图片描述

可以看到，这里我们需要获取的就是img标签的src属性，但是考虑到当前页面可能不止是我们需要获取的img标签，还有其他img标签是我们不需要的，所以先获取img的父标签然后在获取img标签本身

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/album/1638403254/")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  
Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
Element imageElement = null;
for (Element elementChild : elements) {
    imageElement = elementChild.getElementsByTag("img").get(0);         System.out.println(imageElement.attr("src"));
}

此时效果如下：
这里写图片描述
由于当前页面有90条数据，太多了，所以这里我只截图了一部分

递归爬去下一页的数据

当前页面的img标签我们是获取到了src属性的值，但是肯定不止于此，我想获取所有的呢，模拟用户行为，获取”后页”的连接，然后在像之前的行为是一样的遍历查找即可。
这里写图片描述

这里写图片描述

可以看到，找到了”后页”所在的标签就简单了，获取点击”后页”时候的链接

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/album/1638403254/")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get(); 
Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
Element aTag = nextPage.getElementsByTag("a").get(0);
System.out.println(aTag.attr("href"));

此时打印出下一页的图片链接了
这里写图片描述

判断是否是尾页

那么不管当前图片有多少，最终都会有一个尾页，尾页一般href链接是空的，这里目前只有三页数据，我们直接进入尾页
这里写图片描述

可以看到，尾页的”后页”是没有里面的超链接标签的，我们可以根据这个判断当前页面是否是尾页

一次性获取该活动的所有图片地址

有了上面的分析基础，一次获取该活动的所有图片地址就不是什么太难的问题了。

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetPicLink {

    static List<String> urlLists = new ArrayList<>();

    public static void main(String[] args) {
        try {  

             //1. 根据当前后动的链接，获取"查看全部"的链接
            Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  
            Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);

            // 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
            spideAPage(element.attr("href"));

            System.out.println(urlLists.size());
            for (String string : urlLists) {
                System.out.println(string);
            }


        } catch (IOException e) {  
            e.printStackTrace();  
        }         
    }

    private static void spideAPage(String pageUrl) {
        // 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
        try {
            Document doc = Jsoup.connect(pageUrl)  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

            // 获取图片的地址
            Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
            Element imageElement = null;
            for (Element elementChild : elements) {
                imageElement = elementChild.getElementsByTag("img").get(0);
                // 将当前图片链接地址添加到urlLists集合中
                urlLists.add(imageElement.attr("src"));
            }

            // 继续根据当前页面地址，获取"后页"的链接地址
            Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
            if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
                Elements aTags = nextPage.getElementsByTag("a");

                // 3. 递归查找，直到最后一页
                spideAPage(aTags.get(0).attr("href"));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

此时，我们需要做的就是根据这些图片地址，将其字节流保存到本地

添加下载代码

public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{  
        URL url = new URL(urlStr);    
        HttpURLConnection conn = (HttpURLConnection)url.openConnection();    
        //设置超时间为3秒  
        conn.setConnectTimeout(3*1000);  
        //防止屏蔽程序抓取而返回403错误  
        conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");  

        //得到输入流  
        InputStream inputStream = conn.getInputStream();    
        //获取自己数组  
        byte[] getData = readInputStream(inputStream);      

        //文件保存位置  
        File saveDir = new File(savePath);  
        if(!saveDir.exists()){  
            saveDir.mkdir();  
        }  
        File file = new File(saveDir+File.separator+fileName);      
        FileOutputStream fos = new FileOutputStream(file);       
        fos.write(getData);   
        if(fos!=null){  
            fos.close();    
        }  
        if(inputStream!=null){  
            inputStream.close();  
        }  


        System.out.println("info:"+url+" download success");   

}  


private static  byte[] readInputStream(InputStream inputStream) throws IOException {    
        byte[] buffer = new byte[1024];    
        int len = 0;    
        ByteArrayOutputStream bos = new ByteArrayOutputStream();    
        while((len = inputStream.read(buffer)) != -1) {    
            bos.write(buffer, 0, len);    
        }    
        bos.close();    
        return bos.toByteArray();    
}

开始下载喽

for (int i = 0; i < urlLists.size(); i++) {
        downLoadFromUrl(urlLists.get(i),i+"","/home/liuhang/Desktop/test");
}

此时效果如下：
这里写图片描述

在测试一下”午後的一張相片”这个活动
这里写图片描述

另外我们循环遍历的时候，需要为每一个活动分别创建当前的活动目录，这里我就以后缀为例

// https://www.douban.com/online/123077659/
System.out.println("https://www.douban.com/online/123077659/".substring("https://www.douban.com/online/".length(),"https://www.douban.com/online/123077659/".length() -1));

此时打印出的目录为”123077659”,另外在该目录下增加一个说明文件，文件的内容就是活动标题

增加说明文件

private static void writeActivityTitle(String title , String folderName) {
        try {
            File file = new File(folderName);
            if (!file.exists()) {
                file.mkdirs();
            }
            FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            osw.write(title);
            osw.flush();
        } catch (Exception e) {
            e.printStackTrace();
        }
}

获取单个线上活动所有图片总结

下面是获取单个线上活动所有图片的所有代码

package doubanpic;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class test {

    static List<String> urlLists = new ArrayList<>();

    public static void main(String[] args) {
        try {  

             //1. 根据当前后动的链接，获取"查看全部"的链接
            Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  
            Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);

            // 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
            spideAPage(element.attr("href"));

            System.out.println(urlLists.size());
            for (String string : urlLists) {
                System.out.println(string);
            }

            for (int i = 0; i < urlLists.size(); i++) {
                // 将thumb替换成photo，否则显示缩略图
                downLoadFromUrl(urlLists.get(i).replace("thumb", "photo"),i+".jpg","/home/liuhang/Desktop/test"); 
        }

        } catch (IOException e) {  
            e.printStackTrace();  
        }         
    }
    public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{  
        URL url = new URL(urlStr);    
        HttpURLConnection conn = (HttpURLConnection)url.openConnection();    
        //设置超时间为3秒  
        conn.setConnectTimeout(3*1000);  
        //防止屏蔽程序抓取而返回403错误  
        conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");  

        //得到输入流  
        InputStream inputStream = conn.getInputStream();    
        //获取自己数组  
        byte[] getData = readInputStream(inputStream);      

        //文件保存位置  
        File saveDir = new File(savePath);  
        if(!saveDir.exists()){  
            saveDir.mkdir();  
        }  
        File file = new File(saveDir+File.separator+fileName);     
        FileOutputStream fos = new FileOutputStream(file);       
        fos.write(getData);   
        if(fos!=null){  
            fos.close();    
        }  
        if(inputStream!=null){  
            inputStream.close();  
        }  


        System.out.println("info:"+url+" download success"+"    "+file.getAbsolutePath());   

    }  

    public static  byte[] readInputStream(InputStream inputStream) throws IOException {    
        byte[] buffer = new byte[1024];    
        int len = 0;    
        ByteArrayOutputStream bos = new ByteArrayOutputStream();    
        while((len = inputStream.read(buffer)) != -1) {    
            bos.write(buffer, 0, len);    
        }    
        bos.close();    
        return bos.toByteArray();    
    }    


    private static void writeActivityTitle(String title , String folderName) {
        try {
            File file = new File(folderName);
            if (!file.exists()) {
                file.mkdirs();
            }
            FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            osw.write(title);
            osw.flush();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void spideAPage(String pageUrl) {
        // 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
        try {
            Document doc = Jsoup.connect(pageUrl)  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

            // 获取图片的地址
            Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
            Element imageElement = null;
            for (Element elementChild : elements) {
                imageElement = elementChild.getElementsByTag("img").get(0);
                // 将当前图片链接地址添加到urlLists集合中
                urlLists.add(imageElement.attr("src"));
            }

            // 继续根据当前页面地址，获取"后页"的链接地址
            Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
            if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
                Elements aTags = nextPage.getElementsByTag("a");

                // 3. 递归查找，直到最后一页
                spideAPage(aTags.get(0).attr("href"));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

获取所有活动的所有图片

获取每个活动的链接，然后传入到之前分析的方法中
豆瓣线上活动的链接是这样子的
https://www.douban.com/online/list?g=h

获取所有活动的链接

public static void main(String[] args) {
        try {  

            Document doc = Jsoup.connect("https://www.douban.com/online/?r=i")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

            // 当前页面有很多<a>标签，我们要做的就是根据href的内容来匹配，另外需要过滤"线上活动"本身，请参考17.png
            Elements elements = doc.getElementsByAttributeValueMatching("href", "https://www.douban.com/online/*");
            for (Element element : elements) {
                if (!"线上活动".equals(element.text()) && !"".equals(element.text())) {
                    System.out.println(element.attr("href")+" === "+element.text());
                }
            }
        } catch (IOException e) {  
            e.printStackTrace();  
        }         
}

这里写图片描述

此时打印如下：
这里写图片描述

可以看到，此时所有的活动链接都已经获取到了

呀，有点剪不断理还乱了，说下实现思路吧

获取所有线上活动的所有图片，可以划分为获取每一个线上活动的所有图片，然后遍历即可

获取所有线上活动的所有图片总结

前面已经解释的比较清楚，这里我直接上代码了，亲测可用哦。

package doubanpic;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetPicLink {

    static List<String> urlLists = new ArrayList<>();
    static Map<String,String> sMap = new HashMap<>();

    public static void main(String[] args) {
        try {  

            Document doc = Jsoup.connect("https://www.douban.com/online/?r=i")  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

            // 当前页面有很多<a>标签，我们要做的就是根据href的内容来匹配，另外需要过滤"线上活动"本身，请参考17.png
            Elements elements = doc.getElementsByAttributeValueMatching("href", "https://www.douban.com/online/*");
            Element activityElement = null;
            String folderName = "";
            for (Element element : elements) {
                if (!"线上活动".equals(element.text()) && !"".equals(element.text())) {
                    System.out.println(element.attr("href"));
                    sMap.put(element.attr("href"), element.text());
                }
            }

            Set<String> keys = sMap.keySet();
            for (String string : keys) {
                //1. 根据当前后动的链接，获取"查看全部"的链接
                doc = Jsoup.connect(string)  
                        .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  
                try {
                    activityElement = doc.getElementsByAttributeValue("id","pho-num").get(0);
                } catch (Exception e) {
                    continue; //当走到这里，说明当前页面没有 "查看全部"的链接
                }

                // 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
                System.out.println(activityElement.attr("href"));
                try {
                    spideAPage(activityElement.attr("href"));
                } catch (Exception e) {
                    continue;
                }

                // 保存文本文件，用来存储当前 线上活动的标题
                folderName = activityElement.attr("href").substring("https://www.douban.com/online/".length(),activityElement.attr("href").length() - 1);
                writeActivityTitle(sMap.get(string),"/home/liuhang/Desktop/douban/"+folderName);

                System.out.println("urlLists.size() is :"+urlLists.size());
                for (int i = 0; i < urlLists.size(); i++) {
                    downLoadFromUrl(urlLists.get(i),i+"","/home/liuhang/Desktop/douban/"+folderName);
                }
            }


        } catch (IOException e) {  
            e.printStackTrace();  
        }         

    }

    private static void writeActivityTitle(String title , String folderName) {
        try {
            File file = new File(folderName);
            if (!file.exists()) {
                file.mkdirs();
            }
            FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            osw.write(title);
            osw.flush();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    private static void spideAPage(String pageUrl) throws Exception{
        // 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
        try {
            Document doc = Jsoup.connect(pageUrl)  
                    .timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();  

            // 获取图片的地址
            Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
            Element imageElement = null;
            for (Element elementChild : elements) {
                imageElement = elementChild.getElementsByTag("img").get(0);
                // 将当前图片链接地址添加到urlLists集合中
                urlLists.add(imageElement.attr("src"));
            }

            // 继续根据当前页面地址，获取"后页"的链接地址
            try {
                Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
                if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
                    Elements aTags = nextPage.getElementsByTag("a");

                    // 3. 递归查找，直到最后一页
                    spideAPage(aTags.get(0).attr("href"));
                }
            } catch (Exception e) {
                return;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


    public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{  
        URL url = new URL(urlStr);    
        HttpURLConnection conn = (HttpURLConnection)url.openConnection();    
        //设置超时间为3秒  
        conn.setConnectTimeout(3*1000);  
        //防止屏蔽程序抓取而返回403错误  
        conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");  

        //得到输入流  
        InputStream inputStream = conn.getInputStream();    
        //获取自己数组  
        byte[] getData = readInputStream(inputStream);      

        //文件保存位置  
        File saveDir = new File(savePath);  
        if(!saveDir.exists()){  
            saveDir.mkdir();  
        }  
        File file = new File(saveDir+File.separator+fileName);     
        FileOutputStream fos = new FileOutputStream(file);       
        fos.write(getData);   
        if(fos!=null){  
            fos.close();    
        }  
        if(inputStream!=null){  
            inputStream.close();  
        }  


        System.out.println("info:"+url+" download success"+"    "+file.getAbsolutePath());   

    }  

    public static  byte[] readInputStream(InputStream inputStream) throws IOException {    
        byte[] buffer = new byte[1024];    
        int len = 0;    
        ByteArrayOutputStream bos = new ByteArrayOutputStream();    
        while((len = inputStream.read(buffer)) != -1) {    
            bos.write(buffer, 0, len);    
        }    
        bos.close();    
        return bos.toByteArray();    
    }    
}