记录一下很好用的java爬虫代码-很有意思，原理也很简单，一看就懂

最新推荐文章于 2023-12-16 13:41:55 发布

Mr.madong

最新推荐文章于 2023-12-16 13:41:55 发布

阅读量371

点赞数

分类专栏： java

原文链接：https://blog.csdn.net/ql_7256/article/details/107778023?utm_medium=distribute.pc_category.none-task-blog-hot-6.nonecase&depth_1-utm_source=distribute.pc_category.none-task-blog-hot-6.nonecase&request_id=

版权

java 专栏收录该内容

65 篇文章 1 订阅

订阅专栏

**主要功能是：**爬取百度图片中的图片，一键下载
功能就是这样，根据输入的关键字不同，自动下载不同的图片，当然，这些图片都是从百度图片中爬取出来的。
思路
随便输入一个关键字，百度图片就会展示出很多图片
F12打开控制台，看源码，找到图片的地址，多看几个，会发现规律都一样。
在这里插入图片描述
规律还是很容易简单的吧，写一个简单的正则表达式 https://.*?0.jpg 当然，可以写得更准确，但是这个已经够用，就不写那么精确了。

剩下的就简单了，利用java中的URL这个类和IO流，把展示很多图片的那个页面给读出成一个字符串，然后在字符串中，去查找和上面正则匹配的图片路径。然后再用匹配到的每个路径，去下载图片，就OK了。

但是这样就将要下载图片的主题固定死了，因为我们输入的关键字没变。
仔细观察这个路径，发现我们输入的关键字，被拼接到了URL的最后。这是因为这里采用了get请求，请求数据被放在URL里，所以我们可以对这个URL做手脚，自己来手动拼接，就可以达到输入不同关键字，下载不同图片的功能了。
关键点有两个：
一是改变首页URL的提交参数
二是利用正则获取到每张图片的URL

package com.git.commons.service;

import java.io.*;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@SuppressWarnings("all")
public class InetAddressTest02
{
    public static void main(String[] args)
    {
        Scanner input = new Scanner(System.in);
        System.out.println("欢迎体验这个小程序！");
        while (true)
        {
            System.out.println("请输入您要下载图片明星的姓名(输入E\\e退出)：");
            String name = input.next();
            if ("e".equals(name) || "E".equals(name))
            { break; }
            System.out.println("正在下载，请稍等……");
            downBeautyPicture(name);
            System.out.println();
        }
        System.out.println("成功退出，欢迎下次光临！");
    }



    public static void downBeautyPicture(String name)
    {
        String targetPath = "C://Users//15517//Desktop//down//"+name+System.currentTimeMillis();
        new File(targetPath).mkdir();
        int count = 0;

        InputStream is = null;
        FileOutputStream fos = null;
        try
        {
            URL url = new URL("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word="+name);
            is = url.openStream();

            int len;
            byte[] buffer = new byte[1024];
            StringBuilder pageText_ = new StringBuilder();
            while ((len = is.read(buffer)) != -1)
            { pageText_.append(new String(buffer,0,len, StandardCharsets.UTF_8)); }

            String pageText = pageText_.toString();
            Pattern compile = Pattern.compile("https://.*?0\\.jpg");
            Matcher matcher = compile.matcher(pageText);
            ArrayList<String> URLs = new ArrayList<>();

            while (matcher.find())
            {
                String eachURLStr = matcher.group();

                if (URLs.contains(eachURLStr))
                { continue; }

                count ++;
                //System.out.println("正在下载第"+ count +"张图片…………");
                URL eachURL = new URL(eachURLStr);
                is = eachURL.openStream();
                fos = new FileOutputStream(targetPath+ "\\" + System.currentTimeMillis()+".jpg");
                while ((len = is.read(buffer)) != -1)
                { fos.write(buffer,0,len); }

                is.close();
                fos.flush();
                fos.close();
                URLs.add(eachURLStr);
            }
        }
        catch (IOException e)
        {
            System.out.println("对不起，下载错误，请重试");
            e.printStackTrace();
        }
        finally
        {
            System.out.println("下载完成，共下载了"+ count +"图片，请到  "+targetPath+"  目录下查看");
            if (is != null)
            {
                try
                { is.close(); }
                catch (IOException e)
                { e.printStackTrace(); }
            }
            if (fos != null)
            {
                try
                { fos.close(); }
                catch (IOException e)
                { e.printStackTrace(); }
            }
        }
    }
}