Java爬虫——利用HttpClient+jsoup实现

最新推荐文章于 2024-06-25 15:17:32 发布

山河Y

最新推荐文章于 2024-06-25 15:17:32 发布

阅读量779

点赞数

分类专栏：爬虫文章标签： java

本文链接：https://blog.csdn.net/qq_40167715/article/details/105978447

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

前言

由于我是工作需要，然后第一次接触Java的爬虫，很多地方的原理目前还不太了解，只限于知道如何去使用以及怎样去使用。所以爬虫理论、原理相关的知识暂时就不多说了，需要的小伙伴可以先了解一下思路以及如何去使用。

爬虫的用途

在项目当中，不管是我们开发人员还是测试人员，在测试使用某一功能点的时候会用到一些比较真实正规一点的数据，这时候我们不可能一条一条的手动去往数据库中插入数据，太麻烦了。这时候爬虫就起到了作用，我们可以找到一些符合我们需求的一些网站数据去抓取、过滤，挑选需要的数据供我们使用。但是大部分网站还是存在反爬虫技术的，频繁的抓取数据会暂封你的IP访问，这时候推荐使用代理IP去抓取；不过，尽量挑一些开放的，没有反爬虫技术的网址，这次我抓取的对象是https://www.tmall.com/以及https://movie.douban.com/（这个有反爬虫技术）。

思路

我们在天猫的首页任意搜索一个商品，会得到一个跳转的路径
https://list.tmall.com/search_product.htm?q=%E5%89%83%E9%A1%BB%E5%88%80&type=p&vmarket=&spm=875.7931836%2FB.a2227oh.d100&from=mallfp…pc_1_searchbutton
在测试的时候我发现其实这个url后半段没什么用（至少目前没有发现有其他用途），我们可以直接截取前半段url：https://list.tmall.com/search_product.htm?q=%E5%89%83%E9%A1%BB%E5%88%80
这个玩意——>%E5%89%83%E9%A1%BB%E5%88%80其实就是剃须刀的UrlEncode编码，我们其实可以直接把这个编码输成汉字，就像这样：https://list.tmall.com/search_product.htm?q=剃须刀
我们使用Chrome进行搜索，右击鼠标打开检查，找到这些位置：
随后我们看preview这个地方，看它给我们返回的源代码或者数据（可能每个网站返回数据有所不同，第一个是天猫，第二个是豆瓣）

其实不管我们拿到的是哪一种数据，都可以抓取到我们想要的，不过类似于豆瓣这种的，处理的比较简单方便一点，话不多说，细节请看下面代码。

Java后台代码

maven主要依赖

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.10-FINAL</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>
<!-- 配置gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
</dependency>

后台代码这块我们分为两部分来写（分别针对于天猫、豆瓣这种不同返回数据类型）
（1）天猫

int total = 0;

//所定义的搜索关键字可以根据自己的需求随意切换，例如：T恤、剃须刀、手机等……
String searchName = "剃须刀";

// 需要爬取商品信息的网站地址
String url = "https://list.tmall.com/search_product.htm?q=" + searchName;
CloseableHttpResponse response = null;
try {
    // 利用HttpClient动态模拟请求数据
    CloseableHttpClient httpclient = HttpClients.createDefault();
    HttpGet httpGet = new HttpGet(url);
    // 模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）
    httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");
    response = httpclient.execute(httpGet);
} catch (IOException e) {
    e.printStackTrace();
}
// 获取响应状态码
int statusCode = response.getStatusLine().getStatusCode();
try {
    HttpEntity entity = response.getEntity();
    // 如果状态响应码为200，则获取html实体内容或者json文件
    if(statusCode == 200){
        String html = EntityUtils.toString(entity, Consts.UTF_8);
        // 提取HTML得到商品信息结果
        Document doc = null;
        // doc获取整个页面的所有数据
        doc = Jsoup.parse(html);
        //输出doc可以看到所获取到的页面源代码
        //System.out.println(doc);
        // 通过浏览器查看商品页面的源代码，找到信息所在的div标签，再对其进行一步一步地解析
        Elements ulList = doc.select("div[class='view  view-noCom']");//此处的值（可能）每个商品所搜索的结果不同，自己去所返回的源代码中寻找
        Elements liList = ulList.select("div[class='product']");
        // 循环liList的数据（具体获取的数据值还得看doc的页面源代码来获取）
        for (Element item : liList) {
            // 商品ID
            //String id = item.select("div[class='product']").select("p[class='productStatus']").select("span[class='ww-light ww-small m_wangwang J_WangWang']").attr("data-item");
            String id = item.select("div[class='product']").attr("data-id");
            System.out.println("商品ID："+id);
            // 商品名称
            String name = item.select("p[class='productTitle']").select("a").attr("title");
            System.out.println("商品名称："+name);
            // 商品价格
            String price = item.select("p[class='productPrice']").select("em").attr("title");
            System.out.println("商品价格："+price);
            // 商品网址
            String goodsUrl = item.select("p[class='productTitle']").select("a").attr("href");
            System.out.println("商品网址："+goodsUrl);
            // 商品图片网址
            String imgUrl = item.select("div[class='productImg-wrap']").select("a").select("img").attr("data-ks-lazyload");
            System.out.println("商品图片网址："+imgUrl);
            System.out.println("------------------------------------");
            total++;
        }
        // 消耗掉实体
        EntityUtils.consume(response.getEntity());

        System.out.println("本次一共抓取"+total+"条数据");
    } else {
        // 消耗掉实体
        EntityUtils.consume(response.getEntity());

        System.out.println("本次一共抓取"+total+"条数据");
    }
} catch(Exception e){
    e.printStackTrace();
    log.error("抓取数据异常");
} finally {
    try {
        response.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

控制台显示
在这里插入图片描述
（2）豆瓣

int start;//每页多少条
int end = 100;//总共抓取100条数据，因为豆瓣有反爬虫技术，抓取数据多的话会暂封IP访问
int total = 0;
for (start  = 0; start <= end; start += 20)  {

    CloseableHttpResponse response = null;
    try {
        String address = "https://Movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=" + start;

        //利用HttpClient动态模拟请求数据
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(address);
        //模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）
        httpGet.setHeader("user-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");
        response = httpClient.execute(httpGet);
        //获取响应状态码
        int statusCode = response.getStatusLine().getStatusCode();

        HttpEntity entity = response.getEntity();
        if(statusCode == 200){
            String html = EntityUtils.toString(entity, Consts.UTF_8);
            //提取HTML得到的商品信息
            Document doc = null;
            //doc获取页面所有的数据
            doc = Jsoup.parse(html);
            //输出doc可以看到所获取到的页面源代码
            //System.out.println(doc);

            String text = doc.body().text();
            //new一个Gson对象（map转实体类时用到）
            Gson gson = new Gson();
            Map<String, List<Map<String, Object>>> map = new HashMap<String, List<Map<String, Object>>>();
            //转map
            map = gson.fromJson(text, map.getClass());
            //System.out.println(map);
            List<Map<String, Object>> data = map.get("data");
            for(Map<String, Object> da : data){
                System.out.println(da);
                //map转实体类
                Movie movie = JSON.parseObject(JSON.toJSONString(da), Movie.class);
                douBanService.insert(movie);
                total++;
            }
            // 消耗掉实体
            EntityUtils.consume(entity);

            return "本次一共抓取"+total+"条数据";
        } else {
            // 消耗掉实体
            EntityUtils.consume(entity);
        }
    } catch (Exception e) {
        e.printStackTrace();
        log.error("抓取数据异常");
    } finally {
        try {
            response.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

数据库结果
在这里插入图片描述
当然，把数据插入数据库的前提下，你需要根据不同的需求字段创建相对应的实体类，类似于这样：

总结

在这两个代码当中，都有一个比较重要的共同点，类似于这样的一段代码：

// 模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）
httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");

这一段代码的键值对值来源于Chrome的这个地方，我们可以从这个地方取值：
在这里插入图片描述
最后，给大家奉上一段HttpClient的工具类：

public class HttpClientUtils {

    //创建httpclient连接池
    private static PoolingHttpClientConnectionManager connectionManager;
    static{
        connectionManager=new PoolingHttpClientConnectionManager();
        //定义连接池最大连接数
        connectionManager.setMaxTotal(200);
        //对指定的网址最多只有20个连接
        connectionManager.setDefaultMaxPerRoute(20);
    }

    //创建访问对象
    private static CloseableHttpClient getCloseableHttpClient(){
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build();
        return httpClient;
    }

    //执行方法
    private static String execute(HttpRequestBase httpRequestBase) {
        String html = null;
        httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36");

        //设置超时时间
        RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(5000).setConnectTimeout(5000).setSocketTimeout(10 * 1000).build();

        httpRequestBase.setConfig(config);
        CloseableHttpClient httpClient = getCloseableHttpClient();
        try {
            CloseableHttpResponse response = httpClient.execute(httpRequestBase);

            //获取响应状态码
            int statusCode = response.getStatusLine().getStatusCode();

            HttpEntity entity = response.getEntity();
            if(statusCode == 200){
                 html = EntityUtils.toString(entity, Consts.UTF_8);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return html;
    }

    //get请求
    public static String doGet(String url) {
        HttpGet httpGet = new HttpGet(url);
        String html = null;
        try {
            html = execute(httpGet);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return html;
    }

    //post请求
    public static String doPost(String url, Map<String,String> params) {
        HttpPost httpPost = new HttpPost(url);

        List<BasicNameValuePair> list = new ArrayList<BasicNameValuePair>();
        for (String key : params.keySet()) {
            list.add(new BasicNameValuePair(key,params.get(key)));
        }

        try {
            UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list);
            httpPost.setEntity(entity);
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

        return execute(httpPost);
    }

}

OK，anyway，今天就分享到这，总的来说这只算是Java爬虫的小入门吧，主要还是针对那些没有反爬虫技术的网站，有反爬虫技术的建议使用代理IP，大佬们可以自行研究一哈❤~

山河Y

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Java爬虫——利用HttpClient+jsoup实现

前言由于我是工作需要，然后第一次接触Java的爬虫，很多地方的原理目前还不太了解，只知道如何去使用以及怎样去使用。所以爬虫理论、原理相关的知识暂时就不多说了，需要的小伙伴可以先了解一下思路以及如何去使用。爬虫的用途在项目当中，不管是我们开发人员还是测试人员，在测试使用某一功能点的时候会用到一些比较真实正规一点的数据，这时候我们不可能一条一条的手动去往数据库中插入数据，太麻烦了。这时候爬虫就起...
复制链接

扫一扫

专栏目录