spring boot 线程池_Spring Boot 简单爬虫爬取ip代理池

Spring Boot 简单爬虫爬取ip代理池 

fb5ffe3a3647522990953d00cafe2901.png

概述

因为爬虫的进阶阶段,最基本的就是要用到ip代理池,因为单个代理请求频繁,会被ban掉,所以要备一个代理池,用来请求使用

技术栈

  • HttpClient
  • Spring Boot 2.3.1
  • JDK 1.8

快速创建Spring Boot项目

访问 https://start.spring.io/  生成一个初始项目

96aeb7500f759f1bd3d33728b6855c3d.png我们需要去请求接口,所以需要一个Web依赖

ecbcb19f029d8fb39a57a2043cdbead5.png点击Generate,会下载一个zip的项目压缩包

导入Spring Boot项目

解压之后记得复制下demo文件夹放的路径

先用IDE编辑 pom.xml 文件,在下图红框上面加入下述代码

可以切换下载依赖的源为国内阿里源ef4961acddef884395c9bf5810418422.png

<repositories>
        
        <repository>
            <id>aliyunid>
            <name>aliyunname>
            <url>https://maven.aliyun.com/repository/publicurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        repository>
        
        <repository>
            <id>spring-milestonesid>
            <name>Spring Milestonesname>
            <url>https://maven.aliyun.com/repository/springurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        repository>
    repositories>
    <pluginRepositories>
        
        <pluginRepository>
            <id>spring-pluginid>
            <name>spring-pluginname>
            <url>https://maven.aliyun.com/repository/spring-pluginurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        pluginRepository>
    pluginRepositories>

下面是导入流程:

IDEA里点击File -> Open -> 粘贴刚刚的项目文件夹路径 -> 找到pom.xml双击
-> Open as Peoject -> 等待Maven加载完毕,看不明白看下图6958e617eb648262ba5d5f018bb5e00d.pngOpen as Project,之后等待Maven加载完毕即可

pom.xml文件

<?xml  version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0modelVersion>
 <parent>
  <groupId>org.springframework.bootgroupId>
  <artifactId>spring-boot-starter-parentartifactId>
  <version>2.3.1.RELEASEversion>
  <relativePath/> 
 parent>
 <groupId>com.github.gleansgroupId>
 <artifactId>SpringBoot-ProxyPoolartifactId>
 <version>0.0.1-SNAPSHOTversion>
 <name>SpringBoot-ProxyPoolname>
 <description>Demo project for Spring Bootdescription>

 <properties>
  <java.version>1.8java.version>
  <httpclient.version>4.5.12httpclient.version>
  <jsonp.version>1.13.1jsonp.version>
  <knife4j.version>2.0.3knife4j.version>
  <lombok.version>1.18.12lombok.version>
  <mysql.version>8.0.19mysql.version>
 properties>

 <dependencies>
  <dependency>
   <groupId>org.apache.httpcomponentsgroupId>
   <artifactId>httpclientartifactId>
   <version>${httpclient.version}version>
  dependency>
  <dependency>
   <groupId>org.springframework.bootgroupId>
   <artifactId>spring-boot-starter-webartifactId>
  dependency>
  <dependency>
   <groupId>org.projectlombokgroupId>
   <artifactId>lombokartifactId>
   <version>${lombok.version}version>
   <scope>providedscope>
  dependency>
  <dependency>
   <groupId>org.springframework.bootgroupId>
   <artifactId>spring-boot-starter-testartifactId>
   <scope>testscope>
   <exclusions>
    <exclusion>
     <groupId>org.junit.vintagegroupId>
     <artifactId>junit-vintage-engineartifactId>
    exclusion>
   exclusions>
  dependency>
  <dependency>
   <groupId>com.github.xiaoymingroupId>
   <artifactId>knife4j-spring-boot-starterartifactId>
   <version>${knife4j.version}version>
  dependency>
  <dependency>
   <groupId>org.jsoupgroupId>
   <artifactId>jsoupartifactId>
   <version>${jsonp.version}version>
  dependency>
  <dependency>
   <groupId>mysqlgroupId>
   <artifactId>mysql-connector-javaartifactId>
   <version>${mysql.version}version>
  dependency>
  <dependency>
   <groupId>org.springframework.bootgroupId>
   <artifactId>spring-boot-starter-data-jpaartifactId>
  dependency>
  <dependency>
   <groupId>org.springframework.bootgroupId>
   <artifactId>spring-boot-starter-thymeleafartifactId>
  dependency>
 dependencies>

 <build>
  <plugins>
   <plugin>
    <groupId>org.springframework.bootgroupId>
    <artifactId>spring-boot-maven-pluginartifactId>
   plugin>
  plugins>
 build>
 <repositories>
  
  <repository>
   <id>aliyunid>
   <name>aliyunname>
   <url>https://maven.aliyun.com/repository/publicurl>
   <releases>
    <enabled>trueenabled>
   releases>
   <snapshots>
    <enabled>falseenabled>
   snapshots>
  repository>
  
  <repository>
   <id>spring-milestonesid>
   <name>Spring Milestonesname>
   <url>https://maven.aliyun.com/repository/springurl>
   <releases>
    <enabled>trueenabled>
   releases>
   <snapshots>
    <enabled>falseenabled>
   snapshots>
  repository>
 repositories>
 <pluginRepositories>
  
  <pluginRepository>
   <id>spring-pluginid>
   <name>spring-pluginname>
   <url>https://maven.aliyun.com/repository/spring-pluginurl>
   <releases>
    <enabled>trueenabled>
   releases>
   <snapshots>
    <enabled>falseenabled>
   snapshots>
  pluginRepository>
 pluginRepositories>
project>

新建ip实体对象

package com.github.gleans.ekko.model;

import io.swagger.annotations.ApiModelProperty;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;

import javax.persistence.Entity;
import javax.persistence.Id;

@Data
@Entity(name = "ip_data")
@NoArgsConstructor
@Accessors(chain = true)
public class IPData {

    @Id
    @ApiModelProperty(value = "编号")
    private Long ipNo;

    @ApiModelProperty(value = "国家")
    private String country;

    @ApiModelProperty(value = "IP地址")
    private String ipAddress;

    @ApiModelProperty(value = "端口")
    private Integer port;

    @ApiModelProperty(value = "服务器地址")
    private String serverAddress;

    @ApiModelProperty(value = "是否匿名")
    private String anonymous;

    @ApiModelProperty(value = "类型")
    private String type;

    @ApiModelProperty(value = "速度")
    private String speed;

    @ApiModelProperty(value = "连接时间")
    private String connTime;

    @ApiModelProperty(value = "存活时间")
    private String survivalTime;

    @ApiModelProperty(value = "验证时间")
    private String postTime;
}

主要的业务类

IPServiceImpl.java

package com.github.gleans.ekko.service.impl;

import com.github.gleans.ekko.model.IPData;
import com.github.gleans.ekko.service.IPService;
import com.github.gleans.ekko.utils.HttpCustom;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;

@Slf4j
@Service
public class IPServiceImpl implements IPService {


    @Override
    public List getIpList() {
        String html = HttpCustom.getIpStore("https://www.xicidaili.com/nn/1", null, null);
        //将html解析成DOM结构
        Document document = Jsoup.parse(html);

        //提取所需要的数据
        Elements trs = document.select("table[id=ip_list]").select("tbody").select("tr");

        if (null == trs || trs.size() == 0) {
            return new ArrayList<>();
        }

        return trs.stream()
                .map(tr -> {
                    Elements trd = tr.select("td");
                    if (trd != null && trd.size() > 0) {
                        String country = tr.select("td").get(0).text();
                        String ipAddress = tr.select("td").get(1).text();
                        Integer port = Integer.valueOf(tr.select("td").get(2).text());
                        String serverAddress = tr.select("td").get(3).text();
                        String anonymous = tr.select("td").get(4).text();
                        String ipType = tr.select("td").get(5).text();
                        String speed = tr.select("td").get(6).select("div[class=bar]").attr("title");

                        return new IPData().setIpAddress(ipAddress)
                                .setPort(port).setType(ipType)
                                .setCountry(country).setSpeed(speed)
                                .setAnonymous(anonymous).setServerAddress(serverAddress);
                    } else {
                        return null;
                    }

                }).filter(Objects::nonNull).collect(Collectors.toList());
    }
}

上面代码核心有参考:https://github.com/dhengyi/ip-proxy-pools-regularly

封装请求类

package com.github.gleans.ekko.utils;

import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class HttpCustom {

    private final static int CONNECT_TIMEOUT = 3000;
    private final static int SOCKET_TIMEOUT = 3000;

    /**
     * 获取网页信息
     *
     * @param url
     * @param ip
     * @param port
     */
    public static String getIpStore(String url, String ip, Integer port) {

        String resBody = "";
        CloseableHttpClient httpClient = HttpClients.createDefault();

        RequestConfig.Builder configBuilder = RequestConfig
                .custom()
                .setConnectTimeout(CONNECT_TIMEOUT)
                .setSocketTimeout(SOCKET_TIMEOUT);
        if (ip != null && port != null) {
            HttpHost proxy = new HttpHost(ip, port);
            configBuilder.setProxy(proxy);
        }
        RequestConfig config = configBuilder.build();

        HttpGet httpGet = new HttpGet(url);
        httpGet.setConfig(config);

        httpGet.setHeader("Pragma", "no-cache");
        httpGet.setHeader("Connection", "keep-alive");
        httpGet.setHeader("Host", "www.xicidaili.com");
        httpGet.setHeader("Cache-Control", "no-cache");
        httpGet.setHeader("Upgrade-Insecure-Requests", "1");
        httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
        httpGet.setHeader("Accept-Encoding", "gzip, deflate, sdch");
        httpGet.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36");

        try {
            //客户端执行httpGet方法,返回响应
            CloseableHttpResponse httpResponse = httpClient.execute(httpGet);

            //得到服务响应状态码
            if (httpResponse.getStatusLine().getStatusCode() == 200) {
                resBody = EntityUtils.toString(httpResponse.getEntity(), StandardCharsets.UTF_8);
            }

            httpResponse.close();
            httpClient.close();
        } catch (IOException e) {
            resBody = null;
        }

        return resBody;
    }

}

applicat.yml配置文件

spring:
  datasource:
      driver-class-name: com.mysql.cj.jdbc.Driverurl: jdbc:mysql://127.0.0.1:3306/big-data?characterEncoding=utf-8
      username: root
      password: root
  jpa:
    open-in-view: true
    database-platform: org.hibernate.dialect.H2Dialect
    # spring.jpa.show-sql=true 配置在日志中打印出执行的 SQL 语句信息。
    show-sql: true
    # 配置指明在程序启动的时候要删除并且创建实体类对应的表。
    # create 这个参数很危险,因为他会把对应的表删除掉然后重建。所以千万不要在生成环境中使用。只有在测试环境中,一开始初始化数据库结构的时候才能使用一次。
    # ddl-auto:create----每次运行该程序,没有表格会新建表格,表内有数据会清空
    # ddl-auto:create-drop----每次程序结束的时候会清空表
    # ddl-auto:update----每次运行程序,没有表格会新建表格,表内有数据不会清空,只会更新(推荐)
    # ddl-auto:validate----运行程序会校验数据与数据库的字段类型是否相同,不同会报错
    hibernate.ddl-auto: update

前端显示

技术栈

  • vue
  • element-ui
  • html5

index.html

html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Titletitle>
    
    <link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
head>
<body>

<div id="app">
    <h1>{{ message }}h1>
    是否重新抓取
    <el-switchv-model="isRefresh">
    el-switch>
    <br>
    
    <el-table:data="tableData"borderstyle="width: 100%">
        <el-table-columnfixedprop="ipAddress"label="IP地址"width="150">
        el-table-column>
        <el-table-columnprop="port"label="端口"width="120">
        el-table-column>
        <el-table-columnprop="serverAddress"label="服务器地址"width="120">
        el-table-column>
        <el-table-columnprop="speed"label="速度"width="120">
        el-table-column>
        <el-table-columnprop="type"label="请求方式"width="300">
        el-table-column>
        <el-table-columnprop="anonymous"label="匿名类型"width="120">
        el-table-column>
        <el-table-columnlabel="操作">
                <el-button @click="handleClick(scope.row)" type="text" size="small">查看el-button>
                <el-button type="text" size="small">编辑el-button>
        el-table-column>
    el-table>
    
div>

<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js">script>

<script src="https://unpkg.com/axios/dist/axios.min.js">script>
<script src="https://unpkg.com/element-ui/lib/index.js">script>
<script type="text/javascript">var app = new Vue({el: '#app',methods: {
            getTableData() {let _this = this;// 为给定 ID 的 user 创建请求
                axios.get('ip/list')
                    .then(function (response) {console.log(response);
                        _this.tableData = response.data.data
                    })
                    .catch(function (error) {console.log(error);
                    });
            }
        },
        created() {this.getTableData()
        },data: {message: 'ip池子代理',tableData: [],isRefresh: true
        }
    });script>
body>
html>

效果图

启动之后。访问http://127.0.0.1:8080/index

c5c0e4f4084182fff8bed4be46d06b8e.png
在这里插入图片描述

TODO

  • 数据入库,防止一直调取人家接口(待实现)
  • 缓存,防止一直查询数据库(待实现)
  • 数据库去重,去除无效数据(待实现)
  • 页面可修改,查询列表(待实现)

源码地址

https://github.com/Gleans/SpringBootLearn/tree/master/SpringBoot-ProxyPool

阅读原文可评论或与作者交流

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值