使用Java爬取西刺代理的高匿IP

使用Java爬取西刺代理的高匿IP

0. 背景

使用爬虫的人都知道,如果使用一个ip,但是又想快速获取信息,这个时候,如果采取了反爬措施的服务器就会遭到把请求的ip给屏蔽,导致无法请求资源。 解决这个问题的主要方法有两种:

  • 减缓访问频率,这个在个人使用时可以忍受,但是如果是公司级别的产品,则不可用。
  • 使用高匿IP。使用高匿ip 可以帮助服务器屏蔽错误的ip地址,而不是程序运行的那台机器的ip地址。
    少量的高匿ip 地址可以在网络上获取,因为有需求,就有提供方。西刺代理和云代理就是一家做高匿ip的提供商。这里讲的就是如何使用java代码获取西刺代理页面上的高匿IP。西刺代理的主页是:http://www.xicidaili.com/nn/ 。其页面如下:
    在这里插入图片描述
    云代理的主页是:http://www.ip3366.net/ ,其页面如下:
    在这里插入图片描述
1. 实现思想

这里主要使用 多线程+jsoup 获取高匿IP。其中线程 xiciThread 用于获取西刺代理页面的ip,线程 cloudThread 用于获取云代理页面的ip。它们生产的 ip+port 放入一个叫做IPPort 的结果类中。而 类IpConsumer用于消费上述两个线程生产的 ip+port。因为代码量较大,这里只展示部分主要代码,项目的全部代码可在我的GitHub项目 [zhihuCrawler](https://github.com/LittleLawson/zhihuCrawler)中获取.

2. 代码
  • Consumer
package crawler.utils.ip.Consume;

import crawler.result.IPPort;
import crawler.utils.ip.IpUtils;

public abstract class Consumer implements  Runnable{
    private volatile IPPort ipPort;// very important
    private IpUtils ipUtils = new IpUtils();

    public IpUtils getIpUtils() {
        return ipUtils;
    }

    public void setIpUtils(IpUtils ipUtils) {
        this.ipUtils = ipUtils;
    }

    public Consumer(IPPort ipPort){
        this.ipPort = ipPort;
    }

    public IPPort getIpPort() {
        return ipPort;
    }

    public void setIpPort(IPPort ipPort) {
        this.ipPort = ipPort;
    }
}
  • IpConsumer
package crawler.utils.ip.Consume;

import crawler.result.IPPort;

public class IpConsumer extends Consumer {
    private String ip;
    private String port;

    public IpConsumer(IPPort ipPort) {
        super(ipPort);
    }

    public String getIp() {
        return ip;
    }

    public void setIp(String ip) {
        this.ip = ip;
    }

    public String getPort() {
        return port;
    }

    public void setPort(String port) {
        this.port = port;
    }


    public void consume(){
        while(true){//loop start
            String ipPort = null;//ip + port
            //lock the ipPort
            synchronized (this.getIpPort()) {
                if (this.getIpPort().getIpPortQueue().size() == 0) {
                    try {
                        //if queue don't have record
                        System.out.println("ipConsumer wait...");
                        this.getIpPort().wait();
                        System.out.println("ipConsumer waking...");
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }

                if (this.getIpPort().getIpPortQueue().size() > 0) {//if ipPortQueue's size more than zero
                    System.out.println("before consume: " + this.getIpPort().getIpPortQueue().size());
                    //Retrieves and removes the head of this queue, or return the null if this queue is empty
                    ipPort = this.getIpPort().getIpPortQueue().poll();
                    if(this.getIpPort().getIpPortQueue().size()<10) this.getIpPort().notifyAll();// notify the all ip producer
                    System.out.println("after consume: " + this.getIpPort().getIpPortQueue().size());

                    String str[] = ipPort.split(" ");
                    if (this.getIpUtils().isValidIpPort(ipPort)) {// Check if the ipPort is valid?
                        synchronized (this) {//ensure the replace keep synchronized
                            this.setIp(str[0]);
                            this.setPort(str[1]);
                            System.out.println("current Ip : "+this.getIp()+" port : "+this.getPort()+"\n");
                        }
                    }
                }
            }
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    public void run() {
        this.consume();
    }
}
  • Producer
package crawler.utils.ip.produce;


import crawler.result.IPPort;
import crawler.utils.ip.IpUtils;
import crawler.utils.ip.WebSite;

/**
 * 01.抽象类,代表生产者
 */
public abstract class Producer implements Runnable{
    private volatile IPPort ipPort;// the share ipPort to write
    private WebSite webSite;

    public Producer(WebSite webSite, IPPort ipPort) {
        this.webSite = webSite;
        this.ipPort = ipPort;
    }

    public IPPort getIpPort() {
        return ipPort;
    }

    public void setIpPort(IPPort ipPort) {
        this.ipPort = ipPort;
    }

    public WebSite getWebSite() {
        return webSite;
    }

    public void setWebSite(WebSite webSite) {
        this.webSite = webSite;
    }
}
  • IpProducer
package crawler.utils.ip.produce;

import crawler.result.IPPort;
import crawler.utils.ip.WebSite;

/**
 * 01. specific Ip producer
 */
public class IpProducer extends Producer {
    public IpProducer(WebSite webSite, IPPort ipPort) {
        super(webSite, ipPort);
    }

    /**
     * 01.the producer don't stop,so you should ensure the list not out of memory
     * 02.keep the consumer consume timely
     *
     */
    public void run() {
        while (true) {// get the IP forever
            this.getWebSite().getFreeIpInQueue();
            try {
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
}
  • WebSite
package crawler.utils.ip;

/**
 * 01.abstract class
 */
public abstract class WebSite implements FreeIP{

    private String webName;
    private String url;
    private String keyword;//the keyword to get the ip
    private String charset;
    private int page = 1;// the first page to visit

    public int getPage() {
        return page;
    }

    public void setPage(int page) {
        this.page = page;
    }

    public String getCharset() {
        return charset;
    }

    public void setCharset(String charset) {
        this.charset = charset;
    }

    public String getWebName() {
        return webName;
    }

    public void setWebName(String webName) {
        this.webName = webName;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getKeyword() {
        return keyword;
    }

    public void setKeyword(String keyword) {
        this.keyword = keyword;
    }


    /**
     * 01. get the web's next Url
     */
    public String getNextUrl(String prefix){

        String nextUrl ;
        page ++;
        if(page > 10 ){// reset to 0
            page = 0;
        }
        nextUrl = prefix + page; // get a new ip and Port
        this.setUrl(nextUrl);//the url is a variable
        return  nextUrl;
    }
}
  • XiCiIP
package crawler.utils.ip;

import crawler.httpClient.HttpClientUtils;
import crawler.result.IPPort;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import crawler.utils.other.CustomedMethod;

import java.util.HashSet;
import java.util.Set;

public class XiCiIP extends  WebSite {
    private HttpClientUtils httpClientUtils = new HttpClientUtils();
    private volatile IPPort ipPort ;

    public XiCiIP(IPPort ipPort, String url, String keyWord, String webName,String charset) {
        this.ipPort = ipPort;
        this.setUrl(url);
        this.setWebName(webName);
        this.setKeyword(keyWord);
        this.setCharset(charset);
    }

    public Set<String> getFreeIpInSet() {
        //ipPort port,for example: 13.23.49.128 80
        Set<String> ipPortSet = new HashSet<String>();
        String content = httpClientUtils.getEntityContent(this.getUrl(),this.getCharset());
        System.out.println("content "+content);
        Document document = Jsoup.parse(content);

        Element ip_list = document.getElementById("ip_list");

        CustomedMethod.printDelimiter();
        System.out.println(ip_list);

        //get the td
        Elements classEmpty = ip_list.select("tr");
        System.out.println("tr.size is: "+classEmpty.size());

        //default value
        String ip="0.0.0.0";
        String port="8888";

        for (Element trEle : classEmpty) {
            System.out.println(trEle.toString());
            Elements tds = trEle.select("td");

            //if it not a efficient ipPort entry
            if(tds.size()<2)    continue;
            ip = tds.get(1).text();
            port = trEle.select("td").get(2).text();
            System.out.println("ipPort: " + ip + " port: "+port);
            ipPortSet.add(ip + " " + port);// add to set
        }
        return ipPortSet;
    }

    public void getFreeIpInQueue() {
        //ipPort port,for example: 13.23.49.128 80
        String content = httpClientUtils.getEntityContent(this.getUrl(),this.getCharset());
        //System.out.println("content "+content);
        Document document = Jsoup.parse(content);

        Element ip_list = document.getElementById("ip_list");

        //CustomedMethod.printDelimiter();
        //System.out.println(ip_list);

        //get the td
        Elements classEmpty = ip_list.select("tr");
        //System.out.println("tr.size is: "+classEmpty.size());

        //default value
        String ip="0.0.0.0";
        String port="8888";

        for (Element trEle : classEmpty) {
            //System.out.println(trEle.toString());
            Elements tds = trEle.select("td");

            //if it not a efficient ipPort entry
            if(tds.size()<2)    continue;
            ip = tds.get(1).text();
            port = trEle.select("td").get(2).text();
            String tempIP = ip + " " + port;
            //System.out.println("ipPort: " + ip + " port: "+port);
            if(ipPort.getIpPortQueue().contains(tempIP)){
                System.out.println("xiciIp: check the repeating ipPort....");
                continue;
            }

            //the synchronized to ipPort
            synchronized (ipPort) {
                if (ipPort.getIpPortQueue().size() >= 20) {
                    System.out.println("xiciIP producer wait...");
                    try {
                        ipPort.wait(); // the wait must in synchronized code
                        System.out.println("xiciIP producer waking...");
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                System.out.println("xiciIP: before produce "+ ipPort.getIpPortQueue().size());
                if (ipPort.getIpPortQueue().size() < 20) ipPort.getIpPortQueue().add(ip + " " + port);// add to queue
                System.out.println("xiciIP: after produce "+ ipPort.getIpPortQueue().size()+"\n");
                ipPort.notifyAll();
            }
        }
        System.out.println("xiciIP -> IpPortQueue's size is: "+ipPort.getIpPortQueue().size());
        //update the url
        this.setUrl(this.getNextUrl("http://www.xicidaili.com/nn/"));
        //System.out.println("after update: "+this.getUrl());
    }
}
  • CloudIP
package crawler.utils.ip;

import crawler.httpClient.HttpClientUtils;
import crawler.result.IPPort;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.Set;

public class CloudIP extends WebSite{
    private HttpClientUtils httpClientUtils = new HttpClientUtils();
    private volatile IPPort ipPort ;

    public CloudIP(IPPort ipPort, String url, String keyWord, String webName,String charset) {
        this.ipPort = ipPort;
        this.setUrl(url);
        this.setKeyword(keyWord);
        this.setWebName(webName);
        this.setCharset(charset);
    }

    public Set<String> getFreeIpInSet() {
        return null;
    }

    public void getFreeIpInQueue() {
        //ipPort port,for example: 13.23.49.128 80
        String content = httpClientUtils.getEntityContent(this.getUrl(),this.getCharset());
        //System.out.println("content "+content);
        Document document = Jsoup.parse(content);

        Element ip_list = document.getElementById(this.getKeyword());

        //get the td
        Elements classEmpty = ip_list.select("tr");
        //System.out.println("tr.size is: "+classEmpty.size());

        //default value
        String ip="0.0.0.0";
        String port="8888";

        for (Element trEle : classEmpty) {
            //System.out.println(trEle.toString());
            Elements tds = trEle.select("td");

            //if it not a efficient ipPort entry
            if(tds.size()<2)    continue;
            ip = tds.get(0).text();
            port = trEle.select("td").get(1).text();
            String tempIP = ip + " " + port;
            //System.out.println("ipPort: " + ip + " port: "+port);

            //if the queue has contain the ip ,continue
            if(ipPort.getIpPortQueue().contains(tempIP)) {
                System.out.println("CloudIp: check the repeating ipPort....");
                continue;
            }

            //the synchronized to ipPort
            synchronized (ipPort) {
                if (ipPort.getIpPortQueue().size() >= 20) {
                    try {
                        System.out.println("cloudIP producer wait...");
                        ipPort.wait();
                        System.out.println("cloudIP producer waking...");
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }
                System.out.println("CloudIP: before produce "+ ipPort.getIpPortQueue().size());
                if (ipPort.getIpPortQueue().size() < 20)   {
                    ipPort.getIpPortQueue().add(ip + " " + port);// add to queue
                }
                System.out.println("CloudIP: after produce " + ipPort.getIpPortQueue().size()+"\n");
                ipPort.notifyAll();
            }
        }
        System.out.println("CloudIP -> IpPortQueue's size is: "+ipPort.getIpPortQueue().size());
        this.setUrl(this.getNextUrl("http://www.ip3366.net/?stype=1&page="));
        //System.out.println("after update: "+this.getUrl());
    }
}
3. 运行结果
before consume: 17
after consume: 16
current Ip : 183.172.131.4 port : 8118

···

before consume: 12
after consume: 11
current Ip : 115.46.65.236 port : 8123

CloudIP: before produce 11
CloudIP: after produce 12

···

CloudIP: before produce 19
CloudIP: after produce 20

CloudIp: check the repeating ipPort....
CloudIP -> IpPortQueue's size is: 20
xiciIP: before produce 20
xiciIP: after produce 21

xiciIP producer wait...
before consume: 21
after consume: 20
current Ip : 182.88.129.67 port : 8123

before consume: 20
after consume: 19
current Ip : 171.13.85.184 port : 8010

before consume: 19
after consume: 18
current Ip : 115.219.105.38 port : 8010

before consume: 18
after consume: 17
current Ip : 101.236.58.203 port : 8866
4. 技术难点
4.1 produce 的速度 赶不上 consume

先看一下运行结果输出:

ipConsumer wait...
CloudIP: before produce 0
CloudIP: after produce 1

CloudIP: before produce 1
CloudIP: after produce 2

···

ipConsumer wait...
xiciIP: before produce 0
xiciIP: after produce 1

ipConsumer waking...
before consume: 1
after consume: 0
current Ip : 58.240.224.252current port is: 33035

ipConsumer waking...
before consume: 1
after consume: 0
current Ip : 221.214.180.122current port is: 33190

ipConsumer wait...
CloudIP: before produce 0
CloudIP: after produce 1

CloudIP: before produce 1
CloudIP: after produce 2

before consume: 2
after consume: 1
current Ip : 121.9.215.94current port is: 39545
···		

根据上面的输出,可以看到存在的问题就是:produce的ip port 不够 consome。每当consumer消费queue直至queue为0时,consumer wait,然后才唤醒producer。查看consumer的代码如下:

public void consume(){
        while(true){//loop start
            String ipPort = null;//ip + port
            //lock the ipPort
            synchronized (this.getIpPort()) {
                if (this.getIpPort().getIpPortQueue().size() == 0) {
                    try {
                        //if queue don't have record
                        System.out.println("ipConsumer wait...");
                        this.getIpPort().wait();
                        System.out.println("ipConsumer waking...");
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                }

                if (this.getIpPort().getIpPortQueue().size() > 0) {//if ipPortQueue's size more than zero
                    System.out.println("before consume: " + this.getIpPort().getIpPortQueue().size());
                    //Retrieves and removes the head of this queue, or return the null if this queue is empty
                    ipPort = this.getIpPort().getIpPortQueue().poll();
                    this.getIpPort().notifyAll();// notify the all ip producer
                    System.out.println("after consume: " + this.getIpPort().getIpPortQueue().size());

                    String str[] = ipPort.split(" ");
                    synchronized (this) {//ensure the replace keep synchronized
                        this.setIp(str[0]);
                        this.setPort(str[1]);
                        System.out.println("current Ip : "+this.getIp()+"port : "+this.getPort()+"\n");
                    }
                }
                try {
                    Thread.sleep(2000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    }

这个现象的出现是因为:consumer获得 锁之后,只有在wait()之后才会释放锁,导致生产者与消费者之间不能够均匀的获得锁。从代码中看到,Thread.sleep这个过程也是在持锁的,这个大可不必。所以可以优化代码成如下样子:

public void consume(){
        while(true){//loop start
            String ipPort = null;//ip + port
            //lock the ipPort
            synchronized (this.getIpPort()) {
                ···
            }
            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }            
        }
    }

修改之后就再没有出现这种情况了。但是又遇到了其它问题。

4.2 生产的queue过多,导致consumer无法及时消费

这个问题可以从如下的输出日志中查看:

ipConsumer wait...
xiciIP: before produce 0
xiciIP: after produce 1

ipConsumer waking...
before consume: 1
after consume: 0
current Ip : 27.17.45.90port : 43411

xiciIP: before produce 0
xiciIP: after produce 1
···
xiciIP: before produce 26
xiciIP: after produce 27
···
xiciIP: before produce 50
xiciIP: after produce 51
···
cloudIP producer wait...
xiciIP producer wait...
before consume: 64
after consume: 63
current Ip : 115.210.68.94port : 8010

xiciIP: before produce 63
xiciIP: after produce 64

cloudIP producer waking...
CloudIP: before produce 64
CloudIP: after produce 65

xiciIP producer wait...
cloudIP producer wait...

问题是,我对队列的限制大小是50,但是为何这里队列的长度已经到了65呢?
原因是在consumer消费完一个之后,就发出了notifyAll的信号,会唤醒所有的 produce 线程,导致多个生产者同时生产,导致 queue 大小超标。 那么解决问题的办法就是:

  • consumer 线程中不要过早的notifyAll。这里使用if判断,如果可消费的ip 队列已经很少(<10)了,则通知producer生产。
  • producerqueue中存放ipport时,再次判断queue.size。代码修改成如下的样子:
 if (this.getIpPort().getIpPortQueue().size() > 0) {//if ipPortQueue's size more than zero
                    System.out.println("before consume: " + this.getIpPort().getIpPortQueue().size());
                    //Retrieves and removes the head of this queue, or return the null if this queue is empty
                    ipPort = this.getIpPort().getIpPortQueue().poll();
                    if(this.getIpPort().getIpPortQueue().size()<10) this.getIpPort().notifyAll();// notify the all ip producer
                    System.out.println("after consume: " + this.getIpPort().getIpPortQueue().size());
						...
                    }
                }
public void getFreeIpInQueue() {
       ···
        for (Element trEle : classEmpty) {
			  ···
            //the synchronized to ipPort
            synchronized (ipPort) {
                if (ipPort.getIpPortQueue().size() >= 20) {
                    ···
                }
                System.out.println("xiciIP: before produce "+ ipPort.getIpPortQueue().size());
                if (ipPort.getIpPortQueue().size() < 20) ipPort.getIpPortQueue().add(ip + " " + port);// add to queue
                System.out.println("xiciIP: after produce "+ ipPort.getIpPortQueue().size()+"\n");
                ipPort.notifyAll();
            }
        }
        System.out.println("xiciIP -> IpPortQueue's size is: "+ipPort.getIpPortQueue().size());
        //update the url
        this.setUrl(this.getNextUrl("http://www.xicidaili.com/nn/"));
        //System.out.println("after update: "+this.getUrl());
    }
4.4 queue 大小仍然超标
before consume: 10
after consume: 9
current Ip : 106.15.42.179port : 33543

cloudIP producer waking...
CloudIP: before produce 9
CloudIP: after produce 10

····

CloudIP: before produce 26
CloudIP: after produce 27

···

CloudIP -> IpPortQueue's size is: 30
xiciIP: before produce 30
xiciIP: after produce 31

···
xiciIP: before produce 49
xiciIP: after produce 50

xiciIP: before produce 50
xiciIP: after produce 51

xiciIP producer wait...
before consume: 51
after consume: 50
current Ip : 183.63.123.3port : 56489

CloudIP: before produce 50
CloudIP: after produce 51

xiciIP: before produce 51
xiciIP: after produce 52

cloudIP producer wait...
xiciIP producer wait...
before consume: 52
after consume: 51
current Ip : 183.47.2.201port : 30278
'

可以看到 queue 的大小仍然超出了界限,而这个问题的原因就是ipPortQueue关键字没有使用volatile修饰。这个volatile关键字的作用是:保证数据的可见性。所谓可见性就是一个线程对主内存的修改是否可以及时的被其它线程观察到。因为有两个生产者,他们虽然不能同时对ipPort对象访问,但是因为它们的代码中都有一个if()判断,如下:

 //the synchronized to ipPort
            synchronized (ipPort) {
                if (ipPort.getIpPortQueue().size() >= 50) {
                   ···
                }
				 ···
				 if (ipPort.getIpPortQueue().size() < 20)   {
                    ipPort.getIpPortQueue().add(ip + " " + port);// add to queue
                }
               ···
            }

关于volatile关键字我这里不再赘述。修改代码如下:
private Queue<String> ipPortQueue = new LinkedList<String>(); -> private volatile Queue<String> ipPortQueue = new LinkedList<String>(); 这样就可以保证所有的ipPortQueue都是最新的。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

说文科技

看书人不妨赏个酒钱?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值