sh脚本提取代理ip并验证_用PHP构建旋转IP和用户代理Web脚本

最新推荐文章于 2022-11-05 18:33:45 发布

weixin_26737625

最新推荐文章于 2022-11-05 18:33:45 发布

阅读量443

点赞数

文章标签： python php web nginx jmeter

原文链接：https://medium.com/better-programming/building-a-rotating-ip-and-user-agent-web-scraping-script-in-php-277bde659d20

版权

sh脚本提取代理ip并验证

旋转用户代理 (Rotating User-Agent)

“The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.” ― MDN web docs

“ User-Agent 请求标头是一个特征字符串，可让服务器和网络对等方标识请求用户代理的应用程序，操作系统，供应商和/或版本。” ― MDN Web文档

To reach this goal, we are going to randomly select a valid User-Agent from a file containing a list of valid User-Agent strings.

为了达到这个目标，我们将从包含有效User-Agent字符串列表的文件中随机选择一个有效的User-Agent。

Firstly, we need to get such a file. Secondly, we have to read it and extract a random line. This can be achieved with the following function:

首先，我们需要获取这样的文件。其次，我们必须阅读并提取一条随机线。这可以通过以下功能实现：

<?php


function getRandomUserAgent() {
  // default User-Agent
  $userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0";


  // reading a randomly chosen User-Agent string from the User-Agent list file
  if ($file = fopen("user_agents.txt", "r")) {
    $userAgents = array();


    while (!feof($file)) {
      $userAgents[] = fgets($file);
    }


    $userAgent = $userAgents[array_rand($userAgents)];
  }


  return trim($userAgent);
}


?>

旋转出口IP (Rotating the Exit IP)

To implement the IP rotation, we are going to use a proxy server.

为了实现IP轮换，我们将使用代理服务器。

“A proxy server is basically another computer which serves as a hub through which internet requests are processed. By connecting through one of these servers, your computer sends your requests to the server which then processes your request and returns what you were wanting. Moreover, in this way it serves as an intermediary between your home machine and the rest of the computers on the internet.” ―What Is My IP?

“代理服务器基本上是另一台计算机，它是通过其处理Internet请求的集线器。通过这些服务器之一进行连接，您的计算机将您的请求发送到服务器，然后服务器处理您的请求并返回您想要的内容。而且，通过这种方式，它可以充当您的家用计算机与互联网上其余计算机之间的中介。” ― 什么是我的IP？

When using a proxy, the website we are making the request to sees the IP address of the proxy server — not ours. This enables us to scrape the target website anonymously without the risk of being banned or blocked.

当使用代理服务器时，我们正在请求查看代理服务器IP地址的网站，而不是我们的。这使我们能够以匿名方式抓取目标网站，而不会被禁止或阻止。

Using a single proxy means that the IP server can be banned, interrupting our script. To avoid this, we would need to build a pool of proxies to route our requests through. Instead, we are going to use the Tor proxy. If you are not familiar with Tor, reading the following article is greatly recommended: How Does Tor Really Work?

使用单个代理意味着可以禁止IP服务器，从而中断我们的脚本。为了避免这种情况，我们需要建立一个代理池来路由我们的请求。相反，我们将使用Tor代理。如果您不熟悉Tor，则强烈建议阅读以下文章： Tor实际如何工作？

“Tor passes your traffic through at least 3 different servers before sending it on to the destination. Because there’s a separate layer of encryption for each of the three relays, somebody watching your Internet connection can’t modify, or read, what you are sending into the Tor network. Your traffic is encrypted between the Tor client (on your computer) and where it pops out somewhere else in the world.” — Tor’s official documentation

“ Tor将您的流量通过至少3个不同的服务器传递，然后再将其发送到目的地。因为三个中继中的每一个都有单独的加密层，所以观看您的Internet连接的某人无法修改或读取您发送到Tor网络中的内容。您的流量在Tor客户端(在计算机上)与在世界其他地方弹出的位置之间进行了加密。” — Tor的官方文档

First of all, we need to set up the Tor proxy. Following these OS-based guides is highly recommend:

首先，我们需要设置Tor代理。强烈建议您遵循这些基于操作系统的指南：

Windows → How to install Tor and create Tor hidden service on Windows
视窗 → 如何在Windows上安装Tor并创建Tor隐藏服务
macOS → How to install Tor on macOS — Tutorial
macOS→ 如何在macOS上安装Tor-教程
Linux → Setting up Tor Proxy and Hidden Services in Linux
Linux→ 在Linux中设置Tor代理和隐藏服务

Now, we have a Tor service listening for SOCKS4 or SOCKS5 connections on port 9050. This service creates a circuit on start-up and whenever Tor thinks it might need more in the future or right now.

现在，我们有了一个Tor服务，用于监听端口9050上的SOCKS4或SOCKS5连接。该服务在启动时以及Tor认为将来或现在可能需要更多时，都会创建一个电路。

Every time a circuit is used for the first time, it is marked as dirty. By default, dirty circuits are used for ten minutes. To reduce this interval to the lowest allowed value (ten seconds), we need to add this additional line to our torrc configuration file:

每次第一次使用电路时，都将其标记为脏。默认情况下，脏电路使用十分钟。为了将此间隔减小到最小的允许值(十秒)，我们需要将此额外的行添加到torrc配置文件中：

MaxCircuitDirtiness 10

使Web爬网脚本健壮 (Making the Web Scraping Script Robust)

We have just seen how to make our requests look random, but this may not be enough. Our requests might still be refused, and in order to prevent this from happening, we need to implement a retrying logic. After each failed attempt, we are going to call the sleep() function to wait a few seconds. This way, we should prevent the same error from happening again. In fact, a new circuit may be created in that interval. This will make our script more robust and can be easily achieved as follows:

我们刚刚看到了如何使我们的请求看起来随机，但这可能还不够。我们的请求可能仍然被拒绝，为了防止这种情况发生，我们需要实现重试逻辑。每次失败尝试后，我们将调用sleep()函数等待几秒钟。这样，我们应该防止相同的错误再次发生。实际上，可以在该间隔内创建一个新电路。这将使我们的脚本更加健壮，并且可以轻松实现，如下所示：

<?php


function getScrapedData($url, $maxAttempts = 3, $timeout = 60, $sleep = 5) {


  $attempt = 1;


  while ($attempt <= $maxAttempts) {


    // downloading the page to scrape


    $html = // downloaded page
    
    // on failure
    if ($html == false) {
      $attempt += 1;


      // waiting $sleep seconds on failure before a new attempt
      sleep($sleep);
    } else {          
      // scraping logic
    }


      // returning scraped data
    }
  }


  return false;
}


?>

放在一起 (Putting It All Together)

We are now going to show a working example whose goal is to scrape the COVID-19 pandemic by country and territory Wikipedia page to retrieve statistics on COVID-19 to save them in a .csv file. We have not yet mentioned how to parse an HTML page to retrieve the required data. This can be achieved by harnessing a specific library. We used the Simple HTML Dom Parser for PHP, which can be installed via Composer with the following line:

现在，我们将展示一个工作示例，其目标是按国家和地区 Wikipedia页面刮除COVID-19大流行，以检索有关COVID-19的统计信息，并将其保存在.csv文件中。我们尚未提到如何解析HTML页面以检索所需的数据。这可以通过利用特定的库来实现。我们使用了用于PHP的Simple HTML Dom Parser ，可以通过以下方式安装作曲家与以下行：

composer require voku/simple_html_dom

This is the complete code of the working example:

这是工作示例的完整代码：

<?php


require_once '../../vendor/autoload.php';


use voku\helper\HtmlDomParser;


$url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory'; // page to scrape
$maxAttempts = 5; // number of attempts before failure
$timeout = 30; // 30 seconds
$sleep = 3; // 3 seconds


$scrapedData = getScrapedData($url, $maxAttempts, $timeout, $sleep);


if ($scrapedData !== false) {
  // creating a csv file
  $csv = fopen('data.csv', 'w');


  // populating the csv file with the scraped data
  foreach ($scrapedData as $fields) {
    fputcsv($csv, $fields);
  }


  fclose($csv);


  echo "Script successfully completed!\n";
} else {
  echo "Script failed!\n";
}


/**
 * @param string $url page to scrape
 * @param int $maxAttempts number of attempts before failing
 * @param int $timeout request timeout (in seconds)
 * @param int $sleep time spent on failure before a new attempt (in seconds)
 * @return false|string[][] scraped data on success, or false on failure
 */
function getScrapedData($url, $maxAttempts = 3, $timeout = 60, $sleep = 5) {


  $attempt = 1;


  while ($attempt <= $maxAttempts) {


    $curl = curl_init();


    // setting a randomly chosen User-Agent
    curl_setopt($curl, CURLOPT_USERAGENT, getRandomUserAgent());
    curl_setopt($curl, CURLOPT_URL, $url);


    // configuring TOR proxy
    curl_setopt($curl, CURLOPT_PROXY, "127.0.0.1");
    curl_setopt($curl, CURLOPT_PROXYPORT, "9050");
    curl_setopt($curl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);


    // setting a timeout
    curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);


    // certification bundle downloaded here: https://curl.haxx.se/docs/caextract.html
    curl_setopt($curl, CURLOPT_CAINFO, __DIR__ . '/cacert.pem');


    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);


    $html = curl_exec($curl);


    // on failure
    if ($html == false) {
      // printing error message
      echo curl_error($curl) . "\n";


      $attempt += 1;


      // waiting $sleep seconds on failure before a new attempt
      sleep($sleep);
    } else {
      $htmlDomParser = HtmlDomParser::str_get_html($html);


      $dataTable = $htmlDomParser->getElementById("thetable");


      $tbodyDataTable = $dataTable->getElementByTagName("tbody");


      $theadDataTable = $tbodyDataTable->getElementByClass("covid-sticky")[0];


      $headerThs = $theadDataTable->getElementsByTagName("th");


      $scrapedData = array(
        // table header
        array(
          $headerThs[0]->find('text', 0)->html,
          $headerThs[1]->find('text', 0)->html,
          $headerThs[2]->find('text', 0)->html,
          $headerThs[3]->find('text', 0)->html
        )
      );


      foreach ($tbodyDataTable->children() as $row) {
        $countryTh = $row->find("th[scope=row]");


        $rowTds = $row->getElementsByTagName("td");


        // if countryTh and rowTds exists
        if ($countryTh->count() > 0 && $rowTds->count() > 0) {
          $country = $countryTh[1]->getElementByTagName("a")->plaintext;
          $cases = $rowTds[0]->plaintext;
          $deaths = $rowTds[1]->plaintext;
          $recoveries = $rowTds[2]->plaintext;


          $scrapedData[] = array($country, $cases, $deaths, $recoveries);
        }
      }


      return $scrapedData;
    }
  }


  return false;
}


/**
 * @return string a random User-Agent
 */
function getRandomUserAgent() {
  // default User-Agent
  $userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0";


  // reading a randomly chosen User-Agent string from the User-Agent list file
  if ($file = fopen("user_agents.txt", "r")) {
    $userAgents = array();


    while (!feof($file)) {
      $userAgents[] = fgets($file);
    }


    $userAgent = $userAgents[array_rand($userAgents)];
  }


  return trim($userAgent);
}