nodejs怎么使用爬虫HTTP代理IP抓取数据

最新推荐文章于 2024-04-19 16:38:30 发布

DATA5U

最新推荐文章于 2024-04-19 16:38:30 发布

阅读量3.6k

点赞数

分类专栏：爬虫系列文章标签： HTTP代理代理IP NodeJs爬虫数据抓取无忧代理IP

本文链接：https://blog.csdn.net/u010978757/article/details/81901639

版权

爬虫系列专栏收录该内容

10 篇文章 0 订阅

订阅专栏

问题描述：目前我初学爬虫，尽管简单的数据能抓下来，但是看了很多文章，里面有提到一些反爬虫的机制的，这里面都提到了用ip来反爬虫，第二篇文章也提到了用代理池来避免，但是还是不大明白，这些代理ip如果用nodejs要怎么弄?

HTTP代理有很多提供商家，这里选择无忧代理的爬虫代理IP，链接 http://www.data5u.com/buy/dynamic.html

NodeJS整合代码如下：

/**
 * 请确保安装了request和bluebird两个模块
 * 安装模块：打开NODEJS-->输入npm install request-->输入npm install bluebird
 **/

var request = require("request");
var Promise = require("bluebird");

// 请填写无忧代理订单号
var order = 'please-input-your-order-here';
// 要测试的网址
var targetURL = 'http://ip.chinaz.com/getip.aspx';
// 请求超时时间
var timeout = 8000;
// 测试次数
var testTime = 5;
// 间隔多少毫秒调用一次接口
var sleepTime = 5000;

var apiURL = 'http://api.ip.data5u.com/dynamic/get.html?order=' + order + '&sep=3';

console.log('>>>> start test dynamic ip');

function getProxyList() {
    return new Promise((resolve, reject) => {
        var options = {
            method: 'GET',
            url: apiURL,
            gzip: true,
            encoding: null,
            headers: {},
        };

        request(options, function (error, response, body) {
            try {
                if (error) throw error;
                var ret = (body + '').match(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}/g);
                resolve(ret);
            } catch (e) {
                return reject(e);
            }
        });
    });
}

function execute(){
    getProxyList().then(function (proxyList) {
        var targetOptions = {
            method: 'GET',
            url: targetURL,
            timeout: timeout,
            encoding: null,
        };

        proxyList.forEach(function (proxyurl) {
            console.log(`* testing `);
            var startTimestamp = (new Date()).valueOf();
            targetOptions.proxy = 'http://' + proxyurl;
            request(targetOptions, function (error, response, body) {
                try {
                    if (error) throw error;
                    body = body.toString();
                    var endTimestamp = (new Date()).valueOf();
                    console.log('  > time ' + (endTimestamp - startTimestamp) + 'ms ' + body);
                } catch (e) {
                    console.error(e);
                }
            });
        });
    }).catch(e => {
        console.log(e);
    })
}

// 定时执行
var interval = setInterval(function(){
    if(testTime > 0){
        execute()
    } else {
        clearInterval(interval);
        console.log('<<<< end test dynamic ip');
    }
    testTime = testTime - 1;
}, sleepTime);

知乎网友回答（https://www.zhihu.com/question/26804984）
根据个人经验，简单的反爬虫技术有：
1. 判断headers 中的参数，比如user-agent 不是浏览器的不允许访问；refer 不是来源于特定域名的也不行（反盗链常用技术）。这是最常见的反爬虫技术。

cookies 检查用户cookies，需要登录的网站常采用这种技术。比如论坛、微博、雪球等。

以上两个可以通过手动设计headers 和cookies 搞定，python 程序员使用requests 可以很方便解决。

还有一些比较复杂的技术：
1. 数据通过ajax 返回后通过js 混淆处理，而js 处理过程可以写的很复杂，以至于爬虫程序员没法分析。
2. 数据通过flash 和服务器端交互。例如船讯网www.shipxy.com 中请求船舶信息部分。
3. 通过ip 或者特定账号单位时间内请求数量来限制访问，基本无解，比如你爬爬 google scholar 试试看

无忧代理IP(http://www.data5u.com)原创文章，转载请注明出处。

DATA5U

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
1
评论
nodejs怎么使用爬虫HTTP代理IP抓取数据

问题描述：目前我初学爬虫，尽管简单的数据能抓下来，但是看了很多文章，里面有提到一些反爬虫的机制的，这里面都提到了用ip来反爬虫，第二篇文章也提到了用代理池来避免，但是还是不大明白，这些代理ip如果用nodejs要怎么弄?HTTP代理有很多提供商家，这里选择无忧代理的爬虫代理IP，链接 http://www.data5u.com/buy/dynamic.htmlNodeJS整合代码如下：...
复制链接

扫一扫