[node.js] 简单的http爬虫

最新推荐文章于 2022-06-07 11:48:52 发布

weixin_34323858

最新推荐文章于 2022-06-07 11:48:52 发布

阅读量144

点赞数

文章标签：爬虫数据库 javascript ViewUI

原文链接：https://segmentfault.com/a/1190000004364402

版权

用node实现了一个简单的http爬虫，在此向和我一样正在学习node.js的朋友分享一下

准备工作
1.node.js本身
2.cheerio模块
可以将html构建DOM结构，并提供像jquery一样的选择器。
通过npm install cheerio进行安装

爬取页面分析
我这里选取的是segmentfault的未回答问题页面进行爬取。
（url为’http://segmentfault.com/quest...‘+页码）

要爬取的是未回答问题的标题及其代码，从图中我们可以看出，我们所需要的内容都在div.summary下的h2.title中，所以我们在获取html代码之后，可以从这里提取我们需要的信息。

代码实现

var http = require('http');
var cheerio = require('cheerio');
var seg_url = 'http://segmentfault.com/questions/unanswered?page=';

/*
 *用于过滤HTML代码，并从中得到我们需要的信息
 *使用了cheerio模块
 */
function filter_html(html){
    var $ = cheerio.load(html);
    var questions = $('div.summary');
    var questionDatas = [];
    questions.each(function(item){
        var summary = $(this);
        var question = summary.find('h2.title>a');
        var questionData = {
            title:question.text(),
            code:question.attr('href').split('q/')[1]
        };
        questionDatas.push(questionData);
    });
    questionDatas.forEach(function(item){
        console.log('title:'+item.title+' '+'code:'+item.code);
    });
}

var i = 1;//通过改变i的值可以爬取多个页面的未回答问题
/*
 *http.get接受两个参数，第一个包含有hostname，port，headers等内容的对象或者字
 *符串（自动parse成对象）,第二个参数是一个回调函数，用于对结果进行处理。这个方法
 *和  request方法几乎一致，不同之处是get方法的method固定为GET，并且会自动
 *触发end事件
 */
http.get(seg_url+i,function(res){
    var html = '';
    res.on('data',function(data){
        html += data;
    });
    res.on('end',function(){
        filter_html(html);
    });
}).on('error',function(e){
    console.log('Error:'+e.message);
});

结果
最后的结果应该类似这样：

title:LLVM 中 CreatePHI 时，报错 code:1010000004337912
title:求教ss-redir的iptables设置 code:1010000004337693
title:summernote 的pre问题 code:1010000004337464
title:Scheme解释器中正则序是怎么回事？ code:1010000004336759
title:webpack编译handlebars的问题 code:1010000004336625
title:大家觉得html5中的canvas怎么使用呢？ code:1010000004336370
title:移动版Safari不能自动播放mp3怎么办？ code:1010000004336299
title:无线传感器网络的AODV协议有哪些不足之处？ code:1010000004336203
title:Typecho无法发表文章 code:1010000004335581
title:sphinx 以mysql为数据源 建索引失败 code:1010000004334330
………………

weixin_34323858

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫