WEB 编程作业一：爬虫（四）

最新推荐文章于 2024-07-09 15:58:38 发布

谢怜的fafa

最新推荐文章于 2024-07-09 15:58:38 发布

阅读量157

点赞数

文章标签： node.js 爬虫

本文链接：https://blog.csdn.net/qq_43702633/article/details/115827525

版权

换了好几种爬取文章标题的方式都不太满意，雪球网head标签里的title内容非常混乱，看起来完全不像一个文章的标题。我决定尊重客观事实，如果一个作者没有拟标题的话就直接：

if (fetch.title == "") fetch.title = "无标题";

得到结果：
在这里插入图片描述
由于帖子一直在更新，过段时间再爬可以得到更多结果。

下面是东方财富官网的爬取过程：

var source_name = "东方财富网";
var myEncoding = "utf-8";
var seedURL = 'https://www.eastmoney.com/';
var url_reg = /\/(\w{1})\/(\d{18}).html/;
var seedURL_format = "$('a')";
request(seedURL, function(err, res, body) { //读取种子页面
    // try {
    //用iconv转换编码
    var html = myIconv.decode(body, myEncoding);
    //console.log(html);
    //准备用cheerio解析html
    var $ = myCheerio.load(html, { decodeEntities: true });
    // } catch (e) { console.log('读种子页面并转码出错：' + e) };

    var seedurl_news;

    try {
        seedurl_news = eval(seedURL_format);
        //console.log(seedurl_news);
    } catch (e) { console.log('url列表所处的html块识别出错：' + e) };

    seedurl_news.each(function(i, e) { //遍历种子页面里所有的a链接
        var myURL = "";
        try {
            //得到具体新闻url
            var href = "";
            href = $(e).attr("href");
            if (typeof(href) == "undefined") {  // 有些网页地址undefined
                return true;
            }
            if (href.toLowerCase().indexOf('http://') >= 0 || href.toLowerCase().indexOf('https://') >= 0) myURL = href; //http://开头的或者https://开头
            else if (href.startsWith('//')) myURL = 'http:' + href; 开头的
            else myURL = seedURL.substr(0, seedURL.lastIndexOf('/')) + href; //其他

        } catch (e) { console.log('识别种子页面中的新闻链接出错：' + e) }

        if (!url_reg.test(myURL)) return; //检验是否符合新闻url的正则表达式
        console.log(myURL);
        //newsGet(myURL); //读取新闻页面
    });
});

先不读取新闻页面，在命令行输出链接。访问没有问题可以进入下一个步骤：
在这里插入图片描述
爬取标题：

fetch.title = $('div[class="newsContent"]').children("h1").text();
if (fetch.title == "") fetch.title = "无标题";

在这里插入图片描述
爬取内容：

fetch.content = $('div[id="ContentBody"]').children("p").text();

代码写好了突然又不行了。。这个错好像不用管，会不会出bug完全靠运气？？？不过晚上10-11点出现比较多（貌似）
在这里插入图片描述
有时间再来跑一下，东方财富网也爬得差不多了

------------------------------分界线-4月18日晚-----------------------------------

得到东方财富网的爬取内容：
在这里插入图片描述
用同样的方式爬取中国金融网的新闻信息

var source_name = "中国金融网";
var myEncoding = "utf-8";
var seedURL = 'http://www.financeun.com/';
var url_reg = /\/newsDetail\/(\d{5}).shtml/;
var seedURL_format = "$('a')";
var fetch = {};
fetch.title = $('div[class="news-details-data"]').children("h2").text();
if (fetch.title == "") fetch.title = "无标题";
fetch.content = $('div[class="txt"]').text();
fetch.publish_date = $('div[class="news-details-data"]').children("h3").text().split(" ")[2]; //+ $('div[class="news-details-data"]').children("h3").text().split(" ")[3];
fetch.url = myURL;
fetch.source_name = $('div[class="news-details-data"]').children("h3").text().split(" ")[1];
fetch.source_encoding = myEncoding; //编码
fetch.crawltime = new Date();
fetch.keywords = $('meta[name="keywords"]').attr("content");
fetch.author = "未知";
fetch.description = $('meta[name="description"]').attr("content");

中国金融网真的很神奇，，输出都是undefined，新闻链接竟然无法访问…然后也无法点击进入网页（怀疑我是不是测试太多次被制裁了嘤嘤嘤，最开始可以访问的）
在这里插入图片描述
换个网站…
试了试知乎，它要登陆…

尝试添加cookie之后：

--------------------------分界线 4月19日早上-------------------------------------
不死心又试了中国金融网，ip又被封了。。。
而且啥也没爬到

爬到了一些，但是标题还是没有…
在这里插入图片描述
小白财经网：http://www.xiaobaicj.cn/
可以正常爬取，不过它的网页的meta里没有自带关键字和摘要

var source_name = "小白财经";
var myEncoding = "utf-8";
var seedURL = 'http://www.xiaobaicj.cn/';
var url_reg = /\/(\w{1})(\d{5}).html/;
var seedURL_format = "$('a')";
fetch.title = $('h1[class="entry-title"]').text();
if (fetch.title == "") fetch.title = "无标题";
fetch.content = $('div[class="entry-content"]').text();
fetch.publish_date = $('span[class="entry-date"]').text();
fetch.url = myURL;
fetch.source_name = source_name;
fetch.source_encoding = myEncoding; //编码
fetch.crawltime = new Date();
fetch.keywords = $('meta[name="keywords"]').attr("content");
if(fetch.keywords == "") fetch.keywords = "未知";
fetch.author = $('span[class="entry-author"]').text();
if(fetch.author == "") fetch.author = "未知";
fetch.description = $('meta[name="description"]').attr("content");
if(fetch.description == "") fetch.description = "未知";

在这里插入图片描述
那么数据的准备到这里就告一段落啦~接下来着手开始网站的搭建

谢怜的fafa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
WEB 编程作业一：爬虫（四）

换了好几种爬取文章标题的方式都不太满意，雪球网head标签里的title内容非常混乱，看起来完全不像一个文章的标题。我决定尊重客观事实，如果一个作者没有拟标题的话就直接：if (fetch.title == "") fetch.title = "无标题";得到结果：由于帖子一直在更新，过段时间再爬可以得到更多结果。下面是东方财富官网的爬取过程：var source_name = "东方财富网";var myEncoding = "utf-8";var seedURL = 'https://
复制链接

扫一扫