专业课WEB编程之Node.JS爬虫项目实验

最新推荐文章于 2024-09-20 21:26:51 发布

PeterXu1209

最新推荐文章于 2024-09-20 21:26:51 发布

阅读量205

点赞数

分类专栏： WEB编程文章标签： mysql 数据库爬虫

本文链接：https://blog.csdn.net/qq_29438345/article/details/116176597

版权

WEB编程专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言：

本学期，我们开始学习web编程，期中作业是做一个主流新闻网站的爬虫。说实话，一开始我是十分震惊的，因为对于像我这一个对JavaScript毫无了解的人，要写出一个爬虫无异于比登天还难。看着ddl一天天地逼近，我的内心愈发紧张（~~果然deadline才是第一生产力~~）所幸，老师给予了我们充分的指导，并给了我们一个对中国新闻网的爬虫，使我可以依葫芦画瓢模仿。

本文仅限一名普通的大一学生对node.js爬虫的摸索与学习，对于其中较为专业与深入的知识我不报是十分了解，如有任何问题，请各位读者包涵。

由于老师于2021年4月26日发布了最完整的要求，故以下使用最新的要求：

首先是爬虫部分的要求：

完成最少一个目标网站（网站主题不限，不允许直接使用示例中的中国新闻网）的分析和爬虫设计。
爬取不少于100条数据（每条数据包括最少3个字段，标题、内容和时间），并存储在数据库中。
需提交源代码，完成多个网站的爬虫酌情加分。

依照第一次的ppt的建议，我本次的期中爬虫作业的爬虫目标定为网易新闻，以下是代码分析。

I爬虫部分

part1_准备工作:

首先确定所爬网站的名称、编码标准与域名，由于两个网站十分类似，这里便不再赘述。

part2_确定所爬内容（网页上的内容与元素）:

（黑色背景为本次作业，针对网易新闻的爬虫）

。

（白色背景为老师所给的crawler，针对中国新闻网的爬虫）

而由于每个网站上标题、内容、关键词的格式并不相同，故必须根据我们要爬的网站作出相应的调整。而每个网页呈现其元素的方式不同，所以我们需要查看网页的源代码，确定格式。

如正文方面，中国新闻网的格式如下

可以看出它以“left_zw”作为正文的开始，所以我们在对修改，而网易新闻的正文的格式为：

故我们需要把代码中相应的内容改为"post_body" 。相应地，我也需要更改其他格式，如网页的正则表达式，观察中国新闻网的正则表达式“2021/04-29/9467059.shtml”，可以看出它是以4个数字，斜杠，两个数字，横杠，两个数字，斜杠，七个数字为它的正则表达式。同理，观察网易新闻的正则表达式“G8O53AEJ051481US.html”其中既有数字又有字母，且它们的排列位置并不固定，故我们不能使用原来的爬虫里的d，而应使用w，前后两个/表示开始与结尾。修改结果为：

var url_reg = /\/(\w{16}).html/;

其余需要相应修改的地方此处不再赘述，到第十三行为止的代码为：

var source_name="网易新闻";
var myEncoding="utf-8";
var seedURL ='https://news.163.com/';

var seedURL_format = "$('a')";//正确
var keywords_format = " $('meta[name=\"keywords\"]').eq(0).attr(\"content\")";//正确
var title_format = "$('title').text()";//正确
var date_format = " $('meta[property=\"article:published_time\"]').eq(0).attr(\"content\")";//正确
//var author_format = "$('#editor_baidu').text()";
var content_format = "$('.post_body').text()";//正确
//var desc_format = " $('meta[name=\"description\"]').eq(0).attr(\"content\")";
//var source_format = "$('#source_baidu').text()";
var url_reg = /\/(\w{16}).html/;//正确

part3_爬虫的主要部分

接下来我们需要引入包来进行爬虫操作：

var fs = require('fs');
var myRequest = require('request')
var myCheerio = require('cheerio')
var myIconv = require('iconv-lite')

这些包在使用前均需要在终端里输入npm install xxx来安装，经过查找我们可以知道：

request包是服务端发起请求的工具包

cheerio包是服务器端的jQuery，仅仅去掉了jQuery的一些效果类和请求类等等功能。

然后我们需要伪造一个浏览器头，以防止网页屏蔽我们的爬虫请求：

var headers = {
    'User-Agent': 'Safari/14.0.3 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
}

（由于本机的系统版本是10.15.7，所以我把版本改成了10.15.7）

随后是request模块，

function request(url, callback)

可以看到这个函数有两个参数，其中url是网页的地址，callback是回调函数，url是必要的，而callback是用于爬取网页源码以及一些其他功能实现的函数（具体我不是很了解）。

//request模块异步fetch url
function request(url, callback) {
    var options = {
        url: url,
        encoding: null,
        //proxy: 'http://x.x.x.x:8080',
        headers: headers,
        timeout: 10000 //
    }
    myRequest(options, callback)
}

request(seedURL, function(err, res, body) { //读取种子页面
    // try {
    //用iconv转换编码
    var html = myIconv.decode(body, myEncoding);
    //console.log(html);
    //准备用cheerio解析html
    var $ = myCheerio.load(html, { decodeEntities: true });
    // } catch (e) { console.log('读种子页面并转码出错：' + e) };

    var seedurl_news;

    try {
        seedurl_news = eval(seedURL_format);
        //console.log(seedurl_news);
    } catch (e) { console.log('url列表所处的html块识别出错：' + e) };

    seedurl_news.each(function(i, e) { //遍历种子页面里所有的a链接
        var myURL = "";
        try {
            //得到具体新闻url
            var href = "";
            href = $(e).attr("href");
            if (typeof(href) == "undefined") {  // 有些网页地址undefined
                return true;
            }
            if (href.toLowerCase().indexOf('http://') >= 0 || href.toLowerCase().indexOf('https://') >= 0) myURL = href; //http://开头的或者https://开头
            else if (href.startsWith('//')) myURL = 'http:' + href; 开头的
            else myURL = seedURL.substr(0, seedURL.lastIndexOf('/') + 1) + href; //其他

        } catch (e) { console.log('识别种子页面中的新闻链接出错：' + e) }

        if (!url_reg.test(myURL)) return; //检验是否符合新闻url的正则表达式
        //console.log(myURL);
        newsGet(myURL); //读取新闻页面
    });
});

此爬虫的主要部分只剩下一个newsget函数，其中有许多模块，如keywords、title、content等，我们需要注意，这里的功能要与爬虫开头部分我们所定义的格式相适应，前面有的内容这里必须要有，前面没有的内容这里就需要注释掉。

function newsGet(myURL) { //读取新闻页面
    request(myURL, function(err, res, body) { //读取新闻页面
        //try {
        var html_news = myIconv.decode(body, myEncoding); //用iconv转换编码
        //console.log(html_news);
        //准备用cheerio解析html_news
        var $ = myCheerio.load(html_news, { decodeEntities: true });
        myhtml = html_news;
        //} catch (e) {    console.log('读新闻页面并转码出错：' + e);};

        console.log("转码读取成功:" + myURL);
        //动态执行format字符串，构建json对象准备写入文件或数据库
        var fetch = {};
        fetch.title = "";
        fetch.content = "";
        fetch.publish_date = (new Date()).toFormat("YYYY-MM-DD");
        //fetch.html = myhtml;
        fetch.url = myURL;
        fetch.source_name = source_name;
        fetch.source_encoding = myEncoding; //编码
        fetch.crawltime = new Date();

        if (keywords_format == "") fetch.keywords = source_name; // eval(keywords_format);  //没有关键词就用sourcename
        else fetch.keywords = eval(keywords_format);

        if (title_format == "") fetch.title = ""
        else fetch.title = eval(title_format); //标题

        //if (date_format != "") fetch.publish_date = eval(date_format); //刊登日期
        //console.log('date: ' + fetch.publish_date);
        //console.log(myURL);
        //fetch.publish_date = url_reg.exec(fetch.publish_date)[0];
        //fetch.publish_date = fetch.publish_date.replace('年', '-')
        //fetch.publish_date = fetch.publish_date.replace('月', '-')
        //fetch.publish_date = fetch.publish_date.replace('日', '')
        //fetch.publish_date = new Date(fetch.publish_date).toFormat("YYYY-MM-DD");

        //if (author_format == "") fetch.author = source_name; //eval(author_format);  //作者
        //else fetch.author = eval(author_format);

        if (content_format == "") fetch.content = "";
        else fetch.content = eval(content_format).replace("\r\n" + fetch.author, ""); //内容,是否要去掉作者信息自行决定

        //if (source_format == "") fetch.source = fetch.source_name;
        //else fetch.source = eval(source_format).replace("\r\n", ""); //来源

        //if (desc_format == "") fetch.desc = fetch.title;
        //else fetch.desc = eval(desc_format).replace("\r\n", ""); //摘要

        //var filename = source_name + "_" + (new Date()).toFormat("YYYY-MM-DD") +
        //    "_" + myURL.substr(myURL.lastIndexOf('/') + 1) + ".json";
        存储json
        //fs.writeFileSync(filename, JSON.stringify(fetch));
        // var filename = source_name + "_" + (new Date()).toFormat("YYYY-MM-DD") +
        //     "_" + myURL.substr(myURL.lastIndexOf('/') + 1) + ".json";
        // 存储json
        // fs.writeFileSync(filename, JSON.stringify(fetch));

        var fetchAddSql = 'INSERT INTO fetches(url,source_name,source_encoding,title,keywords,author,publish_date,crawltime,content) VALUES(?,?,?,?,?,?,?,?,?)';
        var fetchAddSql_Params = [fetch.url, fetch.source_name, fetch.source_encoding,
            fetch.title, fetch.keywords, fetch.author, fetch.publish_date,
            fetch.crawltime.toFormat("YYYY-MM-DD HH24:MI:SS"), fetch.content
        ];

        //执行sql，数据库中fetch表里的url属性是unique的，不会把重复的url内容写入数据库
        mysql.query(fetchAddSql, fetchAddSql_Params, function(qerr, vals, fields) {
            if (qerr) {
                console.log(qerr);
            }
        }); //mysql写入
    });

（这一页面中的代码为使用mysql来储存所爬内容的代码，若需要将json文件储存入newcrawler所在文件夹，只需要注释掉最后两段，并解除以上部分的注释即可）

正常爬取的效果如下：

接下来是加入msql部分，也就是引入以下数个包，其余需要修改的代码如上代码所示：

并且我们还需要在相同文件夹中创建mysql.js，在终端里调用mysql（这次使用时我又双叒叕地遇到了无法启动mysql的问题，一怒之下只好又将它初始化），创建一个新的数据库，由于以前上课的时候创建了crawler，本次我担心会有问题，于是创建了新的数据库mycrawler，而它的表的部分则沿用了老师所给的fetches.sql,最后顺利地创建了数据库，结合代码，即可将所爬内容加入数据库。

至此，爬虫的核心部分完成，以下是爬虫部分的完整代码（Talk is cheap. Show me the code！）：

var source_name="网易新闻";
var myEncoding="utf-8";
var seedURL ='https://news.163.com/';

var seedURL_format = "$('a')";//正确
var keywords_format = " $('meta[name=\"keywords\"]').eq(0).attr(\"content\")";//正确
var title_format = "$('title').text()";//正确
var date_format = " $('meta[property=\"article:published_time\"]').eq(0).attr(\"content\")";//正确
//var author_format = "$('#editor_baidu').text()";
var content_format = "$('.post_body').text()";//正确
//var desc_format = " $('meta[name=\"description\"]').eq(0).attr(\"content\")";
//var source_format = "$('#source_baidu').text()";
var url_reg = /\/(\w{16}).html/;//正确

var fs = require('fs');
var myRequest = require('request')
var myCheerio = require('cheerio')
var myIconv = require('iconv-lite')
require('date-utils');
var mysql = require('./mysql.js');
var schedule = require('node-schedule');

var headers = {
    'User-Agent': 'Safari/14.0.3 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
}

//request模块异步fetch url
function request(url, callback) {
    var options = {
        url: url,
        encoding: null,
        //proxy: 'http://x.x.x.x:8080',
        headers: headers,
        timeout: 10000 //
    }
    myRequest(options, callback)
}

request(seedURL, function(err, res, body) { //读取种子页面
    // try {
    //用iconv转换编码
    var html = myIconv.decode(body, myEncoding);
    //console.log(html);
    //准备用cheerio解析html
    var $ = myCheerio.load(html, { decodeEntities: true });
    // } catch (e) { console.log('读种子页面并转码出错：' + e) };

    var seedurl_news;

    try {
        seedurl_news = eval(seedURL_format);
        //console.log(seedurl_news);
    } catch (e) { console.log('url列表所处的html块识别出错：' + e) };

    seedurl_news.each(function(i, e) { //遍历种子页面里所有的a链接
        var myURL = "";
        try {
            //得到具体新闻url
            var href = "";
            href = $(e).attr("href");
            if (typeof(href) == "undefined") {  // 有些网页地址undefined
                return true;
            }
            if (href.toLowerCase().indexOf('http://') >= 0 || href.toLowerCase().indexOf('https://') >= 0) myURL = href; //http://开头的或者https://开头
            else if (href.startsWith('//')) myURL = 'http:' + href; 开头的
            else myURL = seedURL.substr(0, seedURL.lastIndexOf('/') + 1) + href; //其他

        } catch (e) { console.log('识别种子页面中的新闻链接出错：' + e) }

        if (!url_reg.test(myURL)) return; //检验是否符合新闻url的正则表达式
        //console.log(myURL);
        newsGet(myURL); //读取新闻页面
    });
});

function newsGet(myURL) { //读取新闻页面
    request(myURL, function(err, res, body) { //读取新闻页面
        //try {
        var html_news = myIconv.decode(body, myEncoding); //用iconv转换编码
        //console.log(html_news);
        //准备用cheerio解析html_news
        var $ = myCheerio.load(html_news, { decodeEntities: true });
        myhtml = html_news;
        //} catch (e) {    console.log('读新闻页面并转码出错：' + e);};

        console.log("转码读取成功:" + myURL);
        //动态执行format字符串，构建json对象准备写入文件或数据库
        var fetch = {};
        fetch.title = "";
        fetch.content = "";
        fetch.publish_date = (new Date()).toFormat("YYYY-MM-DD");
        //fetch.html = myhtml;
        fetch.url = myURL;
        fetch.source_name = source_name;
        fetch.source_encoding = myEncoding; //编码
        fetch.crawltime = new Date();

        if (keywords_format == "") fetch.keywords = source_name; // eval(keywords_format);  //没有关键词就用sourcename
        else fetch.keywords = eval(keywords_format);

        if (title_format == "") fetch.title = ""
        else fetch.title = eval(title_format); //标题

        //if (date_format != "") fetch.publish_date = eval(date_format); //刊登日期
        //console.log('date: ' + fetch.publish_date);
        //console.log(myURL);
        //fetch.publish_date = url_reg.exec(fetch.publish_date)[0];
        //fetch.publish_date = fetch.publish_date.replace('年', '-')
        //fetch.publish_date = fetch.publish_date.replace('月', '-')
        //fetch.publish_date = fetch.publish_date.replace('日', '')
        //fetch.publish_date = new Date(fetch.publish_date).toFormat("YYYY-MM-DD");

        //if (author_format == "") fetch.author = source_name; //eval(author_format);  //作者
        //else fetch.author = eval(author_format);

        if (content_format == "") fetch.content = "";
        else fetch.content = eval(content_format).replace("\r\n" + fetch.author, ""); //内容,是否要去掉作者信息自行决定

        //if (source_format == "") fetch.source = fetch.source_name;
        //else fetch.source = eval(source_format).replace("\r\n", ""); //来源

        //if (desc_format == "") fetch.desc = fetch.title;
        //else fetch.desc = eval(desc_format).replace("\r\n", ""); //摘要

        //var filename = source_name + "_" + (new Date()).toFormat("YYYY-MM-DD") +
        //    "_" + myURL.substr(myURL.lastIndexOf('/') + 1) + ".json";
        存储json
        //fs.writeFileSync(filename, JSON.stringify(fetch));
        // var filename = source_name + "_" + (new Date()).toFormat("YYYY-MM-DD") +
        //     "_" + myURL.substr(myURL.lastIndexOf('/') + 1) + ".json";
        // 存储json
        // fs.writeFileSync(filename, JSON.stringify(fetch));

        var fetchAddSql = 'INSERT INTO fetches(url,source_name,source_encoding,title,keywords,author,publish_date,crawltime,content) VALUES(?,?,?,?,?,?,?,?,?)';
        var fetchAddSql_Params = [fetch.url, fetch.source_name, fetch.source_encoding,
            fetch.title, fetch.keywords, fetch.author, fetch.publish_date,
            fetch.crawltime.toFormat("YYYY-MM-DD HH24:MI:SS"), fetch.content
        ];

        //执行sql，数据库中fetch表里的url属性是unique的，不会把重复的url内容写入数据库
        mysql.query(fetchAddSql, fetchAddSql_Params, function(qerr, vals, fields) {
            if (qerr) {
                console.log(qerr);
            }
        }); //mysql写入
    });


}

II网页部分

我们首先看看网页部分的具体要求：

1、完成对数据库中爬取数据内容或标题的搜索功能，搜索结果以表格形式展示在前端页面中。

2、完成对搜索内容的时间热度分析，比如搜索“新冠”，可以展示爬取数据内容中每一天包含“新冠”的条数，具体展示形式不限，可以用文字或表格展示，也可以用图表展示。（可选）

3、需提交源代码，网站页面设计简洁美观酌情加分。

承上关于数据库的部分，为了使数据可视化，我安装了数据可视化软件navicat for mysql

效果如上，关于content部分为什么没有东西，我表示我也不知道为什么在这个页面显示不出来，但当我们切换到表格视图时即可清晰的看到content部分，我姑且把它当作一个bug（搞不好其实是一个feature）

接着是本节的核心——网页部分

后端采用老师给我们的7.0.3.js（代码此处不再赘述），在webstorm中运行后在Safari里即可搜索，前段为网页 http://127.0.0.1:8080/7.03.html，在尝试搜索“新冠”后我们可以得到如下的内容

同时，我们也可以用表格显示查询内容,我们需要用express脚手架来创建一个网站框架。但express的命令我一直都运行不起来，提示“command not found”，在询问我可爱的卷王室友后，我得知需要安装express和express-generator，在安装了这两个文件后，输入命令，创建了名为“-e”的文件夹，由于它需要读取数据库的文件，所以我们要把数据库mysql.js复制进入该文件夹，然后在文件夹下npm install mysql –save来安装所有我们需要用到的模块，最后打开search_site/routes/index.js，把它修改为老师所给的index.js即可：

var express = require('express');
var router = express.Router();
var mysql = require('../mysql.js');

/* GET home page. */
router.get('/', function(req, res, next) {
    res.render('index', { title: 'Express' });
});

router.get('/process_get', function(request, response) {
    //sql字符串和参数
    var fetchSql = "select url,source_name,title,author,publish_date " +
        "from fetches where title like '%" + request.query.title + "%'";
    mysql.query(fetchSql, function(err, result, fields) {
        response.writeHead(200, {
            "Content-Type": "application/json"
        });
        response.write(JSON.stringify(result));
        response.end();
    });
});
module.exports = router;

此时准备工作大功告成，现在就可以看一看它的具体效果了。

在终端里开始连接

：