web爬虫

作业要求

—核心需求—:

1、选取3-5个代表性的新闻网站(比如新浪新闻、网易新闻等,或者某个垂直领域权威性的网站比如经济领域的雪球财经、东方财富等,或者体育领域的腾讯体育、虎扑体育等等)建立爬虫,针对不同网站的新闻页面进行分析,爬取出编码、标题、作者、时间、关键词、摘要、内容、来源等结构化信息,存储在数据库中。

2、建立网站提供对爬取内容的分项全文搜索,给出所查关词的时间热度分析。

—技术要求—:

1、必须采用Node.JS实现网络爬虫

2、必须采用Node.JS实现查询网站后端,HTML+JS实现前端(尽量不要使用任何前后端框架)

以下是本篇文章正文内容

Mysql.js

目的:使用 Node.js 来连接 MySQL,并对数据库进行操作。

var mysql = require("mysql");
var pool = mysql.createPool({
    host: '127.0.0.1',
    user: 'root',
    password: 'root',
    database: 'crawler2'
});
var query = function(sql, sqlparam, callback) {
    pool.getConnection(function(err, conn) {
        if (err) {
            callback(err, null, null);
        } else {
            conn.query(sql, sqlparam, function(qerr, vals, fields) {
                conn.release(); //释放连接 
                callback(qerr, vals, fields); //事件驱动回调 
            });
        }
    });
};
var query_noparam = function(sql, callback) {
    pool.getConnection(function(err, conn) {
        if (err) {
            callback(err, null, null);
        } else {
            conn.query(sql, function(qerr, vals, fields) {
                conn.release(); //释放连接 
                callback(qerr, vals, fields); //事件驱动回调 
            });
        }
    });
};
exports.query = query;
exports.query_noparam = query_noparam;

准备

首先下载Node.js以及VSCode这两个软件
以及一颗热爱爬虫(以及被爬 )的心

接下来就可以开始快乐地 爬虫

示例分析

1.定义所要爬取网站的域名url

简单来说就是将所需要的网站的地址定义下来,当然也存在需要对域名进行改变的情况:

代码如下

var source_name = "中国新闻网";
var domain = 'http://www.chinanews.com/';
var myEncoding = "utf-8";//防止出现乱码
var seedURL = 'http://www.chinanews.com/';

2.定义新闻元素的读取方式

var seedURL_format = "$('a')";
var keywords_format = " $('meta[name=\"keywords\"]').eq(0).attr(\"content\")";
var title_format = "$('title').text()";
var date_format = "$('#pubtime_baidu').text()";
var author_format = "$('#editor_baidu').text()";
var content_format = "$('.left_zw').text()";
var desc_format = " $('meta[name=\"description\"]').eq(0).attr(\"content\")";
var source_format = "$('#source_baidu').text()";
var url_reg = /\/(\d{4})\/(\d{2})-(\d{2})\/(\d{7}).shtml/;


var regExp = /((\d{4}|\d{2})(\-|\/|\.)\d{1,2}\3\d{1,2})|(\d{4}年\d{1,2}月\d{1,2}日)/

正则表达式
正则表达式描述了一种字符串匹配的模式,可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。

像这里最后两行代码,我们在观察所需要的网页地址后(如下)不难发现
这里
地址中包含四块数字,分别为四位数,两位数,两位数和七位数,所以正则表达式也包含4,2,2,7这几个数字。

3.安装依赖

var fs = require('fs');
var myRequest = require('request');
var myCheerio = require('cheerio');
var myIconv = require('iconv-lite');
require('date-utils');
var mysql = require('./mysql.js');

注意 :因为这里的库需要额外安装,所以还要执行以下代码:

npm install request
npm install cheerio'
npm install iconv-lite

cheerio是nodejs的抓取页面模块,为服务器特别定制的,快速、灵活、实施的jQuery核心实现。

iconv-lite实现代码编译转换模块,项目中的主要的功能是将无论什么编译标准的代码转化为标准的utf-8(其实现在的网站基本都是统一的utf-8,这个的作用我个人感觉不是很大)。

date-utils是一个日期工具类,方便我们对爬取时间的操作。

基本上就是看看缺啥,就install上啥。

4.与数据库建立连接,并对数据库进行操作

抄写 编写Mysql.js

  var mysql = require("mysql");
  var pool = mysql.createPool({
      host: '127.0.0.1',
      user: 'root',
      password: 'root',
      database: 'crawl'
  });
  var query = function(sql, sqlparam, callback) {
      pool.getConnection(function(err, conn) {
          if (err) {
              callback(err, null, null);
          } else {
              conn.query(sql, sqlparam, function(qerr, vals, fields) {
                  conn.release(); //释放连接 
                  callback(qerr, vals, fields); //事件驱动回调 
              });
          }
      });
  };
  var query_noparam = function(sql, callback) {
      pool.getConnection(function(err, conn) {
          if (err) {
              callback(err, null, null);
          } else {
              conn.query(sql, function(qerr, vals, fields) {
                  conn.release(); //释放连接 
                  callback(qerr, vals, fields); //事件驱动回调 
              });
          }
      });
  };
  exports.query = query;
  exports.query_noparam = query_noparam;

将这块代码复制到所对应的文件夹下,已完成对MySQL数据库的建立。

5.源代码 crawler.js

var fs = require('fs');
var myRequest = require('request');
var myCheerio = require('cheerio');
var myIconv = require('iconv-lite');
require('date-utils');
var mysql = require('./mysql.js');

var source_name = "中国新闻网";
var domain = 'http://www.chinanews.com/';
var myEncoding = "utf-8";
var seedURL = 'http://www.chinanews.com/';

var seedURL_format = "$('a')";
var keywords_format = " $('meta[name=\"keywords\"]').eq(0).attr(\"content\")";
var title_format = "$('title').text()";
var date_format = "$('#pubtime_baidu').text()";
var author_format = "$('#editor_baidu').text()";
var content_format = "$('.left_zw').text()";
var desc_format = " $('meta[name=\"description\"]').eq(0).attr(\"content\")";
var source_format = "$('#source_baidu').text()";
var url_reg = /\/(\d{4})\/(\d{2})-(\d{2})\/(\d{7}).shtml/;


var regExp = /((\d{4}|\d{2})(\-|\/|\.)\d{1,2}\3\d{1,2})|(\d{4}年\d{1,2}月\d{1,2}日)/

//防止网站屏蔽我们的爬虫
var headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
}

//request模块异步fetch url
function request(url, callback) {
    var options = {
        url: url,
        encoding: null,
        //proxy: 'http://x.x.x.x:8080',
        headers: headers,
        timeout: 10000 //
    }
    myRequest(options, callback)
};

seedget();

function seedget() {
    request(seedURL, function(err, res, body) { //读取种子页面
        // try {
        //用iconv转换编码
        var html = myIconv.decode(body, myEncoding);
        //console.log(html);
        //准备用cheerio解析html
        var $ = myCheerio.load(html, { decodeEntities: true });
        // } catch (e) { console.log('读种子页面并转码出错:' + e) };
        var seedurl_news;
        try {
            seedurl_news = eval(seedURL_format);
        } catch (e) { console.log('url列表所处的html块识别出错:' + e) };
        seedurl_news.each(function(i, e) { //遍历种子页面里所有的a链接
            var myURL = "";
            try {
                //得到具体新闻url

                var href = "";
                href = $(e).attr("href");
                if (href == undefined) return;
                if (href.toLowerCase().indexOf('http://') >= 0) myURL = href; //http://开头的
                else if (href.startsWith('//')) myURL = 'http:' + href; 开头的
                else myURL = seedURL.substr(0, seedURL.lastIndexOf('/') + 1) + href; //其他

            } catch (e) { console.log('识别种子页面中的新闻链接出错:' + e) }

            if (!url_reg.test(myURL)) return; //检验是否符合新闻url的正则表达式
            //console.log(myURL);

            var fetch_url_Sql = 'select url from fetches where url=?';
            var fetch_url_Sql_Params = [myURL];
            mysql.query(fetch_url_Sql, fetch_url_Sql_Params, function(qerr, vals, fields) {
                //if (vals.length > 0) {
                    //console.log('URL duplicate!')
                //} else 
                newsGet(myURL); //读取新闻页面
            });
        });
    });
};

function newsGet(myURL) { //读取新闻页面
    request(myURL, function(err, res, body) { //读取新闻页面
        //try {
        var html_news = myIconv.decode(body, myEncoding); //用iconv转换编码
        //console.log(html_news);
        //准备用cheerio解析html_news
        var $ = myCheerio.load(html_news, { decodeEntities: true });
        myhtml = html_news;
        //} catch (e) {    console.log('读新闻页面并转码出错:' + e);};

        console.log("转码读取成功:" + myURL);
        //动态执行format字符串,构建json对象准备写入文件或数据库
        var fetch = {};
        fetch.title = "";
        fetch.content = "";
        fetch.publish_date = (new Date()).toFormat("YYYY-MM-DD");
        //fetch.html = myhtml;
        fetch.url = myURL;
        fetch.source_name = source_name;
        fetch.source_encoding = myEncoding; //编码
        fetch.crawltime = new Date();

        if (keywords_format == "") fetch.keywords = source_name; // eval(keywords_format);  //没有关键词就用sourcename
        else fetch.keywords = eval(keywords_format);

        if (title_format == "") fetch.title = ""
        else fetch.title = eval(title_format); //标题

        if (date_format != "") fetch.publish_date = eval(date_format); //刊登日期   
        console.log('date: ' + fetch.publish_date);
        fetch.publish_date = regExp.exec(fetch.publish_date)[0];
        fetch.publish_date = fetch.publish_date.replace('年', '-')
        fetch.publish_date = fetch.publish_date.replace('月', '-')
        fetch.publish_date = fetch.publish_date.replace('日', '')
        fetch.publish_date = new Date(fetch.publish_date).toFormat("YYYY-MM-DD");

        if (author_format == "") fetch.author = source_name; //eval(author_format);  //作者
        else fetch.author = eval(author_format);

        if (content_format == "") fetch.content = "";
        else fetch.content = eval(content_format).replace("\r\n" + fetch.author, ""); //内容,是否要去掉作者信息自行决定

        if (source_format == "") fetch.source = fetch.source_name;
        else fetch.source = eval(source_format).replace("\r\n", ""); //来源

        if (desc_format == "") fetch.desc = fetch.title;
        else fetch.desc = eval(desc_format).replace("\r\n", ""); //摘要    

        // var filename = source_name + "_" + (new Date()).toFormat("YYYY-MM-DD") +
        //     "_" + myURL.substr(myURL.lastIndexOf('/') + 1) + ".json";
        // 存储json
        // fs.writeFileSync(filename, JSON.stringify(fetch));

        var fetchAddSql = 'INSERT INTO fetches(url,source_name,source_encoding,title,' +
            'keywords,author,publish_date,crawltime,content) VALUES(?,?,?,?,?,?,?,?,?)';
        var fetchAddSql_Params = [fetch.url, fetch.source_name, fetch.source_encoding,
            fetch.title, fetch.keywords, fetch.author, fetch.publish_date,
            fetch.crawltime.toFormat("YYYY-MM-DD HH24:MI:SS"), fetch.content
        ];

        //执行sql,数据库中fetch表里的url属性是unique的,不会把重复的url内容写入数据库
        mysql.query(fetchAddSql, fetchAddSql_Params, function(qerr, vals, fields) {
            if (qerr) {
                console.log(qerr);
            }
        }); //mysql写入
    });
}


网页

数据导入Mysql之后,接下来就是进行网页的设置与搜索啦√

1.HTML(前端)

HTML的全称为超文本标记语言,是一种标记语言。它包括一系列标签.简单来说,网站的基本构建就是靠HTML来实现的,再通过CSS和JavaScript进行修饰,但,不存在的,能用HTML完成简单的搭建就很不错啦,CSS想想就好啦(叹气)

如下

<!DOCTYPE html>
<html>

<body>
    <form action="http://127.0.0.1:8080/process_get" method="GET">
        <br> 标题:<input type="text" name="title">
        <input type="submit" value="Submit">
    </form>
    <script>
    </script>
</body>

</html>

很简单对不对,没办法了,这甚至还是老师上课改的 ,请原谅一个小白低端的学习能力(瘫)

2.JavaScript(后端)

var http = require('http');
var fs = require('fs');
var url = require('url');
var mysql = require('./mysql.js');
http.createServer(function(request, response) {
    var pathname = url.parse(request.url).pathname;
    var params = url.parse(request.url, true).query;
    fs.readFile(pathname.substr(1), function(err, data) {
        response.writeHead(200, { 'Content-Type': 'text/html; charset=utf-8' });
        if ((params.title === undefined) && (data !== undefined))
            response.write(data.toString());
        else {
            response.write(JSON.stringify(params));
            var select_Sql = "select title,author,publish_date from fetches where title like '%" +
                params.title + "%'";
            mysql.query(select_Sql, function(qerr, vals, fields) {
                console.log(vals);
            });
        }
        response.end();
    });
}).listen(8080);
console.log('Server running at http://127.0.0.1:8080/');

简单的后端与前端连接形式√
先运行js文件再打开网址(否则会被拒绝访问)就可以得到如下结果啦√
在这里插入图片描述在这里插入图片描述

实验项目

在同学的帮助下,我终于实现了对于新闻网页的爬取,累死(瘫)

1.Sina

'use strict'; var cheerio = require('cheerio');
 var request = require('request');
  var fs = require('fs'); 
 var iconv = require('iconv-lite');
  var mysql = require("./mysql.js"); 
  var Title = [];
   var Time = [];
var Content = []; 

var Url = []; 
request({ url: "https://news.sina.com.cn/", encoding: null, headers: null}, function (err, res, body) 
{ if (err || res.statusCode != 200) 
    { console.error(err);
         console.error(res.statusCode); 
         return; }let $ = cheerio.load(body); 
         let cnt = 0;
          let ulArr = $("ul.list_14").eq(cnt); 
          while (ulArr.text()) { let cnt1 = 0; 
            let liArr = ulArr.children("li").eq(cnt1);
             while (liArr.text()) { let cnt2 = 0; 
                let aArr = liArr.children("a").eq(cnt2); 
                while (aArr.text())
                 { let urlstr = aArr.attr("href");
                  let title = aArr.text(); 
                  if (urlstr && title) 
                  { request({ url: urlstr, encoding: null, headers: null, rejectUnauthorized: false }, function (err, res, body) {
if (err) { console.error(err); 
    return;
 }
 let $ = cheerio.load(iconv.decode(body, 'utf-8')); 
 mysql.query('INSERT INTO myxinwen(Url, Title, Content ,Time) VALUES(?, ?, ?, ?);',
  [res.request.uri.href, 
    $("div.second-title").text(),
    $("div.article").text(),
     $("div.date-source").text() ],
      function(qerr, vals, fields) 
      { if(qerr) { console.error(qerr); }
      console.log(`成功爬取${$("div.second- title").text()}`);
     }); }); }cnt2++; aArr = liArr.children("a").eq(cnt2); 
     urlstr = aArr.attr("href"); title = aArr.text();
     }cnt1++; 
     liArr = ulArr.children("li").eq(cnt1); 
    }cnt++;
     ulArr = $("ul.list_14").eq(cnt);
     } 
    });

2.企鹅体育:

'use strict';
var cheerio = require('cheerio');
var request = require('request');
var fs = require('fs');
var iconv = require('iconv-lite');
var mysql = require("./mysql.js");
var headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
}
var Title = [];
var Content = [];
var Time = [];
var Url = [];

request({ url: "https://sports.qq.com/", encoding: null, headers: null }, function (err, res, body) {
    if (err || res.statusCode != 200) {
        console.error(err);
        console.error(res.statusCode);
        return;
    }
    let $ = cheerio.load(iconv.decode(body, 'gbk'));

    let ulArr = $("div.scr-news").children("ul");
    let cnt1 = 0;
    let liArr = ulArr.children("li").eq(cnt1);
    while (liArr.text()) {
        let urlstr = liArr.children("a").attr("href");
        let title = liArr.text();
        if (urlstr && title) {
            console.log(`正在爬取${title}, ${urlstr}`);
            request({ url: urlstr, encoding: null, headers: null }, function (err, res, body) {
                if (err || res.statusCode != 200) {
                    console.error(err);
                    console.error(res.statusCode);
                    return;
                }
                let $ = cheerio.load(iconv.decode(body, 'gbk'));

                mysql.query('INSERT INTO myxinwen(Url, Title, Content, Time) VALUES(?, ?, ?, ?);', [
                    res.request.uri.href,
                    $("div.LEFT").children("h1").text(),
                    $("p.one-p").text(),
                    $("p.one-p").text().toString().split("北京")[1]("日")[0]
                ], function(qerr, vals, fields) {
                    if(qerr) {
                        console.error(qerr);
                    }
                    console.log(`成功爬取${$("div.LEFT").children("h1").text()}`);
                });
            });
        }
        cnt1++;
        liArr = ulArr.children("li").eq(cnt1);
    }
});

3.网易:

'use strict';
var cheerio = require('cheerio');
var request = require('request');
var fs = require('fs');
var iconv = require('iconv-lite');
var mysql = require("./mysql.js");
var Title = [];
var Time = [];
//var Source = [];
var Content = [];
var Url = [];

request({ url: "https://news.163.com/", encoding: null, headers: null }, function (err, res, body) {
	if (err || res.statusCode != 200) {
		console.error(err);
		console.error(res.statusCode);
		return;
	}
	let $ = cheerio.load(body);

	let cnt = 0;
	let ulArr = $("ul.top_news_ul").eq(cnt);
	while (ulArr.text()) {
		let cnt1 = 0;
		let liArr = ulArr.children("li").eq(cnt1);
		while (liArr.text()) {
			let cnt2 = 0;
			let aArr = liArr.children("a").eq(cnt2);
			while (aArr.text()) {
				let urlstr = aArr.attr("href");
				let title = aArr.text();
				if (urlstr && title) {
					console.log(`正在爬取${title}`);
					request({ url: urlstr, encoding: null, headers: null }, function (err, res, body) {
						if (err || res.statusCode != 200) {
							console.error(err);
							console.error(res.statusCode);
							return;
						}
						let $ = cheerio.load(iconv.decode(body, 'utf-8'));
						mysql.query('INSERT INTO myxinwen(Url, Title, Content, Time) VALUES(?, ?, ?, ?);', [
							res.request.uri.href,
							$("h1.post_title").text(),
							$("div.post_body").text(),
							$("div.post_info").text().toString().split("来源:")[0]
						], function(qerr, vals, fields) {
							if(qerr) {
								console.error(qerr);
							}
							console.log(`成功爬取${$("h1.post_title").text()}`);
						});
					});
				}
				cnt2++;
				aArr = liArr.children("a").eq(cnt2);
				urlstr = aArr.attr("href");
				title = aArr.text();
			}
			cnt1++;
			liArr = ulArr.children("li").eq(cnt1);
		}
		cnt++;
		ulArr = $("ul.top_news_ul").eq(cnt);
	}
});

三个网站都爬一遍。通通通通存进去。

4.开始搜索√

现在可以建立前后端进行关键词查询啦。

先写后端server.js

const express = require("express");
const cheerio = require("cheerio");
const mysql = require("./mysql.js");
const fs = require("fs");

var server = express();

server.get('/', function(req, res) { //处理“/”的路由
    res.end(fs.readFileSync("search.html"));
});

server.get('/search', function(req, res) {
    let kw = req.query.kw;
    let $ = cheerio.load(fs.readFileSync("search.html"));
    
    mysql.query_noparam(`SELECT * FROM myxinwen WHERE Title LIKE '%${kw}%' OR Content LIKE '%${kw}%' OR Time LIKE '%${kw}%'`, function(qerr, vals, fields) {//kw是keyword的简写
        if(qerr) {
            console.error(qerr);
            return;
        }
        for(let i in vals) {
            $("table").append(`<tr><td><a href="${vals[i].Url}">${vals[i].Title}</a></td><td>${vals[i].Content}</td><td>${vals[i].Time}</td></tr>`);
        }
        res.end($.html());
    });//where clause
});

server.listen(3000);

然后前端search.html:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
</head>
<body>
    <form action="/search" method="GET">
        <br> 关键词:<input type="text" name="kw">
        <input type="submit" value="Submit">
    </form>
    <table border="1">
    </table>
</body>

</html>

运行后端,点开前端网页http://127.0.0.1:3000

在这里插入图片描述
点击Submit

秒出结果:

在这里插入图片描述

总结

经过此次爬虫代码的学习,掌握了html,javascript等多种语言的部分语法,但仅仅算是入门级别,还有待日后更加努力。而且主要是在同学老师的帮助下才完成的,等到自己完成可能还有一定距离,不过以后至一定会加油的,毕竟踏上了写代码这条不归路
其实到现在也还有很多知识没有搞懂,希望以后自己也能保持一颗好学的心,继续前进√
(内心真实想法,还是希望这种令人头秃的东西少发点吧。。。)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值