前端设计与开发——2

最新推荐文章于 2024-05-16 19:53:24 发布

季筱

最新推荐文章于 2024-05-16 19:53:24 发布

阅读量570

点赞数

本文链接：https://blog.csdn.net/m0_38049110/article/details/105984657

版权

项目要求：

1、选取3-5个代表性的新闻网站（比如新浪新闻、网易新闻等，或者某个垂直领域权威性的网站比如经济领域的雪球财经、东方财富等，或者体育领域的腾讯体育、虎扑体育等等）建立爬虫，针对不同网站的新闻页面进行分析，爬取出编码、标题、作者、时间、关键词、摘要、内容、来源等结构化信息，存储在数据库中。

2、建立网站提供对爬取内容的分项全文搜索，给出所查关键词的时间热度分析。

技术要求：

1、必须采用Node.JS实现网络爬虫

2、必须采用Node.JS实现查询网站后端，HTML+JS实现前端（尽量不要使用任何前后端框架）

下面开始自己的尝试

1. 下载mysql

2.安装mysql

3.授权客户端访问

4.创建数据库crwal

create database crawl;

5. 创建表fetches

6. 代码中设置把查询结果放入数据库

var fetch_url_Sql = 'select url from fetches where url=?';
            var fetch_url_Sql_Params = [myURL];
            mysql.query(fetch_url_Sql, fetch_url_Sql_Params, function(qerr, vals, fields) {
                if (vals.length > 0) {
                    console.log('URL duplicate!')
                } else newsGet(myURL); //读取新闻页面
            });

7.查询数据库

成功；

8.在网页端显示为

9. 接下来，设置爬取自定义的网站。

在这里，我尝试爬取澎湃新闻。

在官网可以看到，链接都是存在<a href=>下的。仔细看会发现，所有的新闻链接都是

https://www.thepaper.cn/newsDetail_forward_7294039

这样的格式，而在首页仅以 newsDetail_forward_ + 一串数字开头

则我们读取的时候，需要判断网页url以 newsDetail_forward_ 为开始，其余的不能用

if (href.toLowerCase().indexOf('http://') >= 0) return; //http://开头的绝对路径方式不要
else if (href.startsWith('newsDetail_forward_')) myURL = 'https://www.thepaper.cn/' + href; 
else return;

10.子新闻页面的获取

这里，我们查看澎湃新闻下的子网页

按照格式，通过cheerio获取需要的内容，格式如下

var seedURL_format = "$('a')";
var keywords_format = " $('meta[name=\"keywords\"]').eq(0).attr(\"content\")";
var title_format = "$('title').text()";
var information_format = "$('.news_about').text()";
var content_format = "$('.news_txt').text()";
var desc_format = " $('meta[name=\"description\"]').eq(0).attr(\"content\")";

11.最后保存为json文件

        var fetch = {};//构造空的fetch对象，用来存title、content、url等等
        fetch.title = "";
        fetch.content = "";
        //fetch.publish_date = (new Date()).toFormat("YYYY-MM-DD");
        //fetch.html = myhtml;
        fetch.url = myURL;
        fetch.source_name = source_name;
        fetch.source_encoding = myEncoding; //编码
        fetch.crawltime = new Date();//爬取时间

        if (keywords_format == "") fetch.keywords = source_name; 
        else fetch.keywords = eval(keywords_format);

        if (title_format == "") fetch.title = "";
        else fetch.title = eval(title_format); //标题

        if (information_format == "") fetch.information = ""; 
        else fetch.information = eval(information_format).replace("\n", "");

        if (content_format == "") fetch.content = "";
        else fetch.content = eval(content_format); 

        // if (editor_format == "") fetch.editor = "";
        // else fetch.editor = eval(editor_format).replace("\n",""); 


        if (desc_format == "") fetch.desc = fetch.title;
        else fetch.desc = eval(desc_format); //摘要

其中，存储的文件名为澎湃新闻+新闻页面编号

var filename = source_name + "_"  +
            "_" + myURL.substr(myURL.lastIndexOf('/') + 20,7) + ".json";

12.运行爬虫，成功爬取

查看爬取内容

14.建立mysql数据库，为了方便我们暂时只存取一部分内容

CREATE TABLE `news` (

 `id_fetches` int(11) NOT NULL AUTO_INCREMENT,

 `url` varchar(200) DEFAULT NULL,

 `source_name` varchar(200) DEFAULT NULL,

 `source_encoding` varchar(45) DEFAULT NULL,

 `title` varchar(200) DEFAULT NULL,

 `crawltime` datetime DEFAULT NULL,

 `content` longtext,

 `desc` longtext,

 `createtime` datetime DEFAULT CURRENT_TIMESTAMP,

 PRIMARY KEY (`id_fetches`),

 UNIQUE KEY `id_fetches_UNIQUE` (`id_fetches`),

 UNIQUE KEY `url_UNIQUE` (`url`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

最后存入mysql，格式如下

var fetchAddSql = 'INSERT INTO news(url,source_name,source_encoding,title,' +
            'crawltime,content) VALUES(?,?,?,?,?,?)';
var fetchAddSql_Params = [fetch.url, fetch.source_name, fetch.source_encoding,
            fetch.title, fetch.crawltime.toFormat("YYYY-MM-DD HH24:MI:SS"),
            fetch.content,fetch.desc
];

15.mysql查看存入的前10条

16.网页端读取

点击查询

成功！

基本的东西已经实现了，但如果要好看一点的话，可以使用antd+react框架实现

因为用的框架所以自己不用太多改动

                <Search placeholder="输入标题" 
                  style={{ width: 200 }}
                  onSearch={()=>{this.search2mysql(this)}} 
                  enterButton />                
                <div style={{ margin: "24px 0" }} />
                <Table columns={columns2} dataSource={data} />