高校新闻抓取分析之百度新闻篇---数据清洗解析

本文链接：https://blog.csdn.net/qq_22034353/article/details/107601308

高校新闻抓取分析之百度新闻篇—数据清洗解析

tips:

本文代码使用python3编写
代码仓库
使用re抓取解析数据

前言

在上一篇文章中，成功构建URL并获取到高校新闻数据。

现在将对请求回来的数据进行清洗解析，提取：新闻标题，新闻来源，新闻时间，更多新闻链接。

回顾一下新闻信息HTML片段：

<div class="result title" id="1">
•&nbsp;
<h3 class="c-title">
 <a href="http://www.thepaper.cn/newsDetail_forward_8438609" data-click="{
      'f0':'77A717EA',
      'f1':'9F63F1E4',
      'f2':'4CA6DE6E',
      'f3':'54E5243F',
      't':'1595752997'
      }" target="_blank">
      中国科协“老科学家学术成长资料采集工程”<em>北京大学</em>联合采集启动...
    </a>
</h3>
<div class="c-title-author">澎湃新闻&nbsp;&nbsp;				
						7小时前
			
&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:1782421700&amp;same=2&amp;cl=1&amp;tn=newstitle&amp;rn=30&amp;fm=sd" class="c-more_link" data-click="{'fm':'sd'}">查看更多相关新闻&gt;&gt;</a>
</div>
</div>
<div class="result title" id="2">
•&nbsp;
<h3 class="c-title">
 <a href="http://sc.people.com.cn/n2/2020/0726/c345509-34183157.html" data-click="{
      'f0':'77A717EA',
      'f1':'9F63F1E4',
      'f2':'4CA6DD6E',
      'f3':'54E5243F',
      't':'1595752997'
      }" target="_blank">
      <em>北京大学</em>、清华大学等17所高校在川招生行程安排和咨询地址、电话...
    </a>
</h3>
<div class="c-title-author">人民网四川站&nbsp;&nbsp;				
						9小时前
			
</div>
</div>

实现数据清洗

在上篇中将请求回来的数据保存为了html，方便我们进一步处理和分析。一般来说，我们提取html中有用的信息，是不需要样式和js的。所以可以预先删除这部分内容，减少干扰项。通过删除多余的空格和换行符，压缩文件。

删除script，style及空格换行符

import re

with open('test.html','r',encoding='utf-8') as f:
    data = f.read()
# 删除script
data = re.sub(r'<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', '', data, flags=re.I)
# 删除style
data = re.sub(r'<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', '', data, flags=re.I)
# 删除多余的空格及换行符
data = re.sub(r'\s{2,}', '', data)

通过上两步可以删除大部分的样式和js，但是有一部分js无法删除。那么通过下面的方式来删除，找到未能删除的js头，然后找到结束的js尾巴，循环删除这一部分即可。

while data.find('<script>') != -1:
    start_index = data.find('<script>')
    end_index = data.find('</script>')
    if end_index == -1:
        break
    data = data[:start_index] + data[end_index + len('</script>'):]

删除样式及js后压缩的html片段：

<div><div class="result title" id="1">
&#8226;&nbsp;
<h3 class="c-title"><a href="https://3g.163.com/news/article/FIG1TC630514B1BP.html"data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DD6E','f3':'54E5243F','t':'1595773708'}"target="_blank">在宣汉揭牌!<em>北京大学</em>文化产业博士后创新实践基地成立</a>
</h3>
<div class="c-title-author">网易&nbsp;&nbsp;3小时前&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd" class="c-more_link" data-click="{'fm':'sd'}">查看更多相关新闻>></a>
</div>
</div><div class="result title" id="2">
&#8226;&nbsp;
<h3 class="c-title"><a href="https://3g.163.com/news/article/FIFOP7BP05501E5I.html"data-click="{'f0':'77A717EA','f1':'9F63F1E4','f2':'4CA6DD6E','f3':'54E5243F','t':'1595773708'}"target="_blank">衡阳市“人大讲坛”再次开讲!听<em>北京大学</em>法学博士说“贪”</a>
</h3>
<div class="c-title-author">网易&nbsp;&nbsp;5小时前</div>
</div>

实现数据解析

1. 提取Html新闻片段

通过前面的删除多余删除样式及js并压缩后，html相对规整，接下来需要做的：

正则提取出包含所有新闻的字符串
移除多余的字符串
字符操作分割出每条新闻html片段

代码如下：

# 取出解析的数据字符串
wait_parse_string = re.search(r'<div id="content_left">([\s\S]*)<div id="gotoPage">', data)

if wait_parse_string:
    wait_parse_string = wait_parse_string.group()
    # 移除最后多余字符串
    wait_parse_string = re.sub(r'</div></div><div id="gotoPage">', '', wait_parse_string)

    flag = '<div class="result title"'
    flag_length = len(flag)
    # 遍历字符串并分割出单条新闻数据
    news_list = []
    while wait_parse_string.find('<div class="result title"') != -1:

        start_index = wait_parse_string.find(flag)
        end_index = wait_parse_string[start_index + flag_length:].find(flag)
        if end_index > 0:
            end_index = start_index + end_index + flag_length
        else:
            end_index = len(wait_parse_string)
        news_list.append(wait_parse_string[start_index:end_index])
        wait_parse_string = wait_parse_string[end_index:]
    print(news_list)

部分结果：

['<div class="result title" id="1">\n&#8226;&nbsp;\n<h3 class="c-title"><a href="https://3g.163.com/news/article/FIG1TC630514B1BP.html"data-click="{\'f0\':\'77A717EA\',\'f1\':\'9F63F1E4\',\'f2\':\'4CA6DD6E\',\'f3\':\'54E5243F\',\'t\':\'1595773708\'}"target="_blank">在宣汉揭牌!<em>北京大学</em>文化产业博士后创新实践基地成立</a>\n</h3>\n<div class="c-title-author">网易&nbsp;&nbsp;3小时前&nbsp;&nbsp;<a href="/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd" class="c-more_link" data-click="{\'fm\':\'sd\'}">查看更多相关新闻>></a>\n</div>\n</div>']

2. 提取数据

通过上一步分割出了单条数据，将数据进一步解析，通过正则提取出需要的字段。代码如下：

from datetime import datetime, timedelta

def time_convert(time_string: str):
    """
    标准化日期字符串
    :param time_string:
    :return:
    """
    if not time_string:
        return ''

    if '分钟前' in time_string:
        minute = re.search(r'\d+', time_string)
        if minute:
            minute = minute.group()
            now = datetime.now() - timedelta(minutes=int(minute))
        else:
            now = datetime.now()
        return now.strftime('%Y-%m-%d')

    elif '小时前' in time_string:
        hour = re.search(r'\d+', time_string)
        if hour:
            hour = hour.group()
            now = datetime.now() - timedelta(hours=int(hour))
        else:
            now = datetime.now()
        return now.strftime('%Y-%m-%d')
    else:
        try:
            parse_time = datetime.strptime(time_string, '%Y年%m月%d日 %H:%M')
            return parse_time.strftime('%Y-%m-%d')
        except Exception as e:
            now = datetime.now()
            return now.strftime('%Y-%m-%d')


news_data = []
for news in news_list:
    temp = {
        "news_key": '北京大学',
        "news_title": '',
        'news_link': '',
        'news_author': '',
        'news_time': '',
        'more_link': '',
    }

    # 解析链接
    news_link = re.search(r'<a\s*href="(\S+)"\s*data-click', news, re.I)
    if news_link:
        temp["news_link"] = news_link.group(1)
    # 解析标题
    news_title = re.search(r'target="_blank">([\d\D]*)(</a>\s*</h3>)', news, re.I)
    if news_title:
        temp["news_title"] = news_title.group(1)
    # 解析发布者及时间
    author_time = re.search(
        r'<div class="c-title-author">(\S+)&nbsp;&nbsp;((\d+分钟前)|(\d+小时前)|(\d+年\d+月\d+日 \d+:\d+))', news, re.I)
    if author_time:
        temp["news_author"] = author_time.group(1)
        temp["news_time"] = time_convert(author_time.group(2))
    # 解析查询更多相同新闻
    more_link = re.search(r'<a\s*href="(\S+)"\s*class="c-more_link"', news, re.I)
    if more_link:
        temp["more_link"] = more_link.group(1)

    news_data.append(temp)

部分结果：

[{'news_key': '北京大学', 'news_title': '在宣汉揭牌!<em>北京大学</em>文化产业博士后创新实践基地成立', 'news_link': 'https://3g.163.com/news/article/FIG1TC630514B1BP.html', 'news_author': '网易', 'news_time': '2020-07-26', 'more_link': '/ns?word=title%3A%28%E5%8C%97%E4%BA%AC%E5%A4%A7%E5%AD%A6%29+cont:2675916080&same=2&cl=1&tn=newstitle&rn=30&fm=sd'},]

3. 数据存储

经过数据的从抓取和解析，已经能得到比较规整的数据。我们只需要按照一定的字段写入到数据库或者文本即可，接下来存储到非关系型数据库mongodb。需要使用到pymongo包，使用命令pip install pymongo。

import pymongo
# 构建数据库连接
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % ("root", '123456', "127.0.0.1", "27017", "school_news_analysis")
conn = pymongo.MongoClient(mongo_uri)
db = conn.school_news_analysis
school_news = db.school_news
insert_data = list()
for item in news_data:
    # 根据链接去重，重复更新 
    insert_data.append(pymongo.ReplaceOne({'news_link': item['news_link']}, item, upsert=True))
school_news.bulk_write(insert_data)

4. 完整代码

import re
from datetime import datetime, timedelta

import pymongo

with open('test.html', 'r', encoding='utf-8') as f:
    data = f.read()
# 删除script
data = re.sub(r'<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', '', data, flags=re.I)
# 删除style
data = re.sub(r'<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', '', data, flags=re.I)
# 删除多余的空格及换行符
data = re.sub(r'\s{2,}', '', data)

while data.find('<script>') != -1:
    start_index = data.find('<script>')
    end_index = data.find('</script>')
    if end_index == -1:
        break
    data = data[:start_index] + data[end_index + len('</script>'):]

with open('test1.html', 'w', encoding='utf-8') as f:
    f.write(data)

# 取出解析的数据字符串
wait_parse_string = re.search(r'<div id="content_left">([\s\S]*)<div id="gotoPage">', data)

news_list = []
if wait_parse_string:
    wait_parse_string = wait_parse_string.group()
    # 移除最后多余字符串
    wait_parse_string = re.sub(r'</div></div><div id="gotoPage">', '', wait_parse_string)

    flag = '<div class="result title"'
    flag_length = len(flag)
    # 遍历字符串并分割出单条新闻数据
    while wait_parse_string.find('<div class="result title"') != -1:

        start_index = wait_parse_string.find(flag)
        end_index = wait_parse_string[start_index + flag_length:].find(flag)
        if end_index > 0:
            end_index = start_index + end_index + flag_length
        else:
            end_index = len(wait_parse_string)
        news_list.append(wait_parse_string[start_index:end_index])
        wait_parse_string = wait_parse_string[end_index:]
    print(news_list)


def time_convert(time_string: str):
    """
    标准化日期字符串
    :param time_string:
    :return:
    """
    if not time_string:
        return ''

    if '分钟前' in time_string:
        minute = re.search(r'\d+', time_string)
        if minute:
            minute = minute.group()
            now = datetime.now() - timedelta(minutes=int(minute))
        else:
            now = datetime.now()
        return now.strftime('%Y-%m-%d')

    elif '小时前' in time_string:
        hour = re.search(r'\d+', time_string)
        if hour:
            hour = hour.group()
            now = datetime.now() - timedelta(hours=int(hour))
        else:
            now = datetime.now()
        return now.strftime('%Y-%m-%d')
    else:
        try:
            parse_time = datetime.strptime(time_string, '%Y年%m月%d日 %H:%M')
            return parse_time.strftime('%Y-%m-%d')
        except Exception as e:
            now = datetime.now()
            return now.strftime('%Y-%m-%d')


news_data = []
for news in news_list:
    temp = {
        "news_key": '北京大学',
        "news_title": '',
        'news_link': '',
        'news_author': '',
        'news_time': '',
        'more_link': '',
    }

    # 解析链接
    news_link = re.search(r'<a\s*href="(\S+)"\s*data-click', news, re.I)
    if news_link:
        temp["news_link"] = news_link.group(1)
    # 解析标题
    news_title = re.search(r'target="_blank">([\d\D]*)(</a>\s*</h3>)', news, re.I)
    if news_title:
        temp["news_title"] = news_title.group(1)
    # 解析发布者及时间
    author_time = re.search(
        r'<div class="c-title-author">(\S+)&nbsp;&nbsp;((\d+分钟前)|(\d+小时前)|(\d+年\d+月\d+日 \d+:\d+))', news, re.I)
    if author_time:
        temp["news_author"] = author_time.group(1)
        temp["news_time"] = time_convert(author_time.group(2))
    # 解析查询更多相同新闻
    more_link = re.search(r'<a\s*href="(\S+)"\s*class="c-more_link"', news, re.I)
    if more_link:
        temp["more_link"] = more_link.group(1)

    news_data.append(temp)

print(news_data)

# 构建数据库连接
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % ("root", '123456', "127.0.0.1", "27017", "school_news_analysis")
conn = pymongo.MongoClient(mongo_uri)
db = conn.school_news_analysis
school_news = db.school_news
insert_data = list()
for item in news_data:
    # 根据链接去重，重复更新
    insert_data.append(pymongo.ReplaceOne({'news_link': item['news_link']}, item, upsert=True))
school_news.bulk_write(insert_data)