基于Python检索系统（2）爬虫

最新推荐文章于 2023-09-05 20:33:35 发布

海底小星星

最新推荐文章于 2023-09-05 20:33:35 发布

阅读量1.2k

点赞数

分类专栏： Python 文章标签：爬虫正则表达式

本文链接：https://blog.csdn.net/u013270326/article/details/76990641

版权

Python 专栏收录该内容

7 篇文章 4 订阅

订阅专栏

将上海理工大学的新闻中心（http://www.usst.edu.cn/s/1/t/517/p/2/i/411/list.htm）的标题或全文爬取下来，存入News.txt 文件。简单的应用正则表达式（re模块）和字符串的处理即可实现。

导入requests模块，并使用requests.get()，可以从获得我们所需要的所有信息，得到的结果如下：

可以看出，我们所需要的新闻标题是在标签中，其中特殊的带有加粗字体的新闻标题是在标签中的，需要进行简单的处理。最终将近期的新闻标题全部写入News.txt文件。

代码实现：

import requests
import re

def Usst_News_Spider(page=1):
    url = "http://www.usst.edu.cn/s/1/t/517/p/2/i/" + str(page) + "/list.htm"
    full_text = requests.get(url)
    key_content = full_text.text
    #特殊字符串的处理
    content_left_treated = key_content.replace('<b>', '')
    content_right_treated = content_left_treated.replace('</b>', '')
    #正则表达式进行匹配
    title = re.findall("<font color=''>(.*?)</font>", content_right_treated)

    print(title)
    print(key_content)
    for i in title:
        f.write(i)
        f.write("\n")

f = open("News.txt", "w", encoding='utf-8')
for i in range(1, 380):
    Usst_News_Spider(i)
f.close()