python爬文章_python爬虫 ----文章爬虫（合理处理字符串中的\n\t\r........）

最新推荐文章于 2022-06-12 16:30:12 发布

weixin_39935654

最新推荐文章于 2022-06-12 16:30:12 发布

阅读量306

点赞数

文章标签： python爬文章

import urllib.request

import re

import time

num=input("输入日期（20150101000）：")

def openpage(url):

html=urllib.request.urlopen(url)

page=html.read().decode(‘gb2312‘)

return page

def getpassage(page):

passage = re.findall(r‘

([\s\S]*)‘,str(page))

passage1=re.sub("?\w+[^>]*>", "", str(passage))

passage2=passage1.replace(‘\\r‘, ‘\r‘).replace(‘\\n‘, ‘ \n‘).replace(‘\\t‘,‘\t‘).replace(‘]‘,‘‘).replace(‘[‘,‘‘).replace(‘ ‘,‘ ‘)

print(passage2)

with open(load,‘a‘,encoding=‘utf-8‘) as f:

f.write("-----------------------------"+"日期"+str(date)+"---------------------------------\n"+passage2+"----------------------------------------------------\n")

for i in range(1,32):

date=int(num)+int(i)

print(date)

load="C:/Users/home/Desktop/新建文本文档.txt"

url=("http://www.hbuas.edu.cn/news/xyxw/news_"+str(date)+".htm")

try:

page=openpage(url)

getpassage(page)

print("第"+str(i)+"号有文章，----已下载")

except:

print("第"+str(i)+"号无文章。")

time.sleep(2)

写了一个爬学校新闻网的爬虫，

主要涉及 re正则 urllib.request 文件的写入

在爬取文章时通常会返回很多影响美感的代码

如下：

L3Byb3h5L2h0dHAvaW1hZ2VzMjAxNy5jbmJsb2dzLmNvbS9ibG9nLzExNDQyNzEvMjAxNzA4LzExNDQyNzEtMjAxNzA4MTIxNzU4MDYzMzUtMjA3MTEwMjMyNy5wbmc=.jpg

优化：

两次正则

passage = re.findall(r‘

([\s\S]*)‘,str(page)) #第一次匹配字段

passage1=re.sub("?\w+[^>]*>", "", str(passage))　　　　　　　　　　　　　　# 第二次去掉html标签

替换

passage2=passage1.replace(‘\\r‘, ‘\r‘).replace(‘\\n‘, ‘ \n‘).replace(‘\\t‘,‘\t‘).replace(‘]‘,‘‘).replace(‘[‘,‘‘).replace(‘ ‘,‘ ‘)

效果如下：

L3Byb3h5L2h0dHAvaW1hZ2VzMjAxNy5jbmJsb2dzLmNvbS9ibG9nLzExNDQyNzEvMjAxNzA4LzExNDQyNzEtMjAxNzA4MTIxODAzMDY2NjMtMjIwOTczMzQ5LnBuZw==.jpg

over！

时间： 08-12

weixin_39935654

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫