先看效果:
导入模块:
import re
import xlwt
from bs4 import BeautifulSoup
from urllib import request,error
获取html网页信息:
注意:
(1)Request()封装主要会用到4个参数:url, headers, data, method。 最好使用关键字参数进行传参
def askurl(url):
try:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
req = request.Request(url=url, headers=headers)
respond = request.urlopen(req)
html = respond.read().decode('utf-8')
return html
except error.URLError as e:
if hasattr(e, 'code'):
print(r.code)
if hasattr(e, 'reason'):
print(e.reason)
解析数据:
注意:
(1)正则表达式模板构建时要把整个标签都写入模板,() 内的为提取内容
(2)hasattr(tag, attr) 的两个参数前者是标签对象,后者是属性。判断标签tag是否具有属性attr,返回布尔值
(3)经 hasattr() 判断过的属性正则提取也不一定非空(原因不详)
FindLink = re.compile(r'<a href="(https?://.*?)" target=.*?</a>')
FindTitle = re.compile(r'<a href=.*?title="(.*?)">.*?</a>')
def getdata():
datalist = []
html = askurl('https://blog.csdn.net/')
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('a', {'target':'_blank'}):
if hasattr(item, 'title'):
item = str(item)
title = FindTitle.findall(item)
link = FindLink.findall(item)
if len(link) and len(title):
datalist.append([title[0], link[0]])
return datalist
保存数据:
(1)book.save(savepath) 有参数,且参数格式为:‘路径 + 文件名.xls’
(2)注意记一下 book 和 sheet 后面的参数
def savedata(savepath):
datalist = getdata()
book = xlwt.Workbook('encoding = utf-8', style_compression=0)
sheet = book.add_sheet('博客视频链接', cell_overwrite_ok=True)
col = ('视频名称', '视频链接')
for i in range(2):
sheet.write(0, i, col[i])
for i in range(len(datalist)):
for j in range(2):
sheet.write(i+1, j, datalist[i][j])
book.save(savepath)
主函数执行:
def main():
savedata('d:\\博客视频链接.xls')
if __name__=='__main__':
main()