02-python 爬虫中国诗词网的诗词标题和内容

最新推荐文章于 2022-10-02 17:06:21 发布

Y_principal

最新推荐文章于 2022-10-02 17:06:21 发布

阅读量324

点赞数

分类专栏： 002-爬虫

本文链接：https://blog.csdn.net/Y_principal/article/details/96175248

版权

这篇博客介绍了如何使用BeautifulSoup爬取中国诗词网上的诗词标题和内容，通过实例展示了爬虫的实现过程，最终成功将数据存储到文档中。

摘要由CSDN通过智能技术生成

主要参考两篇文章总结下这几天所学，小白入门O(∩_∩)O哈哈~ 不喜勿喷

zhttps://blog.csdn.net/qq_40309183/article/details/80630910

https://blog.csdn.net/stormdony/article/details/79828842

目的：为了实现提取中国诗词网的诗词的标的和内容

工具：beautifulsoup 个人感觉他就是为了替代正则表达式

简单总结下正则表达式：

text=“”“

12o=所得税的所得税法水电费水电费是发送到发送到发顺丰.jpg

”“”

r=r'12o(.*?.jpg)' #关于.*? 叫啥贪婪匹配好像

# 还可以compile下reg=re.compile(r)

re.findall(r,text)

print(re.findall(r,text))

前面说了，然后后面又了解了下beautifulsoup，这里主要用这个东西

【1】源博客内容，后面是我修改后的

import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import  sys


url = 'https://blog.csdn.net/stormdony'
# 获取对应网页的链接地址
#//
# 定义一个headers,存储刚才复制下来的报头,模拟成浏览器
headers = ('User-Agent',
           "Mozilla/5.0 (Windows NT 10.0; Win32; x32; rv:48.0) Gucko/20130101 Firefox/58.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
# 将opener安装为全局
urllib.request.install_opener(opener)
html = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
# print(html)
bs = BeautifulSoup(html, 'lxml')
# 用beautifulsoup的select,找到所有的<a>标签

linklist = links = bs.select('h4 > a')
#print(linklist)
# //

# 定义一个列表texts存储文章的标题
texts = []
# 定义一个列表links存储文章的链接
links = []
# 遍历linkllist,存储标题和链接
for link in linklist:
    texts.append(link.t

最低0.47元/天解锁文章

Y_principal

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
02-python 爬虫中国诗词网的诗词标题和内容

主要参考两篇文章总结下这几天所学，小白入门O(∩_∩)O哈哈~ 不喜勿喷zhttps://blog.csdn.net/qq_40309183/article/details/80630910https://blog.csdn.net/stormdony/article/details/79828842目的：为了实现提取中国诗词网的诗词的标的和内容工具：beautifuls...
复制链接

扫一扫

专栏目录