针对数据的采集,是通过采用python网络爬虫技术进行采集古诗文网站的网页信息,通过Requests库模拟请求,并使用突破反爬、广度优先与深度优先遍历等策略。
首先,看古诗文网页的布局有很多的诗词分类,以爬取唐诗三百首页面为例,页面中诗名是以小标题排列的顺序,点击标题可以看到对应诗经的内容及其注释。所以先爬取该网页上的诗经“目录”区域的所有诗名的链接,并把链接存储到数组中,然后分别访问这些链接,并获取链接中的需要的网页内容。
爬虫第一步获取数据过程,使用python中的第三方库requests库,利用requetsts.get()方法,设置参数URL,告诉其请求的目标,然后向服务器发起GET请求,以此来从服务器中得到数据并获取数据,我们将请求获得的响应内容放到Res变量中,然后使用Res.content就可以获得HTML信息了。具体的信息处理流程如图所示:
当获取了信息后,会有很多冗余信息,需要剔除无用信息获取我们需要的信息,因此爬虫第二步,解析HTML信息,使用BeautiSoup库提取出古诗文网站中古诗的标签内容。查看目标页面可知,我们所需要的数据在其标签下,因此,需要创建一个相对应的对象来存储利用Beauful soup库清洗后的数据,之后使用find_all()方法获取我们需要的信息。其中find_all()方法第一个参数是我们要获取信息相对应的标签名,第二个参数是该标签的属性,这样边顺利匹配到我们所关心的数据。
得到所需要的文本数据后,数据中仍有很多不需要的英文字母或者其他的符号,我们结合正则表达式从上一步得到的文本字符串中获取我们所需要的诗名,诗人,朝代,诗和译文等信息。其中主要使用正则表达式模块Re的findall()方法来获取与构建的pattern相匹配的全部字符串。
爬取古诗词
import re
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36'
}
def get_link(url):
res = requests.get(url=url,headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.content, 'html.parser')
new=soup.find_all('span')
new=str(new)
pattern='<span><a href="(.*?)" target="_blank">(.*?)</a>(.*?)</span>'
link=re.findall(pattern,new)
return link
def get_intent(name,title,links):
f = open(name, 'a', encoding='utf-8')
f.write('\n\n\n'+title+'\n')
col = '诗名'+','+'作者'+','+'朝代'+','+'古诗'+','+'译文'
f.write(col + '\n')
count=1
for link in links:
url='https://so.gushiwen.cn'+str(link[0])
#url=str(link[0])
print(count)
count+=1
print(url)
s = requests.session()
res = s.get(url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.content, 'html.parser')
new=soup.find_all('div',class_='contson')
#print(new[0])
#处理古诗词
new=str(new[0])
new=new.replace('\n','')
new=new.replace('<br/>','')
new = new.replace('<p>', '')
new = new.replace('</p>', '')
#print(new)
pattern = '<div class="contson" id="(.*?)">(.*?)</div>'
poem=re.findall(pattern,new)
for i in poem:
poem=i[1]
#print(poem)
#处理作者和朝代
title=soup.find_all('p',class_='source')
pattern = '<div class="contson" id="(.*?)">(.*?)</div>'
title=str(title[0])
title= title.replace('\n', '')
#print(title)
pattern='<p class="source"><a href="(.*?)">(.*?)</a> <a href="(.*?)">〔(.*?)〕</a></p>'
title=re.findall(pattern,title)
for i in title:
poet=i[1]
dynasty=i[3]
#获得译文
new = soup.find_all('div', class_='contyishang')
soup = BeautifulSoup(str(new), 'html.parser')
new = soup.find_all('p')
if len(new)==0:
yiwen='无'
else:
yiwen=str(new[0])
yiwen=yiwen.replace('<br/>','')
yiwen = yiwen.replace('</a>', '')
yiwen = yiwen.replace('<strong>韵译</strong>', '')
#处理译文
#print(yiwen)
pattern='<p><strong>译文</strong>(.*?)</p>'
yi=re.findall(pattern,yiwen)
if len(yi)==0:
pattern = '<p>译文(.*?)</p>'
yi = re.findall(pattern, yiwen)
if len(yi)==0:
pattern='<p>(.*?)</p>'
yi = re.findall(pattern, yiwen)
#print(yi)
if len(yi)==0:
yi='无'
else:
yi=str(yi[0])
#print(yi)
f = open(name, 'a', encoding='utf-8')
f.write(link[1]+','+poet+','+dynasty+','+poem+','+yi+'\n')
f.close()
url='https://so.gushiwen.cn/gushi/dushu.aspx'
name = "F:\\课程代码资料\\古诗词\\读书.csv"
#唐诗三百首
#url="https://so.gushiwen.cn/gushi/tangshi.aspx"
#古诗三百首
#url='https://so.gushiwen.cn/gushi/sanbai.aspx'
#宋词三百首
#url='https://so.gushiwen.cn/gushi/songsan.aspx'
#初中古诗
#url='https://so.gushiwen.cn/gushi/chuzhong.aspx'
#高中古诗
#url='https://so.gushiwen.cn/gushi/gaozhong.aspx'
#小学文言文
#url='https://so.gushiwen.cn/wenyan/xiaowen.aspx'
#初中文言文
#url='https://so.gushiwen.cn/wenyan/chuwen.aspx'
#高中文言文
#url='https://so.gushiwen.cn/wenyan/gaowen.aspx'
#写景
#url='https://so.gushiwen.cn/gushi/xiejing.aspx'
link=get_link(url)
print(len(link))
get_intent(name,url,link)
爬取诗人信息及其图片
import re
import requests
from bs4 import BeautifulSoup
headers = {
'Referer':'https://so.gushiwen.cn/gushi/tangshi.aspx',
'If-Modified-Since':"Thu, 25 Feb 2021 09:04:01 GMT",
'If-None-Match':'W/"f0c8f32855bd71:0"',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54'
} # 为了防止被反爬虫
def get_urls(url):
res = requests.get(url=url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.content, 'html.parser')
new = soup.find_all('div', class_="divimg")
soup = BeautifulSoup(str(new), 'html.parser')
new = soup.find_all('img')
#print(new)
new=str(new)
pattern='<img alt="(.*?)" height="150" src="(.*?)" width="105"/>'
urls=re.findall(pattern,new)
print(len(urls))
print(urls)
return urls
def get_intent(url):
s = requests.session()
res = s.get(url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.content, 'html.parser')
new=soup.find_all('p',style=' margin:0px;')
# print(new)
poem=soup.find_all('b')
poem=str(poem)
pattern='<b>(.*?)</b>'
poem=re.findall(pattern,poem)
news=[]
shiwen_num=[]
mingju_num=[]
print(len(new))
for i in new:
# print(i)
i=str(i)
i=i.replace('<p style=" margin:0px;">','')
pattern = '(.*?)<a href="(.*?)" target="_blank">► (.*?)篇诗文</a>\u3000<a href="(.*?)" target="_blank">► (.*?)条名句</a></p>'
newi=re.findall(pattern,i)
# print(newi)
if len(newi)==0:
pattern = '(.*?)<a href="(.*?)" target="_blank">► (.*?)条名句</a></p>'
newi=re.findall(pattern,i)
if len(newi)==0:
pattern = '(.*?)<a href="(.*?)" target="_blank">► (.*?)篇诗文</a></p>'
newi = re.findall(pattern, i)
if len(newi)==0:
# print('---------------------------')
list1 = []
list1.append(i)
b = tuple(list1)
newi=[b]
# print(newi)
#print(len(newi[0]))
news.append(str(newi[0][0]))
# print(new)
if len(newi[0])==5:
shiwen_num.append(str(newi[0][2]))
mingju_num.append(str(newi[0][4]))
elif len(newi[0])==3:
shiwen_num.append('0')
mingju_num.append(str(newi[0][2]))
else:
shiwen_num.append('0')
mingju_num.append('0')
length=len(poem)
print("这一页诗人的个数"+str(length))
f = open("F:\\课程代码资料\\古诗词\\清代诗人.csv", 'a', encoding='utf-8')
#print(poem)
# print(shiwen_num)
for i in range(length):
f = open("F:\\课程代码资料\\古诗词\\清代诗人.csv", 'a', encoding='utf-8')
f.write(poem[i]+','+news[i]+','+shiwen_num[i]+','+mingju_num[i]+'\n')
f.close()
def get_images(urls):
for url in urls:
print(url[0],url[1])
file_name="F://课程代码资料//古诗词//诗词数据//新各朝代诗人图片//隋代//"+url[0]+'.jpg'
print(file_name)
response=requests.get(url[1],headers=headers)
with open(file_name,'wb') as f:
f.write(response.content)
for i in range(1,20):
print('----------------------'+str(i)+'------------------------------')
url='https://so.gushiwen.cn/authors/default.aspx?p={}&c=隋代'.format(i)
#get_intent(url)
urls=get_urls(url)
#images = list(zip(urls, poems))
get_images(urls)