中国知网管理科学期刊历年文章标题摘要信息爬取

最新推荐文章于 2024-04-01 15:46:49 发布

Kiss--The--Rain

最新推荐文章于 2024-04-01 15:46:49 发布

阅读量999

点赞数

分类专栏： py

本文链接：https://blog.csdn.net/weixin_39287576/article/details/88524938

版权

该代码实现了一个爬虫，用于抓取中国知网管理科学期刊从1994年至2018年的文章标题、摘要、作者、单位和基金等信息，并将数据保存到Excel表格中。

摘要由CSDN通过智能技术生成

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
import time

def getHTMLText(url):
try:
r = requests.get(url, timeout = 30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"

def infoWrite(sheet,soup,count): #共同代码提取出来
orgn = soup.find('div',{'class':'orgn'}).text
fund = soup.find('label',{'id':'catalog_FUND'})
if fund==None: #判断是否为空
fund = ""
else:
fund = fund.parent.text
fund = fund.replace(' ','') #去除字符串信息的无用空格
fund = fund.replace('\n','')
keywd = soup.find('label',{'id':'catalog_KEYWORD'})
if keywd==None: #判断是否为空
keywd=""
else:
keywd = keywd.parent.text
keywd = keywd.replace(' ','') #去除字符串信息的无用空格
keywd = keywd.replace('\n','')
print(keywd)
sheet.cell(row=count,column=4).value=orgn
sheet.cell(row=count,column=5).value=fund
sheet.cell(row=count,column=6).value=keywd

def getJournalInfos(start_url,end_url,sheet,count,book,path): #爬取1994-2001年
for i in range(1994,2002):
if i in [1996,1997,1998]: #判断是否为96-98年，因为网页格式有变化
for j in range(1,5): #1994-2001年只有4个月
month = '0'+str(j) if len(str(j))==1 else str(j) #形成01、02这种格式数据
for k in range(19):
num = '.00'+str(k) if len(str(k))==1 else '.0'+str(k) #形成01、02这种格式数据

最低0.47元/天解锁文章

Kiss--The--Rain

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
中国知网管理科学期刊历年文章标题摘要信息爬取

import requestsfrom bs4 import BeautifulSoupfrom openpyxl import Workbookimport timedef getHTMLText(url): try: r = requests.get(url, timeout = 30) r.raise_for_status() ...
复制链接

扫一扫