一、背景
日常学习或工作中,有时候需要找相关论文,多数时候我们直接去百度学术或者谷歌学术直接进行搜索 ,然后逐页的去寻找可能有用的论文。但是每页的搜索结果中除了标题还包含很多其他干扰信息,不利于我们快速去寻找。对于我们而言,最有用的信息其实就是论文标题,因此我写了一个爬虫用来下载前几页所有论文的标题,用于快速寻找有用论文。
二、代码
# import urllib.request
import requests
import re
import os,time
def open_url(url):
req = requests.get(url)
req.encoding='utf-8'
# req.add_header('User-Agent',
# 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50')
html = req.text
return html
def findapp_url(url):
html = open_url(url)
# print(html)
list_name = re.findall(r'.+title.+target="_blank">(.+em.+)?</a>', html)
return list_name
def saveapp(url):
list_name = findapp_url(url)
retry=0
while not list_name and retry<=3:
list_name = findapp_url(url)
retry+=1
# os.mkdir('appname')
# os.chdir('appname')
# with open('appname.txt','w') as f:
# for each in list_name:
# f.write(each+'\t')#全部可以打印出来,但是写不进去
# index=0
for each in list_name:
# index+=1
# print(each+'\t'+str(index))
each=each.replace('<em>','')
each=each.replace('</em>','')
print(each + '\n')
with open(r'C:\Users\USER\Desktop\result.txt', 'a') as f:
f.write(each+'\n')
def downloadapp(search_content,num_page):
'''
:param search_content: 搜索词
:param num_page: 打印前n页的相关论文标题
:return:
'''
url = 'http://xueshu.baidu.com/s?wd=%s&pn=0'%search_content
i = 1
while i <= num_page:
print('第' + str(i) + '页')
open_url(url)
findapp_url(url)
saveapp(url)
url = 'http://xueshu.baidu.com/s?wd=%s&pn=%s' %(search_content,str(i * 10))
# print(url)
i += 1
if __name__ == '__main__':
downloadapp(search_content = '推荐算法',num_page=10)
使用起来非常简单,只需要指定搜索内容search_content ,以及打印前几页(num_page)论文标题