【酱浦菌-爬虫项目】爬取学术堂宏观经济学论文原文

最新推荐文章于 2024-05-21 00:00:00 发布

学IC的酱浦菌

最新推荐文章于 2024-05-21 00:00:00 发布

阅读量682

点赞数 20

分类专栏： python爬虫项目文章标签：爬虫 python

本文链接：https://blog.csdn.net/Yuxin_007/article/details/138313865

版权

python爬虫项目专栏收录该内容

5 篇文章 0 订阅

订阅专栏

前言

首先给大家放出完整代码，然后下面就是用jupyter写的代码。实际上在写的时候用的是jupyter写的，因为感觉jupyter写的时候更加的流畅，每一步运行的细节都能保存下来，更方便学习理解。

完整代码：

import os 
import requests
import parsel
import re

url = 'http://www.xueshut.com/bijiaojj/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response_decoded = response.text.encode('iso-8859-1').decode('gbk')

selector = parsel.Selector(response_decoded)
div = selector.css('div.wz_liebiao ul li p.title')
print("开始导入")
print("--------------------")
for div in div:
    title = div.css('a::attr(title)').get()
    href = div.css('a::attr(href)').get()
    url_lunwen = href
    response_lunwen = requests.get(url=url_lunwen,headers=headers).text.encode('iso-8859-1').decode('gbk')
    selector_lunwen = parsel.Selector(response_lunwen)
    title_lunwen = selector_lunwen.css('title').get()
    keywords_lunwen = selector_lunwen.css('meta[name=keywords]::attr(content)').get()
    content_lunwen = selector_lunwen.css('meta[name=description]::attr(content)').get()
    print(f'开始下载:{title}')
    if not os.path.exists(title):  
        with open(title,'a',encoding='utf-8') as f:
            f.write('\n' + title_lunwen)
            f.write('\n论文关键词：' + keywords_lunwen)
            f.write('\n论文主要内容：' + content_lunwen)
    if os.path.exists(title):  
        with open(title,'a',encoding='utf-8') as f:
            f.write('\n' + title_lunwen)
            f.write('\n论文关键词：' + keywords_lunwen)
            f.write('\n论文主要内容：' + content_lunwen)
    p = selector_lunwen.css('p::text')
    for p in p:
        with open(title,'a',encoding='utf-8') as f:
            f.write('\n' + p.get())