scrapy 爬取 arxiv.org 论文

最新推荐文章于 2024-06-27 14:42:02 发布

Joovo

最新推荐文章于 2024-06-27 14:42:02 发布

阅读量2.6k

点赞数 1

分类专栏：爬虫 ※ Python 文章标签： arxiv 爬虫 scrapy python

本文链接：https://blog.csdn.net/Joovo/article/details/82951062

版权

※ Python 同时被 2 个专栏收录

35 篇文章 0 订阅

订阅专栏

爬虫

13 篇文章 0 订阅

订阅专栏

和同学想要建立一个检索 arxiv.org 论文的网站，这是一个 demo
Github地址：https://github.com/Joovo/Arxiv

鸽了好久把博客补了， scrapy 的操作：

scrapy shell 检验 xpath 正确性
reponse.xpath().extract() 转换为字符串列表
str.strip()处理数据
获取 xpath 的子节点的所有 text

arxiv.org 本身是通过构造 url 来爬取比较简单，通过构造年月的时间戳和页面展示数据的条数。

python3 -m scrapy startproject Arxiv
cd Arxiv
# quick start a simple spider
scrapy genspider arxiv arxiv.org

# how to crawl 
scrapy crawl arxiv

有了基本框架后，修改`items.py`

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy


class ArxivItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    authors=scrapy.Field()
    comments=scrapy.Field()
    subjects=scrapy.Field()

修改`pipelines.py`,用于下载

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class ArxivPipeline(object):
    def __init__(self):
        self.file = open('./items.json', 'a+')

    def process_item(self, item, spider):
        content = (json.dumps(dict(item))+"\n").encode(encoding='utf-8')
        self.file.write(content)
        return item

创建`./spider/Arxiv.py`

Arxiv.py继承了scrapy.Spider,另外还有几个用于继承的类需要查文档，这里需要实现的是 parse方法。

需要导入设置好的Item,对页面解析，.框架内部“流通的”的是Item类。
parse通过迭代器返回一个item对象或是一个response
返回的response会加入队列，等待处理。

# -*- coding: utf-8 -*-
import scrapy
from Arxiv.items import *
import re


class ArxivSpider(scrapy.Spider):
    name = 'arxiv'
    allowed_domains = ['arxiv.org']
    start_urls = ['https://arxiv.org/list/cs.CV/1801?show=1000']

    def parse(self, response):
        self.logger.info('A response from %s just arrived' % response.url)
        # get num line
        num = response.xpath('//*[@id="dlpage"]/small[1]/text()[1]').extract()[0]
        # get max_index
        max_index = int(re.search(r'\d+', num).group(0))
        for index in range(1, max_index + 1):
            item = ArxivItem()
            # get title and clean data
            title = response.xpath('//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[1]/text()').extract()
            # remove blank char
            title = [i.strip() for i in title]
            # remove blank str
            title = [i for i in title if i is not '']
            # insert title
            item['title'] = title[0]

            authors = ''
            # authors'  father node
            xpath_fa = '//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[2]//a/text()'
            author_list = response.xpath(xpath_fa).getall()
            authors=str.join('',author_list)
            item['authors'] = authors

            item['subjects']=response.xpath('string(//*[@id="dlpage"]/dl/dd['+str(5)+']/div/div[5]/span[2])').extract_first()
            
            yield item
        # 这里下一个url指向的是1802，改为循环就可以爬取全部信息
        yield scrapy.Request('https://arxiv.org/list/cs.CV/1802?show=1000', callback=self.parse)

item.json

{"title": "Deep Reinforcement Learning for Unsupervised Video Summarization with  Diversity-Representativeness Reward", "authors": "Kaiyang Zhou, Kaiyang Zhou, Kaiyang Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Deformable GANs for Pose-based Human Image Generation", "authors": "Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Face Synthesis from Visual Attributes via Sketch using Conditional VAEs  and GANs", "authors": "Xing Di, Xing Di", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A PDE-based log-agnostic illumination correction algorithm", "authors": "U. A. Nnolim", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Real-time and Registration-free Framework for Dynamic Shape  Instantiation", "authors": "Xiao-Yun Zhou, Xiao-Yun Zhou, Xiao-Yun Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Fractional Local Neighborhood Intensity Pattern for Image Retrieval  using Genetic Algorithm", "authors": "Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Unified Method for First and Third Person Action Recognition", "authors": "Ali Javidani, Ali Javidani", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Integrating semi-supervised label propagation and random forests for  multi-atlas based hippocampus segmentation", "authors": "Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Transfer learning for diagnosis of congenital abnormalities of the  kidney and urinary tract in children based on Ultrasound imaging data", "authors": "Qiang Zheng, Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Context aware saliency map generation using semantic segmentation", "authors": "Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}

随着网站更新代码年久失修，根据网友 @一念逍遥、
指出authors部分需要勘误，已针对该问题改正，其他部分不做修改。

Joovo

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
scrapy 爬取 arxiv.org 论文

和同学想要建立一个检索 arxiv.org 论文的网站，这是一个 demoGithub地址：https://github.com/Joovo/Arxiv鸽了好久把博客补了，主要实战里熟练了 scrapy 的操作：scrapy shell 检验 xpath 正确性reponse.xpath().extract() 转换为字符串列表str.strip()处理数据获取 xpath 的子节点...
复制链接

扫一扫