scrapy 爬取 arxiv.org 论文

和同学想要建立一个检索 arxiv.org 论文的网站,这是一个 demo
Github地址:https://github.com/Joovo/Arxiv

鸽了好久把博客补了, scrapy 的操作:

  • scrapy shell 检验 xpath 正确性
  • reponse.xpath().extract() 转换为字符串列表
  • str.strip()处理数据
  • 获取 xpath 的子节点的所有 text

arxiv.org 本身是通过构造 url 来爬取比较简单,通过构造年月的时间戳和页面展示数据的条数。

python3 -m scrapy startproject Arxiv
cd Arxiv
# quick start a simple spider
scrapy genspider arxiv arxiv.org

# how to crawl 
scrapy crawl arxiv

有了基本框架后,修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy


class ArxivItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    authors=scrapy.Field()
    comments=scrapy.Field()
    subjects=scrapy.Field()

修改pipelines.py,用于下载

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class ArxivPipeline(object):
    def __init__(self):
        self.file = open('./items.json', 'a+')

    def process_item(self, item, spider):
        content = (json.dumps(dict(item))+"\n").encode(encoding='utf-8')
        self.file.write(content)
        return item

创建./spider/Arxiv.py

Arxiv.py继承了scrapy.Spider,另外还有几个用于继承的类需要查文档,这里需要实现的是 parse方法。

  • 需要导入设置好的Item,对页面解析,.框架内部“流通的”的是Item类。
  • parse通过迭代器返回一个item对象或是一个response
  • 返回的response会加入队列,等待处理。
# -*- coding: utf-8 -*-
import scrapy
from Arxiv.items import *
import re


class ArxivSpider(scrapy.Spider):
    name = 'arxiv'
    allowed_domains = ['arxiv.org']
    start_urls = ['https://arxiv.org/list/cs.CV/1801?show=1000']

    def parse(self, response):
        self.logger.info('A response from %s just arrived' % response.url)
        # get num line
        num = response.xpath('//*[@id="dlpage"]/small[1]/text()[1]').extract()[0]
        # get max_index
        max_index = int(re.search(r'\d+', num).group(0))
        for index in range(1, max_index + 1):
            item = ArxivItem()
            # get title and clean data
            title = response.xpath('//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[1]/text()').extract()
            # remove blank char
            title = [i.strip() for i in title]
            # remove blank str
            title = [i for i in title if i is not '']
            # insert title
            item['title'] = title[0]

            authors = ''
            # authors'  father node
            xpath_fa = '//*[@id="dlpage"]/dl/dd[' + str(index) + ']/div/div[2]//a/text()'
            author_list = response.xpath(xpath_fa).getall()
            authors=str.join('',author_list)
            item['authors'] = authors

            item['subjects']=response.xpath('string(//*[@id="dlpage"]/dl/dd['+str(5)+']/div/div[5]/span[2])').extract_first()
            
            yield item
        # 这里下一个url指向的是1802,改为循环就可以爬取全部信息
        yield scrapy.Request('https://arxiv.org/list/cs.CV/1802?show=1000', callback=self.parse)

item.json

{"title": "Deep Reinforcement Learning for Unsupervised Video Summarization with  Diversity-Representativeness Reward", "authors": "Kaiyang Zhou, Kaiyang Zhou, Kaiyang Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Deformable GANs for Pose-based Human Image Generation", "authors": "Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin, Aliaksandr Siarohin", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Face Synthesis from Visual Attributes via Sketch using Conditional VAEs  and GANs", "authors": "Xing Di, Xing Di", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A PDE-based log-agnostic illumination correction algorithm", "authors": "U. A. Nnolim", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Real-time and Registration-free Framework for Dynamic Shape  Instantiation", "authors": "Xiao-Yun Zhou, Xiao-Yun Zhou, Xiao-Yun Zhou", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Fractional Local Neighborhood Intensity Pattern for Image Retrieval  using Genetic Algorithm", "authors": "Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya, Avirup Bhattacharyya", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "A Unified Method for First and Third Person Action Recognition", "authors": "Ali Javidani, Ali Javidani", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Integrating semi-supervised label propagation and random forests for  multi-atlas based hippocampus segmentation", "authors": "Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Transfer learning for diagnosis of congenital abnormalities of the  kidney and urinary tract in children based on Ultrasound imaging data", "authors": "Qiang Zheng, Qiang Zheng, Qiang Zheng", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}
{"title": "Context aware saliency map generation using semantic segmentation", "authors": "Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi, Mahdi Ahmadi", "subjects": "Computer Vision and Pattern Recognition (cs.CV)"}

随着网站更新代码年久失修,根据网友 @一念逍遥、
指出authors部分需要勘误,已针对该问题改正,其他部分不做修改。

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
可以使用 Scrapy 框架和 XPath 选择器来爬取 www.runoob.com 网页图片,具体步骤如下: 1. 创建 Scrapy 项目并定义 Item 在命令行中输入以下命令,创建一个名为 `runoob` 的 Scrapy 项目: ``` scrapy startproject runoob ``` 然后在 `runoob` 项目文件夹下的 `items.py` 文件中定义 `RunoobItem`,用于存储爬取到的图片信息: ```python import scrapy class RunoobItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() ``` 2. 创建 Spider 并编写爬虫逻辑 在 `runoob` 项目文件夹下的 `spiders` 文件夹中创建名为 `image_spider.py` 的文件,并编写如下代码: ```python import scrapy from runoob.items import RunoobItem class ImageSpider(scrapy.Spider): name = 'image_spider' allowed_domains = ['www.runoob.com'] start_urls = ['https://www.runoob.com/'] def parse(self, response): item = RunoobItem() # 提取所有图片链接 item['image_urls'] = response.xpath('//img/@src').extract() yield item ``` 在 `ImageSpider` 中,我们首先定义了爬虫的名称 `name`、允许爬取的域名 `allowed_domains` 和起始爬取的 URL 列表 `start_urls`。然后在 `parse` 方法中使用 XPath 选择器提取页面中所有的图片链接,并将其存储到 `item` 中。最后使用 `yield` 将 `item` 传递给 Scrapy 引擎。 3. 配置 Scrapy Pipeline 在 `runoob` 项目文件夹下的 `settings.py` 文件中,添加如下配置: ```python ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} IMAGES_STORE = 'images/' ``` 在这里我们使用 Scrapy 内置的 `ImagesPipeline` 来下载图片,并将下载的图片保存到 `images/` 目录下。 4. 运行爬虫 在命令行中进入 `runoob` 项目文件夹,输入以下命令启动爬虫: ``` scrapy crawl image_spider ``` Scrapy 就会自动爬取 www.runoob.com 页面中的所有图片,并将其下载到 `images/full/` 目录下。 注意:如果运行过程中出现错误,可以尝试安装 Pillow 库(用于处理图片),并重启命令行窗口再次运行爬虫: ``` pip install Pillow ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值