arxiv论文爬虫

readme

Arxiv Interesting Papers Crawler

Description:

Customized personal paper downloading of the arxiv,
the user can download papers they are interested in.

The time range of the paper downloading:

user can choose to download the interesting papers daily, in past week,
in certain month of certain year (e.g. 2020.04). query_word: ‘recent’(daily),
‘pastweek’(in past week), ‘2004’(2020.04).

The mode of the downloading:

If user choose to download the interesting papers daily, the mode must be ‘daily’, otherwise,
the mode must be ‘all’. query_mode: ‘all’, ‘daily’.

The root of the downloading:

download_root_dir: e.g. ‘/Users/zhangzilong/Desktop/arxiv/’ (Need to reset)

The domain of the downloading:

domain: Computer Vision and Pattern Recognition ‘cs.CV/’, Machine learning ‘cs.LG/’ default: CV.

The customized keywords:

The keywords in which the users are interested appears in the title of the paper. key_words: ‘self-supervised’, ‘contrastive learning’, ‘anomaly detection’,
‘novelty detection’, ‘representation learning’, ‘out-of-distribution’. (Need to reset)

The customized keywords conference

The keywords conference appears in the comments of the committing paper.
key_words_conference: ‘ICLR’, ‘CVPR’, ‘ICML’, ‘ICCV’.


Usage: download the paper of interest daily, to run:

python3 main_arxiv.py
recent
daily

Code

#!/usr/bin/python
# -*- coding:utf-8 -*-
import urllib.parse
import urllib.request
import lxml
from bs4 import BeautifulSoup
import re
import ssl
import time
import os
"""
1. add comments

2. add daily
"""


class main_arxiv(object):

    def __init__(self, query_word: str, domain='cs.CV/', query_mode='all',
                 key_words=['deep learning'],     # 关键字(需更改)
                 key_words_conference=['PST'],    # 会议或期刊(需更改)
                 download_root_dir=r'D:\data\Arxiv-paper-crawler-daily-master\results'):   # 爬取文件的保存路径
        """query_word: month_year, recent, pastweek"""
        self.original_url = 'https://arxiv.org/'
        self.domain_url = self.original_url + 'list/' + domain + query_word
        assert 'all' in query_mode or 'daily' in query_mode, 'please input correct query mode(all, daily)'
        self.query_mode = query_mode
        self.headers = {
   
            'User-Agent':
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
        }
        self.key_words = key_words
        self.key_words_conference = key_words_conference
        self.root_dir = download_root_dir
        current_time = time.strftime("
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值