文章目录
readme
Arxiv Interesting Papers Crawler
Description:
Customized personal paper downloading of the arxiv,
the user can download papers they are interested in.
The time range of the paper downloading:
user can choose to download the interesting papers daily, in past week,
in certain month of certain year (e.g. 2020.04). query_word: ‘recent’(daily),
‘pastweek’(in past week), ‘2004’(2020.04).
The mode of the downloading:
If user choose to download the interesting papers daily, the mode must be ‘daily’, otherwise,
the mode must be ‘all’. query_mode: ‘all’, ‘daily’.
The root of the downloading:
download_root_dir: e.g. ‘/Users/zhangzilong/Desktop/arxiv/’ (Need to reset)
The domain of the downloading:
domain: Computer Vision and Pattern Recognition ‘cs.CV/’, Machine learning ‘cs.LG/’ default: CV.
The customized keywords:
The keywords in which the users are interested appears in the title of the paper. key_words: ‘self-supervised’, ‘contrastive learning’, ‘anomaly detection’,
‘novelty detection’, ‘representation learning’, ‘out-of-distribution’. (Need to reset)
The customized keywords conference
The keywords conference appears in the comments of the committing paper.
key_words_conference: ‘ICLR’, ‘CVPR’, ‘ICML’, ‘ICCV’.
Usage: download the paper of interest daily, to run:
python3 main_arxiv.py
recent
daily
Code
#!/usr/bin/python
# -*- coding:utf-8 -*-
import urllib.parse
import urllib.request
import lxml
from bs4 import BeautifulSoup
import re
import ssl
import time
import os
"""
1. add comments
2. add daily
"""
class main_arxiv(object):
def __init__(self, query_word: str, domain='cs.CV/', query_mode='all',
key_words=['deep learning'], # 关键字(需更改)
key_words_conference=['PST'], # 会议或期刊(需更改)
download_root_dir=r'D:\data\Arxiv-paper-crawler-daily-master\results'): # 爬取文件的保存路径
"""query_word: month_year, recent, pastweek"""
self.original_url = 'https://arxiv.org/'
self.domain_url = self.original_url + 'list/' + domain + query_word
assert 'all' in query_mode or 'daily' in query_mode, 'please input correct query mode(all, daily)'
self.query_mode = query_mode
self.headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
self.key_words = key_words
self.key_words_conference = key_words_conference
self.root_dir = download_root_dir
current_time = time.strftime("