批量下载arXiv论文数据的Python脚本

arXiv-tools

Prerequisites

ArXiv provides bulk data access through Amazon S3. You need an account with Amazon AWS to be able to download the data. You also need python 2.

Downloading arXiv documents

1- Install s3cmd which is a command line tool for interacting with S3

pip install s3cmd (only works on python 2)

2- Configure your s3cmd by entering credentials found in the account management tab of the Amazon AWS website

s3cmd --configure

3- Get the manifest files:

The complete set of arXiv files available from Amazon S3 in requester pays buckets. The files are in .tar format each with ~500MB size. You need to have the keys to these chunks to be able to download them. The complete list of these keys is provided in the manifest files. First download the manifests:

For PDF documents:

s3cmd get --requester-pays s3://arxiv/pdf/arXiv_pdf_manifest.xml local-directory/arXiv_pdf_manifest.xml

For source documents:

s3cmd get --requester-pays s3://arxiv/src/arXiv_src_manifest.xml local-directory/arXiv_src_manifest.xml

4- Download the actual pdf and source files using the download.py script

Download pdf files:

python download.py --manifest_file /path/to/pdf-manifest --mode pdf --output_dir /path/to/output

Download source files:

python download.py --manifest_file /path/to/src-manifest --mode src --output_dir /path/to/output

This will download all the files in the directory that you designated as output.

If you also need the metadata, use metha to bulk download the metadata.

import re
import sys
import json
import subprocess
import traceback
import xml.etree.ElementTree as ET

def main(**args):
  manifest_file = args['manifest_file']
  mode = args['mode']
  out_dir = args['output_dir']
  log_file_path = args['log_file']
  if mode != 'pdf' and mode != 'src':
    print('mode should be "pdf" or "src"')

  def get_file(fname, out_dir):
    cmd = ['s3cmd', 'get', '--requester-pays',
           's3://arxiv/%s' % fname, './%s' % out_dir]
    print(' '.join(cmd))
    subprocess.call(' '.join(cmd), shell=True)

  log_file = open(log_file_path, 'a')
  try:
    for event, elem in ET.iterparse(manifest_file):
      if event == 'end':
        if elem.tag == 'filename':
          fname = elem.text
          get_file(fname, out_dir='%s/%s/' % (out_dir, mode))
          log_file.write(fname + '\n')
  except:
    traceback.print_exc()

  print('Finished')


if __name__ == '__main__':
  from argparse import ArgumentParser
  ap = ArgumentParser()
  ap.add_argument('--manifest_file', '-m', type=str, help='The manifest file for downloading from arxiv. Obtain it from s3://arxiv/pdf/arXiv_pdf_manifest.xml using `s3cmd get --add-header="x-amz-request-payer: requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml`', required=True)
  ap.add_argument('--output_dir', '-o', type=str, default='data', help='the output directory')
  ap.add_argument('--mode', type=str, default='src', choices=set(('pdf', 'src')),
                  help='can be "pdf" or "src"')
  ap.add_argument('--log_file', default='processed.txt', help='a file that logs the processed txt files')
  args = ap.parse_args()
  main(**vars(args))
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值