Exploit WRDS Cloud via Python-CSDN博客

本文链接：https://blog.csdn.net/weixin_38421869/article/details/120749113

因为有了自建博客，所以以后博文都是英文，不过我的 Chinglish 应该很好理解 = =。
个人博客地址在这里 https://mengjiexu.com/。

文章目录

Motivation

The WRDS-SAS Studio seems to be suspended soon. As a big fan of this cloud platform, I have to find something alternative to make my data extraction from WRDS replicable. Compiled from a series of WRDS guidances (you can find them in the Reference part), this blog will introduce how to exploit WRDS cloud via Python.

The biggest advantage of WRDS Cloud is that you don’t have to download everything to your computer but just deal with the massive data using the computing source of WRDS Cloud. Following this blog, one can quickly establish a python-wrds cloud workflow without checking various manuals.

Figure 1: Suspension of WRDS-SAS Studio

Connect to WRDS using SSH

Mac OS

MacOS computers come with a utility called Terminal that has SSH functionality built-in that can be used to connect to the WRDS Cloud. You can initiate an SSH connection to the WRDS Cloud as follows:

Type ssh wrdsusername@wrds-cloud.wharton.upenn.edu where wrdsusername is your WRDS username: the same username you use to log onto the WRDS website.
When prompted, enter your WRDS password.
Once you have connected to the WRDS Cloud, you will be given a prompt – which indicates that the server is ready for your commands.
You can disconnect from the WRDS Cloud at anytime by typing logout or using the key combination CTL-D.

Figure 2: Connect to WRDS Cloud via SSH

Windows

If you are on Windows, you will need to download and install SSH client software (e.g., PuTTY, WinSCP) to your computer.

Once you have downloaded and installed one of the above, you can initiate an SSH connection to the WRDS Cloud. Suppose you use PuTTY and want to configure PuTTY for SSH:

In Host Name, enter wrds-cloud.wharton.upenn.edu
In Port (skip if there is no Port field), enter 22
When prompted, enter your WRDS username.
When prompted, enter your WRDS password.

Open IPython in the Cloud

Interactive Python jobs allow you to run code in serial, and receive a response to each command as you enter it. To run interactive Python jobs, you will need to schedule an interactive job with the WRDS Cloud Grid Engine by entering qrsh and then enter ipython3 to enter an interactive python enviornment, which is exactly what you get in Jupyter. Don’t forget to insert pip install wrds before you connect to the WRDS library list.

[xumj2019@wrds-cloud-login2-w ~]$ qrsh
Last login: Tue Oct 12 14:53:16 2021 from wrds-cloud-login2-w.wharton.private
[xumj2019@wrds-sas7-w ~]$ pip install wrds
Requirement already satisfied: wrds in /usr/local/sas/grid/python2-2.7.17/lib/python2.7/site-packages (3.0.8)
[xumj2019@wrds-sas7-w ~]$ ipython
Python 3.9.5 (default, May  6 2021, 14:32:00) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.23.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:

Initiating a WRDS connection in Python

First, as with every Python program that intends to connect to WRDS, you must import the wrds module. Then you can make the connection and set up a pgpass file.

In [4]: db = wrds.Connection()
Enter your WRDS username [xumj2019]:xumj2019
Enter your password:
WRDS recommends setting up a .pgpass file.
Create .pgpass file now [y/n]?: y
Created .pgpass file successfully.
Loading library list...
Done

Explore WRDS Libraries and Tables

You can analyze the structure of the data through its metadata using the wrds module, as outlined in the following steps:

List all available libraries at WRDS using list_libraries()
Select a library to work with, and list all available datasets within that library using list_tables()
Select a dataset, and list all available variables (column headers) within that dataset using describe_table()

In [5]: db.list_libraries()
Out[5]: 
['aha',
 'aha_sample',
 'ahasamp',
 'audit',
 'auditsmp',
 'auditsmp_all',
 'bank',
 'blab',
 'block',
 'block_all',
 'boardex',
 'boardex_trial',
 'boardsmp',
 'bvd',
	...]
In [7]: db.list_tables(library='taqmsec')
Out[7]: 
['cqm_20030910',
 'cqm_20030911',
 'cqm_20030912',
 'cqm_20030915',
 'cqm_20030916',
 'cqm_20030917',
 'cqm_20030918',
 'cqm_20030919',
 'cqm_20030922',
 'cqm_20030923',
 'cqm_20030924',
 'cqm_20030925',
 'cqm_20030926',
 'cqm_20030929',
 'cqm_20070829',
 	...]
In [9]: db.describe_table(library='taqmsec', table='cqm_20070822')
The row count will return 0 due to the structure of TAQ
There was a problem with retrievingthe row count: 'NoneType' object is not subscriptable
Approximately 0 rows in taqmsec.cqm_20070822.
Out[9]: 
           name  nullable         type
0          date      True         DATE
1        time_m      True         TIME
2            ex      True   VARCHAR(1)
3      sym_root      True   VARCHAR(6)
4    sym_suffix      True  VARCHAR(10)
5           bid      True      NUMERIC
6        bidsiz      True      NUMERIC
7           ask      True      NUMERIC
8        asksiz      True      NUMERIC
9       qu_cond      True   VARCHAR(1)
10        bidex      True   VARCHAR(1)
11        askex      True   VARCHAR(1)
12    qu_seqnum      True      NUMERIC
13   natbbo_ind      True   VARCHAR(1)
14  nasdbbo_ind      True   VARCHAR(1)
15    qu_cancel      True   VARCHAR(1)
16    qu_source      True   VARCHAR(1)

Submit Queries

Now that you know how to query the metadata and understand the structure of the data, you are ready to query WRDS data directly. The wrds module provides several methods that are useful in gathering data:

get_table() - fetches data by matching library and dataset, with the ability to filter using different parameters. This is the easiest method of accessing data.
raw_sql() - executes a SQL query against the specified library and dataset, allowing for highly-granular data queries.
get_row_count() - returns the number of rows in a given dataset.

For parameters, and further explanation of each, use the built-in help: e.g., help(db.get_table).

Here are some examples.

In [10]: data = db.get_table(library='djones', table='djdaily', columns=['date', 'dji'], obs=10)

In [11]: data
Out[