因为有了自建博客,所以以后博文都是英文,不过我的 Chinglish 应该很好理解 = =。
个人博客地址在这里 https://mengjiexu.com/。
文章目录
Motivation
The WRDS-SAS Studio seems to be suspended soon. As a big fan of this cloud platform, I have to find something alternative to make my data extraction from WRDS replicable. Compiled from a series of WRDS guidances (you can find them in the Reference part), this blog will introduce how to exploit WRDS cloud via Python.
The biggest advantage of WRDS Cloud is that you don’t have to download everything to your computer but just deal with the massive data using the computing source of WRDS Cloud. Following this blog, one can quickly establish a python-wrds cloud workflow without checking various manuals.

Connect to WRDS using SSH
Mac OS
MacOS computers come with a utility called Terminal that has SSH functionality built-in that can be used to connect to the WRDS Cloud. You can initiate an SSH connection to the WRDS Cloud as follows:
- Type
ssh wrdsusername@wrds-cloud.wharton.upenn.edu
where wrdsusername is your WRDS username: the same username you use to log onto the WRDS website. - When prompted, enter your WRDS password.
- Once you have connected to the WRDS Cloud, you will be given a prompt – which indicates that the server is ready for your commands.
- You can disconnect from the WRDS Cloud at anytime by typing
logout
or using the key combination CTL-D.

Windows
If you are on Windows, you will need to download and install SSH client software (e.g., PuTTY, WinSCP) to your computer.
Once you have downloaded and installed one of the above, you can initiate an SSH connection to the WRDS Cloud. Suppose you use PuTTY and want to configure PuTTY for SSH:
- In Host Name, enter
wrds-cloud.wharton.upenn.edu
- In Port (skip if there is no Port field), enter
22
- When prompted, enter your WRDS username.
- When prompted, enter your WRDS password.
Open IPython in the Cloud
Interactive Python jobs allow you to run code in serial, and receive a response to each command as you enter it. To run interactive Python jobs, you will need to schedule an interactive job with the WRDS Cloud Grid Engine by entering qrsh
and then enter ipython3
to enter an interactive python enviornment, which is exactly what you get in Jupyter
. Don’t forget to insert pip install wrds
before you connect to the WRDS library list.
[xumj2019@wrds-cloud-login2-w ~]$ qrsh
Last login: Tue Oct 12 14:53:16 2021 from wrds-cloud-login2-w.wharton.private
[xumj2019@wrds-sas7-w ~]$ pip install wrds
Requirement already satisfied: wrds in /usr/local/sas/grid/python2-2.7.17/lib/python2.7/site-packages (3.0.8)
[xumj2019@wrds-sas7-w ~]$ ipython
Python 3.9.5 (default, May 6 2021, 14:32:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.23.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
Initiating a WRDS connection in Python
First, as with every Python program that intends to connect to WRDS, you must import the wrds module. Then you can make the connection and set up a pgpass file.
In [4]: db = wrds.Connection()
Enter your WRDS username [xumj2019]:xumj2019
Enter your password:
WRDS recommends setting up a .pgpass file.
Create .pgpass file now [y/n]?: y
Created .pgpass file successfully.
Loading library list...
Done
Explore WRDS Libraries and Tables
You can analyze the structure of the data through its metadata using the wrds module, as outlined in the following steps:
-
List all available libraries at WRDS using
list_libraries()
-
Select a library to work with, and list all available datasets within that library using
list_tables()
-
Select a dataset, and list all available variables (column headers) within that dataset using
describe_table()
In [5]: db.list_libraries()
Out[5]:
['aha',
'aha_sample',
'ahasamp',
'audit',
'auditsmp',
'auditsmp_all',
'bank',
'blab',
'block',
'block_all',
'boardex',
'boardex_trial',
'boardsmp',
'bvd',
...]
In [7]: db.list_tables(library='taqmsec')
Out[7]:
['cqm_20030910',
'cqm_20030911',
'cqm_20030912',
'cqm_20030915',
'cqm_20030916',
'cqm_20030917',
'cqm_20030918',
'cqm_20030919',
'cqm_20030922',
'cqm_20030923',
'cqm_20030924',
'cqm_20030925',
'cqm_20030926',
'cqm_20030929',
'cqm_20070829',
...]
In [9]: db.describe_table(library='taqmsec', table='cqm_20070822')
The row count will return 0 due to the structure of TAQ
There was a problem with retrievingthe row count: 'NoneType' object is not subscriptable
Approximately 0 rows in taqmsec.cqm_20070822.
Out[9]:
name nullable type
0 date True DATE
1 time_m True TIME
2 ex True VARCHAR(1)
3 sym_root True VARCHAR(6)
4 sym_suffix True VARCHAR(10)
5 bid True NUMERIC
6 bidsiz True NUMERIC
7 ask True NUMERIC
8 asksiz True NUMERIC
9 qu_cond True VARCHAR(1)
10 bidex True VARCHAR(1)
11 askex True VARCHAR(1)
12 qu_seqnum True NUMERIC
13 natbbo_ind True VARCHAR(1)
14 nasdbbo_ind True VARCHAR(1)
15 qu_cancel True VARCHAR(1)
16 qu_source True VARCHAR(1)
Submit Queries
Now that you know how to query the metadata and understand the structure of the data, you are ready to query WRDS data directly. The wrds module provides several methods that are useful in gathering data:
get_table()
- fetches data by matching library and dataset, with the ability to filter using different parameters. This is the easiest method of accessing data.raw_sql()
- executes a SQL query against the specified library and dataset, allowing for highly-granular data queries.get_row_count()
- returns the number of rows in a given dataset.
For parameters, and further explanation of each, use the built-in help: e.g., help(db.get_table)
.
Here are some examples.
In [10]: data = db.get_table(library='djones', table='djdaily', columns=['date', 'dji'], obs=10)
In [11]: data
Out[