用Python的扩展作回归


http://www.stuartreid.co.za/regression-analysis-using-python-statsmodels-and-quandl/


Regression analysis using Python StatsModels and Quandl

My upgrade to Ubuntu 14.04 was an opportunity to finally move away from Java and onto a scientific Python stack. I'll use this stack to accelerate my data-mining projects and prototype new investment strategies. I'll also be writing computational finance tutorials covering various Python packages and their financial applications.

This tutorial covers regression analysis using the Python StatsModels package with Quandl integration. For motivational purposes, here is what we are working towards: a regression analysis program which receives multiple data-set names from Quandl.com, automatically downloads the data, analyses it, and plots the results in a new window. Actual outputs,

Perform a regression analysis of the past 350 weekly prices of YHOO and GOOG

Regression Analysis Trading

Perform a regression analysis of 5 years of GDP values for the BRICS nations

Regression Analysis Economics


Types of Regression Analysis

linear regression analysis uses a straight line to capture the linear relationship between two sets X and Y. The regression is constructed by optimizing the variables of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x. Regressions are used extensively in economics, risk management, and trading.

Linear Regression Analysis

Non-Linear Regression Analysis

Non-linear regression analysis uses a curved function (not a straight line) to capture thenon-linear relationship between two sets X and Y. The regression is often constructed by optimizing the parameters of a higher-order polynomial such that the line best fits a sample of (x, y) observations. In my previous article, Ten Misconceptions about Neural Networks in Finance and Trading, I explained that a neural network is basically an approximation of a multiple non-linear regression.

The argument between linear vs. non-linear regression analysis in finance still rages on today. The issue with linear models is that they often under-fit and they also assert assumptions onto the variables. The issue with non-linear models is that they often over-fit. That said, training and data-preparation techniques can be used to minimize over-fitting.

multiple linear regression analysis is a used for predicting the values of a set of dependent variables, Y, using two or more sets of independent variables e.g. X1, X2, ..., Xn. E.g. you could try to forecast share prices using one fundamental indicator like the PE ratio, or you could used multiple indicators together like the PE, DY, DE ratios, and the share's EPS.

At this point I would like to digress and point out that there is almost no difference between a multiple linear regression and a perceptron with multiple inputs. Both are calculated as the weighted sum of the input vector plus some constant or bias which is used to shift the function. The only difference is that the input signal into the perceptron is fed into an activation function (normally non-linear) whose output determines to which class the input vector belongs.

Another approach is to construct multiple regression analyses using clustering. This is particularly useful if the set contains more than one linear relationship. The approach clusters or partitions the data-set into subsets and then construct independent multiple linear regressions for each cluster. Some useful clustering algorithms are the K-nearest neighbors algorithm, or for a powerful Computational Intelligence approach, you can use an Ant Colony Optimization for cemetery formation.

Multiple Regression Analysis

Logistic Regression Analysis - linear regressions deal with continuous valued series whereas a logistic regression deals with categorical (discrete) values. Discrete values are difficult to work with because they are non differentiable so gradient-based optimization techniques don't apply.

Stepwise Regression Analysis - this is the name given to the iterative construction of a multiple regression model. It works by automatic selecting statistically significant independent variables to include in the regression analysis. This is achieved either by either growing or pruning the regression analysis which is similar to the approach used in adaptive neural networks.

Many other regression analyses exist, and in particular, mixed models are worth mentioning here. Mixed models is is an extension to the generalized linear model in which the linear predictor contains random effects in addition to the usual fixed effects. This decision tree can be used to help determine the right variance components for a model.

Generlized Linear Mixed Model


Approximating a Regression Analysis

Note: if the formulae below don't show up correctly, try refreshing your browser

After deciding on a regression model you must select a technique for approximating the regression analysis. This involves optimizing the free parameters of the regression model such that some objective function which measures how well the model 'fits' the data-set is optimized. The most commonly used approach is called the "least squares" method.

The least squares method aims to minimze the sum of the errors squared, where the errors are the residuals (distances) between the fitted curve and the set of datapoints. The residual can be calculated using perpendicular distances or vertical distances. The errors are squared so that the residuals form a continuous differentiable quantity (explained further below).

Regression Analysis Vertical Offsets

R2=[yif(xi,a1,a2,...,an)]2

Regression Analysis Perpendicular Offsets

R2=[di]

In the case of vertical offsets the error is equal to the difference between the  yi  value from the data-set and the computed  yr  value from the regression line where  yr=f(xi,a1,a2,...,an)  and  a1,a2,...,an  are additional explanatory variables in a multiple regression.

In the case of perpendicular offsets the error is the sum of the distance,  d1,d2,...,dn , between the data points,  (xi,yi) , and the point along the regression curve,  (xr,yr) , perpendicular to that. For a straight line, this can be calculated by solving a quadratic equation. However, for non-linear functions, such as polynomials, this becomes more complex and computationally expensive. As such, the vertical offsets is almost always used in practice.

These concepts should be familiar to statisticians as well as machine learning enthusiasts because the sum-squared error is the objective function used in neural networks as well. In my next article I will demonstrate how a differential evolution algorithm can be used to very efficiently approximate a linear or non-linear regression analysis of a big data set.

Simple linear regression analysis

This optimization problem for straight lines was further simplified by Kenney and Keeping who introduced the concept of the center of mass of the dataset,  (x¯,y¯) , and related this to the y intercept of the fitted line. This optimization problem is mathematically modelled as,

Sum of x squares

ssxx=ni=1(xix¯)2

ssxx=(ni=1x2i)nx¯2

Sum of y squares

ssyy=ni=1(yiy¯)2

ssyy=(ni=1y2i)nx¯2

These two measurements can be combined to calculate the overall sum of squares,

Overall sum of squares

ssxy=ni=1(xix¯)(yiy¯)

ssxy=(ni=1xiyi)nx¯y¯

Using just these three variables,  ssxx,ssyy,ssxy , and the center of mass it is possible to construct the straight line (linear regression) of the form,  y=mx+c , which minimizes the sum squared error,  R2 , of the residuals between the line and the datapoints.

m=ssxyssxx

c=y¯mx¯

These parameters are all that is needed to draw the linear regression analysis which fits a set of observed data points. Lastly, the overall quality of the regression analysis is measured using the correlation coefficient,

r2=ss2xyssxxssyy

Iterative methods - for harder problems (fun)

For more complex functions iterative methods need to be applied. An iterative procedure is one which generates a sequence of improving approximate solutions to a particular problem. In computer science terms, an interative method is the equivalent of a search algorithm.

There are two classes of optimization algorithms, exhaustive or heuristic. Exhaustic techniques are referred to as "brute force" methods because they determistically try every combination.Heuristic methods on the other hand use knowledge about the optimization problem to locate good solutions. Which brings us back to differentiable objective functions ...

One widely used heuristic is the gradient of a function because when this is equal to zero, that point in the function is either a local minima or maxima. Gradient methods such as gradient descent, the Gauss Newton method, and the Levenberg Marquardt algorithm adjust the solution such that the derivate is either minimized or maximized.

Another widely used heuristic is line of sight a.k.a direct methods. These methods don't use gradients but instead generate points within the search space and "look for" the optima. Examples of such algorithms include random searchpattern searchgrid searchhill climbers,simulated annealing, and even the particle swarm optimization algorithm.

Lastly, evolutionary computation is a popular metaheuristic for solving complex optimization problems. These methods are difficult to relate to mathematics as they are inspired by the processes found in natural evolution. In this category we find such algorithms as the genetic algorithmsgrammatical evolution, and the differential evolution algorithm.


Data considerations

Because the error is the squared distance between the data point and the regression line, large distances have disproportionately large errors which cause the regression analysis to converge on a solution with a poor correlation coefficient. As such, outliersshould be removed from the data-set.

Outlier Removal

One might also consider applying weights to different points in the data set. As an example consider an investor who is analyzing a multi-year time series. He might decide to place a greater weight (importance) on recent years because he assumes that to be an accurate reflection of the future prices. The technique used in this instance is weighted least squares.


Quandl Integration

A recurring challenge with any quantitative analysis is the availability of good quality data. Luckily for us, Quandl.com has taken on the data challenge and indexed millions of economics, financial, societal, and country specific datasets. That data is also available through a free API (Application Programming Interface) supported by the Quandl Python package.

Quandl front page

Downloading Quandl data

For more than 50 API calls per day then you need to sign up with Quandl.com to get a free authentication token. This can then be used to download datasets through Quandl for Python. For instructions on installing Quandl for Python check out PyPi or the Github page.

To get a data-set from Quandl e.g. quandl.com/WIKI/AAPL-Apple-Inc-AAPL-Prices-Dividends-Splits-and-Trading-Volume paste it's name (WIKI/AAPL) into the Quandl.get() function,

import Quandl
data_set = Quandl.get("WIKI/AAPL", authtoken="your token here")

The Quandl.get() function also supports a number of data transformations and manipulations which allow you to specify how you would like the data to be returned, including,

  • order:String - ("asc"|"desc")
  • rows:int - the amount of historical data to extract
  • frequency:String - ("daily"|weekly"|"monthly"|"quarterly"|"annual")
  • transformation:String - ("diff"|"rdiff"|"normalize"|"cumul")
  • returns:String - ("numpy"|)

Here is an example of a detailed API call using multiple data transformations,

import Quandl
data_set = Quandl.get("WIKI/AAPL", rows=50, order="desc", frequency="weekly", transformation="normalize", returns="numpy", authtoken="your token here")
Abstraction

For flexibility and re-usability I abstracted the Quandl API with a class called QuandlSettings. A QuandlSettings object contains the parameters required to construct any Quandl API call. I also added an additional column parameter which allows the user to specify which column of the dataset to include in the regression analysis.

class QuandlSettings():
    """
    This class contains settings for the quandl integration package, settings include,
    * rows:int - specifies the amount of historical data to extract in [frequency]
    * column:int - specifies the column in the data-set to use for the regression analysis
    * frequency:String - select between ("daily"|weekly"|"monthly"|"quarterly"|"annual")
    * transformation:String - select the numerical transformation ("diff"|"rdiff"|"normalize"|"cumul")
    * order:String - select order of data between ("asc"|"desc")
    """
    rows = 0
    column = 1
    frequency = "weekly"
    transformation = "normalize"
    order = "desc"
 
    def __init__(self, rows, column, frequency="weekly", transformation="normalize", order="desc"):
        """
        This initialization method constructs a new QuandlSettings object
        """
        self.rows = rows
        self.column = column
        self.frequency = frequency
        self.transformation = transformation
        self.order = order
        pass

The dataset name is the decoupled from the QuandlSettings class to improve re-usability of QuandlSettings objects. Consider how the quandl_args_prices object is reused for each dataset in economic regression analysis example below,

def economics_example():
    """
    This method creates a set of regression analyses based on economics GDP's of the BRICS nations, 
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args = StatsModelsSettings(1, False)
    quandl_args_prices = QuandlSettings(15, 1, "yearly")
 
    # South Africa, China, Brazil, India, Russia
    regressions = [RegressionAnalysis("WORLDBANK/ZAF_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'b'),
                   RegressionAnalysis("WORLDBANK/CHN_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'g'),
                   RegressionAnalysis("WORLDBANK/BRA_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'k'),
                   RegressionAnalysis("WORLDBANK/IND_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'm'),
                   RegressionAnalysis("WORLDBANK/RUS_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'c')]
    plot_regression_line(regressions)
Custom download method

A custom download method was created in the RegressionAnalysis class which receives a QuandlSettings object and the name of the dataset to be downloaded. In order to extract the correct column for the regression analysis a for loop was used,

    def get_quandl_data(quandl_data_set_name, quandl_settings):
        """
        This method retrieves the quandl data set given the settings specified in the quandl_settings object. For more
        information about these settings see documentation from the QuandlSettings class
        """
        quandl_data_set = get(quandl_data_set_name, rows=quandl_settings.rows, returns="numpy",
                              transformation=quandl_settings.transformation,
                              sort_order=quandl_settings.order, collapse=quandl_settings.frequency)
        print(quandl_data_set)
        quandl_dates = np.arange(1, quandl_settings.rows + 1, 1)
        quandl_prices = []
 
        # TODO: find a better way to extract some column, X, from numpy matrix of tuples (w, x, y, z)
        for i in range(quandl_data_set.size):
            quandl_prices.append(quandl_data_set[quandl_settings.rows - (i + 1)][quandl_settings.column] / 100)
        return quandl_dates, quandl_prices

There must be a more efficient method that the for loop, I don't know it yet. Please also take note that np.arange(1, quandl_settings.rows + 1, 1) creates an array of numbers increasing from 1 to quandl_settings.rows. This is used because the StatsModels regression analysis model does not support dates (yet) so these values represent time.


Python StatsModels

StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

I recently started using StatsModels and I've been very impressed. It provides efficient implementations of many statistical tools including, simple linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis tools includingARMAnon-parametric estimators, datasets, statistical tests, and more.

Abstraction

As with the QuandlSettings class, a StatsModelsSettings class was created to improve the re-usability of configurations for the regression analysis. At this point in time, these settings are restricted to changing the power of the fitted curve, and specifying whether or not confidence lines around the regression should be computed and plotted,

class StatsModelsSettings():
    """
    This class contains settings for the statsmodels package, settings include,
    * exponent:int - when equal to one this is a straight line, when >1 this is a curve
    * confidence:boolean - specifies whether confidence lines should be calculated and plotted
    """
    exponent = 1
    confidence = False
 
    def __init__(self, exponent=1, confidence=False):
        """
        This initialization method constructs a new StatsModelSettings object
        """
        self.exponent = exponent
        self.confidence = confidence
        pass

Note that when the exponent is equal to 1.0 the fitted curve is a straight line. When this is greater than zero, the curve begins to take on non-linearities.

Ordinary Least Squares

StatsModels includes an ordinary least squares method. Our run_ordinary_least_squares() method wraps it with Quandl data and a StatsModelsSettings object.

In this wrapper method the data from Quandl is treated as the dependent variable, the array of values from 1 to rows is treated as the independent variable (time / dates), and a StatsModelsSettings object is used to store values for the parameters used to compute the regression analysis. The method is implemented as follows,

    def run_ordinary_least_squares(ols_dates, ols_data, statsmodels_settings):
        """
        This method receives the dates and prices of a Quandl data-set as well as settings for the StatsModels package,
        it then calculates the regression lines and / or the confidence lines are returns the objects
        """
        intercept = np.column_stack((ols_dates, ols_dates ** statsmodels_settings.exponent))
        constant = sm.add_constant(intercept)
        statsmodel_regression = sm.OLS(ols_prices, constant).fit()
        print(statsmodel_regression.summary())
        if statsmodels_settings.confidence:
            prstd, lower, upper = wls_prediction_std(statsmodel_regression)
            return statsmodel_regression, lower, upper
        else:
            return statsmodel_regression

To print the results of the regression analysis from StatsModels you can add the following command, print(statsmodel_regression.summary()). The output should look something like this,

Regression Results

The Regression Analysis Class

A RegressionAnalysis class was created so that it would be easy to create and store multiple regressions. The RegressionAnalysis class encapsulates the run_ordinary_least_squares() and the get_quandl_data() methods. This class is shown at the end of this article.


MatplotLib

The final piece of the puzzle is to plot the results. Because we want to be able to plot multiple regressions on one canvas, plotting functionality and the RegressionAnalysis class are decoupled. For this Matplotlib was used. MatplotLib is a 2D plotting library which produces figures in a variety of hardcopy formats and environments across platforms.

A plot_regression_lines() function was defined as a global method. It receives a list of RegressionAnalysis objects as an argument and plots each out, one by one.

def plot_regression_line(regression_analyses):
    """
    This global method is a front-end to the MatplotLib library which receives a set of regression analyses and plots
    each one of them onto the canvas.
    """
    title = ""
    fig, ax = plot.subplots()
    # Plot each regression analysis in the set
    for regression_i in regression_analyses:
        ax.plot(regression_i.dates, regression_i.prices, regression_i.color, label="Values " + regression_i.data_set)
        ax.plot(regression_i.dates, regression_i.regression.fittedvalues, regression_i.color + '--',
                label="Regression line " + regression_i.data_set)
        if regression_i.lower is not None:
            ax.plot(regression_i.dates, regression_i.lower, regression_i.color + '--')
        if regression_i.upper is not None:
            ax.plot(regression_i.dates, regression_i.upper, regression_i.color + '--')
        plot.xlabel('Time')
        plot.ylabel('Normalized Values')
        title += regression_i.data_set + ", "


Example usage

By combining object oriented programming, the Quandl API, and existing python packages we have created a program which can do simple regression analysis on any Quandl dataset! The remainder of this article will show some simple example applications of the program and in future articles and tutorials I will construct ever more sophisticated analysis tools.


Fundamental Analysis: Google vs. Yahoo vs. Apple revenues

Quandl.com contains historical fundamental indicators as well as company data for many US companies. This is the code we could need to type if we wanted to compare the revenues of Google, Yahoo, and Apple over the past five years,

def investing_example():
    """
    This method creates a set of regression analyses based on fundamental trading (revenues)
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args_inv = StatsModelsSettings(2, False)
    quandl_args_inv = QuandlSettings(5, 1, "yearly")
 
    regressions_inv = [RegressionAnalysis("DMDRN/GOOG_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'b'),
                       RegressionAnalysis("DMDRN/YHOO_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'g'),
                       RegressionAnalysis("DMDRN/AAPL_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'k')]
    plot_regression_line(regressions_inv)

And here are the results,

Regression Analysis Investing Example


Technical Analysis: Trade entry and exit positions

Regression analysis is used extensively in trading. Technical analysts use the "regression channel" to calculate entry and exit positions into a particular stock.

Another application is pairs trading which monitors the performance of two historically correlated securities. When the correlation temporarily weakens, i.e. one stock moves up while the other moves down, the pairs trade shorts the outperforming stock and buys the under-performing one, betting that the "spread" between the two would eventually converge.

If we wanted to compare the past 350 weeks worth of prices for Google and Yahoo with the regression channel (confidence intervals), we would use the following code,

def trading_example():
    """
    This method creates a set of regression analyses based on technical trading details (price)
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args_trade = StatsModelsSettings(1, True)
    quandl_args_trade = QuandlSettings(350, 4, "weekly")
 
    regressions_trade = [RegressionAnalysis("GOOG/NASDAQ_GOOG", quandl_args_trade, statsmodels_args_trade, 'b'),
                         RegressionAnalysis("GOOG/NASDAQ_YHOO", quandl_args_trade, statsmodels_args_trade, 'g')]
    plot_regression_line(regressions_trade)

And here are the results,

Regression Analysis Example Pairs Trading


Economics: GDP comparison of BRICS nations

Another area in finance whether regression analysis is often used is econometrics. If we wanted to compare the past 15 years of GDP values for the BRICS nations (Brazil, Russia, India, China, and South Africa), we would need to just produce the following code,

def economics_example():
    """
    This method creates a set of regression analyses based on economics GDP's of the BRICS nations,
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args = StatsModelsSettings(1, False)
    quandl_args_prices = QuandlSettings(15, 1, "yearly")
 
    # South Africa, China, Brazil, India, Russia
    regressions = [RegressionAnalysis("WORLDBANK/ZAF_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'b'),
                   RegressionAnalysis("WORLDBANK/CHN_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'g'),
                   RegressionAnalysis("WORLDBANK/BRA_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'k'),
                   RegressionAnalysis("WORLDBANK/IND_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'm'),
                   RegressionAnalysis("WORLDBANK/RUS_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'c')]
    plot_regression_line(regressions)

And here are the results (as you all guessed, China was #1)

Regression Analysis Example Economics


Conclusion and Source Code

In conclusion, regression analysis is a simple and yet useful tool. It can be used to help explain and compare various data-sets and is used extensively in finance, trading, risk management, and econometrics. That having been said, regression analysis is not immune to fault and asserts strong requirements on the data being analysed. For a great discussion on the risks and problems with using regression analysis click here.

__author__ = 'Stuart Gordon Reid'
 
import os as os
import csv as csv
import numpy as np
import scipy as spy
import sklearn as kit
import pandas as pandas
import statsmodels.api as sm
import matplotlib.pyplot as plot
from Quandl import get
from statsmodels.sandbox.regression.predstd import wls_prediction_std
 
 
class StatsModelsSettings():
    """
    This class contains settings for the statsmodels package, settings include,
    * exponent:int - when equal to one this is a straight line, when >1 this is a curve
    * confidence:boolean - specifies whether confidence lines should be calculated and plotted
    """
    exponent = 1
    confidence = False
 
    def __init__(self, exponent=1, confidence=False):
        """
        This initialization method constructs a new StatsModelSettings object
        """
        self.exponent = exponent
        self.confidence = confidence
        pass
 
 
class QuandlSettings():
    """
    This class contains settings for the quandl integration package, settings include,
    * rows:int - specifies the amount of historical data to extract in [frequency]
    * column:int - specifies the column in the data-set to use for the regression analysis
    * frequency:String - select between ("daily"|weekly"|"monthly"|"quarterly"|"annual")
    * transformation:String - select the numerical transformation ("diff"|"rdiff"|"normalize"|"cumul")
    * order:String - select order of data between ("asc"|"desc")
    """
    rows = 0
    column = 1
    frequency = "weekly"
    transformation = "normalize"
    order = "desc"
 
    def __init__(self, rows, column, frequency="weekly", transformation="normalize", order="desc"):
        """
        This initialization method constructs a new QuandlSettings object
        """
        self.rows = rows
        self.column = column
        self.frequency = frequency
        self.transformation = transformation
        self.order = order
        pass
 
 
class RegressionAnalysis():
    """
    This class contain the logic for calculating the regression analysis given a Quandl data-set name, a QuandlSettings
    object, and a StatsModelsSettings object. The resulting regression analysis is returned.
    """
    color = 'r'
    dates = []
    prices = []
    data_set = ""
    regression = None
    upper = None
    lower = None
 
    def __init__(self, quandl_data_set_name, quandl_settings, statsmodels_settings, color='r'):
        """
        This initialization method constructs a new RegressionAnalysis object
        """
        self.color = color
        self.data_set = quandl_data_set_name
        self.dates, self.prices = self.get_quandl_data(self.data_set, quandl_settings)
 
        # Only calculate and return confidence lines if setting = True
        if statsmodels_settings.confidence:
            self.regression, self.lower, self.upper = self.run_ordinary_least_squares(self.dates, self.prices,
                                                                                      statsmodels_settings)
        else:
            self.regression = self.run_ordinary_least_squares(self.dates, self.prices, statsmodels_settings)
        pass
 
    @staticmethod
    def get_quandl_data(quandl_data_set_name, quandl_settings):
        """
        This method retrieves the quandl data set given the settings specified in the quandl_settings object. For more
        information about these settings see documentation from the QuandlSettings class
        """
        quandl_data_set = get(quandl_data_set_name, rows=quandl_settings.rows, returns="numpy",
                              transformation=quandl_settings.transformation,
                              sort_order=quandl_settings.order, collapse=quandl_settings.frequency)
        print(quandl_data_set)
        quandl_dates = np.arange(1, quandl_settings.rows + 1, 1)
        quandl_prices = []
 
        # TODO: find a better way to extract some column, X, from numpy matrix of tuples (w, x, y, z)
        for i in range(quandl_data_set.size):
            quandl_prices.append(quandl_data_set[quandl_settings.rows - (i + 1)][quandl_settings.column] / 100)
        return quandl_dates, quandl_prices
 
    @staticmethod
    def run_ordinary_least_squares(ols_dates, ols_data, statsmodels_settings):
        """
        This method receives the dates and prices of a Quandl data-set as well as settings for the StatsModels package,
        it then calculates the regression lines and / or the confidence lines are returns the objects
        """
        intercept = np.column_stack((ols_dates, ols_dates ** statsmodels_settings.exponent))
        constant = sm.add_constant(intercept)
        statsmodel_regression = sm.OLS(ols_data, constant).fit()
        print(statsmodel_regression.summary())
        if statsmodels_settings.confidence:
            prstd, lower, upper = wls_prediction_std(statsmodel_regression)
            return statsmodel_regression, lower, upper
        else:
            return statsmodel_regression
 
 
def plot_regression_line(regression_analyses):
    """
    This global method is a front-end to the MatplotLib library which receives a set of regression analyses and plots
    each one of them onto the canvas.
    """
    title = ""
    fig, ax = plot.subplots()
    # Plot each regression analysis in the set
    for regression_i in regression_analyses:
        ax.plot(regression_i.dates, regression_i.prices, regression_i.color, label="Values " + regression_i.data_set)
        ax.plot(regression_i.dates, regression_i.regression.fittedvalues, regression_i.color + '.',
                label="Regression line " + regression_i.data_set)
        if regression_i.lower is not None:
            ax.plot(regression_i.dates, regression_i.lower, regression_i.color + '--')
        if regression_i.upper is not None:
            ax.plot(regression_i.dates, regression_i.upper, regression_i.color + '--')
        plot.xlabel('Time')
        plot.ylabel('Normalized Values')
        title += regression_i.data_set + ", "
 
    plot.title('Regression Analysis of ' + title)
    ax.legend(loc='best')
    plot.grid(True)
    plot.show()
 
 
def investing_example():
    """
    This method creates a set of regression analyses based on fundamental trading (revenues)
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args_inv = StatsModelsSettings(2, False)
    quandl_args_inv = QuandlSettings(5, 1, "yearly")
 
    regressions_inv = [RegressionAnalysis("DMDRN/GOOG_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'b'),
                       RegressionAnalysis("DMDRN/YHOO_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'g'),
                       RegressionAnalysis("DMDRN/AAPL_REV_LAST", quandl_args_inv, statsmodels_args_inv, 'k')]
    plot_regression_line(regressions_inv)
 
 
def trading_example():
    """
    This method creates a set of regression analyses based on technical trading details (price)
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args_trade = StatsModelsSettings(1, True)
    quandl_args_trade = QuandlSettings(350, 4, "weekly")
 
    regressions_trade = [RegressionAnalysis("GOOG/NASDAQ_GOOG", quandl_args_trade, statsmodels_args_trade, 'b'),
                         RegressionAnalysis("GOOG/NASDAQ_YHOO", quandl_args_trade, statsmodels_args_trade, 'g')]
    plot_regression_line(regressions_trade)
 
 
def economics_example():
    """
    This method creates a set of regression analyses based on economics GDP's of the BRICS nations,
    """
    # b: blue, g: green, r: red, c: cyan, m: magenta, y: yellow, k: black, w: white
    statsmodels_args = StatsModelsSettings(1, False)
    quandl_args_prices = QuandlSettings(15, 1, "yearly")
 
    # South Africa, China, Brazil, India, Russia
    regressions = [RegressionAnalysis("WORLDBANK/ZAF_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'b'),
                   RegressionAnalysis("WORLDBANK/CHN_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'g'),
                   RegressionAnalysis("WORLDBANK/BRA_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'k'),
                   RegressionAnalysis("WORLDBANK/IND_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'm'),
                   RegressionAnalysis("WORLDBANK/RUS_NY_GDP_MKTP_KN", quandl_args_prices, statsmodels_args, 'c')]
    plot_regression_line(regressions)
 
 
if __name__ == "__main__":
    # This main method run the regression analysis program
    trading_example()
Tagged with:  Algorithmic TradingComputational FinanceComputational InvestingData MiningMatplotLib, PythonRegression AnalysisStatisticsStatsModelsTechnical Analysis
Posted in  Algorithmic TradingComputational FinanceComputational InvestingNeural NetworksPython, Tutorial
0 comments on “Regression analysis using Python StatsModels and Quandl
    1 Pings/Trackbacks for "Regression analysis using Python StatsModels and Quandl"



    • 0
      点赞
    • 0
      收藏
      觉得还不错? 一键收藏
    • 0
      评论

    “相关推荐”对你有帮助么?

    • 非常没帮助
    • 没帮助
    • 一般
    • 有帮助
    • 非常有帮助
    提交
    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值