如果变量x的观测值存在误差,那么对(x,y)进行回归的时候应该用几何平均回归。以下自matlab转为Python。
Trujillo-Ortiz, A. and R. Hernandez-Walls. (2010). gmregress:
Geometric Mean Regression (Reduced Major Axis Regression).
A MATLAB file. [WWW document]. URL: http://www.mathworks.com/matlabcentral/fileexchange/27918-gmregress
# This funtion is converted from matlab funtion Trujillo-Ortiz, A. and
# R. Hernandez-Walls. (2010). gmregress:
# Geometric Mean Regression (Reduced Major Axis Regression).
# A MATLAB file. [WWW document]. URL http://www.mathworks.com/
# matlabcentral/fileexchange/27918-gmregress
#GMREGRESS Geometric Mean Regression (Reduced Major Axis Regression).
# Model II regression should be used when the two variables in the
# regression equation are random and subject to error, i.e. not
# controlled by the researcher. Model I regression using ordinary least
# squares underestimates the slope of the linear relationship between the
# variables when they both contain error. According to Sokal and Rohlf
# (1995), the subject of Model II regression is one on which research and
# controversy are continuing and definitive recommendations are difficult
# to make.
#
# GMREGRESS is a Model II procedure. It standardize variables before the
# slope is computed. Each of the two variables is transformed to have a
# mean of zero and a standard deviation of one. The resulting slope is
# the geometric mean of the linear regression coefficient of Y on X.
# Ricker (1973) coined this term and gives an extensive review of Model
# II regression. It is also known as Standard Major Axis.
#
# B,BINTR,BINTJM = GMREGRESS(X,Y,ALPHA) returns the vector(ndarray) B of
# regression coefficients in the linear Model II and a matrix BINT of the
# given confidence intervals for B by the Ricker (1973) and Jolicoeur and
# Mosimann (1968)-McArdle (1988) procedure.
#
# GMREGRESS treats NaNs in X or Y as missing values, and removes them.
#
# Syntax: b,bintr,bintjm = gmregress(x,y,alpha)
#
# Example. From the Box 14.12 (California fish cabezon [Scorpaenichthys
# marmoratus]) of Sokal and Rohlf (1995). The data are:
#
# x=np.array([14,17,24,25,27,33,34,37,40,41,42],dtype='float')
# y=np.array([61,37,65,69,54,93,87,89,100,90,97],dtype='float')
#
# Calling on Python the function:
# b,bintr,bintjm = gmregress(x,y)
#
# Answer is:
#
# b =
# array([12.19378495, 2.11936636])
#
# bintr =
# array([[-10.64446401, 35.03203392],
# [ 1.36720846, 2.87152426]]
#
# bintjm =
# array([[-14.57692871, 31.09956922],
# [ 1.49672077, 3.00103657]]
#
# Matlab Code Created by A. Trujillo-Ortiz and R. Hernandez-Walls
# Facultad de Ciencias Marinas
# Universidad Autonoma de Baja California
# Apdo. Postal 453
# Ensenada, Baja California
# Mexico.
# atrujo@uabc.edu.mx
#
# Copyright (C) June 15, 2010.
#
# To cite the matlab file, it would be an appropriate format:
# Trujillo-Ortiz, A. and R. Hernandez-Walls. (2010). gmregress:
# Geometric Mean Regression (Reduced Major Axis Regression).
# A MATLAB file. [WWW document]. URL http://www.mathworks.com/
# matlabcentral/fileexchange/27918-gmregress
#
# References:
# Jolicoeur, P. and Mosimann, J. E. (1968), Intervalles de confiance pour
# la pente de l'axe majeur d'une distribution normale bidimensionelle
# . Biomrie-Praximrie, 9:121-140.
# McArdle, B. (1988), The structural relationship:regression in biology.
# Can. Jour. Zool. 66:2329-2339.
# Ricker, W. E. (1973), Linear regression in fishery research. J. Fish.
# Res. Board Can., 30:409-434.
# Sokal, R. R. and Rohlf, F. J. (1995), Biometry. The principles and
# practice of the statistics in biologicalreserach. 3rd. ed.
# New-York:W.H.,Freeman. [Sections 14.13 and 15.7]
#
def gmregress(x,y,alpha=0.05):
from scipy.stats import f as f
from scipy.stats import t as t_
import numpy as np
# Check that matrix (X) and rigth hand side (Y) have compatible dimensions
n = x.size
if type(x) is not np.ndarray or type(y) is not np.ndarray:
raise ValueError('Y must be a np.ndarray.')
elif len(y) != n:
raise ValueError('The number of rows in Y must equal the number of rows in X.')
# Remove missing values, if any
wasnan = np.isnan(y) | np.isnan(x)
if np.any(wasnan):
y = y[~wasnan]
x = x[~wasnan]
n = x.size
R = np.corrcoef(x,y)
r = R[0,1] #correlation coefficient
s = r/abs(r) #find sign of the correlation coefficient: this former bug
#was efficiently corrected thanks to the valuable suggestions
#given by Holger Goerlitz and Joel E. Cohen. Yes, a negative
#slope are always negative!
S = np.cov(x,y)
SCX = S[0,0]*(n-1)
SCY = S[1,1]*(n-1)
SCP = S[0,1]*(n-1)
v = s*np.sqrt(SCY/SCX) #slope
u = np.average(y)-np.average(x)*v #intercept
b = np.array([u,v])
#Ricker procedure
SCv = SCY-(SCP**2)/SCX
N = SCv/(n-2)
sv = np.sqrt(N/SCX)
t = t_.isf(alpha/2,n-2)
vi = v-t*sv #confidence lower limit of slope
vs = v+t*sv #confidence upper limit of slope
ui = np.average(y)-np.average(x)*vs #confidence lower limit of intercept
us = np.average(y)-np.average(x)*vi #confidence upper limit of intercept
uint = np.array([ui,us])
vint = np.array([vi,vs])
if ui > us:
uint = uint[::-1]
if vi > vs:
vint = vint[::-1]
bintr = np.array([uint,vint])
#Jolicoeur and Mosimann procedure
r = R[0,1]
F =f.isf(alpha,1,n-2)
B = F*(1-r**2)/(n-2)
a = np.sqrt(B+1)
c = np.sqrt(B)
qi = v*(a-c) #confidence lower limit of slope
qs = v*(a+c) #confidence upper limit of slope
pi = np.average(y)-np.average(x)*qs #confidence lower limit of intercept
ps = np.average(y)-np.average(x)*qi #confidence upper limit of intercept
pint = np.array([pi,ps])
qint = np.array([qi,qs])
if pi > ps:
pint = pint[::-1]
if qi > qs:
qint = qint[::-1]
bintjm = np.array([pint,qint])
return b,bintr,bintjm