你可以使用
pd.get_dummies:
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)
产量
a b c blue orange red
0 1 5 9 0 0 1
1 2 6 10 1 0 0
2 3 7 11 0 0 0
3 4 8 12 0 0 1
4 3 4 3 0 1 0
5 3 4 3 1 0 0
6 3 4 3 0 0 1
import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())
产量
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.149e+25
Date: Sun, 22 Mar 2015 Prob (F-statistic): 1.64e-13
Time: 05:57:33 Log-Likelihood: 200.74
No. Observations: 7 AIC: -389.5
Df Residuals: 1 BIC: -389.8
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -1.6000 6.11e-13 -2.62e+12 0.000 -1.600 -1.600
b 1.6000 1.59e-13 1.01e+13 0.000 1.600 1.600
c -0.6000 6.36e-14 -9.44e+12 0.000 -0.600 -0.600
blue 1.11e-16 3.08e-13 0.000 1.000 -3.91e-12 3.91e-12
orange 7.994e-15 3.87e-13 0.021 0.987 -4.91e-12 4.93e-12
red 4.829e-15 2.75e-13 0.018 0.989 -3.49e-12 3.5e-12
==============================================================================
Omnibus: nan Durbin-Watson: 0.203
Prob(Omnibus): nan Jarque-Bera (JB): 0.752
Skew: 0.200 Prob(JB): 0.687
Kurtosis: 1.445 Cond. No. 85.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
import pandas as pd
import statsmodels.formula.api as smf
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)
model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())
参考文献: