对于一些表示类别的变量(也就是 categorical variable),我们不应该分配数字,这样是没有意义的。相反,我们应当使用独热编码。(不知道还有哪些更合理的方式)
直接上例子:
>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
... 'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
... 'debt_ratio':np.random.randn(5),
... 'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data
industry debt_ratio cash_flow
0 mining 1.351461 90.834770
1 transportation -1.762797 91.709986
2 hospitality -1.625949 89.645358
3 finance 0.252498 88.487650
4 entertainment -0.779833 89.627691
>>> data = pd.concat((
... data,
... pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
industry debt_ratio cash_flow finance hospitality mining transportation
0 mining 0.357440 88.856850 0 0 1 0
1 transportation 0.377538 89.457560 0 0 0 1
2 hospitality 1.382338 89.451292 0 1 0 0
3 finance 1.175549 90.208520 1 0 0 0
4 entertainment -0.939276 90.212690 0 0 0 0
其实关键就是 pd.get_dummies()这个函数。注意,这里为什么要 drop_first=True呢?这是为了防止线性相关性的出现(dummy trap)。详情见:http://facweb.cs.depaul.edu/sjost/csc423/documents/dummy-variable-trap.htm
参考资料:
https://stackoverflow.com/questions/55738056/using-categorical-variables-in-statsmodels-ols-class