我们在使用pandas处理一个商业机构的数据集时,遇到了一个问题。这个数据集是一个宽格式的面板,其中包含了每个年份(例如,2005、2006、2007 等)的就业人数。还存在一个变量,表示企业搬到新地点的年份(例如,2006 年)。我们希望创建一个变量来表示搬迁年份的具体就业人数——也就是说,如果搬迁年份是 x,那么就查找年份 x 的就业价值。
理想情况下,我们希望对这个过程进行向量化。目前,我们使用以下代码来实现这个功能,但我们担心索引不够通用/可能不安全,在使用真实数据时可能会得到意外的结果。
import pandas as pd
import numpy as np
np.random.seed(43)
## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
move = np.arange(2006,2010)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
moveyr = np.random.choice(move, N)
## place it in dataframe
jobs06 = np.random.randint(low=1,high=250,size=N)
jobs06 = np.random.randint(low=1,high=250,size=N)
jobs07 = np.random.randint(low=1,high=250,size=N)
jobs08 = np.random.randint(low=1,high=250,size=N)
jobs09 = np.random.randint(low=1,high=250,size=N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'moveyear':moveyr,'jobs06':jobs06,'jobs07':jobs07,'jobs08':jobs08,'jobs09':jobs09})
df_city.head()
+---+------------+------------+--------+--------+--------+--------+----------+
| | city | industry | jobs06 | jobs07 | jobs08 | jobs09 | moveyear |
+---+------------+------------+--------+--------+--------+--------+----------+
| 0 | sf | utilities | 206 | 82 | 192 | 236 | 2009 |
| 1 | oakland | utilities | 10 | 244 | 2 | 7 | 2007 |
| 2 | san mateo | finance | 182 | 164 | 49 | 66 | 2006 |
| 3 | oakland | sales | 27 | 228 | 33 | 169 | 2007 |
| 4 | san mateo | sales | 24 | 24 | 127 | 165 | 2007 |
+---+------------+------------+--------+--------+--------+--------+----------+
df_city['moveyearemp']=0 ## seemingly must declare first
for count, row in df_city.head(5).iterrows():
get_moveyear_emp = 'jobs' + str(row['moveyear'])[2:]
## is this 'proper' indexing?
df_city.ix[count,'moveyearemp'] = df_city.ix[count,get_moveyear_emp]
print df_city['moveyearemp'].head()
0 236
1 244
2 182
3 228
4 24
Name: moveyearemp, dtype: int64
看起来这个代码可以得到预期的结果——例如,对于第一行/企业来说,236 确实就是 2009 年的就业人数;对于第二行来说,244 确实是 2007 年的就业人数,以此类推。
2、解决方案
我们可以通过迭代年份来解决这个问题,因为年份的数量要少于行数:
In [11]: df_city.moveyear.unique()
Out[11]: array([2009, 2007, 2006, 2008])
以下是一种解决方法,但我们认为它不是最好的方法:
g = df_city.groupby('moveyear')
df_city['moveyearemp'] = 0
for year, ind in g.indices.iteritems():
year_abbr = str(year)[2:]
df_city.loc[ind, 'moveyearemp'] = df_city.loc[ind, 'jobs%s' % year_abbr]
这样就可以得到以下结果:
In [21]: df_city.head()
Out[21]:
city industry jobs06 jobs07 jobs08 jobs09 moveyear moveyearemp
0 sf utilities 206 82 192 236 2009 236
1 oakland utilities 10 244 2 7 2007 244
2 san mateo finance 182 164 49 66 2006 182
3 oakland sales 27 228 33 169 2007 228
4 san mateo sales 24 24 127 165 2007 24