Pandas: 根据另一个变量的值获取列的值_pandas 根据其他列值条件判断计算列值-CSDN博客

本文链接：https://blog.csdn.net/D0126_/article/details/142560036

我们在使用pandas处理一个商业机构的数据集时，遇到了一个问题。这个数据集是一个宽格式的面板，其中包含了每个年份（例如，2005、2006、2007 等）的就业人数。还存在一个变量，表示企业搬到新地点的年份（例如，2006 年）。我们希望创建一个变量来表示搬迁年份的具体就业人数——也就是说，如果搬迁年份是 x，那么就查找年份 x 的就业价值。

理想情况下，我们希望对这个过程进行向量化。目前，我们使用以下代码来实现这个功能，但我们担心索引不够通用/可能不安全，在使用真实数据时可能会得到意外的结果。

import pandas as pd
import numpy as np
np.random.seed(43)

## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
move = np.arange(2006,2010)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
moveyr = np.random.choice(move, N)

## place it in dataframe
jobs06 = np.random.randint(low=1,high=250,size=N)
jobs06 = np.random.randint(low=1,high=250,size=N)
jobs07 = np.random.randint(low=1,high=250,size=N)
jobs08 = np.random.randint(low=1,high=250,size=N)
jobs09 = np.random.randint(low=1,high=250,size=N)


df_city =pd.DataFrame({'industry':ind,'city':cty,'moveyear':moveyr,'jobs06':jobs06,'jobs07':jobs07,'jobs08':jobs08,'jobs09':jobs09})

df_city.head()

+---+------------+------------+--------+--------+--------+--------+----------+
|   |    city    |  industry  | jobs06 | jobs07 | jobs08 | jobs09 | moveyear |
+---+------------+------------+--------+--------+--------+--------+----------+
| 0 |  sf        |  utilities |    206 |     82 |    192 |    236 |     2009 |
| 1 |  oakland   |  utilities |     10 |    244 |      2 |      7 |     2007 |
| 2 |  san mateo |  finance   |    182 |    164 |     49 |     66 |     2006 |
| 3 |  oakland   |  sales     |     27 |    228 |     33 |    169 |     2007 |
| 4 |  san mateo |  sales     |     24 |     24 |    127 |    165 |     2007 |
+---+------------+------------+--------+--------+--------+--------+----------+

df_city['moveyearemp']=0  ## seemingly must declare first
for count, row  in df_city.head(5).iterrows(): 
    get_moveyear_emp = 'jobs' + str(row['moveyear'])[2:]
    ## is this 'proper' indexing?
    df_city.ix[count,'moveyearemp'] = df_city.ix[count,get_moveyear_emp]
print df_city['moveyearemp'].head()

0    236
1    244
2    182
3    228
4     24
Name: moveyearemp, dtype: int64

看起来这个代码可以得到预期的结果——例如，对于第一行/企业来说，236 确实就是 2009 年的就业人数；对于第二行来说，244 确实是 2007 年的就业人数，以此类推。

2、解决方案

我们可以通过迭代年份来解决这个问题，因为年份的数量要少于行数：

In [11]: df_city.moveyear.unique()
Out[11]: array([2009, 2007, 2006, 2008])

以下是一种解决方法，但我们认为它不是最好的方法：

g = df_city.groupby('moveyear')
df_city['moveyearemp'] = 0
for year, ind in g.indices.iteritems():
    year_abbr = str(year)[2:]
    df_city.loc[ind, 'moveyearemp'] = df_city.loc[ind, 'jobs%s' % year_abbr]

这样就可以得到以下结果：

In [21]: df_city.head()
Out[21]: 
        city   industry  jobs06  jobs07  jobs08  jobs09  moveyear  moveyearemp
0         sf  utilities     206      82     192     236      2009          236
1    oakland  utilities      10     244       2       7      2007          244
2  san mateo    finance     182     164      49      66      2006          182
3    oakland      sales      27     228      33     169      2007          228
4  san mateo      sales      24      24     127     165      2007           24