数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

目录
  • 分组/Groupby
  • 聚合/agg
  • 数据拼接/concat
  • 数据合并/merge/join
  • 小项目/projects
分组/Group by

举个例子,假设我们手头有一张公司每个员工的收入流水:

import pandas as pd
import numpy as np
salaries = pd.DataFrame({
    'Name':['BOSS','HanMeimei','HanMeimei','Han','BOSS','BOSS','HanMeimei','BOSS'],
    'Year':[2016,2016,2016,2016,2017,2017,2017,2017],
    'Salary':[40000,5000,4000,3000,38000,42000,6000,39000],
    'Bonus':[12000,3000,3500,1200,16000,18000,7000,21000]
})
salaries
BonusNameSalaryYear
012000BOSS400002016
13000HanMeimei50002016
23500HanMeimei40002016
31200Han30002016
416000BOSS380002017
518000BOSS420002017
67000HanMeimei60002017
721000BOSS390002017

group by实际上就是分组,通过某个字段不同,进行数据分组划分

group_by_name = salaries.groupby('Name')
type(group_by_name)
pandas.core.groupby.DataFrameGroupBy
# 取不同的分组
for item in group_by_name:
    print(item)
    print(item[0]) #分组对象名称
    print(item[1]) #分组数据
    print("\n")
('BOSS',    Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017)
BOSS
   Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017


(‘Han’, Bonus Name Salary Year
3 1200 Han 3000 2016)
Han
Bonus Name Salary Year
3 1200 Han 3000 2016


(‘HanMeimei’, Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017)
HanMeimei
Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017


groupby分组之后你可以去做一些统计聚会操作

分组求和
group_by_name.sum() #求和
BonusSalaryYear
Name
BOSS670001590008067
Han120030002016
HanMeimei13500150006049
group_by_name.mean() #求平均
BonusSalaryYear
Name
BOSS16750.039750.02016.750000
Han1200.03000.02016.000000
HanMeimei4500.05000.02016.333333
挑选一些列做统计运算
group_by_name[['Bonus','Salary']].sum() 
BonusSalary
Name
BOSS67000159000
Han12003000
HanMeimei1350015000
按照Name排序
salaries.groupby('Name', sort=False).agg(sum)
BonusSalaryYear
Name
BOSS670001590008067
HanMeimei13500150006049
Han120030002016
salaries.groupby('Name', sort=False).sum()
BonusSalaryYear
Name
BOSS670001590008067
HanMeimei13500150006049
Han120030002016
# sum:求和  mean:#求平均    median:求中位数
salaries.groupby('Name').median()
BonusSalaryYear
Name
BOSS17000395002017
Han120030002016
HanMeimei350050002016
频次/出现了多少次
salaries.groupby('Name').size()  #出现了多少次
Name
BOSS         4
Han          1
HanMeimei    3
dtype: int64
salaries.info() #基本信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Bonus     8 non-null int64
Name      8 non-null object
Salary    8 non-null int64
Year      8 non-null int64
dtypes: int64(3), object(1)
memory usage: 336.0+ bytes
salaries.describe() #描述统计信息
BonusSalaryYear
count8.0000008.0000008.000000
mean10212.50000022125.0000002016.500000
std7581.26214518893.9710720.534522
min1200.0000003000.0000002016.000000
25%3375.0000004750.0000002016.000000
50%9500.00000022000.0000002016.500000
75%16500.00000039250.0000002017.000000
max21000.00000042000.0000002017.000000
分组查看统计信息
salaries.groupby('Name').describe()  #按照name分组来查看统计信息
BonusSalaryYear
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
Name
BOSS4.016750.03774.91721812000.015000.017000.018750.021000.04.039750.0...40500.042000.04.02016.7500000.500002016.02016.752017.02017.02017.0
Han1.01200.0NaN1200.01200.01200.01200.01200.01.03000.0...3000.03000.01.02016.000000NaN2016.02016.002016.02016.02016.0
HanMeimei3.04500.02179.4494723000.03250.03500.05250.07000.03.05000.0...5500.06000.03.02016.3333330.577352016.02016.002016.02016.52017.0

3 rows × 24 columns

salaries.groupby('Name')[['Bonus', 'Salary']].agg(['sum', 'mean', 'std', 'median'])
BonusSalary
summeanstdmediansummeanstdmedian
Name
BOSS67000167503774.91721817000159000397501707.82512839500
Han12001200NaN120030003000NaN3000
HanMeimei1350045002179.44947235001500050001000.0000005000
salaries.groupby('Name')[['Bonus','Salary']].agg([np.sum, np.mean, np.std, np.median])
BonusSalary
summeanstdmediansummeanstdmedian
Name
BOSS67000167503774.91721817000159000397501707.82512839500
Han12001200NaN120030003000NaN3000
HanMeimei1350045002179.44947235001500050001000.0000005000
salaries.groupby('Name')[['Bonus']].agg([np.sum, np.mean, np.std, np.median])
Bonus
summeanstdmedian
Name
BOSS67000167503774.91721817000
Han12001200NaN1200
HanMeimei1350045002179.4494723500
type(salaries['Bonus'])
pandas.core.series.Series
type(salaries[['Bonus']])
pandas.core.frame.DataFrame
变换/transform
nvda = pd.read_csv('1_pandas_part2_data/pandas_part2_data/NVDA.csv', index_col=0, parse_dates=['Date'])
# 把第一列设置为索引列   解析日期列
nvda.head()  #查看前五行
OpenHighLowCloseAdj CloseVolume
Date
1999-01-221.7500001.9531251.5520831.6406251.52343067867200
1999-01-251.7708331.8333331.6406251.8125001.68302812762000
1999-01-261.8333331.8697921.6458331.6718751.5524488580000
1999-01-271.6770831.7187501.5833331.6666671.5476116109200
1999-01-281.6666671.6770831.6510421.6614581.5427765688000
nvda.loc[:,'year'] = nvda.index.year
nvda.head()
OpenHighLowCloseAdj CloseVolumeyear
Date
1999-01-221.7500001.9531251.5520831.6406251.523430678672001999
1999-01-251.7708331.8333331.6406251.8125001.683028127620001999
1999-01-261.8333331.8697921.6458331.6718751.55244885800001999
1999-01-271.6770831.7187501.5833331.6666671.54761161092001999
1999-01-281.6666671.6770831.6510421.6614581.54277656880001999
Series类型调用unique():查看一列中的不同的取值
Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数
某一列有多少不同的取值 => unique
#求year这列中不同的取值
nvda['year'].unique()
array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], dtype=int64)
某一列有多少不同的取值,它们分别出现了多少次 => value_counts()
# 求year这列中不同取值分别出现的次数
nvda['year'].value_counts()
2008    253
2015    252
2011    252
2004    252
2016    252
2005    252
2009    252
2013    252
2000    252
2002    252
2010    252
2014    252
2003    252
2006    251
2007    251
2012    250
2001    248
1999    239
2017    138
Name: year, dtype: int64
tmp = nvda['year'].value_counts().to_frame()#Series转成DataFrame
tmp = tmp.reset_index()
tmp.columns = ['year','count']
tmp.head()
yearcount
02008253
12015252
22011252
32004252
42016252
tmp.sort_values(by='year').head()
yearcount
171999239
82000252
162001248
92002252
122003252
nvda.head()
OpenHighLowCloseAdj CloseVolumeyear
Date
1999-01-221.7500001.9531251.5520831.6406251.523430678672001999
1999-01-251.7708331.8333331.6406251.8125001.683028127620001999
1999-01-261.8333331.8697921.6458331.6718751.55244885800001999
1999-01-271.6770831.7187501.5833331.6666671.54761161092001999
1999-01-281.6666671.6770831.6510421.6614581.54277656880001999
nvda.groupby('year').agg(['mean', 'std'])
OpenHighLowCloseAdj CloseVolume
meanstdmeanstdmeanstdmeanstdmeanstdmeanstd
year
19991.9507820.5888822.0073170.6143021.8835590.5716581.9472300.6010411.8081340.5581076.433220e+068.142949e+06
20008.7810842.9999089.2226973.1141868.3605222.9047618.7788263.0131048.1517292.7978691.104182e+077.985374e+06
200113.0912543.83977713.6007503.82983812.6805483.83094413.1815523.83363712.2399563.5597892.782387e+071.384318e+07
20029.6903446.5612879.9550936.6642269.3443916.3752129.6147496.5190538.9279406.0533793.168655e+071.558742e+07
20035.9024341.4618626.0426591.4912605.7649601.4234225.9003441.4598525.4788651.3555702.430220e+071.899657e+07
20046.4847351.4674456.6088101.4820366.3535581.4447976.4659131.4565756.0040341.3525281.706331e+071.191968e+07
20059.5123811.5800619.6596561.5912749.3531751.5711389.5138231.5897628.8342231.4762011.542825e+079.623837e+06
200618.0579023.67509218.4251263.71861617.7202793.65758418.0959633.70096016.8033163.4365901.534446e+076.616879e+06
200727.7620456.11143728.2516736.22566227.2060565.90262027.7245426.08768125.7440985.6528201.514562e+075.818216e+06
200816.0043086.86276016.4262456.96452815.5214626.69638115.9456136.81152714.8065726.3249602.022721e+078.552974e+06
200911.8251192.63809712.1147622.61249911.5659522.64053711.8508732.63166411.0043312.4436771.919821e+078.291987e+06
201013.5763492.88888413.8026592.90490513.3185322.84306513.5631752.88426112.5943182.6782301.853295e+078.434693e+06
201116.9125403.40488417.2675403.49036816.5121433.30550716.8875403.40403215.6812143.1608722.289352e+071.270114e+07
201213.5262001.17695713.7174001.19177513.3198001.16541913.5078801.18513912.5511661.0917361.207757e+075.050116e+06
201314.1735711.25150814.3298021.25328714.0352781.25337214.1891271.25088313.4122781.2601528.843986e+064.202323e+06
201418.5430561.29328318.7454761.28348018.3482141.27603818.5470641.28493217.8750531.3128337.098902e+063.140560e+06
201523.6805954.10632723.9795244.15222923.4110714.07935123.7182544.12887923.2622834.1546787.756520e+063.933075e+06
201653.63083321.71454054.41539722.18262152.89511921.26351753.76119021.80392753.47573721.8243671.107062e+077.547056e+06
2017120.48130522.027821122.30072522.510244118.40275421.281863120.54797121.991898120.43686322.0562901.907742e+071.073342e+07
def my_transform(x):
    return (x-x.mean())/x.std()
tmp_arr = np.array(range(10))
tmp_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
my_transform(tmp_arr)
array([-1.5666989 , -1.21854359, -0.87038828, -0.52223297, -0.17407766,
        0.17407766,  0.52223297,  0.87038828,  1.21854359,  1.5666989 ])
tranformed = nvda.groupby('year').transform(my_transform)
tranformed.head()
OpenHighLowCloseAdj CloseVolume
Date
1999-01-22-0.340955-0.088217-0.579850-0.510124-0.5101247.544438
1999-01-25-0.305578-0.283222-0.424964-0.224161-0.2241610.777210
1999-01-26-0.199444-0.223871-0.415854-0.458130-0.4581310.263637
1999-01-27-0.464778-0.469747-0.525185-0.466795-0.466798-0.039791
1999-01-28-0.482465-0.537575-0.406741-0.475462-0.475461-0.091517
%matplotlib inline
compare_df = pd.DataFrame({'Origin':nvda['Adj Close'], 'Transformed':tranformed['Adj Close']})
compare_df.head()
OriginTransformed
Date
1999-01-221.523430-0.510124
1999-01-251.683028-0.224161
1999-01-261.552448-0.458131
1999-01-271.547611-0.466798
1999-01-281.542776-0.475461
compare_df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x2354e970c18>

在这里插入图片描述

compare_df['Transformed'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x235517b0978>

在这里插入图片描述

变换函数apply
salaries
BonusNameSalaryYear
012000BOSS400002016
13000HanMeimei50002016
23500HanMeimei40002016
31200Han30002016
416000BOSS380002017
518000BOSS420002017
67000HanMeimei60002017
721000BOSS390002017
salaries.loc[:,'tmp_col'] = (salaries['Salary']*2-1500)/0.8
salaries
BonusNameSalaryYeartmp_col
012000BOSS40000201698125.0
13000HanMeimei5000201610625.0
23500HanMeimei400020168125.0
31200Han300020165625.0
416000BOSS38000201793125.0
518000BOSS420002017103125.0
67000HanMeimei6000201713125.0
721000BOSS39000201795625.0
def trans(x):
    return (x*2-1500)/0.8
salaries.loc[:,'tmp_col2'] = salaries['Salary'].apply(trans)
salaries
BonusNameSalaryYeartmp_coltmp_col2
012000BOSS40000201698125.098125.0
13000HanMeimei5000201610625.010625.0
23500HanMeimei400020168125.08125.0
31200Han300020165625.05625.0
416000BOSS38000201793125.093125.0
518000BOSS420002017103125.0103125.0
67000HanMeimei6000201713125.013125.0
721000BOSS39000201795625.095625.0
def trans2(x):
    if x=='Han':
        return 'HanXiaoyang'
    else:
        return x
salaries.loc[:,'full_name'] = salaries['Name'].apply(trans2)
salaries
BonusNameSalaryYeartmp_coltmp_col2full_name
012000BOSS40000201698125.098125.0BOSS
13000HanMeimei5000201610625.010625.0HanMeimei
23500HanMeimei400020168125.08125.0HanMeimei
31200Han300020165625.05625.0HanXiaoyang
416000BOSS38000201793125.093125.0BOSS
518000BOSS420002017103125.0103125.0BOSS
67000HanMeimei6000201713125.013125.0HanMeimei
721000BOSS39000201795625.095625.0BOSS
help(pd.Series.apply)
Help on function apply in module pandas.core.series:

apply(self, func, convert_dtype=True, args=(), **kwds)
    Invoke function on values of Series. Can be ufunc (a NumPy function
    that applies to the entire Series) or a Python function that only works
    on single values
    
    Parameters
    ----------
    func : function
    convert_dtype : boolean, default True
        Try to find better dtype for elementwise function results. If
        False, leave as dtype=object
    args : tuple
        Positional arguments to pass to function in addition to the value
    Additional keyword arguments will be passed as keywords to the function
    
    Returns
    -------
    y : Series or DataFrame if func returns a Series
    
    See also
    --------
    Series.map: For element-wise operations
    Series.agg: only perform aggregating type operations
    Series.transform: only perform transformating type operations
    
    Examples
    --------
    
    Create a series with typical summer temperatures for each city.
    
    >>> import pandas as pd
    >>> import numpy as np
    >>> series = pd.Series([20, 21, 12], index=['London',
    ... 'New York','Helsinki'])
    >>> series
    London      20
    New York    21
    Helsinki    12
    dtype: int64
    
    Square the values by defining a function and passing it as an
    argument to ``apply()``.
    
    >>> def square(x):
    ...     return x**2
    >>> series.apply(square)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Square the values by passing an anonymous function as an
    argument to ``apply()``.
    
    >>> series.apply(lambda x: x**2)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Define a custom function that needs additional positional
    arguments and pass these additional arguments using the
    ``args`` keyword.
    
    >>> def subtract_custom_value(x, custom_value):
    ...     return x-custom_value
    
    >>> series.apply(subtract_custom_value, args=(5,))
    London      15
    New York    16
    Helsinki     7
    dtype: int64
    
    Define a custom function that takes keyword arguments
    and pass these arguments to ``apply``.
    
    >>> def add_custom_values(x, **kwargs):
    ...     for month in kwargs:
    ...         x+=kwargs[month]
    ...         return x
    
    >>> series.apply(add_custom_values, june=30, july=20, august=25)
    London      95
    New York    96
    Helsinki    87
    dtype: int64
    
    Use a function from the Numpy library.
    
    >>> series.apply(np.log)
    London      2.995732
    New York    3.044522
    Helsinki    2.484907
    dtype: float64

salaries
BonusNameSalaryYeartmp_coltmp_col2full_name
012000BOSS40000201698125.098125.0BOSS
13000HanMeimei5000201610625.010625.0HanMeimei
23500HanMeimei400020168125.08125.0HanMeimei
31200Han300020165625.05625.0HanXiaoyang
416000BOSS38000201793125.093125.0BOSS
518000BOSS420002017103125.0103125.0BOSS
67000HanMeimei6000201713125.013125.0HanMeimei
721000BOSS39000201795625.095625.0BOSS
gender = 'male'
sex = '男' if gender=='male' else '女'
sex
'男'
lambda匿名函数
salaries.loc[:,'new_name'] = salaries['Name'].apply(lambda x: 'HanXiaoyang' if x=='han' else x)
对几列做操作
# 如果是boss,返回工资+奖金,其他人返回工资
def my_fun(name, salary, bonus):
    if name=='BOSS':
        return salary+bonus
    else:
        return salary
salaries.loc[:,'my_s_result'] = list(map(lambda x,y,z:my_fun(x,y,z), \
                                         salaries['Name'],\
                                         salaries['Salary'],\
                                         salaries['Bonus']))
总结
  • groupby取分组内容
  • groupby分组之后做统计计算agg([np.sum,‘median’,‘std’])
  • groupby之后describe、transform
  • apply对列做变换(定义一个函数)
  • 附加:对多列做变换,map(lambda x,y,z,a:my_fun(x,y,z,a), df[‘x’], df[‘y’]…)

数据的拼接与合并

  • concat
  • merge
  • join
df1 = pd.DataFrame({'apts':[55000,60000], 'cars':[200000,300000]}, index=['Shanghai','Beijing'])
df1
aptscars
Shanghai55000200000
Beijing60000300000
df2 = pd.DataFrame({'apts':[35000, 45000], 'cars':[150000, 180000]}, index=['Hangzhou','Guangzhou'])
df2
aptscars
Hangzhou35000150000
Guangzhou45000180000
df3 = pd.DataFrame({'apts':[30000, 10000], 'cars':[120000, 100000]}, index=['Nanjing','Chongqing'])
df3
aptscars
Nanjing30000120000
Chongqing10000100000
# concat
result = pd.concat([df1,df2,df3])
result
aptscars
Shanghai55000200000
Beijing60000300000
Hangzhou35000150000
Guangzhou45000180000
Nanjing30000120000
Chongqing10000100000
#行对齐去拼接
pd.concat([df1,df2,df3], axis=1)
aptscarsaptscarsaptscars
Beijing60000.0300000.0NaNNaNNaNNaN
ChongqingNaNNaNNaNNaN10000.0100000.0
GuangzhouNaNNaN45000.0180000.0NaNNaN
HangzhouNaNNaN35000.0150000.0NaNNaN
NanjingNaNNaNNaNNaN30000.0120000.0
Shanghai55000.0200000.0NaNNaNNaNNaN
#列对齐拼接
pd.concat([df1,df2,df3], axis=0)
aptscars
Shanghai55000200000
Beijing60000300000
Hangzhou35000150000
Guangzhou45000180000
Nanjing30000120000
Chongqing10000100000
#append
df1.append(df1)
aptscars
Shanghai55000200000
Beijing60000300000
Shanghai55000200000
Beijing60000300000
result
aptscars
Shanghai55000200000
Beijing60000300000
Hangzhou35000150000
Guangzhou45000180000
Nanjing30000120000
Chongqing10000100000
合并/merge
def my_trans_apts(x):
    if x<45000:
        return 45000
    else:
        return 60000
result.loc[:,'new_apts'] = result['apts'].apply(my_trans_apts)
result
aptscarsnew_apts
Shanghai5500020000060000
Beijing6000030000060000
Hangzhou3500015000045000
Guangzhou4500018000060000
Nanjing3000012000045000
Chongqing1000010000045000
new_df = pd.DataFrame({'new_apts':[60000,45000], 'bonus':[100000, 50000]})
new_df
bonusnew_apts
010000060000
15000045000
pd.merge(result, new_df, on=['new_apts'], how='inner')
aptscarsnew_aptsbonus
05500020000060000100000
16000030000060000100000
24500018000060000100000
3350001500004500050000
4300001200004500050000
5100001000004500050000
new_df2 = pd.DataFrame({'new_apts':[65000,45000], 'bonus':[100000, 50000]})
new_df2
bonusnew_apts
010000065000
15000045000
result
aptscarsnew_apts
Shanghai5500020000060000
Beijing6000030000060000
Hangzhou3500015000045000
Guangzhou4500018000060000
Nanjing3000012000045000
Chongqing1000010000045000
pd.merge(result, new_df2, on=['new_apts'], how='left')
aptscarsnew_aptsbonus
05500020000060000NaN
16000030000060000NaN
2350001500004500050000.0
34500018000060000NaN
4300001200004500050000.0
5100001000004500050000.0
pd.merge(result, new_df2, on=['new_apts'], how='right')
aptscarsnew_aptsbonus
035000.0150000.04500050000
130000.0120000.04500050000
210000.0100000.04500050000
3NaNNaN65000100000
df1
aptscars
Shanghai55000200000
Beijing60000300000
df2
aptscars
Hangzhou35000150000
Guangzhou45000180000
df3
aptscars
Nanjing30000120000
Chongqing10000100000
df4 = pd.DataFrame({'salaries':[10000,30000,30000,20000,15000]}, index=['Suzhou', 'Beijing','Shanghai','Guangzhou','Tianjin'])
df4
salaries
Suzhou10000
Beijing30000
Shanghai30000
Guangzhou20000
Tianjin15000
join基于index去合并数据的函数
df1.join(df4)
aptscarssalaries
Shanghai5500020000030000
Beijing6000030000030000
df2.join(df4)
aptscarssalaries
Hangzhou35000150000NaN
Guangzhou4500018000020000.0
help(pd.DataFrame.join)
Help on function join in module pandas.core.frame:

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
    Join columns with other DataFrame either on index or on a key
    column. Efficiently Join multiple DataFrame objects by index at once by
    passing a list.
    
    Parameters
    ----------
    other : DataFrame, Series with name field set, or list of DataFrame
        Index should be similar to one of the columns in this one. If a
        Series is passed, its name attribute must be set, and that will be
        used as the column name in the resulting joined DataFrame
    on : column name, tuple/list of column names, or array-like
        Column(s) in the caller to join on the index in other,
        otherwise joins index-on-index. If multiples
        columns given, the passed DataFrame must have a MultiIndex. Can
        pass an array as the join key if not already contained in the
        calling DataFrame. Like an Excel VLOOKUP operation
    how : {'left', 'right', 'outer', 'inner'}, default: 'left'
        How to handle the operation of the two objects.
    
        * left: use calling frame's index (or column if on is specified)
        * right: use other frame's index
        * outer: form union of calling frame's index (or column if on is
          specified) with other frame's index, and sort it
          lexicographically
        * inner: form intersection of calling frame's index (or column if
          on is specified) with other frame's index, preserving the order
          of the calling's one
    lsuffix : string
        Suffix to use from left frame's overlapping columns
    rsuffix : string
        Suffix to use from right frame's overlapping columns
    sort : boolean, default False
        Order result DataFrame lexicographically by the join key. If False,
        the order of the join key depends on the join type (how keyword)
    
    Notes
    -----
    on, lsuffix, and rsuffix options are not supported when passing a list
    of DataFrame objects
    
    Examples
    --------
    >>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
    ...                        'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
    
    >>> caller
        A key
    0  A0  K0
    1  A1  K1
    2  A2  K2
    3  A3  K3
    4  A4  K4
    5  A5  K5
    
    >>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
    ...                       'B': ['B0', 'B1', 'B2']})
    
    >>> other
        B key
    0  B0  K0
    1  B1  K1
    2  B2  K2
    
    Join DataFrames using their indexes.
    
    >>> caller.join(other, lsuffix='_caller', rsuffix='_other')
    
    >>>     A key_caller    B key_other
        0  A0         K0   B0        K0
        1  A1         K1   B1        K1
        2  A2         K2   B2        K2
        3  A3         K3  NaN       NaN
        4  A4         K4  NaN       NaN
        5  A5         K5  NaN       NaN


If we want to join using the key columns, we need to set key to be
the index in both caller and other. The joined DataFrame will have
key as its index.

    >>> caller.set_index('key').join(other.set_index('key'))
    
    >>>      A    B
        key
        K0   A0   B0
        K1   A1   B1
        K2   A2   B2
        K3   A3  NaN
        K4   A4  NaN
        K5   A5  NaN
    
    Another option to join using the key columns is to use the on
    parameter. DataFrame.join always uses other's index but we can use any
    column in the caller. This method preserves the original caller's
    index in the result.
    
    >>> caller.join(other.set_index('key'), on='key')
    
    >>>     A key    B
        0  A0  K0   B0
        1  A1  K1   B1
        2  A2  K2   B2
        3  A3  K3  NaN
        4  A4  K4  NaN
        5  A5  K5  NaN


See also
--------
DataFrame.merge : For column(s)-on-columns(s) operations

    Returns
    -------
    joined : DataFrame

总结
  • concat:拼接,axies指定拼接的维度
  • merge:基于某个列去做关联
  • join:基于index去做数据合并

小案例:自行车租赁案例分析

bikes = pd.read_csv('1_pandas_part2_data/pandas_part2_data/bikes.csv', sep=';', encoding='latin1', \
                    parse_dates=['Date'], index_col='Date')
bikes.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 to 2012-05-11
Data columns (total 9 columns):
Berri 1                                310 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  310 non-null int64
Maisonneuve 1                          310 non-null int64
Maisonneuve 2                          310 non-null int64
du Parc                                310 non-null int64
Pierre-Dupuy                           310 non-null int64
Rachel1                                310 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7)
memory usage: 24.2 KB
bikes.head()
Berri 1Brébeuf (données non disponibles)Côte-Sainte-CatherineMaisonneuve 1Maisonneuve 2du ParcPierre-DupuyRachel1St-Urbain (données non disponibles)
Date
2012-01-0135NaN03851261016NaN
2012-02-0183NaN16815353643NaN
2012-03-01135NaN210424889358NaN
2012-04-01144NaN1116318111861NaN
2012-05-01197NaN2124330971395NaN
bikes.shape
(310, 9)
#dropna去空,默认是去除有缺失值的行
bikes.dropna()
Berri 1Brébeuf (données non disponibles)Côte-Sainte-CatherineMaisonneuve 1Maisonneuve 2du ParcPierre-DupuyRachel1St-Urbain (données non disponibles)
Date
bikes.dropna(axis=1, how='all').head()#将存在这一列的所有数据都缺失值的列去掉
Berri 1Côte-Sainte-CatherineMaisonneuve 1Maisonneuve 2du ParcPierre-DupuyRachel1
Date
2012-01-013503851261016
2012-02-018316815353643
2012-03-01135210424889358
2012-04-011441116318111861
2012-05-011972124330971395
bikes.shape
(310, 9)
bikes.dropna(axis=1, how='all', inplace=True)
bikes.shape
(310, 7)
bikes.loc[:,'weekday'] = bikes.index.weekday
bikes.head()
Berri 1Côte-Sainte-CatherineMaisonneuve 1Maisonneuve 2du ParcPierre-DupuyRachel1weekday
Date
2012-01-0135038512610166
2012-02-0183168153536432
2012-03-011352104248893583
2012-04-0114411163181118616
2012-05-0119721243309713951
weekday_counts = bikes.groupby('weekday').agg(sum)
weekday_counts.head()
Berri 1Côte-Sainte-CatherineMaisonneuve 1Maisonneuve 2du ParcPierre-DupuyRachel1
weekday
013244657940908281630428933841524126215
111989552113808651453897958535967114622
214678564189996741771059634045103130796
3147630618551028011772859338646600135268
4150183614321023171816519573147272143115
%matplotlib inline
weekday_counts['Berri 1'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f157e27c4e0>

在这里插入图片描述

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值