数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

最新推荐文章于 2022-05-29 14:12:36 发布

程序员酱油哥

最新推荐文章于 2022-05-29 14:12:36 发布

阅读量597

点赞数 1

分类专栏： 03【教程】python数据分析文章标签： Pandas 数据分析工程师 Pandas数据分析

本文链接：https://blog.csdn.net/qintian888/article/details/93878949

版权

03【教程】python数据分析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

分组/Group by

举个例子，假设我们手头有一张公司每个员工的收入流水：

import pandas as pd
import numpy as np

salaries = pd.DataFrame({
    'Name':['BOSS','HanMeimei','HanMeimei','Han','BOSS','BOSS','HanMeimei','BOSS'],
    'Year':[2016,2016,2016,2016,2017,2017,2017,2017],
    'Salary':[40000,5000,4000,3000,38000,42000,6000,39000],
    'Bonus':[12000,3000,3500,1200,16000,18000,7000,21000]
})

salaries

	Bonus	Name	Salary	Year
0	12000	BOSS	40000	2016
1	3000	HanMeimei	5000	2016
2	3500	HanMeimei	4000	2016
3	1200	Han	3000	2016
4	16000	BOSS	38000	2017
5	18000	BOSS	42000	2017
6	7000	HanMeimei	6000	2017
7	21000	BOSS	39000	2017

group by实际上就是分组，通过某个字段不同，进行数据分组划分

group_by_name = salaries.groupby('Name')

type(group_by_name)

pandas.core.groupby.DataFrameGroupBy

# 取不同的分组
for item in group_by_name:
    print(item)
    print(item[0]) #分组对象名称
    print(item[1]) #分组数据
    print("\n")

('BOSS',    Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017)
BOSS
   Bonus  Name  Salary  Year
0  12000  BOSS   40000  2016
4  16000  BOSS   38000  2017
5  18000  BOSS   42000  2017
7  21000  BOSS   39000  2017

(‘Han’, Bonus Name Salary Year
3 1200 Han 3000 2016)
Han
Bonus Name Salary Year
3 1200 Han 3000 2016

(‘HanMeimei’, Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017)
HanMeimei
Bonus Name Salary Year
1 3000 HanMeimei 5000 2016
2 3500 HanMeimei 4000 2016
6 7000 HanMeimei 6000 2017

groupby分组之后你可以去做一些统计聚会操作

分组求和

group_by_name.sum() #求和

	Bonus	Salary	Year
Name
BOSS	67000	159000	8067
Han	1200	3000	2016
HanMeimei	13500	15000	6049

group_by_name.mean() #求平均

	Bonus	Salary	Year
Name
BOSS	16750.0	39750.0	2016.750000
Han	1200.0	3000.0	2016.000000
HanMeimei	4500.0	5000.0	2016.333333

挑选一些列做统计运算

group_by_name[['Bonus','Salary']].sum()

	Bonus	Salary
Name
BOSS	67000	159000
Han	1200	3000
HanMeimei	13500	15000

按照Name排序

salaries.groupby('Name', sort=False).agg(sum)

	Bonus	Salary	Year
Name
BOSS	67000	159000	8067
HanMeimei	13500	15000	6049
Han	1200	3000	2016

salaries.groupby('Name', sort=False).sum()

	Bonus	Salary	Year
Name
BOSS	67000	159000	8067
HanMeimei	13500	15000	6049
Han	1200	3000	2016

# sum：求和  mean：#求平均    median：求中位数

salaries.groupby('Name').median()

	Bonus	Salary	Year
Name
BOSS	17000	39500	2017
Han	1200	3000	2016
HanMeimei	3500	5000	2016

频次/出现了多少次

salaries.groupby('Name').size()  #出现了多少次

Name
BOSS         4
Han          1
HanMeimei    3
dtype: int64

salaries.info() #基本信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Bonus     8 non-null int64
Name      8 non-null object
Salary    8 non-null int64
Year      8 non-null int64
dtypes: int64(3), object(1)
memory usage: 336.0+ bytes

salaries.describe() #描述统计信息

	Bonus	Salary	Year
count	8.000000	8.000000	8.000000
mean	10212.500000	22125.000000	2016.500000
std	7581.262145	18893.971072	0.534522
min	1200.000000	3000.000000	2016.000000
25%	3375.000000	4750.000000	2016.000000
50%	9500.000000	22000.000000	2016.500000
75%	16500.000000	39250.000000	2017.000000
max	21000.000000	42000.000000	2017.000000

分组查看统计信息

salaries.groupby('Name').describe()  #按照name分组来查看统计信息

	Bonus								Salary					Year
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Name
BOSS	4.0	16750.0	3774.917218	12000.0	15000.0	17000.0	18750.0	21000.0	4.0	39750.0	...	40500.0	42000.0	4.0	2016.750000	0.50000	2016.0	2016.75	2017.0	2017.0	2017.0
Han	1.0	1200.0	NaN	1200.0	1200.0	1200.0	1200.0	1200.0	1.0	3000.0	...	3000.0	3000.0	1.0	2016.000000	NaN	2016.0	2016.00	2016.0	2016.0	2016.0
HanMeimei	3.0	4500.0	2179.449472	3000.0	3250.0	3500.0	5250.0	7000.0	3.0	5000.0	...	5500.0	6000.0	3.0	2016.333333	0.57735	2016.0	2016.00	2016.0	2016.5	2017.0

3 rows × 24 columns

salaries.groupby('Name')[['Bonus', 'Salary']].agg(['sum', 'mean', 'std', 'median'])

	Bonus				Salary
	sum	mean	std	median	sum	mean	std	median
Name
BOSS	67000	16750	3774.917218	17000	159000	39750	1707.825128	39500
Han	1200	1200	NaN	1200	3000	3000	NaN	3000
HanMeimei	13500	4500	2179.449472	3500	15000	5000	1000.000000	5000

salaries.groupby('Name')[['Bonus','Salary']].agg([np.sum, np.mean, np.std, np.median])

	Bonus				Salary
	sum	mean	std	median	sum	mean	std	median
Name
BOSS	67000	16750	3774.917218	17000	159000	39750	1707.825128	39500
Han	1200	1200	NaN	1200	3000	3000	NaN	3000
HanMeimei	13500	4500	2179.449472	3500	15000	5000	1000.000000	5000

salaries.groupby('Name')[['Bonus']].agg([np.sum, np.mean, np.std, np.median])

	Bonus
	sum	mean	std	median
Name
BOSS	67000	16750	3774.917218	17000
Han	1200	1200	NaN	1200
HanMeimei	13500	4500	2179.449472	3500

type(salaries['Bonus'])

pandas.core.series.Series

type(salaries[['Bonus']])

pandas.core.frame.DataFrame

变换/transform

nvda = pd.read_csv('1_pandas_part2_data/pandas_part2_data/NVDA.csv', index_col=0, parse_dates=['Date'])
# 把第一列设置为索引列   解析日期列

nvda.head()  #查看前五行

	Open	High	Low	Close	Adj Close	Volume
Date
1999-01-22	1.750000	1.953125	1.552083	1.640625	1.523430	67867200
1999-01-25	1.770833	1.833333	1.640625	1.812500	1.683028	12762000
1999-01-26	1.833333	1.869792	1.645833	1.671875	1.552448	8580000
1999-01-27	1.677083	1.718750	1.583333	1.666667	1.547611	6109200
1999-01-28	1.666667	1.677083	1.651042	1.661458	1.542776	5688000

nvda.loc[:,'year'] = nvda.index.year

nvda.head()

	Open	High	Low	Close	Adj Close	Volume	year
Date
1999-01-22	1.750000	1.953125	1.552083	1.640625	1.523430	67867200	1999
1999-01-25	1.770833	1.833333	1.640625	1.812500	1.683028	12762000	1999
1999-01-26	1.833333	1.869792	1.645833	1.671875	1.552448	8580000	1999
1999-01-27	1.677083	1.718750	1.583333	1.666667	1.547611	6109200	1999
1999-01-28	1.666667	1.677083	1.651042	1.661458	1.542776	5688000	1999

Series类型调用unique():查看一列中的不同的取值

Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数

某一列有多少不同的取值 => unique

#求year这列中不同的取值
nvda['year'].unique()

array([1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], dtype=int64)

某一列有多少不同的取值，它们分别出现了多少次 => value_counts()

# 求year这列中不同取值分别出现的次数
nvda['year'].value_counts()

2008    253
2015    252
2011    252
2004    252
2016    252
2005    252
2009    252
2013    252
2000    252
2002    252
2010    252
2014    252
2003    252
2006    251
2007    251
2012    250
2001    248
1999    239
2017    138
Name: year, dtype: int64

tmp = nvda['year'].value_counts().to_frame()#Series转成DataFrame
tmp = tmp.reset_index()
tmp.columns = ['year','count']

tmp.head()

	year	count
0	2008	253
1	2015	252
2	2011	252
3	2004	252
4	2016	252

tmp.sort_values(by='year').head()

	year	count
17	1999	239
8	2000	252
16	2001	248
9	2002	252
12	2003	252

nvda.head()

	Open	High	Low	Close	Adj Close	Volume	year
Date
1999-01-22	1.750000	1.953125	1.552083	1.640625	1.523430	67867200	1999
1999-01-25	1.770833	1.833333	1.640625	1.812500	1.683028	12762000	1999
1999-01-26	1.833333	1.869792	1.645833	1.671875	1.552448	8580000	1999
1999-01-27	1.677083	1.718750	1.583333	1.666667	1.547611	6109200	1999
1999-01-28	1.666667	1.677083	1.651042	1.661458	1.542776	5688000	1999

nvda.groupby('year').agg(['mean', 'std'])

	Open		High		Low		Close		Adj Close		Volume
	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
year
1999	1.950782	0.588882	2.007317	0.614302	1.883559	0.571658	1.947230	0.601041	1.808134	0.558107	6.433220e+06	8.142949e+06
2000	8.781084	2.999908	9.222697	3.114186	8.360522	2.904761	8.778826	3.013104	8.151729	2.797869	1.104182e+07	7.985374e+06
2001	13.091254	3.839777	13.600750	3.829838	12.680548	3.830944	13.181552	3.833637	12.239956	3.559789	2.782387e+07	1.384318e+07
2002	9.690344	6.561287	9.955093	6.664226	9.344391	6.375212	9.614749	6.519053	8.927940	6.053379	3.168655e+07	1.558742e+07
2003	5.902434	1.461862	6.042659	1.491260	5.764960	1.423422	5.900344	1.459852	5.478865	1.355570	2.430220e+07	1.899657e+07
2004	6.484735	1.467445	6.608810	1.482036	6.353558	1.444797	6.465913	1.456575	6.004034	1.352528	1.706331e+07	1.191968e+07
2005	9.512381	1.580061	9.659656	1.591274	9.353175	1.571138	9.513823	1.589762	8.834223	1.476201	1.542825e+07	9.623837e+06
2006	18.057902	3.675092	18.425126	3.718616	17.720279	3.657584	18.095963	3.700960	16.803316	3.436590	1.534446e+07	6.616879e+06
2007	27.762045	6.111437	28.251673	6.225662	27.206056	5.902620	27.724542	6.087681	25.744098	5.652820	1.514562e+07	5.818216e+06
2008	16.004308	6.862760	16.426245	6.964528	15.521462	6.696381	15.945613	6.811527	14.806572	6.324960	2.022721e+07	8.552974e+06
2009	11.825119	2.638097	12.114762	2.612499	11.565952	2.640537	11.850873	2.631664	11.004331	2.443677	1.919821e+07	8.291987e+06
2010	13.576349	2.888884	13.802659	2.904905	13.318532	2.843065	13.563175	2.884261	12.594318	2.678230	1.853295e+07	8.434693e+06
2011	16.912540	3.404884	17.267540	3.490368	16.512143	3.305507	16.887540	3.404032	15.681214	3.160872	2.289352e+07	1.270114e+07
2012	13.526200	1.176957	13.717400	1.191775	13.319800	1.165419	13.507880	1.185139	12.551166	1.091736	1.207757e+07	5.050116e+06
2013	14.173571	1.251508	14.329802	1.253287	14.035278	1.253372	14.189127	1.250883	13.412278	1.260152	8.843986e+06	4.202323e+06
2014	18.543056	1.293283	18.745476	1.283480	18.348214	1.276038	18.547064	1.284932	17.875053	1.312833	7.098902e+06	3.140560e+06
2015	23.680595	4.106327	23.979524	4.152229	23.411071	4.079351	23.718254	4.128879	23.262283	4.154678	7.756520e+06	3.933075e+06
2016	53.630833	21.714540	54.415397	22.182621	52.895119	21.263517	53.761190	21.803927	53.475737	21.824367	1.107062e+07	7.547056e+06
2017	120.481305	22.027821	122.300725	22.510244	118.402754	21.281863	120.547971	21.991898	120.436863	22.056290	1.907742e+07	1.073342e+07

def my_transform(x):
    return (x-x.mean())/x.std()

tmp_arr = np.array(range(10))

tmp_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

my_transform(tmp_arr)

array([-1.5666989 , -1.21854359, -0.87038828, -0.52223297, -0.17407766,
        0.17407766,  0.52223297,  0.87038828,  1.21854359,  1.5666989 ])

tranformed = nvda.groupby('year').transform(my_transform)

tranformed.head()

	Open	High	Low	Close	Adj Close	Volume
Date
1999-01-22	-0.340955	-0.088217	-0.579850	-0.510124	-0.510124	7.544438
1999-01-25	-0.305578	-0.283222	-0.424964	-0.224161	-0.224161	0.777210
1999-01-26	-0.199444	-0.223871	-0.415854	-0.458130	-0.458131	0.263637
1999-01-27	-0.464778	-0.469747	-0.525185	-0.466795	-0.466798	-0.039791
1999-01-28	-0.482465	-0.537575	-0.406741	-0.475462	-0.475461	-0.091517

%matplotlib inline

compare_df = pd.DataFrame({'Origin':nvda['Adj Close'], 'Transformed':tranformed['Adj Close']})

compare_df.head()

	Origin	Transformed
Date
1999-01-22	1.523430	-0.510124
1999-01-25	1.683028	-0.224161
1999-01-26	1.552448	-0.458131
1999-01-27	1.547611	-0.466798
1999-01-28	1.542776	-0.475461

compare_df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2354e970c18>

在这里插入图片描述

compare_df['Transformed'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x235517b0978>

在这里插入图片描述

变换函数apply

salaries

	Bonus	Name	Salary	Year
0	12000	BOSS	40000	2016
1	3000	HanMeimei	5000	2016
2	3500	HanMeimei	4000	2016
3	1200	Han	3000	2016
4	16000	BOSS	38000	2017
5	18000	BOSS	42000	2017
6	7000	HanMeimei	6000	2017
7	21000	BOSS	39000	2017

salaries.loc[:,'tmp_col'] = (salaries['Salary']*2-1500)/0.8

salaries

	Bonus	Name	Salary	Year	tmp_col
0	12000	BOSS	40000	2016	98125.0
1	3000	HanMeimei	5000	2016	10625.0
2	3500	HanMeimei	4000	2016	8125.0
3	1200	Han	3000	2016	5625.0
4	16000	BOSS	38000	2017	93125.0
5	18000	BOSS	42000	2017	103125.0
6	7000	HanMeimei	6000	2017	13125.0
7	21000	BOSS	39000	2017	95625.0

def trans(x):
    return (x*2-1500)/0.8

salaries.loc[:,'tmp_col2'] = salaries['Salary'].apply(trans)

salaries

	Bonus	Name	Salary	Year	tmp_col	tmp_col2
0	12000	BOSS	40000	2016	98125.0	98125.0
1	3000	HanMeimei	5000	2016	10625.0	10625.0
2	3500	HanMeimei	4000	2016	8125.0	8125.0
3	1200	Han	3000	2016	5625.0	5625.0
4	16000	BOSS	38000	2017	93125.0	93125.0
5	18000	BOSS	42000	2017	103125.0	103125.0
6	7000	HanMeimei	6000	2017	13125.0	13125.0
7	21000	BOSS	39000	2017	95625.0	95625.0

def trans2(x):
    if x=='Han':
        return 'HanXiaoyang'
    else:
        return x

salaries.loc[:,'full_name'] = salaries['Name'].apply(trans2)

salaries

	Bonus	Name	Salary	Year	tmp_col	tmp_col2	full_name
0	12000	BOSS	40000	2016	98125.0	98125.0	BOSS
1	3000	HanMeimei	5000	2016	10625.0	10625.0	HanMeimei
2	3500	HanMeimei	4000	2016	8125.0	8125.0	HanMeimei
3	1200	Han	3000	2016	5625.0	5625.0	HanXiaoyang
4	16000	BOSS	38000	2017	93125.0	93125.0	BOSS
5	18000	BOSS	42000	2017	103125.0	103125.0	BOSS
6	7000	HanMeimei	6000	2017	13125.0	13125.0	HanMeimei
7	21000	BOSS	39000	2017	95625.0	95625.0	BOSS

help(pd.Series.apply)

Help on function apply in module pandas.core.series:

apply(self, func, convert_dtype=True, args=(), **kwds)
    Invoke function on values of Series. Can be ufunc (a NumPy function
    that applies to the entire Series) or a Python function that only works
    on single values
    
    Parameters
    ----------
    func : function
    convert_dtype : boolean, default True
        Try to find better dtype for elementwise function results. If
        False, leave as dtype=object
    args : tuple
        Positional arguments to pass to function in addition to the value
    Additional keyword arguments will be passed as keywords to the function
    
    Returns
    -------
    y : Series or DataFrame if func returns a Series
    
    See also
    --------
    Series.map: For element-wise operations
    Series.agg: only perform aggregating type operations
    Series.transform: only perform transformating type operations
    
    Examples
    --------
    
    Create a series with typical summer temperatures for each city.
    
    >>> import pandas as pd
    >>> import numpy as np
    >>> series = pd.Series([20, 21, 12], index=['London',
    ... 'New York','Helsinki'])
    >>> series
    London      20
    New York    21
    Helsinki    12
    dtype: int64
    
    Square the values by defining a function and passing it as an
    argument to ``apply()``.
    
    >>> def square(x):
    ...     return x**2
    >>> series.apply(square)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Square the values by passing an anonymous function as an
    argument to ``apply()``.
    
    >>> series.apply(lambda x: x**2)
    London      400
    New York    441
    Helsinki    144
    dtype: int64
    
    Define a custom function that needs additional positional
    arguments and pass these additional arguments using the
    ``args`` keyword.
    
    >>> def subtract_custom_value(x, custom_value):
    ...     return x-custom_value
    
    >>> series.apply(subtract_custom_value, args=(5,))
    London      15
    New York    16
    Helsinki     7
    dtype: int64
    
    Define a custom function that takes keyword arguments
    and pass these arguments to ``apply``.
    
    >>> def add_custom_values(x, **kwargs):
    ...     for month in kwargs:
    ...         x+=kwargs[month]
    ...         return x
    
    >>> series.apply(add_custom_values, june=30, july=20, august=25)
    London      95
    New York    96
    Helsinki    87
    dtype: int64
    
    Use a function from the Numpy library.
    
    >>> series.apply(np.log)
    London      2.995732
    New York    3.044522
    Helsinki    2.484907
    dtype: float64

salaries

	Bonus	Name	Salary	Year	tmp_col	tmp_col2	full_name
0	12000	BOSS	40000	2016	98125.0	98125.0	BOSS
1	3000	HanMeimei	5000	2016	10625.0	10625.0	HanMeimei
2	3500	HanMeimei	4000	2016	8125.0	8125.0	HanMeimei
3	1200	Han	3000	2016	5625.0	5625.0	HanXiaoyang
4	16000	BOSS	38000	2017	93125.0	93125.0	BOSS
5	18000	BOSS	42000	2017	103125.0	103125.0	BOSS
6	7000	HanMeimei	6000	2017	13125.0	13125.0	HanMeimei
7	21000	BOSS	39000	2017	95625.0	95625.0	BOSS

gender = 'male'
sex = '男' if gender=='male' else '女'

sex

'男'

lambda匿名函数

salaries.loc[:,'new_name'] = salaries['Name'].apply(lambda x: 'HanXiaoyang' if x=='han' else x)

对几列做操作

# 如果是boss，返回工资+奖金，其他人返回工资
def my_fun(name, salary, bonus):
    if name=='BOSS':
        return salary+bonus
    else:
        return salary

salaries.loc[:,'my_s_result'] = list(map(lambda x,y,z:my_fun(x,y,z), \
                                         salaries['Name'],\
                                         salaries['Salary'],\
                                         salaries['Bonus']))

总结

groupby取分组内容
groupby分组之后做统计计算agg([np.sum,‘median’,‘std’])
groupby之后describe、transform
apply对列做变换(定义一个函数)
附加：对多列做变换，map(lambda x,y,z,a:my_fun(x,y,z,a), df[‘x’], df[‘y’]…)

数据的拼接与合并

concat
merge
join

df1 = pd.DataFrame({'apts':[55000,60000], 'cars':[200000,300000]}, index=['Shanghai','Beijing'])

df1

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000

df2 = pd.DataFrame({'apts':[35000, 45000], 'cars':[150000, 180000]}, index=['Hangzhou','Guangzhou'])

df2

	apts	cars
Hangzhou	35000	150000
Guangzhou	45000	180000

df3 = pd.DataFrame({'apts':[30000, 10000], 'cars':[120000, 100000]}, index=['Nanjing','Chongqing'])

df3

	apts	cars
Nanjing	30000	120000
Chongqing	10000	100000

# concat
result = pd.concat([df1,df2,df3])

result

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000
Hangzhou	35000	150000
Guangzhou	45000	180000
Nanjing	30000	120000
Chongqing	10000	100000

#行对齐去拼接
pd.concat([df1,df2,df3], axis=1)

	apts	cars	apts	cars	apts	cars
Beijing	60000.0	300000.0	NaN	NaN	NaN	NaN
Chongqing	NaN	NaN	NaN	NaN	10000.0	100000.0
Guangzhou	NaN	NaN	45000.0	180000.0	NaN	NaN
Hangzhou	NaN	NaN	35000.0	150000.0	NaN	NaN
Nanjing	NaN	NaN	NaN	NaN	30000.0	120000.0
Shanghai	55000.0	200000.0	NaN	NaN	NaN	NaN

#列对齐拼接
pd.concat([df1,df2,df3], axis=0)

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000
Hangzhou	35000	150000
Guangzhou	45000	180000
Nanjing	30000	120000
Chongqing	10000	100000

#append
df1.append(df1)

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000
Shanghai	55000	200000
Beijing	60000	300000

result

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000
Hangzhou	35000	150000
Guangzhou	45000	180000
Nanjing	30000	120000
Chongqing	10000	100000

合并/merge

def my_trans_apts(x):
    if x<45000:
        return 45000
    else:
        return 60000

result.loc[:,'new_apts'] = result['apts'].apply(my_trans_apts)

result

	apts	cars	new_apts
Shanghai	55000	200000	60000
Beijing	60000	300000	60000
Hangzhou	35000	150000	45000
Guangzhou	45000	180000	60000
Nanjing	30000	120000	45000
Chongqing	10000	100000	45000

new_df = pd.DataFrame({'new_apts':[60000,45000], 'bonus':[100000, 50000]})

new_df

	bonus	new_apts
0	100000	60000
1	50000	45000

pd.merge(result, new_df, on=['new_apts'], how='inner')

	apts	cars	new_apts	bonus
0	55000	200000	60000	100000
1	60000	300000	60000	100000
2	45000	180000	60000	100000
3	35000	150000	45000	50000
4	30000	120000	45000	50000
5	10000	100000	45000	50000

new_df2 = pd.DataFrame({'new_apts':[65000,45000], 'bonus':[100000, 50000]})

new_df2

	bonus	new_apts
0	100000	65000
1	50000	45000

result

	apts	cars	new_apts
Shanghai	55000	200000	60000
Beijing	60000	300000	60000
Hangzhou	35000	150000	45000
Guangzhou	45000	180000	60000
Nanjing	30000	120000	45000
Chongqing	10000	100000	45000

pd.merge(result, new_df2, on=['new_apts'], how='left')

	apts	cars	new_apts	bonus
0	55000	200000	60000	NaN
1	60000	300000	60000	NaN
2	35000	150000	45000	50000.0
3	45000	180000	60000	NaN
4	30000	120000	45000	50000.0
5	10000	100000	45000	50000.0

pd.merge(result, new_df2, on=['new_apts'], how='right')

	apts	cars	new_apts	bonus
0	35000.0	150000.0	45000	50000
1	30000.0	120000.0	45000	50000
2	10000.0	100000.0	45000	50000
3	NaN	NaN	65000	100000

df1

	apts	cars
Shanghai	55000	200000
Beijing	60000	300000

df2

	apts	cars
Hangzhou	35000	150000
Guangzhou	45000	180000

df3

	apts	cars
Nanjing	30000	120000
Chongqing	10000	100000

df4 = pd.DataFrame({'salaries':[10000,30000,30000,20000,15000]}, index=['Suzhou', 'Beijing','Shanghai','Guangzhou','Tianjin'])

df4

	salaries
Suzhou	10000
Beijing	30000
Shanghai	30000
Guangzhou	20000
Tianjin	15000

join基于index去合并数据的函数

df1.join(df4)

	apts	cars	salaries
Shanghai	55000	200000	30000
Beijing	60000	300000	30000

df2.join(df4)

	apts	cars	salaries
Hangzhou	35000	150000	NaN
Guangzhou	45000	180000	20000.0

help(pd.DataFrame.join)

Help on function join in module pandas.core.frame:

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
    Join columns with other DataFrame either on index or on a key
    column. Efficiently Join multiple DataFrame objects by index at once by
    passing a list.
    
    Parameters
    ----------
    other : DataFrame, Series with name field set, or list of DataFrame
        Index should be similar to one of the columns in this one. If a
        Series is passed, its name attribute must be set, and that will be
        used as the column name in the resulting joined DataFrame
    on : column name, tuple/list of column names, or array-like
        Column(s) in the caller to join on the index in other,
        otherwise joins index-on-index. If multiples
        columns given, the passed DataFrame must have a MultiIndex. Can
        pass an array as the join key if not already contained in the
        calling DataFrame. Like an Excel VLOOKUP operation
    how : {'left', 'right', 'outer', 'inner'}, default: 'left'
        How to handle the operation of the two objects.
    
        * left: use calling frame's index (or column if on is specified)
        * right: use other frame's index
        * outer: form union of calling frame's index (or column if on is
          specified) with other frame's index, and sort it
          lexicographically
        * inner: form intersection of calling frame's index (or column if
          on is specified) with other frame's index, preserving the order
          of the calling's one
    lsuffix : string
        Suffix to use from left frame's overlapping columns
    rsuffix : string
        Suffix to use from right frame's overlapping columns
    sort : boolean, default False
        Order result DataFrame lexicographically by the join key. If False,
        the order of the join key depends on the join type (how keyword)
    
    Notes
    -----
    on, lsuffix, and rsuffix options are not supported when passing a list
    of DataFrame objects
    
    Examples
    --------
    >>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
    ...                        'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
    
    >>> caller
        A key
    0  A0  K0
    1  A1  K1
    2  A2  K2
    3  A3  K3
    4  A4  K4
    5  A5  K5
    
    >>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
    ...                       'B': ['B0', 'B1', 'B2']})
    
    >>> other
        B key
    0  B0  K0
    1  B1  K1
    2  B2  K2
    
    Join DataFrames using their indexes.
    
    >>> caller.join(other, lsuffix='_caller', rsuffix='_other')
    
    >>>     A key_caller    B key_other
        0  A0         K0   B0        K0
        1  A1         K1   B1        K1
        2  A2         K2   B2        K2
        3  A3         K3  NaN       NaN
        4  A4         K4  NaN       NaN
        5  A5         K5  NaN       NaN

If we want to join using the key columns, we need to set key to be
the index in both caller and other. The joined DataFrame will have
key as its index.

    >>> caller.set_index('key').join(other.set_index('key'))
    
    >>>      A    B
        key
        K0   A0   B0
        K1   A1   B1
        K2   A2   B2
        K3   A3  NaN
        K4   A4  NaN
        K5   A5  NaN
    
    Another option to join using the key columns is to use the on
    parameter. DataFrame.join always uses other's index but we can use any
    column in the caller. This method preserves the original caller's
    index in the result.
    
    >>> caller.join(other.set_index('key'), on='key')
    
    >>>     A key    B
        0  A0  K0   B0
        1  A1  K1   B1
        2  A2  K2   B2
        3  A3  K3  NaN
        4  A4  K4  NaN
        5  A5  K5  NaN

See also
--------
DataFrame.merge : For column(s)-on-columns(s) operations

    Returns
    -------
    joined : DataFrame

总结

concat：拼接，axies指定拼接的维度
merge：基于某个列去做关联
join：基于index去做数据合并

小案例：自行车租赁案例分析

bikes = pd.read_csv('1_pandas_part2_data/pandas_part2_data/bikes.csv', sep=';', encoding='latin1', \
                    parse_dates=['Date'], index_col='Date')

bikes.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 310 entries, 2012-01-01 to 2012-05-11
Data columns (total 9 columns):
Berri 1                                310 non-null int64
Brébeuf (données non disponibles)      0 non-null float64
Côte-Sainte-Catherine                  310 non-null int64
Maisonneuve 1                          310 non-null int64
Maisonneuve 2                          310 non-null int64
du Parc                                310 non-null int64
Pierre-Dupuy                           310 non-null int64
Rachel1                                310 non-null int64
St-Urbain (données non disponibles)    0 non-null float64
dtypes: float64(2), int64(7)
memory usage: 24.2 KB

bikes.head()

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-02-01	83	NaN	1	68	153	53	6	43	NaN
2012-03-01	135	NaN	2	104	248	89	3	58	NaN
2012-04-01	144	NaN	1	116	318	111	8	61	NaN
2012-05-01	197	NaN	2	124	330	97	13	95	NaN

bikes.shape

(310, 9)

#dropna去空，默认是去除有缺失值的行
bikes.dropna()

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date

bikes.dropna(axis=1, how='all').head()#将存在这一列的所有数据都缺失值的列去掉

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
Date
2012-01-01	35	0	38	51	26	10	16
2012-02-01	83	1	68	153	53	6	43
2012-03-01	135	2	104	248	89	3	58
2012-04-01	144	1	116	318	111	8	61
2012-05-01	197	2	124	330	97	13	95

bikes.shape

(310, 9)

bikes.dropna(axis=1, how='all', inplace=True)

bikes.shape

(310, 7)

bikes.loc[:,'weekday'] = bikes.index.weekday

bikes.head()

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	weekday
Date
2012-01-01	35	0	38	51	26	10	16	6
2012-02-01	83	1	68	153	53	6	43	2
2012-03-01	135	2	104	248	89	3	58	3
2012-04-01	144	1	116	318	111	8	61	6
2012-05-01	197	2	124	330	97	13	95	1

weekday_counts = bikes.groupby('weekday').agg(sum)

weekday_counts.head()

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1
weekday
0	132446	57940	90828	163042	89338	41524	126215
1	119895	52113	80865	145389	79585	35967	114622
2	146785	64189	99674	177105	96340	45103	130796
3	147630	61855	102801	177285	93386	46600	135268
4	150183	61432	102317	181651	95731	47272	143115

%matplotlib inline
weekday_counts['Berri 1'].plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f157e27c4e0>

在这里插入图片描述

程序员酱油哥

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
数据分析工程师_第03讲Pandas数据分析处理技能(下篇)

数据分析工程师_03Pandas数据分析处理技能下篇数据分析工程师_03Pandas数据分析处理技能(下篇)目录分组/Group by分组求和挑选一些列做统计运算按照Name排序频次/出现了多少次分组查看统计信息变换/transformSeries类型调用unique():查看一列中的不同的取值Series类型调用value_counts():查看一列中的不同的取值,以及该取值出现的次数某一列有多...
复制链接

扫一扫